EIP-7939: Create a new opcode for counting leading zeros (CLZ)

Count leading zeros (CLZ) is a native instruction in many processor architectures
(even in RISC architectures like ARM).

It is a basic building block used in many important fixed point and binary operations:

  • lnWad
  • powWad (due to lnWad)
  • sqrt
  • log2
  • most significant bit (msb)
  • least significant bit (lsb)
    (can be easily implemented with CLZ by isolating the smallest bit via x & -x)

Code examples:

For more applications see: Find first set - Wikipedia

Emulating this opcode is very wasteful in terms of runtime gas and bytecode size
(due to the need to store large constants like 0xffffffffffffffffffffffffffffffff or De Bruijn lookups like 0x0009010a0d15021d0b0e10121619031e080c141c0f111807131b17061a05041f).

Given that most processors have this opcode implemented natively, which can be used to efficiently implement the opcode in EVM, it feels even more wasteful not to have it available in the EVM.

Having this opcode in the EVM will enable cheaper runtime gas, cheaper validator computation, and smaller bytecode.


For the implementation of CLZ, I’d propose a gas cost of 3
(to keep in line with the gas cost of shl, shr, not, xor, etc).

This CLZ opcode will take in a single variable on the stack, and output a single variable onto the stack.

For the case of 0 as the input, CLZ returns 256 (the bit width of the input) to stay consistent with how it behaves in most processors, and for lesser bytecode required to handle the output.


CLZ has the following advantages over other bit counting operations:

  • log2(x) can be computed via 255 - clz(x), which automatically reverts for the case of 0 if using checked arithmetic. This makes it preferred over find first set (FFS).

  • Count trailing zeros (CTZ) is way cheaper to emulate with current Yul opcodes compared to CLZ. See the LibBit.sol. Inclusion of CLZ will make CTZ easy to compute, but not vice versa.

Basically, CLZ is the most common bit counting operation in processors because other bit counting operations can be easily implemented with CLZ.

4 Likes

Can you leave an example of an efficient ctz implementation using clz?

Well, not automatically, but with code generated by the compiler :slight_smile:

Would you be able to rewrite the solady code using the hypothetical clz opcode? It would be a good way to see the complexity reduction.

Here is a lookup table for that.

521 bytes long.

Test it on evm.codes !

You just need to open an instruction’s info, then click on " Reproduce in playground", because if you click on the menu’s playground link it will only accept Solidity code.

Here is the raw table in hex:

0000010002ea0000038bebab0099000004008c00ec36ac0000d09a000078004a050000008dd60000ed003700ad7100dd00e5d1429ba4009100fe796000004b00067f0000004700da8e00d70000000000ee12003d38000000ae0072000000de000025e632d2a043009c1ca5000000920000c7ff007a2061f1005200f84cb600150766800000a90000000048000000db408f5e0000d800003b0000003000000000eff6130000003e00392e000000000000af000000730000b100bc0000dfc1000000002600e7963375d36ea100440000009d001db3a6000000000000be930000000057c800000000e17b0e21c3625af2000000530000cbf9004d00b7000000162808006700810000e9008aaa980000003500cf0077490000d500000070dce441a390fd5f00007e0046d900000000113c00000000000024319f001b000000c6001ff051f7b5146500a8000000003f5d00003a002f0000f50000002d000000000000b0bb00c000000095746d00000000b2000000bd0000560000e00dc259000000ca0000000027000000e889970034ce7600d4006fe3a2fc007d45000010000000239e1a00c51e50b464a700005c000000f4002c000000babf00946c000000000055000c5800c9000000008800cd0000e2fb7c000f002219c44f63005b00f32b00b9006b0000540b00000087cc00fa0000184e00002ab86a000a0086000017002969098500006884008382

Please put this EIP onto the table on Github!

https://github.com/search?q=repo%3Aethereum%2FEIPs%20zeroes&type=code

https://github.com/search?q=repo%3Aethereum%2FEIPs+leading+zeroes&type=code

https://github.com/search?q=repo%3Aethereum%2FEIPs+clz&type=code

:roll_eyes: :frowning_face:

Some updates.

As of the latest Solady at the time of writing, solady/src/utils/LibBit.sol at 24b6272d83fba9a4233bafafbcd7f879e3b0948b · Vectorized/solady · GitHub

CLZ: 184 gas
FFS: 136 gas

Having a specialized CLZ opcode can reduce the amount of gas to about 8 gas for CLZ, and about 30 gas for FFS (due to the need to isolate the least significant bit before CLZ).

CLZ will be useful in sqrt, cbrt, logarithms, bitmap lookups.

As of now, the codebase that will benefit the most if CLZ were to be shipped immediately will be Uniswap V4. But since the Uniswap V4 codebase has already been submitted for final audits, and the singleton will be launched soon, the benefits of having this opcode in EVM may be limited going forward.

After considering the amount of EVM fragmentation this may cause, I am hesistant on pushing core devs to implement this opcode.

1 Like

Would be less work than an ADD, but have some stack work so I’d suggest CLZ would be 3 gas as NOT is also 3 gas

I’ve changed my mind.

I’m in full support of clz now.

I have recently found a very good use case of using clz for generalized calldata compression (Solady LibZip.cdCompress and LibZip.cdDecompress). Having clz will greatly speed up this algorithm, as well as many other related bytes/string algorithms. I expect many future contracts to use LibZip.cdCompress and LibZip.cdDecompress for compressing/decompressing variables to and from storage.

X thread on how to use it to automagically pack storage variables.

https://x.com/optimizoor/status/1914238227386704053?s=61

Why is 0 256 rather than 255?

Since 1 would be 254? :thinking:

Ignore this I can’t count

Updated the EIP with a SP1 rv32im optimized example.

inline uint32_t clz(uint32_t x) {
    uint32_t r = 0;
    if (!(x & 0xFFFF0000)) { r += 16; x <<= 16; }
    if (!(x & 0xFF000000)) { r += 8;  x <<= 8;  }
    if (!(x & 0xF0000000)) { r += 4;  x <<= 4;  }
    if (!(x & 0xC0000000)) { r += 2;  x <<= 2;  }
    if (!(x & 0x80000000)) { r += 1; }
    return r;
}

// `x` is a uint256 bit number represented with 8 uint32 limbs.
// This implementation is optimized for SP1 proving via rv32im.
// For regular compute, it performs similarly to `add`.
inline uint32_t clz(uint32_t x[8]) {
    if (x[7] != 0) return clz(x[7]);
    if (x[6] != 0) return 32 + clz(x[6]);
    if (x[5] != 0) return 64 + clz(x[5]);
    if (x[4] != 0) return 96 + clz(x[4]);
    if (x[3] != 0) return 128 + clz(x[3]);
    if (x[2] != 0) return 160 + clz(x[2]);
    if (x[1] != 0) return 192 + clz(x[1]);
    if (x[0] != 0) return 224 + clz(x[0]);
    return 256;
}

When optimized, clz is cheaper to prove than add.

(it’s cheaper to compute too, if we don’t care about proving)

void add256(
    const uint32_t x[8],
    const uint32_t y[8],
    uint32_t out[8]
) {
    uint32_t carry, sum;

    sum = x[0] + y[0];
    carry = (sum < x[0]);
    out[0] = sum;

    sum = x[1] + y[1] + carry;
    carry = (sum < x[1]) || (carry && sum == x[1]);
    out[1] = sum;

    sum = x[2] + y[2] + carry;
    carry = (sum < x[2]) || (carry && sum == x[2]);
    out[2] = sum;

    sum = x[3] + y[3] + carry;
    carry = (sum < x[3]) || (carry && sum == x[3]);
    out[3] = sum;

    sum = x[4] + y[4] + carry;
    carry = (sum < x[4]) || (carry && sum == x[4]);
    out[4] = sum;

    sum = x[5] + y[5] + carry;
    carry = (sum < x[5]) || (carry && sum == x[5]);
    out[5] = sum;

    sum = x[6] + y[6] + carry;
    carry = (sum < x[6]) || (carry && sum == x[6]);
    out[6] = sum;

    sum = x[7] + y[7] + carry;
    carry = (sum < x[7]) || (carry && sum == x[7]);
    out[7] = sum;
}

The original Solidity implementation is actually notoriously expensive to prove in SP1, because of the dynamic shr.

This further increases the motivation for clz.

TL;DR: This instruction is useful in software and straight-forward to implement in EVM.

Language & Hardware support

The CLZ-like instruction is supported in many system programming languages.

The CLZ-like instruction is usually available in modern hardware.

  • x86-64: partial
  • x86-64-v3: yes
  • armv8: yes
  • rv64im: no
  • rv64imzbb: yes

EVM implementation context

  1. This hardware instruction is probably already used in EVM implementations, e.g. in division related opcodes: DIV, MOD, ADDMOD, MULMOD or in EXP.
  2. This EVM instruction should also help with EVM implementation of expmod or EC point multiplication precompiles.
  3. Implementation should be straight forward and linear in number of words (4). The best approach seems to be to look for the non-zero most significant word and apply the native clz operation to it. This is optimal for architectures having no/partial clz.

Gas cost

EIPs proposing new features to EVM (including this one) usually select the gas cost which is consistent with the current gas model but don’t necessarily reflect the actual execution performance. In this case, the comparison to the ADD instruction is a bit off because ADD pops two items from the stack while CLZ only one. Better comparison would be with the NOT instruction. However, because both ADD and NOT have the gas cost of 3, the same proposed gas cost for CLZ looks right.

On the other hand, in the context of EIP-7904: General Repricing, the gas cost of CLZ should be 1.

My recommendation is to keep the proposed gas cost of 3 and change it only in case the EIP-7904 is activated before or at the same time as this EIP.

Nit picks

Adding a clz opcode will also lead to cheaper ZK proving costs.

The CLZ instruction will be available only for newly deployed contract (at least 2 years from now). And it is unlikely to see big migrations just because there is a new instruction available.

We have benchmarked the clz implementation against the add implementation in the intx library.

If you have done any benchmarks please publish the source code anyhow.

Questions

  1. Should we consider adding the sibling instruction CTZ (count tailing zeros)?
1 Like

CTZ can be very easily implemented by isolating the least significant bit via -x & x and applying CLZ.

To preserve the opcode namespace, I’d propose only implementing CLZ.

1 Like