EIP-2537 (BLS12 precompile) discussion thread

I finished fully implemented EIP-2537 in Constantine in EIP-2537 - BLS12-381 precompiles for the EVM by mratsim · Pull Request #368 · mratsim/constantine · GitHub and now can provide accurate benchmarks that take into account (de)serialization and other miscellaneous costs.

First on AMD Ryzen 7840U (2023, low power / ultraportable version of their flagship 7840U, TDP 28W down to 15W, 3.3GHz turbo 5.1GHz) with assembly and Clang.

Second on i5-5257U (Macbook Pro 2015, low power dual core Broadwell, TDP 15W down to 7.5W, 2.3GHz turbo 2.9GHz) without assembly (it doesn’t have ADOX/ADCX instructions as well), but we do use the add-with-carry intrinsics. This is a 9 year old x86 CPU.

Finally on a Raspberry Pi 4 (2019, TDP 10W, 1.5GHz) without assembly, and without any add-with-carry intrinsics. No add-with-carry makes bigint operations up to 3x more costly has add-with-carry need to be replaced by main addition, comparison/carry check, adding the carry.

See also compiler codegen Compiler Explorer and Tracking compiler inefficiencies · Issue #357 · mratsim/constantine · GitHub


Price suggestions

This remark is confirmed with end-to-end test.
Besides MSMs all operations can achieve 60Mgas/s on a 9 year old very low-powered CPU (15W!), without assembly. and even 230Mgas/s for mapping to G2. This strongly suggests that those operations are overcosted.

I think that 9 year old, 15W CPU is significantly outdated and achieving 20Mgas/s without assembly on it should be fine.

Hence I suggest the following price changes:

Primitive Current price Suggested price Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - current Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - repriced Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - current Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - repriced
BLS12_G1ADD 500 125 185.60 MGas/s 46.4 79.24 MGas/s 19.81
BLS12_G1MUL 12000 3800 144.80 MGas/s 45.9 52.94 MGas/s 16.8
BLS12_G2ADD 800 170 218.28 MGas/s 46.325 86.30 MGas/s 18.33
BLS12_G2MUL 45000 6100 346.23 MGas/s 46.9 113.23 MGas/s 15.35
BLS12_MAP_FP_TO_G1 5500 1600 161.09 MGas/s 46.86 59.91 MGas/s 17.43
BLS12_MAP_FP2_TO_G2 75000 5000 702.17 MGas/s 46.8 231.85 MGas/s 15.46

The rationale for those changes is targeting a CPU 2~3 years older than my own for 30Mgas/s, CPUs are progressing on average by 10~15%/year, something that is 66% perf is reasonable (so around 45Mgas/s). But that said that CPU is still a low-power CPU, so instead of targeting 45Mgas on it we can target 40Mgas or even 35Mgas.

In any case it is clear that G2 functions are severely overcosted, especially mapping to G2.

Pairings

For pairings, the evolution of cost for multipairing is stable but the price is 5x the actual cost.

Hence I suggest moving from 43000*k + 65000 to 8600*k + 13000

MSMs

The discounts are too steep and need adjustment.

Primitive Current cost Suggested cost Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - current Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - repriced Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - current Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - repriced
G1MSM 2 2 * 12000 * 888 / 1000 = 21312 2 * 3800 * 999 / 1000 = 7592 121.74 Mgas/s 43.37 Mgas/s 39.46 Mgas/s 14.06 Mgas/s
G1MSM 4 4 * 12000 * 641 / 1000 = 30768 4 * 3800 * 919 / 1000 = 13969 102.15 Mgas/s 46.38 Mgas/s 31.77 Mgas/s 14.42 Mgas/s
G1MSM 8 8 * 12000 * 453 / 1000 = 43488 8 * 3800 * 813 / 1000 = 24715 81.67 Mgas/s 46.41 Mgas/s 28.10 Mgas/s 15.97 Mgas/s
G1MSM 16 16 * 12000 * 334 / 1000 = 64128 16 * 3800 * 728 / 1000 = 44262 67.23 Mgas/s 46.40 Mgas/s 22.78 Mgas/s 15.72 Mgas/s
G1MSM 32 32 * 12000 * 268 / 1000 = 102912 32 * 3800 * 668 / 1000 = 81229 59.01 Mgas/s 46.58 Mgas/s 20.42 Mgas/s 16.12 Mgas/s
G1MSM 64 64 * 12000 * 221 / 1000 = 169728 64 * 3800 * 630 / 1000 = 153216 51.40 Mgas/s 46.40 Mgas/s 18.18 Mgas/s 16.41 Mgas/s
G1MSM 128 128 * 12000 * 174 / 1000 = 267264 128 * 3800 * 602 / 1000 = 292813 42.37 Mgas/s 46.42 Mgas/s 15.22 Mgas/s 16.65 Mgas/s
G2MSM 2 2 * 45000 * 888 / 1000 = 79920 2 * 6100 * 999 / 1000 = 12188 274.23 Mgas/s 41.82 Mgas/s 85.47 Mgas/s 13.03 Mgas/s
G2MSM 4 4 * 45000 * 641 / 1000 = 115380 4 * 6100 * 919 / 1000 = 22424 168.31 Mgas/s 32.71 Mgas/s 69.02 Mgas/s 13.41 Mgas/s
G2MSM 8 8 * 45000 * 453 / 1000 = 163080 8 * 6100 * 813 / 1000 = 39674 177.98 Mgas/s 42.77 Mgas/s 55.77 Mgas/s 13.57 Mgas/s
G2MSM 16 16 * 45000 * 334 / 1000 = 240480 16 * 6100 * 728 / 1000 = 71053 150.39 Mgas/s 44.43 Mgas/s 48.60 Mgas/s 14.36 Mgas/s
G2MSM 32 32 * 45000 * 268 / 1000 = 385920 32 * 6100 * 668 / 1000 = 130394 136.80 Mgas/s 46.22 Mgas/s 44.21 Mgas/s 14.94 Mgas/s
G2MSM 64 64 * 45000 * 221 / 1000 = 636480 64 * 6100 * 630 / 1000 = 245952 119.28 Mgas/s 46.09 Mgas/s 39.68 Mgas/s 15.33 Mgas/s
G2MSM 128 128 * 45000 * 174 / 1000 = 1002240 128 * 6100 * 602 / 1000 = 470042 101.59 Mgas/s 47.64 Mgas/s 33.94 Mgas/s 15.92 Mgas/s

Comparison with EIP-4844 KZG

EIP-4844 introduced a point evaluation precompile that is mostly a double pairing at 50000 gas: EIP-4844: Shard Blob Transactions

On my machine that is 1283.318 operations/s so 64.17Mgas/s

Comparison with EIP-196 and EIP-197

  • BN254 ECADD is at 500
  • BN254 ECMUL is at 40000
  • BN254 PAIRING is at 80000*k + 100000

I can introduce benchmarks later but they seem mispriced by a factor over 10x for multiplication and about 10x for pairings.

1 Like