EIP-2537 (BLS12 precompile) discussion thread

mratsim · May 1, 2024, 11:26pm

I finished fully implemented EIP-2537 in Constantine in EIP-2537 - BLS12-381 precompiles for the EVM by mratsim · Pull Request #368 · mratsim/constantine · GitHub and now can provide accurate benchmarks that take into account (de)serialization and other miscellaneous costs.

First on AMD Ryzen 7840U (2023, low power / ultraportable version of their flagship 7840U, TDP 28W down to 15W, 3.3GHz turbo 5.1GHz) with assembly and Clang.

Second on i5-5257U (Macbook Pro 2015, low power dual core Broadwell, TDP 15W down to 7.5W, 2.3GHz turbo 2.9GHz) without assembly (it doesn’t have ADOX/ADCX instructions as well), but we do use the add-with-carry intrinsics. This is a 9 year old x86 CPU.

Finally on a Raspberry Pi 4 (2019, TDP 10W, 1.5GHz) without assembly, and without any add-with-carry intrinsics. No add-with-carry makes bigint operations up to 3x more costly has add-with-carry need to be replaced by main addition, comparison/carry check, adding the carry.

See also compiler codegen Compiler Explorer and Tracking compiler inefficiencies · Issue #357 · mratsim/constantine · GitHub

Price suggestions

This remark is confirmed with end-to-end test.
Besides MSMs all operations can achieve 60Mgas/s on a 9 year old very low-powered CPU (15W!), without assembly. and even 230Mgas/s for mapping to G2. This strongly suggests that those operations are overcosted.

I think that 9 year old, 15W CPU is significantly outdated and achieving 20Mgas/s without assembly on it should be fine.

Hence I suggest the following price changes:

Primitive	Current price	Suggested price	Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - current	Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - repriced	Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - current	Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - repriced
BLS12_G1ADD	500	125	185.60 MGas/s	46.4	79.24 MGas/s	19.81
BLS12_G1MUL	12000	3800	144.80 MGas/s	45.9	52.94 MGas/s	16.8
BLS12_G2ADD	800	170	218.28 MGas/s	46.325	86.30 MGas/s	18.33
BLS12_G2MUL	45000	6100	346.23 MGas/s	46.9	113.23 MGas/s	15.35
BLS12_MAP_FP_TO_G1	5500	1600	161.09 MGas/s	46.86	59.91 MGas/s	17.43
BLS12_MAP_FP2_TO_G2	75000	5000	702.17 MGas/s	46.8	231.85 MGas/s	15.46

The rationale for those changes is targeting a CPU 2~3 years older than my own for 30Mgas/s, CPUs are progressing on average by 10~15%/year, something that is 66% perf is reasonable (so around 45Mgas/s). But that said that CPU is still a low-power CPU, so instead of targeting 45Mgas on it we can target 40Mgas or even 35Mgas.

In any case it is clear that G2 functions are severely overcosted, especially mapping to G2.

Pairings

For pairings, the evolution of cost for multipairing is stable but the price is 5x the actual cost.

Hence I suggest moving from 43000*k + 65000 to 8600*k + 13000

MSMs

The discounts are too steep and need adjustment.

Primitive	Current cost	Suggested cost	Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - current	Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - repriced	Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - current	Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - repriced
G1MSM 2	2 * 12000 * 888 / 1000 = 21312	2 * 3800 * 999 / 1000 = 7592	121.74 Mgas/s	43.37 Mgas/s	39.46 Mgas/s	14.06 Mgas/s
G1MSM 4	4 * 12000 * 641 / 1000 = 30768	4 * 3800 * 919 / 1000 = 13969	102.15 Mgas/s	46.38 Mgas/s	31.77 Mgas/s	14.42 Mgas/s
G1MSM 8	8 * 12000 * 453 / 1000 = 43488	8 * 3800 * 813 / 1000 = 24715	81.67 Mgas/s	46.41 Mgas/s	28.10 Mgas/s	15.97 Mgas/s
G1MSM 16	16 * 12000 * 334 / 1000 = 64128	16 * 3800 * 728 / 1000 = 44262	67.23 Mgas/s	46.40 Mgas/s	22.78 Mgas/s	15.72 Mgas/s
G1MSM 32	32 * 12000 * 268 / 1000 = 102912	32 * 3800 * 668 / 1000 = 81229	59.01 Mgas/s	46.58 Mgas/s	20.42 Mgas/s	16.12 Mgas/s
G1MSM 64	64 * 12000 * 221 / 1000 = 169728	64 * 3800 * 630 / 1000 = 153216	51.40 Mgas/s	46.40 Mgas/s	18.18 Mgas/s	16.41 Mgas/s
G1MSM 128	128 * 12000 * 174 / 1000 = 267264	128 * 3800 * 602 / 1000 = 292813	42.37 Mgas/s	46.42 Mgas/s	15.22 Mgas/s	16.65 Mgas/s
G2MSM 2	2 * 45000 * 888 / 1000 = 79920	2 * 6100 * 999 / 1000 = 12188	274.23 Mgas/s	41.82 Mgas/s	85.47 Mgas/s	13.03 Mgas/s
G2MSM 4	4 * 45000 * 641 / 1000 = 115380	4 * 6100 * 919 / 1000 = 22424	168.31 Mgas/s	32.71 Mgas/s	69.02 Mgas/s	13.41 Mgas/s
G2MSM 8	8 * 45000 * 453 / 1000 = 163080	8 * 6100 * 813 / 1000 = 39674	177.98 Mgas/s	42.77 Mgas/s	55.77 Mgas/s	13.57 Mgas/s
G2MSM 16	16 * 45000 * 334 / 1000 = 240480	16 * 6100 * 728 / 1000 = 71053	150.39 Mgas/s	44.43 Mgas/s	48.60 Mgas/s	14.36 Mgas/s
G2MSM 32	32 * 45000 * 268 / 1000 = 385920	32 * 6100 * 668 / 1000 = 130394	136.80 Mgas/s	46.22 Mgas/s	44.21 Mgas/s	14.94 Mgas/s
G2MSM 64	64 * 45000 * 221 / 1000 = 636480	64 * 6100 * 630 / 1000 = 245952	119.28 Mgas/s	46.09 Mgas/s	39.68 Mgas/s	15.33 Mgas/s
G2MSM 128	128 * 45000 * 174 / 1000 = 1002240	128 * 6100 * 602 / 1000 = 470042	101.59 Mgas/s	47.64 Mgas/s	33.94 Mgas/s	15.92 Mgas/s

Comparison with EIP-4844 KZG

EIP-4844 introduced a point evaluation precompile that is mostly a double pairing at 50000 gas: EIP-4844: Shard Blob Transactions

On my machine that is 1283.318 operations/s so 64.17Mgas/s

Comparison with EIP-196 and EIP-197

BN254 ECADD is at 500
BN254 ECMUL is at 40000
BN254 PAIRING is at 80000*k + 100000

I can introduce benchmarks later but they seem mispriced by a factor over 10x for multiplication and about 10x for pairings.