EIP-2537 (BLS12 precompile) discussion thread

mratsim · April 6, 2024, 1:09pm

I’m currently adding EIP-2537 support to Constantine and at the same-time providing metering for the worst-case scenario using the optimized constant-time routines and detailed operation counts.

See benchmarks: EIP-2537 - BLS12-381 precompiles for the EVM by mratsim · Pull Request #368 · mratsim/constantine · GitHub

and detailed operation counts https://github.com/mratsim/constantine/blob/eip2537/metering/eip2537.md#metering (note: the metering slows down execution by ~5x)

However, while subgroup checks are mandatory for pairings, they are not for:

elliptic curve addition
elliptic curve multiplication
multiscalar multiplication (MSM)

For addition, this is not a problem for Jacobian coordinates, however if using Projective coordinates, in particular the complete formulas from Complete addition formulas for prime order elliptic curves there might be an assumption of being in the prime order subgroup. This needs test vectors with a point not in subgroup.

For multiplication, endomorphism acceleration is widely used in BLS12-381 libraries, for a minimum of ~30% speedup. It will give wrong result if the point is not in the correct subgroup.
The spec needs to spell out that implementation should not use endomorphism acceleration.
(Or that endomorphism acceleration is mandatory, and a test should be done to check that all implementations, give the same wrong result.)
This needs test vectors with a point not in subgroup.

For multi-scalar multiplication, we will have the same issue as scalar multiplication. This needs test vectors with points not in subgroup.

To be clear because 33% one way is 50% another, it’s 50us vs 75us per operation on my machine.

As an aside, the EIP should pick either the additive notation or the multiplicative notation but not mix both, i.e. replace the name “multiexponentiation” with the “multi-scalar-multiplication” (MSM).

I can open 2 PRs for what I think is the sensible choice:

Do not use endomorphism acceleration.
Rename to multi-scalar-multiplication.
edit: PR open Update EIP-2537: Rename multiexponentiation to MSM by mratsim · Pull Request #8393 · ethereum/EIPs · GitHub

Tagging @asanso re test vectors.

ethyla · April 7, 2024, 1:45pm

You saved me a lot of wasted time, thanks.
I picked it up again this week and now have a working implementation. It passes the rfc tests against a modified geth.
Though some more cleanup and optimizations are needed.

asanso · April 8, 2024, 9:41am

Thanks a lot @mratsim for the great analysis.
W,r.t the subgroup check I totally agree with you about the GLV optimization.
Now my take would be to enforce subgroup check for all the operations (see PR 8322).
The analysis done by @MariusVanDerWijden shows though that this would not be possible for BLS12_G1ADD and BLS12_G2ADD (too expensive) but feasible for the other operations.
Now it might be a bit “ugly” to force subgroup check for all but BLS12_G1ADD and BLS12_G2ADD but IMHO it’s the best way to go. Thoughts ?

mratsim · April 10, 2024, 9:23am

I’ll split my reply into an analysis and factual part and then my own opinion piece, then open up on something related to EIP-7667 on gas cost.

Analysis

I see 3 different paths:

No change approach. no subgroup check, price like double-and-add and Pippenger today and add test vectors. For MSMs, in both Gnark and Constantine, endomorphism acceleration was not a problem or was even slower due to probably decomposition overhead + memory bandwidth / hardware cache limitations (at least on x86, Apple memory likely behaves differently).
Misuse resistance approach. See @asanso, enforce subgroup check except on addition.
Modular approach or performance approach, provide subgroup_check and clear_cofactor as separate primitives. People can cache subgroup_checks and ensure their cost is only paid once. Issue: they need to learn what a subgroup and cofactor is.

Now, for a more informed decision, we’re okay with the pricing of regular scalar-multiplication as double-and-add, but is endomorphism acceleration important for MSM?

When optimizing for G1, on my machines beyond 2^10 points (1024), endomorphisms were slower and so are deactivated: constantine/constantine/math/elliptic/ec_multi_scalar_mul.nim at 976c8bb215a3f0b21ce3d05f894eb506072a6285 · mratsim/constantine · GitHub

One extra remark is that on G1, endomorphism cut the scalar size from 255 bits to 128 bits. On G2, you cut the scalar size from 255 bit to 64 bit, further benchmarks confirm that similarly, there is a threshold, where endomorphism acceleration doesn’t really help.
With endomorphism:

No endomorphism:

Opinion

1 If not properly tested, the “no change” approach is likely to lead to state split if people start using point on curve but not in subgroup. If properly tested, the only difficulty is if some libraries do not allow selection of scalar multiplication algorithm and use endomorphism by default.

3 is similar to 1 in the happy path. In the unhappy path, will users respect a “Before calling the scalar-mul precompile, input MUST be subgroup-checked or coming from clear_cofactor or mapping_to_curve)”? If not, libraries can’t use the endomorphism accelerated path and we might as well use 1.

2 sounds like the reasonable choice.

Source code: constantine/constantine/math/elliptic/ec_multi_scalar_mul.nim at 976c8bb215a3f0b21ce3d05f894eb506072a6285 · mratsim/constantine · GitHub
Now for pricing the subgroup checks, I’ll use Scott, https://eprint.iacr.org/2021/1130.pdf, but there is a later note from El Housni:

G1: P is in the 𝔾1 subgroup iff ϕ(P) == -u², ϕ is cubic root of unity endomorphism and costs 1 fp_mul, u is the curve parameter and is 64-bit with low hamming weight, hence about ~130 fp_muls for subgroup checks. A 255-bit double-and-add is on average 255+127.5 fp_mul so the ratio subgroup_check/scalar_mul is about 1/3.
G2: P is in the 𝔾2 subgroup iff ψ(P) == u ψ is Frobenius endomorphism and costs 1 fp2_mul, u is the curve parameter and is 64-bit with low hamming weight, hence about ~70 fp2_muls for subgroup checks. A 255-bit double-and-add is on average 255+127.5 fp_mul so the ratio is about 1/5.

El Housni latest paper on subgroup checks:

Gas costs

Better discussed here probably: EIP-7667: Raise gas costs of hash functions - #6 by mratsim

but here is the gist, some operations are very expensive in zkEVMs and zkVMs like hash functions, and the Verge roadmap endgame is having a snarkified EVM (text: Annotated Ethereum Roadmap and https://twitter.com/VitalikButerin/status/1741190491578810445)

So EIP-7667 proposes to raise gas cost of hash functions. But should BN254 and BLS12-381 also reflect that?

shamatar · April 10, 2024, 9:45am

I think decision on whether to enforce or not subgroup checks should be biased on the use modes, namely “is there any use case that doesn’t eventually want to have everything in the main subgroup?”. So far it looks like if we do clear cofactors in “map to curve” implementations (both fp → G1 and fp_2 → G2) then all the cases would be happy to work with points in the subgroup.

So, at this moment I would be 99% for inclusion of subgroup checks into all relevant precompiles (so everything except additions and map-to-curve). Also, it’ll eventually someone from mistake or misuse.

mratsim · April 10, 2024, 1:15pm

The only legitimate use-case I have for no subgroup check is if several consecutive operations are needed and the subgroup check cost can only be paid once.

For example, verifying KZG commitments would incur that: constantine/constantine/commitments/kzg_polynomial_commitments.nim at 976c8bb215a3f0b21ce3d05f894eb506072a6285 · mratsim/constantine · GitHub.

But this is only a matter of gas pricing.

The non-legitimate use-case would be malicious:

double-spend attacks on a on-chain BLS signature protocol like monero: Disclosure of a Major Bug in CryptoNote Based Currencies | Monero - secure, private, untraceable,
or causing a chain split if some clients use endomorphisms and get the wrong result on a point not in the proper subgroup.

mratsim · April 10, 2024, 1:18pm

Some benches from Rollcall discussion

gas costs in ratio of G1 scalarmul

original: eip2537_constantine_bench.csv · GitHub

Reproduction

git clone https://github.com/mratsim/constantine
cd constantine
git checkout eip2537
CC=clang nimble bench_eip2537_subgroup_checks_impact

asanso · April 15, 2024, 1:31pm

One other things that came out chatting with @dankrad is this. `POINT_EVALUATION_PRECOMPILE_GAS in EIP-4844 is about 50000. Now the POINT_EVALUATION_PRECOMPILE is about 2 pairings. But if we do the equivalent of 2 pairings in EIP-2537 is 151000. Isn’t it really inconsistent?

Thoughts @mratsim @shamatar @ralexstokes ?

mratsim · April 18, 2024, 9:19am

In my noisy bench (browser, discord, telegram running), I can confirm your observations.

This is the details of operation cost for BLS12-381 primitives

And this is for EIP-4844:

Pairing is 415us, verify_kzg_proof is 779us.
verify_kzg_proof is indeed a 2-way multi-pairing: constantine/constantine/ethereum_eip4844_kzg.nim at 976c8bb215a3f0b21ce3d05f894eb506072a6285 · mratsim/constantine · GitHub

2*415/779 = 1.065x hence I concur that having a ratio of 3 between both is way to big of a difference.

However, if we focus at multipairing closely in Constantine, a dedicated multi-pairing function grows much slower.

It grows by about 120us per extra pair so ~30%.

Note: every 8 pairs, there is a slight cost bump as my multi-pairing batch size is 8
That said, in my analysis the cost bump is less than 0.1% per pairing

github.com/mratsim/constantine

constantine/math/pairings/miller_accumulators.nim

976c8bb


      
          const MillerAccumMax = 8
          # Max buffer size before triggering a Miller Loop.
          # Assuming pairing costs 100, with 50 for Miller Loop and 50 for Final exponentiation.
          #
          # N unbatched pairings would cost              N*100
          # N maximally batched pairings would cost      N*50 + 50
          # N AccumMax batched pairings would cost       N*50 + N/MillerAccumMax*(Fpᵏ mul) + 50
          #
          # Fpᵏ mul costs 0.7% of a Miller Loop and so is negligeable.
          # By choosing AccumMax = 8, we amortized the cost to below 0.1% per pairing.

I can do a repricing proposal based on 30 Mgas/s on my PC.
If wanted I can also compare to BN254 precompile.

mratsim · April 18, 2024, 9:58am

here is a repricing, using the number above.

Disclaimers:

CPU is Ryzen Pro 7840U, mobile 8-core CPU. The low-power version of their flagship 7840HS. It’s a 15W to 30W CPU (while flagship can go up to 54W): https://www.amd.com/fr/products/apu/amd-ryzen-7-pro-7840u
Turbo was not disabled
Library is Constantine, it has significant optimizations that do not always exist in BLST, Gnark, arkworks, bellman, …
Unless noted, library is constant-time, meaning perf is always worst-case.
For vartime scalarmul operation about half the bits are set (random initialization) while the spec use the worst-case all-bits set.
The raw cryptographic timing are done and do not account for overhead of converting from EVM ABI to internal ABI, checking that a number is lower than the modulus, etc …

Operation + Algorithm	Curve	Throughput (ops/s)	Gas cost (target 30 Mgas/s)	Latency (ns/op)	CPU cycle count (approx)
EC Add G1	EC_Prj[Fp[BLS12_381]]	2732240.437	10.98	366	1206
EC Add vartime G1	EC_Prj[Fp[BLS12_381]]	4975124.378	6.03	201	664
EC Double G1	EC_Prj[Fp[BLS12_381]]	4219409.283	7.11	237	781
EC Add G1	EC_Jac[Fp[BLS12_381]]	2188183.807	13.71	457	1507
EC Add vartime G1	EC_Jac[Fp[BLS12_381]]	5780346.821	5.19	173	572
EC Double G1	EC_Jac[Fp[BLS12_381]]	4566210.046	6.57	219	723
EC Add G2	EC_Prj[Fp2[BLS12_381]]	943396.226	31.80	1060	3494
EC Add vartime G2	EC_Prj[Fp2[BLS12_381]]	1869158.879	16.05	535	1763
EC Double G2	EC_Prj[Fp2[BLS12_381]]	1564945.227	19.17	639	2107
EC Add G2	EC_Jac[Fp2[BLS12_381]]	810372.771	37.02	1234	4066
EC Add vartime G2	EC_Jac[Fp2[BLS12_381]]	2207505.519	13.59	453	1493
EC Double G2	EC_Jac[Fp2[BLS12_381]]	2000000	15.00	500	1648
EC ScalarMul 255-bit G1	EC_Prj[Fp[BLS12_381]]	16451.427	1 823.55	60785	200212
EC ScalarMul vartime 255-bit G1	EC_Prj[Fp[BLS12_381]]	19455.253	1 542.00	51400	169301
EC ScalarMul 255-bit G1	EC_Jac[Fp[BLS12_381]]	16266.511	1 844.28	61476	202489
EC ScalarMul vartime 255-bit G1	EC_Jac[Fp[BLS12_381]]	20193.861	1 485.60	49520	163106
EC ScalarMul 255-bit G2	EC_Prj[Fp2[BLS12_381]]	8268.768	3 628.11	120937	398339
EC ScalarMul vartime 255-bit G2	EC_Prj[Fp2[BLS12_381]]	9304.836	3 224.13	107471	353983
EC ScalarMul 255-bit G2	EC_Jac[Fp2[BLS12_381]]	8717.332	3 441.42	114714	377842
EC ScalarMul vartime 255-bit G2	EC_Jac[Fp2[BLS12_381]]	10393.498	2 886.42	96214	316905
Miller Loop BLS12	BLS12_381	5241.695	5 723.34	190778	628376
Final Exponentiation BLS12	BLS12_381	4379.089	6 850.74	228358	752155
Pairing BLS12	BLS12_381	2408.223	12 457.32	415244	1367710
Hash to G1 (SSWU - Draft #14)	BLS12_381	23078.698	1 299.90	43330	142721
Hash to G2 (SSWU - Draft #14)	BLS12_381	7504.015	3 997.86	133262	438933
Subgroup check	EC_Jac[Fp[BLS12_381]]	30690.851	977.49	32583	107323
Subgroup check	EC_Jac[Fp2[BLS12_381]]	26205.451	1 144.80	38160	125690
Miller Loop BLS12	BLS12_381	5222.423	5 744.46	191482	630693
Final Exponentiation BLS12	BLS12_381	4385.138	6 841.29	228043	751118
Pairing BLS12	BLS12_381	2408.269	12 457.08	415236	1367683
Pairing BLS12 batched: 1	BLS12_381	2408.014	12 458.40	415280	1367824
Pairing BLS12 non-batched: 2	BLS12_381	1200.394	24 991.79	833060	2743882
Pairing BLS12 batched: 2	BLS12_381	1882.654	15 934.95	531165	1749524
Pairing BLS12 non-batched: 3	BLS12_381	801.33	37 437.76	1247925	4110348
Pairing BLS12 batched: 3	BLS12_381	1530.618	19 599.93	653331	2151906
Pairing BLS12 batched: 4	BLS12_381	1301.712	23 046.57	768219	2530312
Pairing BLS12 batched: 5	BLS12_381	1124.613	26 675.84	889195	2928783
Pairing BLS12 batched: 6	BLS12_381	992.763	30 218.69	1007290	3317752
Pairing BLS12 batched: 7	BLS12_381	885.784	33 868.30	1128943	3718451
Pairing BLS12 batched: 8	BLS12_381	804.54	37 288.39	1242946	4093950
verify_kzg_proof (EIP-4844)	BLS12_381 serial	1284.248	23 359.97	778666	2564722

mratsim · April 18, 2024, 10:41am

Now the gas price I suggest assuming we want to allow mobile CPUs from 10 years ago to be competitive.

Old benchmark of Constantine on i5-5257U (mobile Broadwell 2.7GHz without ADCX, ADOX instructions for bigint): Multipairing by mratsim · Pull Request #165 · mratsim/constantine · GitHub

The CPU is about 2.5x slower (single-threaded) than 7840U with Constantine. We assume that all modern BLS12-381 backend are within 20% of Constantine, so multiply ideal gas cost on 7840U by 3 and round up generously. Also we use the constant-time jacobian coordinates as reference as every library but Constantine, Halo2 and Icicle use Jacobian coordinates.

For pairing cost, the formula used is:
<Miller Loop> * k * 0.8 + <Final exponentiation>, the 0.8 is due to acceleration tricks when batching multiple miller loops.

Operation + Algorithm	Curve	Throughput (ops/s)	Gas cost (target 30 Mgas/s)	Current gas	Suggested gas (CPU 2015, 2.5x slower, framework 20% slower, => x3)	Subgroup checks cost	Total gas cost
EC Add G1	Prj G1	2732240.437	10.98
EC Add vartime G1	Prj G1	4975124.378	6.03
EC Double G1	Prj G1	4219409.283	7.11
EC Add G1	Jac G1	2188183.807	13.71	500	40		40
EC Add vartime G1	Jac G1	5780346.821	5.19
EC Double G1	Jac G1	4566210.046	6.57
EC Add G2	Prj G2	943396.226	31.80
EC Add vartime G2	Prj G2	1869158.879	16.05
EC Double G2	Prj G2	1564945.227	19.17
EC Add G2	Jac G2	810372.771	37.02	800	110		110
EC Add vartime G2	Jac G2	2207505.519	13.59
EC Double G2	Jac G2	2000000	15.00
Subgroup check	Jac G1	16451.427	1 823.55		3000		3000
Subgroup check	Jac G2	19455.253	1 542.00		3500		3500
EC ScalarMul 255-bit G1	Prj G1	16266.511	1 844.28
EC ScalarMul vartime 255-bit G1	Prj G1	20193.861	1 485.60
EC ScalarMul 255-bit G1	Jac G1	8268.768	3 628.11	12000	5500	3000	8500
EC ScalarMul vartime 255-bit G1	Jac G1	9304.836	3 224.13
EC ScalarMul 255-bit G2	Prj G2	8717.332	3 441.42
EC ScalarMul vartime 255-bit G2	Prj G2	10393.498	2 886.42
EC ScalarMul 255-bit G2	Jac G2	5241.695	5 723.34	45000	10300	3500	13800
EC ScalarMul vartime 255-bit G2	Jac G2	4379.089	6 850.74
Miller Loop BLS12		2408.223	12 457.32
Final Exponentiation BLS12		23078.698	1 299.90
Pairing BLS12		7504.015	3 997.86
Hash to G1 (SSWU - Draft #14)		30690.851	977.49
Hash to G2 (SSWU - Draft #14)		26205.451	1 144.80
Miller Loop BLS12		5222.423	5 744.46		18000
Final Exponentiation BLS12		4385.138	6 841.29		20500
Pairing BLS12		2408.269	12 457.08	108000	38500	6500	45000
Pairing BLS12 batched: 1		2408.014	12 458.40
Pairing BLS12 non-batched: 2		1200.394	24 991.79
Pairing BLS12 batched: 2		1882.654	15 934.95	151000	49300	13000	62300
Pairing BLS12 non-batched: 3		801.33	37 437.76
Pairing BLS12 batched: 3		1530.618	19 599.93	194000	63700	19500	83200
Pairing BLS12 batched: 4		1301.712	23 046.57	237000	78100	26000	104100
Pairing BLS12 batched: 5		1124.613	26 675.84	280000	92500	32500	125000
Pairing BLS12 batched: 6		992.763	30 218.69	323000	106900	39000	145900
Pairing BLS12 batched: 7		885.784	33 868.30	366000	121300	45500	166800
Pairing BLS12 batched: 8		804.54	37 288.39	409000	135700	52000	187700
verify_kzg_proof (EIP-4844)		1284.248	23 359.97	50000	70000	(priced in)	70000

QEDK · April 21, 2024, 6:01pm

Agree with the premise of this, we should bring the pricing of these in-line with current KZG proof costs.

mratsim · April 27, 2024, 7:25am

I notice that unlike EIP-196, EIP-2537 does not allow truncated / virtually padded with zeros.

Encoding

Field elements and scalars are encoded as 32 byte big-endian numbers. Curve points are encoded as two field elements (x, y), where the point at infinity is encoded as (0, 0).

Tuples of objects are encoded as their concatenation.

For both precompiled contracts, if the input is shorter than expected, it is assumed to be virtually padded with zeros at the end (i.e. compatible with the semantics of the CALLDATALOAD opcode). If the input is longer than expected, surplus bytes at the end are ignored.

The length of the returned data is always as specified (i.e. it is not “unpadded”).

Is this intentional? I wonder if we can make a retroactive EIP to enshrine the same for EIP-196.

mratsim · May 1, 2024, 11:26pm

I finished fully implemented EIP-2537 in Constantine in EIP-2537 - BLS12-381 precompiles for the EVM by mratsim · Pull Request #368 · mratsim/constantine · GitHub and now can provide accurate benchmarks that take into account (de)serialization and other miscellaneous costs.

First on AMD Ryzen 7840U (2023, low power / ultraportable version of their flagship 7840U, TDP 28W down to 15W, 3.3GHz turbo 5.1GHz) with assembly and Clang.

Second on i5-5257U (Macbook Pro 2015, low power dual core Broadwell, TDP 15W down to 7.5W, 2.3GHz turbo 2.9GHz) without assembly (it doesn’t have ADOX/ADCX instructions as well), but we do use the add-with-carry intrinsics. This is a 9 year old x86 CPU.

Finally on a Raspberry Pi 4 (2019, TDP 10W, 1.5GHz) without assembly, and without any add-with-carry intrinsics. No add-with-carry makes bigint operations up to 3x more costly has add-with-carry need to be replaced by main addition, comparison/carry check, adding the carry.

See also compiler codegen Compiler Explorer and Tracking compiler inefficiencies · Issue #357 · mratsim/constantine · GitHub

Price suggestions

This remark is confirmed with end-to-end test.
Besides MSMs all operations can achieve 60Mgas/s on a 9 year old very low-powered CPU (15W!), without assembly. and even 230Mgas/s for mapping to G2. This strongly suggests that those operations are overcosted.

I think that 9 year old, 15W CPU is significantly outdated and achieving 20Mgas/s without assembly on it should be fine.

Hence I suggest the following price changes:

Primitive	Current price	Suggested price	Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - current	Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - repriced	Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - current	Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - repriced
BLS12_G1ADD	500	125	185.60 MGas/s	46.4	79.24 MGas/s	19.81
BLS12_G1MUL	12000	3800	144.80 MGas/s	45.9	52.94 MGas/s	16.8
BLS12_G2ADD	800	170	218.28 MGas/s	46.325	86.30 MGas/s	18.33
BLS12_G2MUL	45000	6100	346.23 MGas/s	46.9	113.23 MGas/s	15.35
BLS12_MAP_FP_TO_G1	5500	1600	161.09 MGas/s	46.86	59.91 MGas/s	17.43
BLS12_MAP_FP2_TO_G2	75000	5000	702.17 MGas/s	46.8	231.85 MGas/s	15.46

The rationale for those changes is targeting a CPU 2~3 years older than my own for 30Mgas/s, CPUs are progressing on average by 10~15%/year, something that is 66% perf is reasonable (so around 45Mgas/s). But that said that CPU is still a low-power CPU, so instead of targeting 45Mgas on it we can target 40Mgas or even 35Mgas.

In any case it is clear that G2 functions are severely overcosted, especially mapping to G2.

Pairings

For pairings, the evolution of cost for multipairing is stable but the price is 5x the actual cost.

Hence I suggest moving from 43000*k + 65000 to 8600*k + 13000

MSMs

The discounts are too steep and need adjustment.

Primitive	Current cost	Suggested cost	Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - current	Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - repriced	Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - current	Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - repriced
G1MSM 2	2 * 12000 * 888 / 1000 = 21312	2 * 3800 * 999 / 1000 = 7592	121.74 Mgas/s	43.37 Mgas/s	39.46 Mgas/s	14.06 Mgas/s
G1MSM 4	4 * 12000 * 641 / 1000 = 30768	4 * 3800 * 919 / 1000 = 13969	102.15 Mgas/s	46.38 Mgas/s	31.77 Mgas/s	14.42 Mgas/s
G1MSM 8	8 * 12000 * 453 / 1000 = 43488	8 * 3800 * 813 / 1000 = 24715	81.67 Mgas/s	46.41 Mgas/s	28.10 Mgas/s	15.97 Mgas/s
G1MSM 16	16 * 12000 * 334 / 1000 = 64128	16 * 3800 * 728 / 1000 = 44262	67.23 Mgas/s	46.40 Mgas/s	22.78 Mgas/s	15.72 Mgas/s
G1MSM 32	32 * 12000 * 268 / 1000 = 102912	32 * 3800 * 668 / 1000 = 81229	59.01 Mgas/s	46.58 Mgas/s	20.42 Mgas/s	16.12 Mgas/s
G1MSM 64	64 * 12000 * 221 / 1000 = 169728	64 * 3800 * 630 / 1000 = 153216	51.40 Mgas/s	46.40 Mgas/s	18.18 Mgas/s	16.41 Mgas/s
G1MSM 128	128 * 12000 * 174 / 1000 = 267264	128 * 3800 * 602 / 1000 = 292813	42.37 Mgas/s	46.42 Mgas/s	15.22 Mgas/s	16.65 Mgas/s
G2MSM 2	2 * 45000 * 888 / 1000 = 79920	2 * 6100 * 999 / 1000 = 12188	274.23 Mgas/s	41.82 Mgas/s	85.47 Mgas/s	13.03 Mgas/s
G2MSM 4	4 * 45000 * 641 / 1000 = 115380	4 * 6100 * 919 / 1000 = 22424	168.31 Mgas/s	32.71 Mgas/s	69.02 Mgas/s	13.41 Mgas/s
G2MSM 8	8 * 45000 * 453 / 1000 = 163080	8 * 6100 * 813 / 1000 = 39674	177.98 Mgas/s	42.77 Mgas/s	55.77 Mgas/s	13.57 Mgas/s
G2MSM 16	16 * 45000 * 334 / 1000 = 240480	16 * 6100 * 728 / 1000 = 71053	150.39 Mgas/s	44.43 Mgas/s	48.60 Mgas/s	14.36 Mgas/s
G2MSM 32	32 * 45000 * 268 / 1000 = 385920	32 * 6100 * 668 / 1000 = 130394	136.80 Mgas/s	46.22 Mgas/s	44.21 Mgas/s	14.94 Mgas/s
G2MSM 64	64 * 45000 * 221 / 1000 = 636480	64 * 6100 * 630 / 1000 = 245952	119.28 Mgas/s	46.09 Mgas/s	39.68 Mgas/s	15.33 Mgas/s
G2MSM 128	128 * 45000 * 174 / 1000 = 1002240	128 * 6100 * 602 / 1000 = 470042	101.59 Mgas/s	47.64 Mgas/s	33.94 Mgas/s	15.92 Mgas/s

Comparison with EIP-4844 KZG

EIP-4844 introduced a point evaluation precompile that is mostly a double pairing at 50000 gas: EIP-4844: Shard Blob Transactions

On my machine that is 1283.318 operations/s so 64.17Mgas/s

Comparison with EIP-196 and EIP-197

BN254 ECADD is at 500
BN254 ECMUL is at 40000
BN254 PAIRING is at 80000*k + 100000

I can introduce benchmarks later but they seem mispriced by a factor over 10x for multiplication and about 10x for pairings.

mratsim · May 5, 2024, 5:33pm

Bench update with SHA256 and BN254 precompiles

poojaranjan · July 19, 2024, 2:51pm

EIP-2537: Precompile for BLS12-381 curve operations with @ralexstokes

jwasinger · August 16, 2024, 8:08pm

~~I’m sure this has been covered previously, but I can’t find it:~~

~~Why is the ABI encoding for Fp elements aligned along 64 bytes instead of 48 bytes?~~

it’s discussed in the EIP text.

chfast · September 11, 2024, 12:40pm

Is there a mathematical formula for discount?

chfast · October 3, 2024, 8:08pm

BLS precompile `g1msm` gas cost analysis

We’ve implemented EIP-2537: Precompile for BLS12-381 curve operations in evmone using blst v0.3.13 library. Then, we’ve benchmark multi-scalar multiplication g1msm in this implementation and compared timings with the specified gas schedule.

Benchmarks have been done on 2 different CPU architectures:

x86-64 Intel Ultra 7 155U Performance Core 4.0GHz
arm64 Apple M3 Pro

Timings have been normalized to single point multiplication (precompile g1mul). I.e. single naive multiplication is 1.

We also benchmark variant of g1msm with subgroup checks disabled (“nocheck”).

The chart also includes the line “gas” of by-the-spec gas cost. It is “discounted” up to 128, then linear with slope 174/1000.

Finally, we’ve added the “check only” line which is difference between default and “nocheck” series.

Conclusions

The real cost of the precompile is close to linear with slope 100/256 and is significantly higher than the estimated by the gas cost. While, the cost of the precompile without subgroup checks is sub-linear and “relatively” matches the gas cost.

In order to approximate the cost of the subgroup check component of the operation, we analyze the difference of time taken by the “check” and “nocheck” variant. It is visible from the chart that slope of the data is constant for various $n$ (contrary to that of the “check” and “nocheck” slopes which gradually decrease). So we have a sub-linear (piecewise linear) component of the MS itself and a linear component of the subgroup check, which aligns with the theoretical expectations

In other words, the g1msm precompile is compound of linear and sub-linear operations, therefore it total complexity is linear.

The gas schedule, specifically the discount table, was proposed in the initial commit BLS12-381 curve operations from 2020 and hadn’t been updated since.

The subgroup check was added two years later in 2022 by the Update EIP-2537: Rephrased subgroup check part.

We should get under consideration once again if the subgroup check is needed for point multiplication. The algorithms (mul and msm) works well on every point from the curve. This check seems to be added only because of not very clear security consideration. Moreover, making this check for every call is unnecessary when the programmer is sure that the input is in proper subgroup. I.e it’s a result of previous operation which does not change the subgroup.

By @chfast @rodiazet @pdobacz

shamatar · October 10, 2024, 5:40pm

The original gas table implied/required that MSM will be implemented via peppinger algorithm, that has roughly N/logN assymptotics, and the original table was derived from empirical benchmarks for such case, that included subgroup checks. I will also note that subgroup checks were implemented with trivial multiplication by group order, while any more modern sane implementation should use fast subgroup checks