EIP-2537 (BLS12 precompile) discussion thread

I’m currently adding EIP-2537 support to Constantine and at the same-time providing metering for the worst-case scenario using the optimized constant-time routines and detailed operation counts.

See benchmarks: EIP-2537 - BLS12-381 precompiles for the EVM by mratsim · Pull Request #368 · mratsim/constantine · GitHub
image
and detailed operation counts constantine/metering/eip2537.md at eip2537 · mratsim/constantine · GitHub (note: the metering slows down execution by ~5x)

However, while subgroup checks are mandatory for pairings, they are not for:

  1. elliptic curve addition
  2. elliptic curve multiplication
  3. multiscalar multiplication (MSM)

For addition, this is not a problem for Jacobian coordinates, however if using Projective coordinates, in particular the complete formulas from Complete addition formulas for prime order elliptic curves there might be an assumption of being in the prime order subgroup. This needs test vectors with a point not in subgroup.

For multiplication, endomorphism acceleration is widely used in BLS12-381 libraries, for a minimum of ~30% speedup. It will give wrong result if the point is not in the correct subgroup.
The spec needs to spell out that implementation should not use endomorphism acceleration.
(Or that endomorphism acceleration is mandatory, and a test should be done to check that all implementations, give the same wrong result.)
This needs test vectors with a point not in subgroup.

For multi-scalar multiplication, we will have the same issue as scalar multiplication. This needs test vectors with points not in subgroup.

To be clear because 33% one way is 50% another, it’s 50us vs 75us per operation on my machine.

As an aside, the EIP should pick either the additive notation or the multiplicative notation but not mix both, i.e. replace the name “multiexponentiation” with the “multi-scalar-multiplication” (MSM).

I can open 2 PRs for what I think is the sensible choice:

Tagging @asanso re test vectors.

1 Like

You saved me a lot of wasted time, thanks.
I picked it up again this week and now have a working implementation. It passes the rfc tests against a modified geth.
Though some more cleanup and optimizations are needed.

Thanks a lot @mratsim for the great analysis.
W,r.t the subgroup check I totally agree with you about the GLV optimization.
Now my take would be to enforce subgroup check for all the operations (see PR 8322).
The analysis done by @MariusVanDerWijden shows though that this would not be possible for BLS12_G1ADD and BLS12_G2ADD (too expensive) but feasible for the other operations.
Now it might be a bit “ugly” to force subgroup check for all but BLS12_G1ADD and BLS12_G2ADD but IMHO it’s the best way to go. Thoughts ?

I’ll split my reply into an analysis and factual part and then my own opinion piece, then open up on something related to EIP-7667 on gas cost.

Analysis

I see 3 different paths:

  1. No change approach. no subgroup check, price like double-and-add and Pippenger today and add test vectors. For MSMs, in both Gnark and Constantine, endomorphism acceleration was not a problem or was even slower due to probably decomposition overhead + memory bandwidth / hardware cache limitations (at least on x86, Apple memory likely behaves differently).
  2. Misuse resistance approach. See @asanso, enforce subgroup check except on addition.
  3. Modular approach or performance approach, provide subgroup_check and clear_cofactor as separate primitives. People can cache subgroup_checks and ensure their cost is only paid once. Issue: they need to learn what a subgroup and cofactor is.

Now, for a more informed decision, we’re okay with the pricing of regular scalar-multiplication as double-and-add, but is endomorphism acceleration important for MSM?

When optimizing for G1, on my machines beyond 2^10 points (1024), endomorphisms were slower and so are deactivated: constantine/constantine/math/elliptic/ec_multi_scalar_mul.nim at 976c8bb215a3f0b21ce3d05f894eb506072a6285 · mratsim/constantine · GitHub

One extra remark is that on G1, endomorphism cut the scalar size from 255 bits to 128 bits. On G2, you cut the scalar size from 255 bit to 64 bit, further benchmarks confirm that similarly, there is a threshold, where endomorphism acceleration doesn’t really help.
With endomorphism:



No endomorphism:

Opinion

1 If not properly tested, the “no change” approach is likely to lead to state split if people start using point on curve but not in subgroup. If properly tested, the only difficulty is if some libraries do not allow selection of scalar multiplication algorithm and use endomorphism by default.

3 is similar to 1 in the happy path. In the unhappy path, will users respect a “Before calling the scalar-mul precompile, input MUST be subgroup-checked or coming from clear_cofactor or mapping_to_curve)”? If not, libraries can’t use the endomorphism accelerated path and we might as well use 1.

2 sounds like the reasonable choice.

Source code: constantine/constantine/math/elliptic/ec_multi_scalar_mul.nim at 976c8bb215a3f0b21ce3d05f894eb506072a6285 · mratsim/constantine · GitHub
Now for pricing the subgroup checks, I’ll use Scott, https://eprint.iacr.org/2021/1130.pdf, but there is a later note from El Housni:

  • G1: P is in the 𝔾1 subgroup iff ϕ(P) == -u², ϕ is cubic root of unity endomorphism and costs 1 fp_mul, u is the curve parameter and is 64-bit with low hamming weight, hence about ~130 fp_muls for subgroup checks. A 255-bit double-and-add is on average 255+127.5 fp_mul so the ratio subgroup_check/scalar_mul is about 1/3.
  • G2: P is in the 𝔾2 subgroup iff ψ(P) == u ψ is Frobenius endomorphism and costs 1 fp2_mul, u is the curve parameter and is 64-bit with low hamming weight, hence about ~70 fp2_muls for subgroup checks. A 255-bit double-and-add is on average 255+127.5 fp_mul so the ratio is about 1/5.

El Housni latest paper on subgroup checks:

Gas costs

Better discussed here probably: EIP-7667: Raise gas costs of hash functions - #6 by mratsim

but here is the gist, some operations are very expensive in zkEVMs and zkVMs like hash functions, and the Verge roadmap endgame is having a snarkified EVM (text: Annotated Ethereum Roadmap and https://twitter.com/VitalikButerin/status/1741190491578810445)

So EIP-7667 proposes to raise gas cost of hash functions. But should BN254 and BLS12-381 also reflect that?

I think decision on whether to enforce or not subgroup checks should be biased on the use modes, namely “is there any use case that doesn’t eventually want to have everything in the main subgroup?”. So far it looks like if we do clear cofactors in “map to curve” implementations (both fp → G1 and fp_2 → G2) then all the cases would be happy to work with points in the subgroup.

So, at this moment I would be 99% for inclusion of subgroup checks into all relevant precompiles (so everything except additions and map-to-curve). Also, it’ll eventually someone from mistake or misuse.

1 Like

The only legitimate use-case I have for no subgroup check is if several consecutive operations are needed and the subgroup check cost can only be paid once.

For example, verifying KZG commitments would incur that: constantine/constantine/commitments/kzg_polynomial_commitments.nim at 976c8bb215a3f0b21ce3d05f894eb506072a6285 · mratsim/constantine · GitHub.

But this is only a matter of gas pricing.

The non-legitimate use-case would be malicious:

Some benches from Rollcall discussion


gas costs in ratio of G1 scalarmul

original: eip2537_constantine_bench.csv · GitHub

Reproduction

git clone https://github.com/mratsim/constantine
cd constantine
git checkout eip2537
CC=clang nimble bench_eip2537_subgroup_checks_impact

One other things that came out chatting with @dankrad is this. `POINT_EVALUATION_PRECOMPILE_GAS in EIP-4844 is about 50000. Now the POINT_EVALUATION_PRECOMPILE is about 2 pairings. But if we do the equivalent of 2 pairings in EIP-2537 is 151000. Isn’t it really inconsistent?

Thoughts @mratsim @shamatar @ralexstokes ?

In my noisy bench (browser, discord, telegram running), I can confirm your observations.

This is the details of operation cost for BLS12-381 primitives

And this is for EIP-4844:

Pairing is 415us, verify_kzg_proof is 779us.
verify_kzg_proof is indeed a 2-way multi-pairing: constantine/constantine/ethereum_eip4844_kzg.nim at 976c8bb215a3f0b21ce3d05f894eb506072a6285 · mratsim/constantine · GitHub

2*415/779 = 1.065x hence I concur that having a ratio of 3 between both is way to big of a difference.

However, if we focus at multipairing closely in Constantine, a dedicated multi-pairing function grows much slower.

It grows by about 120us per extra pair so ~30%.

Note: every 8 pairs, there is a slight cost bump as my multi-pairing batch size is 8
That said, in my analysis the cost bump is less than 0.1% per pairing

I can do a repricing proposal based on 30 Mgas/s on my PC.
If wanted I can also compare to BN254 precompile.

1 Like

here is a repricing, using the number above.

Disclaimers:

  • CPU is Ryzen Pro 7840U, mobile 8-core CPU. The low-power version of their flagship 7840HS. It’s a 15W to 30W CPU (while flagship can go up to 54W): https://www.amd.com/fr/products/apu/amd-ryzen-7-pro-7840u
  • Turbo was not disabled
  • Library is Constantine, it has significant optimizations that do not always exist in BLST, Gnark, arkworks, bellman, …
  • Unless noted, library is constant-time, meaning perf is always worst-case.
  • For vartime scalarmul operation about half the bits are set (random initialization) while the spec use the worst-case all-bits set.
  • The raw cryptographic timing are done and do not account for overhead of converting from EVM ABI to internal ABI, checking that a number is lower than the modulus, etc …
Operation + Algorithm Curve Throughput (ops/s) Gas cost (target 30 Mgas/s) Latency (ns/op) CPU cycle count (approx)
EC Add G1 EC_Prj[Fp[BLS12_381]] 2732240.437 10.98 366 1206
EC Add vartime G1 EC_Prj[Fp[BLS12_381]] 4975124.378 6.03 201 664
EC Double G1 EC_Prj[Fp[BLS12_381]] 4219409.283 7.11 237 781
EC Add G1 EC_Jac[Fp[BLS12_381]] 2188183.807 13.71 457 1507
EC Add vartime G1 EC_Jac[Fp[BLS12_381]] 5780346.821 5.19 173 572
EC Double G1 EC_Jac[Fp[BLS12_381]] 4566210.046 6.57 219 723
EC Add G2 EC_Prj[Fp2[BLS12_381]] 943396.226 31.80 1060 3494
EC Add vartime G2 EC_Prj[Fp2[BLS12_381]] 1869158.879 16.05 535 1763
EC Double G2 EC_Prj[Fp2[BLS12_381]] 1564945.227 19.17 639 2107
EC Add G2 EC_Jac[Fp2[BLS12_381]] 810372.771 37.02 1234 4066
EC Add vartime G2 EC_Jac[Fp2[BLS12_381]] 2207505.519 13.59 453 1493
EC Double G2 EC_Jac[Fp2[BLS12_381]] 2000000 15.00 500 1648
EC ScalarMul 255-bit G1 EC_Prj[Fp[BLS12_381]] 16451.427 1 823.55 60785 200212
EC ScalarMul vartime 255-bit G1 EC_Prj[Fp[BLS12_381]] 19455.253 1 542.00 51400 169301
EC ScalarMul 255-bit G1 EC_Jac[Fp[BLS12_381]] 16266.511 1 844.28 61476 202489
EC ScalarMul vartime 255-bit G1 EC_Jac[Fp[BLS12_381]] 20193.861 1 485.60 49520 163106
EC ScalarMul 255-bit G2 EC_Prj[Fp2[BLS12_381]] 8268.768 3 628.11 120937 398339
EC ScalarMul vartime 255-bit G2 EC_Prj[Fp2[BLS12_381]] 9304.836 3 224.13 107471 353983
EC ScalarMul 255-bit G2 EC_Jac[Fp2[BLS12_381]] 8717.332 3 441.42 114714 377842
EC ScalarMul vartime 255-bit G2 EC_Jac[Fp2[BLS12_381]] 10393.498 2 886.42 96214 316905
Miller Loop BLS12 BLS12_381 5241.695 5 723.34 190778 628376
Final Exponentiation BLS12 BLS12_381 4379.089 6 850.74 228358 752155
Pairing BLS12 BLS12_381 2408.223 12 457.32 415244 1367710
Hash to G1 (SSWU - Draft #14) BLS12_381 23078.698 1 299.90 43330 142721
Hash to G2 (SSWU - Draft #14) BLS12_381 7504.015 3 997.86 133262 438933
Subgroup check EC_Jac[Fp[BLS12_381]] 30690.851 977.49 32583 107323
Subgroup check EC_Jac[Fp2[BLS12_381]] 26205.451 1 144.80 38160 125690
Miller Loop BLS12 BLS12_381 5222.423 5 744.46 191482 630693
Final Exponentiation BLS12 BLS12_381 4385.138 6 841.29 228043 751118
Pairing BLS12 BLS12_381 2408.269 12 457.08 415236 1367683
Pairing BLS12 batched: 1 BLS12_381 2408.014 12 458.40 415280 1367824
Pairing BLS12 non-batched: 2 BLS12_381 1200.394 24 991.79 833060 2743882
Pairing BLS12 batched: 2 BLS12_381 1882.654 15 934.95 531165 1749524
Pairing BLS12 non-batched: 3 BLS12_381 801.33 37 437.76 1247925 4110348
Pairing BLS12 batched: 3 BLS12_381 1530.618 19 599.93 653331 2151906
Pairing BLS12 batched: 4 BLS12_381 1301.712 23 046.57 768219 2530312
Pairing BLS12 batched: 5 BLS12_381 1124.613 26 675.84 889195 2928783
Pairing BLS12 batched: 6 BLS12_381 992.763 30 218.69 1007290 3317752
Pairing BLS12 batched: 7 BLS12_381 885.784 33 868.30 1128943 3718451
Pairing BLS12 batched: 8 BLS12_381 804.54 37 288.39 1242946 4093950
verify_kzg_proof (EIP-4844) BLS12_381 serial 1284.248 23 359.97 778666 2564722
1 Like

Now the gas price I suggest assuming we want to allow mobile CPUs from 10 years ago to be competitive.

Old benchmark of Constantine on i5-5257U (mobile Broadwell 2.7GHz without ADCX, ADOX instructions for bigint): Multipairing by mratsim · Pull Request #165 · mratsim/constantine · GitHub

The CPU is about 2.5x slower (single-threaded) than 7840U with Constantine. We assume that all modern BLS12-381 backend are within 20% of Constantine, so multiply ideal gas cost on 7840U by 3 and round up generously. Also we use the constant-time jacobian coordinates as reference as every library but Constantine, Halo2 and Icicle use Jacobian coordinates.

For pairing cost, the formula used is:
<Miller Loop> * k * 0.8 + <Final exponentiation>, the 0.8 is due to acceleration tricks when batching multiple miller loops.

Operation + Algorithm Curve Throughput (ops/s) Gas cost (target 30 Mgas/s) Current gas Suggested gas (CPU 2015, 2.5x slower, framework 20% slower, => x3) Subgroup checks cost Total gas cost
EC Add G1 Prj G1 2732240.437 10.98
EC Add vartime G1 Prj G1 4975124.378 6.03
EC Double G1 Prj G1 4219409.283 7.11
EC Add G1 Jac G1 2188183.807 13.71 500 40 40
EC Add vartime G1 Jac G1 5780346.821 5.19
EC Double G1 Jac G1 4566210.046 6.57
EC Add G2 Prj G2 943396.226 31.80
EC Add vartime G2 Prj G2 1869158.879 16.05
EC Double G2 Prj G2 1564945.227 19.17
EC Add G2 Jac G2 810372.771 37.02 800 110 110
EC Add vartime G2 Jac G2 2207505.519 13.59
EC Double G2 Jac G2 2000000 15.00
Subgroup check Jac G1 16451.427 1 823.55 3000 3000
Subgroup check Jac G2 19455.253 1 542.00 3500 3500
EC ScalarMul 255-bit G1 Prj G1 16266.511 1 844.28
EC ScalarMul vartime 255-bit G1 Prj G1 20193.861 1 485.60
EC ScalarMul 255-bit G1 Jac G1 8268.768 3 628.11 12000 5500 3000 8500
EC ScalarMul vartime 255-bit G1 Jac G1 9304.836 3 224.13
EC ScalarMul 255-bit G2 Prj G2 8717.332 3 441.42
EC ScalarMul vartime 255-bit G2 Prj G2 10393.498 2 886.42
EC ScalarMul 255-bit G2 Jac G2 5241.695 5 723.34 45000 10300 3500 13800
EC ScalarMul vartime 255-bit G2 Jac G2 4379.089 6 850.74
Miller Loop BLS12 2408.223 12 457.32
Final Exponentiation BLS12 23078.698 1 299.90
Pairing BLS12 7504.015 3 997.86
Hash to G1 (SSWU - Draft #14) 30690.851 977.49
Hash to G2 (SSWU - Draft #14) 26205.451 1 144.80
Miller Loop BLS12 5222.423 5 744.46 18000
Final Exponentiation BLS12 4385.138 6 841.29 20500
Pairing BLS12 2408.269 12 457.08 108000 38500 6500 45000
Pairing BLS12 batched: 1 2408.014 12 458.40
Pairing BLS12 non-batched: 2 1200.394 24 991.79
Pairing BLS12 batched: 2 1882.654 15 934.95 151000 49300 13000 62300
Pairing BLS12 non-batched: 3 801.33 37 437.76
Pairing BLS12 batched: 3 1530.618 19 599.93 194000 63700 19500 83200
Pairing BLS12 batched: 4 1301.712 23 046.57 237000 78100 26000 104100
Pairing BLS12 batched: 5 1124.613 26 675.84 280000 92500 32500 125000
Pairing BLS12 batched: 6 992.763 30 218.69 323000 106900 39000 145900
Pairing BLS12 batched: 7 885.784 33 868.30 366000 121300 45500 166800
Pairing BLS12 batched: 8 804.54 37 288.39 409000 135700 52000 187700
verify_kzg_proof (EIP-4844) 1284.248 23 359.97 50000 70000 (priced in) 70000

Agree with the premise of this, we should bring the pricing of these in-line with current KZG proof costs.

I notice that unlike EIP-196, EIP-2537 does not allow truncated / virtually padded with zeros.

Encoding

Field elements and scalars are encoded as 32 byte big-endian numbers. Curve points are encoded as two field elements (x, y), where the point at infinity is encoded as (0, 0).

Tuples of objects are encoded as their concatenation.

For both precompiled contracts, if the input is shorter than expected, it is assumed to be virtually padded with zeros at the end (i.e. compatible with the semantics of the CALLDATALOAD opcode). If the input is longer than expected, surplus bytes at the end are ignored.

The length of the returned data is always as specified (i.e. it is not “unpadded”).

Is this intentional? I wonder if we can make a retroactive EIP to enshrine the same for EIP-196.

I finished fully implemented EIP-2537 in Constantine in EIP-2537 - BLS12-381 precompiles for the EVM by mratsim · Pull Request #368 · mratsim/constantine · GitHub and now can provide accurate benchmarks that take into account (de)serialization and other miscellaneous costs.

First on AMD Ryzen 7840U (2023, low power / ultraportable version of their flagship 7840U, TDP 28W down to 15W, 3.3GHz turbo 5.1GHz) with assembly and Clang.

Second on i5-5257U (Macbook Pro 2015, low power dual core Broadwell, TDP 15W down to 7.5W, 2.3GHz turbo 2.9GHz) without assembly (it doesn’t have ADOX/ADCX instructions as well), but we do use the add-with-carry intrinsics. This is a 9 year old x86 CPU.

Finally on a Raspberry Pi 4 (2019, TDP 10W, 1.5GHz) without assembly, and without any add-with-carry intrinsics. No add-with-carry makes bigint operations up to 3x more costly has add-with-carry need to be replaced by main addition, comparison/carry check, adding the carry.

See also compiler codegen Compiler Explorer and Tracking compiler inefficiencies · Issue #357 · mratsim/constantine · GitHub


Price suggestions

This remark is confirmed with end-to-end test.
Besides MSMs all operations can achieve 60Mgas/s on a 9 year old very low-powered CPU (15W!), without assembly. and even 230Mgas/s for mapping to G2. This strongly suggests that those operations are overcosted.

I think that 9 year old, 15W CPU is significantly outdated and achieving 20Mgas/s without assembly on it should be fine.

Hence I suggest the following price changes:

Primitive Current price Suggested price Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - current Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - repriced Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - current Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - repriced
BLS12_G1ADD 500 125 185.60 MGas/s 46.4 79.24 MGas/s 19.81
BLS12_G1MUL 12000 3800 144.80 MGas/s 45.9 52.94 MGas/s 16.8
BLS12_G2ADD 800 170 218.28 MGas/s 46.325 86.30 MGas/s 18.33
BLS12_G2MUL 45000 6100 346.23 MGas/s 46.9 113.23 MGas/s 15.35
BLS12_MAP_FP_TO_G1 5500 1600 161.09 MGas/s 46.86 59.91 MGas/s 17.43
BLS12_MAP_FP2_TO_G2 75000 5000 702.17 MGas/s 46.8 231.85 MGas/s 15.46

The rationale for those changes is targeting a CPU 2~3 years older than my own for 30Mgas/s, CPUs are progressing on average by 10~15%/year, something that is 66% perf is reasonable (so around 45Mgas/s). But that said that CPU is still a low-power CPU, so instead of targeting 45Mgas on it we can target 40Mgas or even 35Mgas.

In any case it is clear that G2 functions are severely overcosted, especially mapping to G2.

Pairings

For pairings, the evolution of cost for multipairing is stable but the price is 5x the actual cost.

Hence I suggest moving from 43000*k + 65000 to 8600*k + 13000

MSMs

The discounts are too steep and need adjustment.

Primitive Current cost Suggested cost Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - current Ryzen 7840U 28W 3.3GHz turbo 5.1GHz - clang+ASM - repriced Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - current Intel i5-5257U 15W 2.3GHz turbo 2.8GHz - clang+noASM+intrinsics - repriced
G1MSM 2 2 * 12000 * 888 / 1000 = 21312 2 * 3800 * 999 / 1000 = 7592 121.74 Mgas/s 43.37 Mgas/s 39.46 Mgas/s 14.06 Mgas/s
G1MSM 4 4 * 12000 * 641 / 1000 = 30768 4 * 3800 * 919 / 1000 = 13969 102.15 Mgas/s 46.38 Mgas/s 31.77 Mgas/s 14.42 Mgas/s
G1MSM 8 8 * 12000 * 453 / 1000 = 43488 8 * 3800 * 813 / 1000 = 24715 81.67 Mgas/s 46.41 Mgas/s 28.10 Mgas/s 15.97 Mgas/s
G1MSM 16 16 * 12000 * 334 / 1000 = 64128 16 * 3800 * 728 / 1000 = 44262 67.23 Mgas/s 46.40 Mgas/s 22.78 Mgas/s 15.72 Mgas/s
G1MSM 32 32 * 12000 * 268 / 1000 = 102912 32 * 3800 * 668 / 1000 = 81229 59.01 Mgas/s 46.58 Mgas/s 20.42 Mgas/s 16.12 Mgas/s
G1MSM 64 64 * 12000 * 221 / 1000 = 169728 64 * 3800 * 630 / 1000 = 153216 51.40 Mgas/s 46.40 Mgas/s 18.18 Mgas/s 16.41 Mgas/s
G1MSM 128 128 * 12000 * 174 / 1000 = 267264 128 * 3800 * 602 / 1000 = 292813 42.37 Mgas/s 46.42 Mgas/s 15.22 Mgas/s 16.65 Mgas/s
G2MSM 2 2 * 45000 * 888 / 1000 = 79920 2 * 6100 * 999 / 1000 = 12188 274.23 Mgas/s 41.82 Mgas/s 85.47 Mgas/s 13.03 Mgas/s
G2MSM 4 4 * 45000 * 641 / 1000 = 115380 4 * 6100 * 919 / 1000 = 22424 168.31 Mgas/s 32.71 Mgas/s 69.02 Mgas/s 13.41 Mgas/s
G2MSM 8 8 * 45000 * 453 / 1000 = 163080 8 * 6100 * 813 / 1000 = 39674 177.98 Mgas/s 42.77 Mgas/s 55.77 Mgas/s 13.57 Mgas/s
G2MSM 16 16 * 45000 * 334 / 1000 = 240480 16 * 6100 * 728 / 1000 = 71053 150.39 Mgas/s 44.43 Mgas/s 48.60 Mgas/s 14.36 Mgas/s
G2MSM 32 32 * 45000 * 268 / 1000 = 385920 32 * 6100 * 668 / 1000 = 130394 136.80 Mgas/s 46.22 Mgas/s 44.21 Mgas/s 14.94 Mgas/s
G2MSM 64 64 * 45000 * 221 / 1000 = 636480 64 * 6100 * 630 / 1000 = 245952 119.28 Mgas/s 46.09 Mgas/s 39.68 Mgas/s 15.33 Mgas/s
G2MSM 128 128 * 45000 * 174 / 1000 = 1002240 128 * 6100 * 602 / 1000 = 470042 101.59 Mgas/s 47.64 Mgas/s 33.94 Mgas/s 15.92 Mgas/s

Comparison with EIP-4844 KZG

EIP-4844 introduced a point evaluation precompile that is mostly a double pairing at 50000 gas: EIP-4844: Shard Blob Transactions

On my machine that is 1283.318 operations/s so 64.17Mgas/s

Comparison with EIP-196 and EIP-197

  • BN254 ECADD is at 500
  • BN254 ECMUL is at 40000
  • BN254 PAIRING is at 80000*k + 100000

I can introduce benchmarks later but they seem mispriced by a factor over 10x for multiplication and about 10x for pairings.

1 Like

Bench update with SHA256 and BN254 precompiles

1 Like

EIP-2537: Precompile for BLS12-381 curve operations with @ralexstokes

I’m sure this has been covered previously, but I can’t find it:

Why is the ABI encoding for Fp elements aligned along 64 bytes instead of 48 bytes?

it’s discussed in the EIP text.

Is there a mathematical formula for discount?

BLS precompile g1msm gas cost analysis

We’ve implemented EIP-2537: Precompile for BLS12-381 curve operations in evmone using blst v0.3.13 library. Then, we’ve benchmark multi-scalar multiplication g1msm in this implementation and compared timings with the specified gas schedule.

Benchmarks have been done on 2 different CPU architectures:

  • x86-64 Intel Ultra 7 155U Performance Core 4.0GHz
  • arm64 Apple M3 Pro

Timings have been normalized to single point multiplication (precompile g1mul). I.e. single naive multiplication is 1.

We also benchmark variant of g1msm with subgroup checks disabled (“nocheck”).

The chart also includes the line “gas” of by-the-spec gas cost. It is “discounted” up to 128, then linear with slope 174/1000.

Finally, we’ve added the “check only” line which is difference between default and “nocheck” series.

Conclusions

The real cost of the precompile is close to linear with slope 100/256 and is significantly higher than the estimated by the gas cost. While, the cost of the precompile without subgroup checks is sub-linear and “relatively” matches the gas cost.

In order to approximate the cost of the subgroup check component of the operation, we analyze the difference of time taken by the “check” and “nocheck” variant. It is visible from the chart that slope of the data is constant for various $n$ (contrary to that of the “check” and “nocheck” slopes which gradually decrease). So we have a sub-linear (piecewise linear) component of the MS itself and a linear component of the subgroup check, which aligns with the theoretical expectations

In other words, the g1msm precompile is compound of linear and sub-linear operations, therefore it total complexity is linear.

The gas schedule, specifically the discount table, was proposed in the initial commit BLS12-381 curve operations from 2020 and hadn’t been updated since.

The subgroup check was added two years later in 2022 by the Update EIP-2537: Rephrased subgroup check part.

We should get under consideration once again if the subgroup check is needed for point multiplication. The algorithms (mul and msm) works well on every point from the curve. This check seems to be added only because of not very clear security consideration. Moreover, making this check for every call is unnecessary when the programmer is sure that the input is in proper subgroup. I.e it’s a result of previous operation which does not change the subgroup.

By @chfast @rodiazet @pdobacz

4 Likes

The original gas table implied/required that MSM will be implemented via peppinger algorithm, that has roughly N/logN assymptotics, and the original table was derived from empirical benchmarks for such case, that included subgroup checks. I will also note that subgroup checks were implemented with trivial multiplication by group order, while any more modern sane implementation should use fast subgroup checks