Long-term L1 execution layer proposal: replace the EVM with RISC-V

mratsim · April 24, 2025, 9:09am

However they can be implemented more efficiently with AVX512 or AVX512IFMA

They can’t

If you have an implementation with AVX512 or AVX512IFMA feel free to send it over, I will benchmark it against my library, or you can run the benchmark yourself:

github.com/mratsim/constantine

README-PERFORMANCE.md

master

# Performance

High-performance is a sought out property.
Note that security and side-channel resistance takes priority over performance.

New applications of elliptic curve cryptography like zero-knowledge proofs or
proof-of-stake based blockchain protocols are bottlenecked by cryptography.

## In blockchain

Ethereum 2 clients spent or use to spend anywhere between 30% to 99% of their processing time verifying the signatures of block validators on R&D testnets
Assuming we want nodes to handle a thousand peers, if a cryptographic pairing takes 1ms, that represents 1s of cryptography per block to sign with a target
block frequency of 1 every 6 seconds.

## In zero-knowledge proofs

According to https://medium.com/loopring-protocol/zksnark-prover-optimizations-3e9a3e5578c0
a 16-core CPU can prove 20 transfers/second or 10 transactions/second.
The previous implementation was 15x slower and one of the key optimizations
was changing the elliptic curve cryptography backend.

This file has been truncated. show original

xxuejie · April 24, 2025, 9:12am

And to further clarify myself(apologies for not making it clear earlier):

With RISC-V as a blockchain VM, we can only have an approximation of real memory load metrics on modern CPU. It will never be 100% accurate, but I personally would argue it is close enough, and it might be the best we can have in blockchains which require determinism. By limiting the memory of a VM(and we should do this even if not considering memory loads) and introducing a minimal but reasonable MMU model, I doubt if a denial of service attack can be performed by abusing memory loads.
My initial description, simply states the fact that implementing cycle metering has never been a problem for CKB-VM, our RISC-V VM(some have doubted this, stating for JIT/AOT VM this is not possible). I do admit that right now in CKB-VM we are slightly over optimistic by assuming memory loads would hit cache. We do keep monitor inconsistencies between true running time and charged cycles in CKB-VM smart contracts, and fix those inconsistencies periodically in hardforks when we can.
My discussion on hardware performance counters, does not refer to performance counters in modern x64 or aarch64 CPUs. I’m merely talking about taking an existing open source RISC-V CPU implementation, one can add a performance counter following CKB-VM’s(or other RISC-V blockchain VM’s) cycle metering design. The result will be a real RISC-V CPU that can also measure blockchain-style cycles independent from CPU’s own internal cycles. I’m merely stating that this is a possible thing to do.

xxuejie · April 24, 2025, 9:21am

For the record, while RISC-V does not have the adc instruction directly, there are ways to achieve similar effect:

Solution 1: by chaining add and sltu, one can get similar code chains to add w/ adc: /godbolt.org/z/v165TYKqb

Solution 2: in CKB-VM, we have abused the idea of macro-op fusion (please use archive to look for s://riscv.org/wp-content/uploads/2016/07/Tue1130celio-fusion-finalV2.pdf ) and introduce adc to CKB-VM: /github.com/nervosnetwork/rfcs/blob/master/rfcs/0033-ckb-vm-version-1/0033-ckb-vm-version-1.md#42-mop

mratsim · April 24, 2025, 9:51am

For 256-bit, RISC-V has a 2x longer trace than ARMv7: Compiler Explorer

using Clang 13 because afterwards, Clang stopped generating code for >128-bit integer with the C ExtInt/BitInt extension.

Solution 2: in CKB-VM, we have abused the idea of macro-op fusion

This is something that can indeed be implemented in an interpreter to reduce VM traces, though the precompile approach is already ubiquitous and more performant for most bottlenecks (cryptography, u256, hashes).

koute · April 24, 2025, 9:54am

Yes. But I disagree that it’s the best we can do. (: The way to do it is to assume the worst case, and then extend the model to lower those costs in cases when we know it’s safe. If you assume the best/average case then you’re just opening yourself up to denial of service attacks. The fact that you haven’t had any problems so far only means that no one bothered to attack your chain yet, not that your gas metering is safe.

And what makes you think that on this hypothetical future RISC-V CPU the performance counters will be deterministic? (:

In fact, I can guarantee you that they will not be deterministic. The non-determinism of the performance counters is not an arbitrary decision that the hardware designers made; it’s a direct consequence of making a high performance, superscalar, pipelined, out-of-order CPU core with multi-level caches.

Yes, you can in theory add such performance counters to a CPU, but that means the CPU will have to be slow to keep the counters deterministic, so what’s the point of running on a slow CPU with deterministic hardware gas metering if we can get faster performance by doing the gas metering in software on a fast CPU?

edubart · April 24, 2025, 12:38pm

This comparison is based on the rv32gc ISA. If you switch to rv64gc on the same page, the trace is actually smaller than ARMv7. I’m not certain why most RISC-V based zkVMs prefer rv32 over rv64, but larger registers naturally lead to fewer operations when handling 256-bit values. In the future, I hope to see more zkVMs based on RISC-V with 64-bit registers. RISC-V even has rv128 well defined which would lead to even fewer operations, although I’m not advocating for it given its limited support by the current compilers.

An ISA optimized for ZK would actually optimize for reducing those data movements.

I think this choice represents a fundamental trade-off between performance and tooling maturity.

Optimizing purely for speed would suggest creating a custom ISA tailored to ZK constraints, but this requires significant research, implementation, auditing and tooling development - ultimately creating higher barriers to adoption and taking more time to fully develop it.

Alternatively, using a stable, well-defined ISA like RISC-V leverages years of development and growing adoption. I believe this approach sacrifices some performance but significantly reduces implementation effort through existing tooling.

For a blockchain base layer intended to be open and widely adopted, I believe leveraging RISC-V’s openness and mature tooling ecosystem maximizes adoption. While custom ISAs may have their place in specific projects, for Ethereum specifically, I would prefer the path that offers greater developer freedom and broader adoption - which is the goal of open standard ISAs like RISC-V.

zonu · April 24, 2025, 12:46pm

RISC-V is simple, yes — but not optimized for zero-knowledge. Custom ZK-friendly ISAs can outperform it significantly. There’s nothing inherently ZK-native about RISC-V other than being “not as bad as EVM.” Smaller opcode amount compared to WASM … but this also does not really matter as a’la’carte opcode selecting in proving schemes exist.

kladkogex · April 24, 2025, 1:44pm

I wonder how is the problem of

This is exactly the point.

There are billions of dollars invested in all kinds of Solidity—from developer tools to security audits. The security of smart contracts depends on how widely the code is used.

Hundreds of thousands of people have learned Solidity.

Rust is a complex, general-purpose language, which makes security incredibly hard to achieve. There are simply too many moving parts.

Usually, once a community is formed, people stick to the language no matter what. It’s simply network effect.

The chances of success here are extremely low. People like Solidity, and there’s no reason for them to switch. Rust is not a magic bullet for anything.

elistark · April 24, 2025, 2:01pm

Just a short message to say we - StarkWare - will provide our thoughts on this very important topic in a few weeks. We have a lot to say because the considerations regarding which VM to pick have been deeply and hotly discussed by our team internally a few years ago before embarking on the Cairo VM, which was where we landed.

FWIW, I was probably the most staunchly opposed to that approach at the start, and was in favor of providing proofs for some off-the-shelf VM, for the exact reasons you’d think: better to focus on proving something already built, than to “boil the ocean” and go with something entirely new.

By now, what was entirely new back then isn’t so anymore, and we’ve gained a lot of insights, which we’ll share. Briefly, we went for an architecture that optimizes for both safety (as a blockchain VM), STARK-proof efficiency and also STARK soundness. This has allowed us to already formally prove the soundness of the Cairo VM and expand on to a few mathematical syscalls.

What does this imply for Ethetreum? I don’t know yet, because I need to pick the brains of all the smarter folks in StarkWare that know way more than me about how to build a VM that’s safe, ergonomic, usable and efficient for blockchain and STARK-proving. But I assure you we have a lot to say.

It would really help if @vbuterin or someone on leadership of EF specifies what exactly are the desiderata and considerations for picking a next VM for Ethereum. After all, Ethereum has the unique position to determine The Gold Standard for blockchain VMs, i.e., Ethereum can say “here’s what I want” and that may become the standard.

This is a bit similar to an earlier process we led on behalf of the EF, that of recommending STARK-friendly hash functions. Would be great to have a similar effort for STARK-friendly VM.

xxuejie · April 24, 2025, 3:28pm

Yes. But I disagree that it’s the best we can do. (: The way to do it is to assume the worst case, and then extend the model to lower those costs in cases when we know it’s safe. If you assume the best/average case then you’re just opening yourself up to denial of service attacks. The fact that you haven’t had any problems so far only means that no one bothered to attack your chain yet, not that your gas metering is safe.

Please correct me if I’m wrong, but my general feeling is that we are approaching the same solution from 2 sides. One side could be starting from the optimistic side, and gradually incur cycles at places where metering is wrong. And yes it totally possible to start from the other side, where the worst case(assuming every memory load results in a cache miss) is used first, and only lower the cost when it is safe. I don’t think it tampers the main point: modelling VM like a real CPU, enables us to build a much more accurate cycle model.

And I totally agree that CKB might still be in a stage that no one bothers to attack it, we definitely are watching things, and be prepared to revise our cycle metering model when fit.

In fact, I can guarantee you that they will not be deterministic. The non-determinism of the performance counters is not an arbitrary decision that the hardware designers made; it’s a direct consequence of making a high performance, superscalar, pipelined, out-of-order CPU core with multi-level caches.

Yes, you can in theory add such performance counters to a CPU, but that means the CPU will have to be slow to keep the counters deterministic, so what’s the point of running on a slow CPU with deterministic hardware gas metering if we can get faster performance by doing the gas metering in software on a fast CPU?

I totally agree with you, for superscalar, pipelined out-of-order CPU core, performance counters will be indeterministic. However I think there is a middle ground between the following 2:

A slow in-order, simple RISC-V hardware core
A software VM running on a modern performant OOO CPU

It’s not one side or the other, there might well exist cases where a suitable solution can be found in between the two.

That being said, a real hardware as a blockchain VM, can be a quite niche case, I can drop this argument if you still think otherwise.

xxuejie · April 24, 2025, 3:45pm

@edubart has helped with the first part(thank you!) so I will skip it.

This is something that can indeed be implemented in an interpreter to reduce VM traces, though the precompile approach is already ubiquitous and more performant for most bottlenecks (cryptography, u256, hashes).

If you know the exact algorithm people will use and only use, yes precompiles are the ideal solution. In fact, people in the RISC-V world do it all the time: all kinds of custom instructions are added to different RISC-V cores, just to solve a particular task.

That being said, all those macro-op fusions, as well as V & P extensions mentioned in earlier post, helps optimize program in a more general way: the optimizations are only introduced to the VM once(or much fewer times than we are adding new precompiles), and a series of cryptographic algorithms compiled down to RISC-V can all embrace those optimizations.

I do believe all the discussions on new precompiles we see here and in the EIP repo, already speak the fact that a more general approach might be yearned for than simply adding new precompiles. I still remember when /ethereum-magicians.org/t/eip-7212-precompiled-for-secp256r1-curve-support/14789 is being discussed…

mratsim · April 24, 2025, 3:50pm

It does not sacrifice just “some performance” but it starts with a 4x~7x lag from the load-compute-store overhead. If (3) is the most important bottleneck to solve (from Vitalik’s original post), then
I’d argue that Risc-V is not good enough.

There are ISAs that have been developed for a long-time, some that are used in production like Cairo or Valida that can be targeted from C, Rust, Nim and WASM today (and can easily add Zig for example tomorrow and C++ needs some work on the stdlib). See Valida spec here (Technical Design: VM | Lita Docs, disclaimer I’m working on Valida).

On the tooling side, Valida at the very least is using LLVM toolchains, can reuse usual tooling like gdb/lldb for debugging, and even perf for optimization.

It was simpler. 2 years ago, zkVMs were unproven, it was unsure if they were feasible vs zkEVMs with each EVM opcodes being a circuit: zkevm-circuits/zkevm-circuits/src at main · privacy-scaling-explorations/zkevm-circuits · GitHub

wjmelements · April 24, 2025, 4:49pm

ZK-EVM provers today already work by proving over implementations of the EVM compiled down to RISC-V

Is RISC-V really the best architecture for proofs or can we make a better one? RISC-V wasn’t designed for ZK. If there is a better architecture in 2 years, how much of our RISC-V development would transfer to the next architecture?

then the gas schedule will reflect proving times

The gas itself has to be calculated during runtime to solve the halting problem. Gas accounting is a non-negligible part of both execution and proof. Many current operations have variable gas costs. It is possible that computing these variable costs will increase proof complexity, increasing gas complexity, etc. This is a chicken-and-egg problem because we have to know to charge gas in order to know how much gas to charge.

Instead, perhaps the EVM/EOF/EWASM can all be JIT’d into (RISC-V) using separate gas schedules. (RISC-V) would be the base execution layer and everything would transpile into it, injecting gas calculations at each step. Every hard fork would need their own transpilers, but these should never change.

u59149403 · April 24, 2025, 6:54pm

Please, don’t.

RISC-V is not simple! pdf spec for RISC-V (latest ratified spec, unprivileged instructions) is 670 pages long!!! ( unpriv-isa-asciidoc.pdf - Google Drive ). For comparision: current wasm spec (2.0) is 226 pages long, i. e. 3 times smaller ( https://webassembly.github.io/spec/core/_download/WebAssembly.pdf ). Also for comparision: Shanghai yellow paper is 42 pages ( https://ethereum.github.io/yellowpaper/paper.pdf ), i. e. RISC-V is a lot more complex than Shanghai EVM
RISC-V (similarly to all usual CPU ISAs) allows reading program code and even self-modifying code. Code and data live in the same address space. And code is not checked before execution. All these are exactly things we avoided in EOF. So, RISC-V is simply step back. EOF (and wasm) doesn’t allow code introspection, separates code and data and requires upfront code checking. In short, EOF and wasm are a lot better for execution layer than RISC-V. PolkaVM author wrote above ( Long-term L1 execution layer proposal: replace the EVM with RISC-V - #32 by koute ) that they didn’t take RISC-V as-is, instead they made RISC-V more EOF-like by removing introspection and other similar things
Wasm is deterministic. This is critical thing for us. Wasm 2.0 spec says: “…execution – is
deterministic. WebAssembly has only very few exceptions to this, which are noted explicitly in this specification”. As well as I understand there is no similar level of dedication to determinism in RISC-V
Wasm has very precise spec. For example, look here: Types — WebAssembly 2.0 (Draft 2025-01-28) . Moreover, it was decided that future versions of wasm will be written using spectec: SpecTec has been adopted - WebAssembly . In short, this is machine readable description of wasm, which is used to generate all other artifacts: human-readable spec, reference interpreter and definitions for Coq/Rocq! (Note: “reference interpreter” part is not finished yet.) Here is an analogy: this is as if we had single machine readable text, which is used to generate both yellow paper and python execution layer spec!!! From that news article: “it [SpecTec] takes Wasm to a new level of rigor and assurance that is unprecedented when it comes to language standards… One feature that sets Wasm apart from other mainstream programming technologies is that it comes with a complete formalization… it enabled the soundness of the language — i.e., the fact that it has no undefined behavior and no runtime type errors can occur — to be machine-verified before the first version of the standard was published”. As well as I understand, RISC-V has nothing like this.
Wasm is more high-level. It has concept of functions, similarly to EOF. RISC-V doesn’t

ericsson49 · April 25, 2025, 10:34am

I believe the main problem not the ISA per se, but code generation/optimization performed using CPU cost model.
Problematic ISA aspects can be (at least, partially) alleviated via binary recompilation. But re-optimization of code with ZK cost profile will be more complicated, e.g. static analysis of code RISC-V will be more difficult (due to aliasing).

I’d say that EVM+EOF is much better in that respect, since it’s higher level, easier to analyze statically (EOF is designed to be so) and does not include low-level CPU-oriented optimizations.

koute · April 25, 2025, 10:49am

This is misleading. The RISC-V spec PDF has so many pages because it includes all of the ratified optional extensions up until this point. The baseline RISC-V instruction set is super simple. It’s so simple that I implemented a RISC-V interpreter in a single day, and a recompiler (which recompiles RISC-V into native code and runs it) in two days for my initial PolkaVM prototype. (Try doing that with WASM, especially since for WASM you also need to write a register allocator.) And the great thing about RISC-V is that you can chose which exact extensions you want to use.

WASM definitely doesn’t win against RISC-V when it comes to complexity. 172 instructions in the 1.0 WASM spec and 220 instructions in the 2.0 spec is not simple (up to total of 541 last time I counted if we include all proposed WASM extensions), especially compared to RISC-V’s 47 instructions in the base spec. And, again, with RISC-V you have the choice - you can only use the base spec! With WASM this is not supported, and compilers are already slowly sunsetting support for WASM 1.0 (e.g. the signext extension is now enabled by default; in Polkadot we have to forcibly disable it when building our WASM runtimes, but eventually that will stop being supported and we’ll be screwed). With RISC-V you can use the baseline target with only 47 instructions forever, and decide yourself when/if you want to opt into any extensions; with WASM eventually you’ll be forced to either fork all of the compilers and toolchains, or get on the complexity treadmill and implement all of the instructions that the newest WASM spec implements.

This is an entirely orthogonal issue. There’s nothing in RISC-V that prevents you from making the program code unreadable, or making self-modifying code impossible, or splitting the address space, or being able to check the code before execution. It’s just an issue of how much effort you want to invest. It’s possible to decide what you want to use from vanilla RISC-V and what you want to change; we went all the way with PolkaVM and got rid of almost every downside of RISC-V, but you can pick and choose:

Do you use ELF as a container, or do you use a custom container? (We went with a WASM-like custom container.)
Do you use RISC-V’s instruction encoding as-is, or do you use a custom encoding? (We went with a custom encoding that’s more compact and allows full width immediates while not being slower to decode.)
Do you make a von Neuman machine (data and code live in the same address space) or do you make a Harvard machine (data and code live in separate address spaces)? (We went with a Harvard machine.)
Do you only use vanilla RISC-V instructions (regardless of how you chose to encode them), or do you add custom extensions? (We have some custom extensions - for example, we do macro op fusion offline which allows us to both get better performance and make VM implementations simpler).

So strictly speaking I agree with you, out of box WASM is better than RISC-V for the reasons you’ve mentioned, but you can fix those problems by just not naively taking RISC-V as-is as it’s used for native hardware and tweak certain aspects of it. It depends on how much of a perfectionist you are, and whether you want to maximize “keep it as close to RISC-V as possible”, or you want to maximize “be as fast as possible and as simple as possible while still being able to use vanilla RISC-V toolchains”, or something in-between.

For Polkadot we were already using WASM, so we wanted to migrate to something that’s strictly superior on all fronts, which is why we went all the way and tried to change/improve on almost every aspect of RISC-V that can be changed to make it better for our use case.

RISC-V is also 100% deterministic, and it’s actually easier to make RISC-V deterministic than WASM. For example, with RISC-V you can just not support floating point (and get the compilers to emit softfloats instead, so programs can still use them); with WASM you don’t have that choice, and you have to deal with issues like NaN canonicalization. Or, as another example, it’s a lot easier to make RISC-V’s stack deterministic, because unlike with WASM on RISC-V the stack is not a special thing and just lives in normal address space.

Note that this is a downside if you want to write a high performance recompiler that achieves near native performance. The ideal you’d want to achieve for performance is “one bytecode instruction == one native instruction”, as this makes writing a recompiler very simple (because you don’t need any complex logic, no register allocation, no op fusion, no interprocedural analysis, etc.) and is the reason why PolkaVM can recompile programs into native code in O(n) time and do it faster than you can calculate a hash and still have the resulting code be near-native speed, while wasmtime’s recompilation is not O(n), needs a recompiler whose complexity is many orders of magnitude higher, and needs hundreds times more resources to generate similarly fast code as PolkaVM.

u59149403 · April 25, 2025, 1:12pm

Thank you for your post! Now I see

shemnon · April 25, 2025, 4:07pm

I don’t think there has been much consideration of how a transition to RISC-V would look. Here’s my brief take that shows how EOF is a valuable part of that transition:

chfast · April 26, 2025, 8:11pm

From the linked article:

The code to reproduce these experiments is available at GitHub - succinctlabs/overhead: Measuring revm and wasmi overhead for SP1 programs..

But this repo doesn’t exist or is private. So I’m guessing it’s only me trying to reproduce these numbers.

I’m assuming the “Fibonacci-100k (u128)” benchmark computes the 100k’th element of the Fibonacci sequence with 128-bit precision. By the way, this number has almost 70k bits.

So I’ve written a simple EVM implementation of fibonacci:

5f195f355f60015b90810183830192600757035f5260205ff3

and EOF variant (just replaced JUMPDEST/JUMPI with RJUMPI):

ef00010100040200010018ff000000008000065f195f355f600190810183830192e1fff6035f5260205ff3

Using evmone and “native” CPU (x86) execution takes 5.0M cycles (legacy EVM) and 3.8M cycles (EOF).

I don’t know how to execute this in SP1 yet, but to get any upper bound estimate I’ve compiled evmone to riscv32 and counted host CPU cycles of the execution in the qemu emulator: 127M cycles (legacy EVM) and 114M (EOF).

This is still one order of magnitude lower than the number in the paper (1002M cycles).

u59149403 · April 27, 2025, 8:56am

You wrote:

What do you mean? There is no self-modifying code. It seems you mean that initcode can generate contract code on-the-fly. I. e. “generation” should be, not “modification”