EIP-2315 "Simple Subroutines for the EVM" - Analysis

I like this proposal a lot! I hope the cross-sub-jump analysis can be done efficiently, because I think disallowing “crossing returnsubs” makes testing much easier, as suggested by the authors. Returning from a sub will be a syntactic instead of a semantic action.

One edge case that would have to be covered by a test: Jumping to a beginsub that is directly followed by a beginsub, which could be implemented differently than flowing into a beginsub.

I’m pretty sure the Solidity compiler never generated code that jumped over non-code. Anything that is not code is at the end of the bytecode.

I think these two things are confusing, taken together:

Every BEGINSUB position starts new subroutine ending the “current” one.

(this implies, to me, that there’s some form of “implicit returnsub” on a BEGINSUB?)
and

Execution of BEGINSUB causes exception

Your example program:

What would happen after 0e, doesn’t it suddenly walk into a new subroutine and crash?
If there is not an implicit returnsub, shouldn’t you put RETURNSUB at e.g 0f in the example?

Update: comment again before reading the entire message :man_facepalming:

There is a statement in “Fallthrough to next subroutine”, but I agree the “spec” part of “JUMPs across subroutine” not clear:

Execution of BEGINSUB causes exception (OOG: all gas consumed) and terminates execution.

I’m still worried about making subroutines syntactic. The intent was pure mechanism at the level of EVM assembly - retaining current control flow operations and adding a Forth-style return stack. At first I only had JUMPSUBs to any JUMPDST. It was trying to write some example code that way inspired BEGINSUB - it seemed the minimum necessary structure.

For an interpreter this doesn’t present much trouble, but I can see how it would cause problems for LLVM and similar tools. So I’d ask, without time to analyze, for the minimal structure needed for the purpose.

And I suspect extended dup and swap could help.

Lazy JUMPDEST analysis is indeed possible currently and problematic in the same time — it is rather complex to implement and easily defeatable by worst cases. In nature it may help in average workloads so I never was keen to implement it and check it in practice. I regret a bit I mentioned it in this context at all. However, having subroutines strictly partitioning code opens up new (at least theoretical) strategies.


Originally I was wrong. We are able to perform static analysis of all EVM code deployed and check if BEGINSUB instruction is present there. We actually should extend this and collect information about “usage” of all unassigned opcodes. Should be helpful for some future EIPs.
But in case BEGINSUB is present, it would be very difficult to check there exist a possible jump over it.
Requiring code versioning is definitely not something we want to use here.


I actually was not considering EIP-615 at all. @gcolvin proposed the MVP of subroutines what was good decision on its own and also good discussion starting point. We propose two additional “restrictions” which (in our opinions) provide some additional benefits to this feature. But it stays within EVM Look&Feel:

  • it is backward compatible with existing contracts (one issue remains here),
  • no code version is needed,
  • no code validation is needed at deploy time,
  • all execution aborts happen only when you execute malformed piece of code.

I think it is good to stay within this boundaries.


I must to agree here. How to perform jumpdest/subroutine analysis and how later store this information is problem on its own. And looks it will get much harder with the proposed change. I think we at least need to inspect that further and propose a recommended implementation. I would prefer to discuss that on a side as there is a lot to say about this problem.

Some quick ideas though:

  1. You can store the information in a byte array of the size of the code. We need 1 bit for “is it a JUMPDEST”, 1 bit for “is it a BEGINSUB”, and we have 6 bits left to encode (using variadic length encoding) the length of a subroutine. This has 8x larger memory footprint than bitset used in geth.
  2. To bound worst cases we can limit the max code size to the 2x the max deplyable code size (this will affect the input size for CREATE, CREATE2 and “create” transactions). EIP-1985 comes to mind.

The first sentence is about analysis, not execution. I changed it to:

Every BEGINSUB opcode position in the code marks the beginning of a new subroutine and the ending of the “previous” subroutine.

It would crash only if “no subroutine fallthrough” is applied. But I put the missing RETURNSUB in the example not to be distracted by this.

Yes, when you put it like this, I agree

I really am coming to prefer the approach of subroutines as pure mechanism. Just new control-flow operators, with no syntax and no changes to any other operators.

Unlimited SWAP and DUP instructions would probably be useful in this programming model.

Can the imposition of further syntax and invariants – for things like validation, merklization, and clean compilation – be handled with custom smart contracts and init code?

I have an variant of the proposal.

  • A subroutine may only start at PCs divisable by 32: 0, 32, 64` etc.
  • The BEGINSUB, if executed at some other position, causes error.
  • If a BEGINSUB is at PC % 32 == 0, then we say that a new subroutine starts there.

With these changes, the following effects happen:

  • We keep the old code/data bitmap. Size codelen / 8. (A)
  • We create one more bitmap during jumpdest-analysis. Bit 0 represents PC=0, bit 1 represents PC=32. If a new subroutine starts there, we place a 1 in the bitmap. (B). Size of B is codelen/ 32 /8

Now, whenever we need to check if a JUMP is valid, we do

  • Check A as before,
  • Check LOC is JUMPDEST as before,
  • Seek through B from PC to LOC. This seek operation effectively covers 32 bytes per check. On a 64-bit platform, large large jumps can be checked with 64 bits at a time, effectively covering 32*64 bytes of code per check. A jump across 1MB gigantic subroutine can be performed with ~490 checks.

This means that compilers will have to be smart about allocating the subroutines, and use some padding here and there, and maybe inline small stuff instead of subroutining it.

For code merklization, that might be a good thing too, since it puts a floor on the size of a code chunk.

Furthermore:

  • this decreases the chance of causing backward compatibility problems: any non-32 BEGINSUB op is still ‘just an invalid opcode’, and the only case that would cause problems would be
    • A jump across a data-section, where a BEGINSUB is at PC %32 ==0.

PS: The choice of 32 is pretty arbitrary.

I still expect (without proof) that EIP-615’s restrictions can be validated on the simpler instruction set without restrictions.

And yes, that might make the LLVM output harder work. Current clients are interpreters, so aren’t in trouble.

Some preliminary results about performing jumpdest analysis and doing JUMPS.

Building the jumpdest analysis(es) required:

Old implementation versus new, where we also need to build the secondary smaller bitmap. It incurs an extra cost in all cases:

JumpdestAnalysis_1200k/zeroes-6          801µs ± 2%  1981µs ± 1%  +147.15%  (p=0.016 n=5+4)
JumpdestAnalysis_1200k/jumpdests-6       803µs ± 1%  1958µs ± 3%  +143.75%  (p=0.016 n=4+5)
JumpdestAnalysis_1200k/push32-6          397µs ± 2%   462µs ± 2%   +16.58%  (p=0.008 n=5+5)
JumpdestAnalysis_1200k/push1-6          1.85ms ± 0%  1.94ms ± 2%    +5.08%  (p=0.016 n=4+5)
JumpdestAnalysis_1200k/beginsub-6       1.56ms ± 7%  2.09ms ± 4%   +33.50%  (p=0.008 n=5+5)
JumpdestAnalysis_1200k/beginsub_push-6  1.88ms ± 2%  1.97ms ± 2%    +4.71%  (p=0.008 n=5+5)

Code is 1.2 Mb.
zeroes: zero-filled code
jumpdests: all jumpdest
push32 : all push32
push1: all push1
beginsub: all beginsub
beginsub_push: all push1, but every 32th is a BEGINSUB

Executing a JUMP

This benchmark evaluates the validity of a JUMP across the entire body of code, from 0 to the end (1.2 Mb).
Code consists of all JUMPDEST.

BenchmarkJumpdestValidation/jumpdests-6         	151248043	         7.57 ns/op
BenchmarkJumpdestValidation/jumpdests-with-subroutines-6         	   25620	     46575 ns/op

The ‘old’ way is very fast, at sub 8ns, and does not depend on the size of the jump nor the size of the code.
The code chosen is notorious for the seek-based approach, since it there are no
subroutines to abort the check early. It takes 46us. In this case, I have not optimized it to check
bits 64 at a time, but if we assume that it can be optimized by a factor of 64 for large jumps, that would
still land at 727ns, which is almost a factor 100 slower than previously.

Note, though. that for a ‘regular’ contrcact, limited to 24K, the figures look differently:

BenchmarkJumpdestValidation/jumpdests-6         	154333724	         7.59 ns/op
BenchmarkJumpdestValidation/jumpdests-with-subroutines-6         	 1000000	      1070 ns/op

If we apply the same 64 divisor, we wind up with 7.6ns vs 16.7ns, which is much more reasonable.

This indeed makes the potential number of subroutines manageable.

I can think of an alternative subroutine map implementation (B).

  • We keep an array []uint32 of subroutine sizes at valid PC values (0, 32, 64, …).
  • The subroutine size includes BEGINSUB so the size of a subroutine cannot be 0.
  • Zero value means there is no subroutine at given PC.
  • The size of B is codelen/8 (because codelen/32 * (32/8))

The JUMPSUB validation:

  • Check LOC % 32 == 0,
  • Check B[LOC / 32] > 0,
  • Remember current subroutine span:
    • current_subroutine_begin = LOC
    • current_subroutine_end = current_subroutine_begin + B[LOC / 32]

The JUMP validation:

  • Check if LOC is JUMPDEST (as pre-EIP)
  • Check current_subroutine_begin <= LOC < current_subroutine_end

There are also other possible encodings of this variant of B:

  • Use uint16 but divide subroutine size by 32. Then the limit subroutine size is 2**16 * 32 == 2M.
  • Use uint8 and variadic length number encoding like LEB128.

I would consider factor values: 16, 32, 64.

1 Like

With pawels suggested algo (the uint16 version), I get this (against the current master implementation)

name                                    old time/op  new time/op  delta
JumpdestAnalysis_1200k/zeroes-6          842µs ±11%  2594µs ± 0%  +208.00%  (p=0.016 n=5+4)
JumpdestAnalysis_1200k/jumpdests-6       885µs ±21%  2612µs ± 1%  +194.98%  (p=0.008 n=5+5)
JumpdestAnalysis_1200k/push32-6          420µs ±12%   453µs ±23%      ~     (p=0.222 n=5+5)
JumpdestAnalysis_1200k/push1-6          2.00ms ± 5%  1.86ms ±41%      ~     (p=0.151 n=5+5)
JumpdestAnalysis_1200k/beginsub-6       1.68ms ± 6%  2.67ms ± 1%   +59.20%  (p=0.016 n=5+4)
JumpdestAnalysis_1200k/beginsub_push-6  1.99ms ±10%  1.63ms ± 7%   -18.19%  (p=0.008 n=5+5)

The worst case (beginsub) is roughly on-par with the seek-algo.
It’s a lot better on the /beginsub_push than the seek-algo, since it’s faster to set a byte than flip a bit (I guess?).

I am not going to do an evaluation of the speed of validating the JUMP: it’s obviously O(1), and can’t be much slower
than the current jump-instructions.

Code is here btw: https://github.com/holiman/go-ethereum/tree/subroutines2

It would really be nice if we could have this EIP without the restriction on where to start a subroutine. Note that this EIP does not bring a very big benefit for compilers. If the subroutine does not have many return values, it might even be better to use a regular jump instead of wasting 16 bytes on average when using the subroutine mechanism.

I think it should be possible to extend the current analysis to also allow beginsubs at arbitrary positions, the following might be a start:

We have a bitmap where bit i is 1 if and only if there is a beginsub opcode somewhere between PC 32 * i and 32 * (i + 1) - 1. With this data structure, we essentially do the analysis proposed by @holiman , just with an additional linear search inside the 32 bytes of code (problem to solve: pushdata):

The subroutine size data structure proposed by @chfast can be similarly modified: For each chunk of 32 bytes, we store the (largest) number of following chunks that do not contain a beginsub opcode.

For a jumpsub, we check if the target is a valid beginsub by first looking at the 32-byte-granular beginsub bitmap. If there is no match, we OOG. If there is a match, we check the code itself.

For jumps, we keep track of the PC of the beginsub of the current subroutine, let this be O. Let E be the number of 32-byte chunks following the chunk of O that do not contain a beginsub opcode, as read from the data structure.

For a jump with target T, we check if T is less than O and OOG if yes. Let C be floor(T / 32), the target chunk. If C = floor(O / 32), we perform a linear search for beginsub on the at most 32 bytes of the chunk and OOG if we find one. If C > floor(O / 32) + E, we OOG. If C = floor(O / 32) + E, we do a linear search on the 32 bytes of chunk C and OOG if and only if there is a beginsub between the beginning of the chunk and T.

I have the feeling that there should already be generic data structures and algorithms that can answer the question “does the array slice from x to y contain a zero” and that we can use without many modifications…

Hm, I’m a bit sceptical about this, tbh. The original EIP added one more data structure: the returnstack. With this suggestion + Pawels implementation idea, we will need to have one more: the stack of ‘current subroutine offset’, to remove the linear factor from evaluating a JUMP.

With the idea proposed by @chriseth, we’d add back a penalty to the analysis itself, and add a penalty to the jump evaluation. A JUMP costs 8 gas, so it needs to be pretty slim in order to not be exploitable.

I mean, it might be doable. But is the 32-byte offset really that much of a problem?

@holiman @chfast maybe we should have a call about this. I do not see in which way my idea adds a penalty that is not present in Pawel’s. Also note that “current subroutine offset” can be removed from my idea by creating a reverse “subroutine sizes” data structure in parallel to the regular one.

If we design subroutine call such that we see the return address as the first operand of the callee function (because it is very easy to calculate the offset of return value), we will have some stack manipulations to bring effective operands (actual operands of the callee function) to top of stack to compute. These are some hidden overheads if the compiler is implemented that way. This EIP will overall reduces the pressure on the stack (by removing one element from it). The scheduler should be capable to generate better instruction sequences by reducing the stack manipulation opcodes needed.

Or, if you implement in such a way that you see the return address value as the last operand in the function, then this EIP’s performance impact will be reduced.

I think this is exactly the right solution. Have any unfixed conceptual problems been found? How hard is it to implement?

[quote=“gcolvin, post:30, topic:4229, full:true”]

I think this is exactly the right solution. Have any unfixed conceptual problems been found? How hard is it to implement?

To be clear, the section that you refer to is “prevent walking into subroutine”. This is pretty trivial to implement.

So, does that mean you are not favouring the “prevent jumping into another subroutine” ? That one is the more difficult one, but which I think is fully doable, both from an evm-perspective and from a backwards-compatibility perspective.

I am not favoring more than the minimal “prevent walking into another subroutine”. It’s a clear improvement that presents no difficulties. Beyond that I’m less much less convinced, as I’ve stated at intervals above.