EIP-2315 "Simple Subroutines for the EVM" - Analysis

I have an variant of the proposal.

  • A subroutine may only start at PCs divisable by 32: 0, 32, 64` etc.
  • The BEGINSUB, if executed at some other position, causes error.
  • If a BEGINSUB is at PC % 32 == 0, then we say that a new subroutine starts there.

With these changes, the following effects happen:

  • We keep the old code/data bitmap. Size codelen / 8. (A)
  • We create one more bitmap during jumpdest-analysis. Bit 0 represents PC=0, bit 1 represents PC=32. If a new subroutine starts there, we place a 1 in the bitmap. (B). Size of B is codelen/ 32 /8

Now, whenever we need to check if a JUMP is valid, we do

  • Check A as before,
  • Check LOC is JUMPDEST as before,
  • Seek through B from PC to LOC. This seek operation effectively covers 32 bytes per check. On a 64-bit platform, large large jumps can be checked with 64 bits at a time, effectively covering 32*64 bytes of code per check. A jump across 1MB gigantic subroutine can be performed with ~490 checks.

This means that compilers will have to be smart about allocating the subroutines, and use some padding here and there, and maybe inline small stuff instead of subroutining it.

For code merklization, that might be a good thing too, since it puts a floor on the size of a code chunk.

Furthermore:

  • this decreases the chance of causing backward compatibility problems: any non-32 BEGINSUB op is still ā€˜just an invalid opcodeā€™, and the only case that would cause problems would be
    • A jump across a data-section, where a BEGINSUB is at PC %32 ==0.

PS: The choice of 32 is pretty arbitrary.

I still expect (without proof) that EIP-615ā€™s restrictions can be validated on the simpler instruction set without restrictions.

And yes, that might make the LLVM output harder work. Current clients are interpreters, so arenā€™t in trouble.

Some preliminary results about performing jumpdest analysis and doing JUMPS.

Building the jumpdest analysis(es) required:

Old implementation versus new, where we also need to build the secondary smaller bitmap. It incurs an extra cost in all cases:

JumpdestAnalysis_1200k/zeroes-6          801Āµs Ā± 2%  1981Āµs Ā± 1%  +147.15%  (p=0.016 n=5+4)
JumpdestAnalysis_1200k/jumpdests-6       803Āµs Ā± 1%  1958Āµs Ā± 3%  +143.75%  (p=0.016 n=4+5)
JumpdestAnalysis_1200k/push32-6          397Āµs Ā± 2%   462Āµs Ā± 2%   +16.58%  (p=0.008 n=5+5)
JumpdestAnalysis_1200k/push1-6          1.85ms Ā± 0%  1.94ms Ā± 2%    +5.08%  (p=0.016 n=4+5)
JumpdestAnalysis_1200k/beginsub-6       1.56ms Ā± 7%  2.09ms Ā± 4%   +33.50%  (p=0.008 n=5+5)
JumpdestAnalysis_1200k/beginsub_push-6  1.88ms Ā± 2%  1.97ms Ā± 2%    +4.71%  (p=0.008 n=5+5)

Code is 1.2 Mb.
zeroes: zero-filled code
jumpdests: all jumpdest
push32 : all push32
push1: all push1
beginsub: all beginsub
beginsub_push: all push1, but every 32th is a BEGINSUB

Executing a JUMP

This benchmark evaluates the validity of a JUMP across the entire body of code, from 0 to the end (1.2 Mb).
Code consists of all JUMPDEST.

BenchmarkJumpdestValidation/jumpdests-6         	151248043	         7.57 ns/op
BenchmarkJumpdestValidation/jumpdests-with-subroutines-6         	   25620	     46575 ns/op

The ā€˜oldā€™ way is very fast, at sub 8ns, and does not depend on the size of the jump nor the size of the code.
The code chosen is notorious for the seek-based approach, since it there are no
subroutines to abort the check early. It takes 46us. In this case, I have not optimized it to check
bits 64 at a time, but if we assume that it can be optimized by a factor of 64 for large jumps, that would
still land at 727ns, which is almost a factor 100 slower than previously.

Note, though. that for a ā€˜regularā€™ contrcact, limited to 24K, the figures look differently:

BenchmarkJumpdestValidation/jumpdests-6         	154333724	         7.59 ns/op
BenchmarkJumpdestValidation/jumpdests-with-subroutines-6         	 1000000	      1070 ns/op

If we apply the same 64 divisor, we wind up with 7.6ns vs 16.7ns, which is much more reasonable.

This indeed makes the potential number of subroutines manageable.

I can think of an alternative subroutine map implementation (B).

  • We keep an array []uint32 of subroutine sizes at valid PC values (0, 32, 64, ā€¦).
  • The subroutine size includes BEGINSUB so the size of a subroutine cannot be 0.
  • Zero value means there is no subroutine at given PC.
  • The size of B is codelen/8 (because codelen/32 * (32/8))

The JUMPSUB validation:

  • Check LOC % 32 == 0,
  • Check B[LOC / 32] > 0,
  • Remember current subroutine span:
    • current_subroutine_begin = LOC
    • current_subroutine_end = current_subroutine_begin + B[LOC / 32]

The JUMP validation:

  • Check if LOC is JUMPDEST (as pre-EIP)
  • Check current_subroutine_begin <= LOC < current_subroutine_end

There are also other possible encodings of this variant of B:

  • Use uint16 but divide subroutine size by 32. Then the limit subroutine size is 2**16 * 32 == 2M.
  • Use uint8 and variadic length number encoding like LEB128.

I would consider factor values: 16, 32, 64.

1 Like

With pawels suggested algo (the uint16 version), I get this (against the current master implementation)

name                                    old time/op  new time/op  delta
JumpdestAnalysis_1200k/zeroes-6          842Āµs Ā±11%  2594Āµs Ā± 0%  +208.00%  (p=0.016 n=5+4)
JumpdestAnalysis_1200k/jumpdests-6       885Āµs Ā±21%  2612Āµs Ā± 1%  +194.98%  (p=0.008 n=5+5)
JumpdestAnalysis_1200k/push32-6          420Āµs Ā±12%   453Āµs Ā±23%      ~     (p=0.222 n=5+5)
JumpdestAnalysis_1200k/push1-6          2.00ms Ā± 5%  1.86ms Ā±41%      ~     (p=0.151 n=5+5)
JumpdestAnalysis_1200k/beginsub-6       1.68ms Ā± 6%  2.67ms Ā± 1%   +59.20%  (p=0.016 n=5+4)
JumpdestAnalysis_1200k/beginsub_push-6  1.99ms Ā±10%  1.63ms Ā± 7%   -18.19%  (p=0.008 n=5+5)

The worst case (beginsub) is roughly on-par with the seek-algo.
Itā€™s a lot better on the /beginsub_push than the seek-algo, since itā€™s faster to set a byte than flip a bit (I guess?).

I am not going to do an evaluation of the speed of validating the JUMP: itā€™s obviously O(1), and canā€™t be much slower
than the current jump-instructions.

Code is here btw: https://github.com/holiman/go-ethereum/tree/subroutines2

It would really be nice if we could have this EIP without the restriction on where to start a subroutine. Note that this EIP does not bring a very big benefit for compilers. If the subroutine does not have many return values, it might even be better to use a regular jump instead of wasting 16 bytes on average when using the subroutine mechanism.

I think it should be possible to extend the current analysis to also allow beginsubs at arbitrary positions, the following might be a start:

We have a bitmap where bit i is 1 if and only if there is a beginsub opcode somewhere between PC 32 * i and 32 * (i + 1) - 1. With this data structure, we essentially do the analysis proposed by @holiman , just with an additional linear search inside the 32 bytes of code (problem to solve: pushdata):

The subroutine size data structure proposed by @chfast can be similarly modified: For each chunk of 32 bytes, we store the (largest) number of following chunks that do not contain a beginsub opcode.

For a jumpsub, we check if the target is a valid beginsub by first looking at the 32-byte-granular beginsub bitmap. If there is no match, we OOG. If there is a match, we check the code itself.

For jumps, we keep track of the PC of the beginsub of the current subroutine, let this be O. Let E be the number of 32-byte chunks following the chunk of O that do not contain a beginsub opcode, as read from the data structure.

For a jump with target T, we check if T is less than O and OOG if yes. Let C be floor(T / 32), the target chunk. If C = floor(O / 32), we perform a linear search for beginsub on the at most 32 bytes of the chunk and OOG if we find one. If C > floor(O / 32) + E, we OOG. If C = floor(O / 32) + E, we do a linear search on the 32 bytes of chunk C and OOG if and only if there is a beginsub between the beginning of the chunk and T.

I have the feeling that there should already be generic data structures and algorithms that can answer the question ā€œdoes the array slice from x to y contain a zeroā€ and that we can use without many modificationsā€¦

Hm, Iā€™m a bit sceptical about this, tbh. The original EIP added one more data structure: the returnstack. With this suggestion + Pawels implementation idea, we will need to have one more: the stack of ā€˜current subroutine offsetā€™, to remove the linear factor from evaluating a JUMP.

With the idea proposed by @chriseth, weā€™d add back a penalty to the analysis itself, and add a penalty to the jump evaluation. A JUMP costs 8 gas, so it needs to be pretty slim in order to not be exploitable.

I mean, it might be doable. But is the 32-byte offset really that much of a problem?

@holiman @chfast maybe we should have a call about this. I do not see in which way my idea adds a penalty that is not present in Pawelā€™s. Also note that ā€œcurrent subroutine offsetā€ can be removed from my idea by creating a reverse ā€œsubroutine sizesā€ data structure in parallel to the regular one.

If we design subroutine call such that we see the return address as the first operand of the callee function (because it is very easy to calculate the offset of return value), we will have some stack manipulations to bring effective operands (actual operands of the callee function) to top of stack to compute. These are some hidden overheads if the compiler is implemented that way. This EIP will overall reduces the pressure on the stack (by removing one element from it). The scheduler should be capable to generate better instruction sequences by reducing the stack manipulation opcodes needed.

Or, if you implement in such a way that you see the return address value as the last operand in the function, then this EIPā€™s performance impact will be reduced.

I think this is exactly the right solution. Have any unfixed conceptual problems been found? How hard is it to implement?

[quote=ā€œgcolvin, post:30, topic:4229, full:trueā€]

I think this is exactly the right solution. Have any unfixed conceptual problems been found? How hard is it to implement?

To be clear, the section that you refer to is ā€œprevent walking into subroutineā€. This is pretty trivial to implement.

So, does that mean you are not favouring the ā€œprevent jumping into another subroutineā€ ? That one is the more difficult one, but which I think is fully doable, both from an evm-perspective and from a backwards-compatibility perspective.

I am not favoring more than the minimal ā€œprevent walking into another subroutineā€. Itā€™s a clear improvement that presents no difficulties. Beyond that Iā€™m less much less convinced, as Iā€™ve stated at intervals above.

1 Like

@chriseth, @holiman,

Maybe that was missed, but I suggested to limit ā€œcreate initā€ code to twice the deployed code size limit, i.e. 2x 24576 == 49152. This should significantly bound the possible number of jumpdests and subroutines. Thoughts?

Yes I think thatā€™s a very good suggestion. Should definitely help

Iā€™m editing the spec now. How would that go?

If this were to be applied to the network I would really suggest to make it into a separate EIP, required by 2315, otherwise mixing different restrictions into a feature EIP sounds weird.

I could be convinced on the contrary though.

2 Likes

Whatā€™s the current thoughts about tail calls and e.g. @chfastā€™s suggestion of TAILJUMPSUB? Iā€™ve personally been wanting to implement tail call optimizations in solidity for a while (although I think thereā€™s no consensus about how beneficial they woud be - however, e.g. potential more functional languages might suffer even more from lacking the possibility of such an optimization). So Iā€™m wondering, since the current proposal reads like that will no longer be possible when using subroutines.
(or wait: actually BEGINSUB JUMPDEST ... RETURNSUB and then jumping to that jumpdest from another subroutine would work? If so, then that would encourage jumping into subroutines for this, which @axic wants to avoid; I need to read the proposal a bit more carefully)
EDIT: ok, reading the proposal a bit more carefully makes me realize that the only benefit tail call optimizations would bring would be to save one slot in the return_stack - apart from that the combination of JUMPSUB RETURNSUB should do exactly what I want already. And the size of return_stack is probably a reasonable limitation, so further support for tail call optimizations might really not be needed.

So far as preventing jumping into subroutines goes ā€“ my problem remains that it has no semantics. Itā€™s syntax, not mechanism, and the Rationale for this proposal is the least amount of mechanism needed to implement subroutines.

Throughout this discussion my take becomes stronger that syntactic restrictions should be no part of the specification of the opcodes.

  • Opcodes control state-machine transitions, and should be constrained only by the validity of the states.

This lack of syntax should pose no obstacles to compilers from high level languages to EVM code ā€“ they can enforce whatever syntax they like. I understand that when compiling from EVM to lower level code this matters, but Iā€™ve generally seen compilers just ā€œthrow up their handsā€ and generate bad code when their input has too little structure. We can avoid this if we want ā€¦

We may later want to propose a separate EIP for a more-constrained EVM. It can enforce code structure at deployment time that is better suited for streaming compilation and other purposes. So along the lines of EIP-615, taking recent conversations and experience with EIP-2315 into account.