EIP-2315 Simple Subroutines for the EVM

Thanks for catching the opcodes out of order. I’ve fixed them and some other things. The overall results don’t change much.

JUMPDEST costs 1 gas regardless. It should probably be 0, but that only means that a tail-optimized JUMP-based routine is as cheap as an un-optimized JUMPSUB-based routine.

Yes, the current draft allows for walking into subroutines. Preventing it didn’t seem to provide safety worth sacrificing tail-call elimination. It doesn’t (yet?) allow jumping to subroutines, so I’ve fixed my example to use an extra gas for a leading JUMPDEST. I’m not sure that I can optimize things much more, but would be glad to be proven wrong.

More realistic examples would be good, but I don’t know the details of the Solidity code generator, and comparing maximally-optimized examples is a close to apples-to-apples as I know how to get.

Ok, allowing back tail call optimizations definitely means that this does less harm to optimization than before. There is still head call optimizations, but I’m not sure how significant they are.

Still to account for actual real world gas saving one has to consider that of course all of these toy functions are generally inlined in practice, because the call overhead far exceeds the cost of executing them in-place. For that matter even a single sstore would still be inlined, for which - if not - the savings would already only be (20045-20029)/20045 = 0.08%
I.e. control flow is only a fraction of the actual gas cost of a contract, so a fractional saving in control flow is minimal gain in practice.

So even if with allowing back tail call optimizations this might in fact save a few gas on average, I’d still say that that alone doesn’t warrant changing the evm for it.

And allowing walking into or jumping to subroutines seems to me like it’s moving away from the independent goal of better analyzability (even though I’d still argue that none of this really makes a difference in analyzing anything anyways).

One more quick note about the tail call optimized example: I think 1+3+3+3+8+5+3+8 is 34 not 35 ;-).
Ah, no, 35 is correct, but it’s now missing a JUMPDEST.
But yeah, in general, I’d say, there is a few more cases that one might want to look at, but given the change to the spec, I would no longer say that this EIP breaking optimization opportunities is a strong argument against it anymore.

@ekpyron I’m using a non-tail-call “toy function” that uses only 11 gas for computation to isolate the overhead of making calls from the overhead of making computations - I show that RETURNSUB improves gas use by 32%, and that an example that uses over 120 gas for computation is still 10% better. I also show that RETURNSUB gives superior tail call optimization. I don’t know if that is “realistic.”

Not all calls are tail calls, and not all tail-calls can be inlined without increasing contract size. So I don’t see that improving normal calls with a fairly easy (and already implemented) change to the EVM - adding a common facility - is a waste.

I’m not hard-set on making BEGINSUB execution invalid - tail-call optimization can be done either way - but code generators wouldn’t execute it by mistake, and humans hand-coding assembly should also be that skillful. And my analysis is that it doesn’t provide any extra safety, and neither does restricting “jumping into functions.”

I have in mind some of the same safety characteristics as EIP-615 – no stack underflows, no invalid jumps or invalid instructions, and linear-time, acyclic traversal of the control-flow graph. EIP-615 makes it makes it difficult to violate and easy to enforce these guarantees by eliminating dynamic jumps. EIP-2315 only makes it possible to avoid the use of dynamic jumps. Currently they are necessary because returns must be implemented by jumping dynamically to an address on the stack (or in memory), which RETURNSUB renders unnecessary.

Just to argue the point: you can still have calls like this:

function F(a) {
  G(a)
  ...[F]...
}
function G(a) {
  H(a)
  ...[G]...
}
function H(a) {
  ...[H]...
}

F(42)
...

in which I can pull out the heads, i.e. generate

ret					3
42					3
F					3
G					3
H					3
jump				8
ret:
...

F:
	jumpdest		1
	...[F]...
	jump			8
G:
	jumpdest		1
	...[G]...
	jump			8
H:
	jumpdest		1
	...[H]...
	jump			8

50 gas

compared to:

42					3
F					3
jumpsub				10
...

F:
	beginsub
	G				3
	jumpsub			10
	...[F]...
	returnsub		5
G:
	beginsub
	H				3
	jumpsub			10
	...[G]...
	returnsub		5
H:
	beginsub
	...[H]...
	returnsub		5

57 gas

and this I’m pretty sure you cannot beat using the subroutines, since it actually relies on the “dynamic” return jumps (which I’d not call dynamic, since they still always have fixed values for each call - they are just on the normal stack instead of hidden in the return stack).
That’s of course also highly artificial and other cases will behave differently. It’s possible to construct artificial cases that tip either way, if one really wants to. But I still maintain that gas savings alone are not a strong argument here.

EIP-2315 doesn’t prevent you from using JUMP in the way that you have. But indeed it is a highly artificial example, not likely to be generated by Solidity or most any other compiler. (I don’t think you could write it that way in Wasm either.) So I’m trying to use examples that are not artificial - just one routine calling another.

The raw improvement in gas cost and the complexity of calling conventions seem obvious to me.

  • 47 gas, 11 instructions reduced to 31 gas, 8 instructions – for an optimal ordinary routine.
  • 32 gas, 7 instructions reduced to 29 gas, 7 instructions – for a tail-optimized routine.

Whether that is enough of a difference to make a difference depends on the program being compiled and how well it is optimized, which is not just a matter of the “typical” Solidity program.

Anyway, given the schedule of upcoming upgrades this EIP has no hope of going in for many months, and would be most valuable as part of a more integrated set of EIPs. Thus I’ve opened discussions at EVM evolution Working Group Formation - #3 by gcolvin.

1 Like

It’s not that artificial actually - it wouldn’t be directly generated nor handwritten like that, but an optimizer can easily recognize the patterns of “PUSH(F) JUMP” and “F: [something cheap] JUMP” and partially inline it depending on the precise cost of “something cheap”, the number of such patterns occurring, etc. (which becomes more complicated with several competing calling conventions). The artificial part is rather that I specifically chose one such example that actually has the gas cost tip this way, while others will not :-).
But yeah, if this EIP has additional merit as natural part of an integrated set of EIPs things can easily look quite different in general!

It seems everyone has been assuming otherwise. Why are they wrong?

Why the 0x00ff?

Taking my own bait. Extending the turbo-geth code to handle JUMPSUB (original spec - to JUMPDESTs) I get the following sketch. I can’t see any problem – this jumpdest analysis should find and mark all of the immediate data for the PUSH* opcodes and any others we introduce, so at runtime these will not be considered valid destinations.

	for pc := 0; pc < len(code); {
		op := OpCode(code[pc])
		pc++
		if op >= PUSH1 && op <= PUSH32 {
			numbits := int(op - PUSH1 + 1)
			// ... set bits
			pc += numbits
		} else if op == JUMPSUB {
			// ... set bits
			pc += 2
		} // else if op == any other op taking immediate data
	}

But way back over here I find an example of the problem I don’t really understand, from @gumb0

@chriseth put it as:

If the argument to a newly-introduced multi-byte opcode is the byte value of JUMPDEST, then this was a valid jump destination before the hard fork, but it is no longer a valid jump destination after the hard fork.

I think this can happen only where code preceding the JUMPDEST wasn’t executabl – at some point I could scan the blockchain to see if the danger has ever manifested. This danger could be eliminated going forward by demanding that unreachable code be 0 (or INVALID, or something) so that new opcodes will never occur outside of immediate data.

And this is the problem with my analysis algorithm - it can’t tell whether it’s seeing old data or a new bytecode. Perhaps it can do better.

Not sure exactly what you are suggesting or how it helps.

What strikes me this morning is that if the problem is that the immediate data might look like a JUMPDEST the answer is just to never use 0x5b == 0x01011011 as immediate data. That could be fixed by always setting the high bit of immediate data to 1 and masking it away before use. That does reduce the number of bits for specifying immediate jumps to 14, and 2^14 is only 16384 – not enough to span an entire code array. We can either live with this restriction, use three bytes of immediate data, or use a UTF-8 style bit-ladder.

The remaining problem would be a JUMPDEST occurring before the new opcode, which in the past would have aborted with an error, but instead would execute. I think this is just another case of the fact that adding a bytecode always risks causing an invalid program to become valid - a risk we have taken many times before.

A few months down the road and I still like this idea. I suspect some base-64 variation on VLQ Variable-length quantity - Wikipedia would be best, setting bit 7 on to mark immediate data bytes (and prevent collisions with JUMPDEST), using bit 6 as a continuation bit, and the remaining 6 bits for data. (Various schemes can be used for negative numbers, including using the leading data bit as a sign bit and using 2’s complement.) It’s backwards-compatible, not needing any sort of versioning. And it’s future-proof, allowing for the size of the bytecode addressed to increase over time. It does require more decoding time than a fixed two-byte offset. It may use more or less space, depending on whether the offset requires more than 12 or less than 7 bits to represent.

And a little further down the line and I’ve moved the validation code to it’s own PR, as I realized it does not depend on these new opcodes, only deprecating dynamic use of JUMP and JUMPI.

Since this EIP is up for discussion for Shanghai again, maybe it’s good to clarify the following to avoid misunderstandings:

It is still my position (and I suspect this is shared by the solidity team in general), that gas savings cannot serve as justification for this EIP and the gas analysis in the EIP is a bit misleading (for example, the subroutine examples use RJUMP, which is per se independent of this EIP and would also reduce the gas costs of the non-subroutine versions, if used there; the swap for the return address can be nicely integrated in the overall stack layout of more complex functions, in which most of the time a globally swap-free stack layout cannot be achieved anyways, so this is no major issue for us; etc. - but I actually don’t think there is a lot of merit in arguing these points in detail). More importantly, any gas savings this EIP may theoretically bring will be insignificant in practice, since internal function calls only occur at very low frequency in real world Solidity-generated bytecode (among others due to inlining, etc.). So the overall real-world gas advantage of subroutines (for Solidity generated bytecode) will be negligible in practice.

That is not to say that I oppose this EIP. I just want to make sure that if it is considered, that it is considered for the right reasons. In combination with EIP 4200, the subroutines of this EIP can serve to completely eliminate the need for dynamic jumps, which may have significant merit independently of gas savings - but this merit has to be argued for independently and different approaches like EIP 4750 have to be considered. And such arguments and considerations will be largely independent of the use in Solidity (resp. of any theoretical gas savings).

And also: sorry, if I have missed or misunderstood any particular developments - we have quite a high workload these days and I did not find the time to review the updated EIP in all detail, but maybe the general sentiment expressed above is helpful.

Thanks for your helpful review.

The gas savings I show are very large for low level code, and I don’t think it’s at all irrelevant or misleading to show that. Quite the opposite.

And I don’t think the tail-optimization example I give is at all misleading. I’m properly comparing using the subroutines and static jumps that I require here – both RJUMP and RJUMPSUB – to the current practice of using dynamic JUMP. (Note that EIP-2315 requires EIP-4200).

It’s good that inlining Solidity code helps. And it’s interesting that Solidity does not usually generate many small routines from the code it’s usually given. (That may change given these instructions.) But Solidity is not the only compiler, and people do hand-code assembly when performance is critical - even when mostly writing Solidity.

As this proposal states, it’s the whole package of EIP-2315, EIP-4200, and EIP-3779 that provides for a complete, static, and verifiably safe control-flow facility. But even without EIP-3779 I do think my arguments for performance improvements are valid, and that improvements as large as I report are well worth it for the sort of code that needs them. It’s the critical code I’m targeting here.

Note: EIP-4750 is a very new PR that I just now took a glance at. Off-hand it looks to have excess runtime overhead – larger return-stack slots to support runtime checks for stack height – that EIP-2315 moves to validation time. And by restricting code sections to a single function it makes optimizations more difficult. I might be misreading, will study it more later.

1 Like

I would not expect that looking at hand-written assembly code for performance-critical situations will yield significantly different results in that regard. Any function call - no matter the calling convention - will incur overhead and especially in hand-optimized performance-critical code I would very rarely expect to find dense calls, since especially in such situations inlining (whether done by automatic optimization or by hand) will in general result in the lower overall gas cost. But yeah: really all I want to say is that I suggest to put more focus on the other aspects of this EIP (and its combination with the others) rather than the gas savings for its evaluation.

I don’t fully follow you here, I agree performance is not the only argument, but I do think it’s a stronger one than you give it credit for :wink:

Turing designed his calling convention to be cheap enough that inlining was less often necessary – inlining to avoid an RJUMPSUB/RETURNSUB would save 8 gas, compared to saving about 26 gas for the current approach. That moves the line for the speed/space tradeoff a fair distance.

It’s generally good software engineering to keep subroutines small and reusable, so making them a lot cheaper to call has to be a win.

1 Like

Of course you always want to have small and reusable subroutines initially. That’s why even low-level code is not written directly as opcodes, but rather e.g. in Yul, in which you still have functions you can subject to inlining by optimization. In case you don’t want to run any optimization, but optimize solely by hand, you usually optimize for every single piece of gas, so the size of the function call overhead won’t matter, you will still inline by hand.
What I’m saying is that the usual case is to have few and relatively large functions in deployed bytecode. If somebody wants to refute that: solidity will annotate jumps into and out of functions, so in any deployed contract with sources available it should be straightforward to calculate (a good approximation of) the ratio of overall gas cost to function calls (this applies to both solidity generated code and inline assembly code, i.e. “hand-optimized cases”).
Or to put it differently: most meaningful transactions will involve at least one state change, so 20000 gas. Even if you save 20 gas per function call, you need hundreds of calls to make a difference that comes even close to the percentages in the EIP’s gas analysis.
Also the main advantage of inlining itself often is not really saving the function call overhead, but rather allowing to fuse and optimize the code further based on the concrete arguments and surroundings.

So I’m rather worried about gas savings being an exaggeration rather than not giving it too much credit :-).

But yeah, I think I made my point about this now, so I’ll leave it at that for the time being.

I think I see your point of view as a Solidity compiler writer. (E.g. inlining isn’t just about pulling small routines in line to avoid call overhead, it’s also about pulling in large routines to subject them to further optimization. We went down that road 100% in C++ – many libraries exist entirely as inline code.) I still think that if you had these operations you could make good use of them, and would appreciate it if you could think on that a little.

My point of view is that of a machine designer who is relatively agnostic about where the code is coming from – it’s hard to fully anticipate all of the ways a general-purpose machine will be used. So I look at the history of computing machines and find that most of them provide subroutine operations going all the way back to Turing, 1946. And I look at the history of virtual machines going back to Forth, 1970 and farther and find they are mostly stack machines that use Turing’s design: a stack of return addresses. And I can’t see any good reason for our VM not to do the same.

So… I’m proposing two opcodes that follow an industry-standard design and provide the most efficient call/return mechanism I know of. Calling takes 5 gas for an RJUMPSUB, returning takes 3 gas for aRETURNSUB. 8 gas total.

By contrast, currently: Calling takes 14 gas for a PUSH <return address> PUSH <destination> JUMP. Returning takes 11 gas for a SWAP JUMP. 25 gas total.

So, 25 - 8 = 17 gas, and 17 ÷ 25 = 68% savings. YMMV.

That seems to me to be too large of a difference to ignore, even if maximally efficient subroutines aren’t on the critical path for Solidity.

(Yes, sometimes the SWAP isn’t needed: 22-8=14, 14/22=64%. And if we also had RJUMP then calling could only take PUSH <return address> RJUMP <destination>. 17-8=9, 9/17=53%. The savings are still substantial.)