EIP-4444: Bound Historical Data in Execution Clients

Discussion thread for EIP-4444.

A link to the EIP itself? (I don’t see it in the github repo)

Was hoping to get it merged quickly and update with eips.ethereum.org link, but it didn’t happen. I’ve updated it just now with the PR link.

It worth say when the specification of this EIP comes into effect. An important thing that this change depends on the Merge. This is because before the Merge historical block headers are essential for sync and bootstrapping process. Verifying the PoW seal of all blocks in the chain is the only way to prove the chain is valid. Network upgrade to PoS shifts this paradigm and makes historical data of EL chain not a requisite for node bootstrapping process.

I don’t think that this spec should be so prescriptive, i.e. saying MUST NOT wrt serving ancient data. The reason is that sync process of some clients may depend on the historic data and they will need a time to be prepared for such a big change. For instance, Erigon executes all blocks since genesis and doesn’t used any state downloading techniques to get in sync with the network; they will probably want to serve ancient blocks even if other clients stop doing it.

IMO, the purpose of this EIP in the context of the Merge is to send a clear signal to infra, users, and clients that they should consider that the invariant of storing the history is going to be broken in some future. If we don’t want this for the Merge then it could be prescriptive but it worth considering that it will take a lot of time for network participants to get prepared for such a change.

1 Like

I disagree here. If we don’t make this prescriptive then we will have a gradual degradation in historic block and receipt sync while users/clients still rely on it, until it just becomes broken and a bunch of users are confused and upset.

Making this a MUST NOT serve makes this a harder EIP to implement because it will require preparing dapps (especially receipts) and users for this breaking change, but then it will be completed in a clean way. If instead it’s a SHOULD NOT or MAY NOT, we create a path for dapps to slowly become broken on some indefinite timeline (because many will just continue to rely on the functionality for as long as it seems to work).

iiuc, Erigon has had a torrent block downloader in production for quite a while (it is way faster) and is expecting this breaking change at some point.

I agree with this, but to do so effectively, I think the strategy is to specify the EIP as it will be in it’s final form and begin communicating about it now rather than actually introducing the breaking change simultaneously with the Merge. Imo, this is going to take 12+ months to properly communicate and execute on the dapp/community side, but this EIP can use the Merge and weak subjectivity in it’s rationale to bound it to this shift to PoS even though it wont be fully implemented at the point of the shift.

Preserving the history of Ethereum is fundamental

Yes.

We suggest that a specialized “full sync” client is built. The client is a shim that pieces together different releases of execution engines and can import historical blocks to validate the entire Ethereum chain from genesis and generate all other historical data.

You don’t say who would build or maintain this client. And it’s not clear to me how the shimming would work. Existing clients go to some effort to efficiently manage the differences between releases, given most of the code hasn’t changed.

There are censorship/availability risks if there is a lack of incentives to keep historical data.

there is a risk that more dapps will rely on centralized services for acquiring historical data.

Yes. And I don’t think mitigating these risks is “out-of-scope for this proposal”.

2 Likes

Let’s say there are upgrades X, Y, and Z. During X, clients support X. During Y, clients support Y and so on. After each fork the code to run the fork is removed (save for the transition period of the fork). So if you want to validate the entire chain, you would start the client version that supports X and import all the X blocks. Then it would shutdown and you would run the client version that supports Y with the same data directory and again import all the blocks. Proceed with this and eventually you’ll be executing the tip of the chain with everything validated. The two caveats are i) after the merge, you’ll also need to get the beacon history and drive clients with that and ii) you may need to run glue code between client versions if there are breaking changes to the state / history storage.

I think there are pretty straightforward solutions to replicated storage of static data that we can tap into for this (e.g. IPFS, torrent, data mirrors, etc).

Copying from the PR thread:

@djrtwo: Note, that discussions with @karalabe have pointed to making this a MUST because otherwise users will still rely on it while quality of this feature degrades until it is unusable. A MUST, instead, forces users and dapps to actually “upgrade” how they utilize this at the point of this shipping, forcing everyone’s hand so it doesn’t silently get worse and worse.

@djrtwo: If we go that path, then this spec should specify that devp2p does return errors on requests outside of the specified range of epochs/time

I feel like SHOULD is adequate level of force for clients to do this? Geth continue to make up a large portion of the network and if they stop serving the data, it’s going to mostly be unavailable (and users will mostly be forced to adapt). That said, I don’t have a strong argument for one or the other. The one caveat is that most devp2p messages wouldn’t know how to distinguish requests for “non-existent data” and “expired data” because they are by hash. Only GetBlockHeaders is done via block number and could therefore return an error.

If we were to go with SHOULD and GetBlockHeaders returns an empty response instead of error, I think the main fail case will be clients that try to header sync the old way will be confused since they don’t get any headers back. This seems acceptable and avoids an new wire protocol version.

@axic: Actually back in April 2020 or so, within the Ipsilon team we thought about making a proposal to just hardcode given hashes for given blocks in an EIP. The idea would be to hardcode the hashes for past hard forks.

However then regenesis was proposed, which in practice does the same, but programatically.

Since regenesis as a concept is delayed ever so often, and given this EIP, would such a proposal make sense now?

In my mind, regenesis tackles a slightly different problem (and when I say regenesis, I generally mean this version). This EIP is about removing the need to store historical data whereas regenesis is a mechanism primarily aimed at reducing the state data. Regenesis, as far as I understand it, does not prescribe that clients should discontinue holding historical headers / bodies / etc.

I am curious to understand better what you mean about hard coding past fork blocks. I think you’re referring to a type of weak subjectivity checkpoint?

Generally a question that keeps coming up is how to deal with the difference between non-existent and expired data. Geth has set the precedence in v1.10.0 by turning the txlookup index to prune indexes older that 1 year by default. This means that if a user calls eth_getTransactionByHash with a valid tx hash from Byzantium, it will return an empty response.

Is the going to be acceptable behavior other data like blocks? And is acceptable over the wire? Seems like we’re leaning towards “yes”.

My understanding of RFC 2119 is that what you are describing is exactly what SHOULD is used for. SHOULD is a way for a specification to say “this behavior is what is best for the ecosystem/user, but it isn’t something that is strictly enforced and if you don’t follow it nothing is going to outright break”.

That’s what I feared :wink: It means all of these clients have to be maintained indefinitely.

Then I’d like a bit more discussion of what the solutions and who is responsible for fixing what is going to get broken.

A. This is only necessary if one wants to validate the entire blockchain from genesis, which I argue is an uncommon operation at best, and I suspect eventually will simply be something that no one does.

B. The old clients don’t have to be maintained, they only need to continue to exist. No updates need to be applied to them.

Preserving the history of Ethereum is fundamental and we believe there are various out-of-band ways to achieve this.

It should be stated what preserving the history of Ethereum is fundamental to, and how important state history preservation is relative to other properties of the protocol.

And if it is as fundamental as stated, why do the authors not propose an alternative, sustainable mechanism for it?

Adding this: Not that the current situation is indefinitely sustainable, but the current requirement sufficiently preserves and provides state history. The burden put on network users is heavy and growing, but there needs a realistic plan for how to maintain this widely-used aspect of the Ethereum network.

Yes. The Ethereum blockchain is, fundamentally, an immutable record of transactions – value transfers and valid computations. It seems to me that there should be some standard protocol for that, however the history is stored.

1 Like

Additionally, if we use MUST NOT then the peer that breaks this requirement SHOULD be disconnected and penalised.

I agree completely that “preserving this widely-used aspect of the Ethereum network” is of utmost importance. But why does it have to be a capability that the node maintains?

It seems to me that maintaining this ability in the node is exactly the problem. If the data is immutable, the entire state and the entire history of the chain can be written once to some content-addressable store (such as IPFS) and as long as someone preserves that data, anyone can get it. AN Ethereum node would not even have to be involved. All one would need is the hash of where the immutable data is stored.

Fresh data can be written as an ‘addendum’, so there would have to be some sort of published manifest of the original hash and the periodic ongoing hashes. I would argue that the hash of the manifest should be part of the chain, but, short of that, the community would have to maintain it (perhaps by publishing the hash to a smart contract).

My point is that because the data is immutable, and because we have content-address storage to store it in, there’s literally no need to continue to provide the ability to regenerate this data from genesis. The only outcome of regenerating from genesis would be to arrive at the same IPFS hash as you already have.

On top of that, there’s no reason the clients have to maintain this capability, and the entire purpose of this EIP is to remove that requirement. This might possibly open a whole new area of innovation related to providing access to this historical data – which I think would allow for amazingly more interesting dApps than we currently have (because of the need for a node to get to it).

Furthermore, if the historical data is chunked into manageable pieces, and it was properly indexed by chunk (with a bloom filter in front of each chunk) each individual user could easily download and pin only that portion of the database that they are interested in. Thereby distributing this historical data throughout the community as a natural by-product of using it. (See TrueBlocks).

I agree that people are uploading lot of stuff on blockchain especially with the rise of NFTs but also not optimized token contracts which are causing state bloat. But did anyone think about other examples and use cases of blockchain? What if people uploaded their important documents like birth certificates on blockchain as the mission of blockchain is ledger which stores information on-chain forever. Suddenly those people won’t be able to access their documents because some devs thought it’s a good idea to delete blockchain state after some time… Another great example is NFTs especially NFTs that were made before ERC-721 ie 2017 and older like CryptoPunks. Those will be gone for ever.

From developer perspective, I’m sure that there are many data that are not important and doesn’t need to be stored.

Probably better idea would be to store data on full nodes and have light nodes or think about different ideas how to make infrastructure the most efficient without having to delete and loose data.

Don’t get me wrong, I’m just trying to think realistically from non-core-dev perspective and I’m against this EIP.

Ethereum was never designed to be a permanent data storage system. Something like FileCoin is much better suited for long term data storage, and they have incentives built into the protocol to ensure that the cost of long term storage is paid for by those seeking it.

Also, this EIP removes history but not state. State expiry is also an active area of research, but out of scope for this thread.