I have a few pain points I’d like to get addressed. These are not so much theoretical protocol niceties, rather than actual nasty issues that cause some part of node implementations to behave suboptimally.
eth/63
has a GetNodeData(hash)
method (or some variation of this). This is used to retrieve either a trie node identified by is hash, or a bytecode identified by its hash. In theory this is a nice, flexible thing. In practice, this is horribly too flexible.
This method makes the assumption that nodes store all code and all trie nodes as hash->value
mappings. This assumption actually forces nodes to do this, even though it makes no sense. The false assumption was that nodes will deduplicate data, and this hash->value
mapping is the most optimal.
Nodes may not want to deduplicate data so aggressively: storage tries across multiple accounts can share the same data with the same hash (in practice they won’t much). However, if nodes implement pruning, they need to duplicate this data back, because a pruning algorithm won’t b able to track references across multiple account (potentially infinite).
This is a problem, because the GetNodeData
assumes the node can retrieve a trie node by it’s hash, whereas if it’s duplicated, we also need the account to which it belongs to to retrieve.
- Parity hacks around this issue by
xor
-ing the account into the hash’s last 20 bytes, and when retrieving a trie by hash, they iterate the database for the first 12 byte prefix, and if multiple results are found, they hash on the fly to check which is good.
- Geth’s PoC pruning code currently appends the account to the hash and uses a similar iteration mechanism to pull the data from disk. For us this is problematic because storing them by
<account><hash>
order instead of <hash><account>
would give us proximity, but break iterability. Similarly this fetch-by-hash requirement puts a huge burden on in-memory caches, which need extra indexing structured to allow translatinc account-scoped trie nodes into “global deduped trie nodes”.
My request is that the GetNodeData beextended with a context parameter, clearly stating which account a particular trie node should be pulled from. If a node dedupes everything as now, it can simply ignore the context and still work cleanly. If a node stores them contextually, the extra data can help speed access up a lot. During a fast sync you know the context either way when downloading the data, so there’s no overhead there either.
I would also make a separate call for retrieving a trie node and retrieving code. IT makes things cleaner and allows still deduping code and storing it potentially differently.
I forgot the other one, damn. Will post here when I remember it.