Here’s my initial post-mortem and concern/task list following the Constantinople Ropsten hard fork, in no particular order.
-
A consensus bug in Parity was discovered (https://github.com/paritytech/parity-ethereum/pull/9746). We need to understand why this consensus bug occurred in the first place, and particularly why it wasn’t caught by the tests. @ethchris suggests that we may need clearer EIP specs including pseudo code (https://twitter.com/ethchris/status/1052503731072315392). Apparently in this case there was some confusion over the meaning of terms like “transaction” and “execution frame” that may have contributed to the bug (cf. ethereum/AllCoreDevs - Gitter).
-
There were no miners on the new fork. Why? We need to better understand, and have more control over, how mining happens on a PoW testnet. Does the Foundation/Parity/other core dev teams need to have more miners running on a PoW testnet? Do we need to have miners on standby, fully synced, ready to jump on after a fork and assert the correct chain? Do we need to enlist the help of altrustic miners like @atlanticcrypto in this effort? How do we coordinate?
-
Parity has a limit on how far back nodes can automatically reorg (cf. @5chdn at ethereum/AllCoreDevs - Gitter). We should better understand this mechanism. Is this supposed to be a limit on “on chain governance” (allowing nodes to automatically come to consensus on the canonical chain) that triggers the need for meatspace/developer intervention? Or is it more about resource constraints? What’s the limit and why is it set as such? Why does parity have this limit, but not geth? UPDATE: it appears that both geth and parity have such a limit.
-
Geth has a
debug.setHead
command that allows you to manually force it onto the right chain; it appears that parity does not have such a feature. Is this desirable? -
It’s possible for an upgraded node in fast sync mode (geth or parity, I believe) to fast sync over a bad block which caused a fork and keep following the wrong chain. This is clearly the shortcoming for fast sync but we should discuss this in the context of forks and long reorgs - is there some way to communicate a hint to such nodes that they’re on the wrong chain?
-
Similarly, after a fork has occurred and there are many chains (there appear to be as many as four Ropsten chains right now, cf. ethereum/AllCoreDevs - Gitter), it’s very difficult for a node sycning from scratch to find the right chain. For one thing, it constantly tries to peer with nodes on the wrong chain; in this case it was necessary for nodes to turn off discovery entirely and manually enter a set of peers to get caught up to the right chain. Perhaps a “fork ID” as suggested by @MicahZoltu would help. Alternatively, some sort of “beacon” (not to be confused with the Eth 2.0 beacon chain, sorry for the poor choice of terms) which gave nodes a verifiable hint about the current canonical chain would help (or, alternatively, a blacklist of bad blocks or chains). Some challenges here: @cdetrio points out that the P2P layer makes transmitting this information hard (ethereum/AllCoreDevs - Gitter), there’s the question of centralization here (who controls the beacon?), and it’s a possible DoS vector in the wrong hands.
-
Communication - the AllCoreDevs channel on Gitter served as our primary means of communication throughout this fork - but it’s disorganized, has no threading, and it’s difficult to find a canonical source of information (e.g., which is the current head? what’s the current block number? what’s the status of each client? what series of commands do I need to run to sync a new node to the current head? etc.). I propose the creation of a core devs “War Room” where this information can be managed going into and during an upgrade or other emergent situation. Hudson set this up, which is a great example: https://notes.ethereum.org/s/SJE9O2ksQ#
-
General hard fork strategy - the core devs seem to be all over the map here. @AlexeyAkhunov argues that we should roll Ropsten back to Constantinople and try the fork again. @sorpaas disagrees. Is there a minimum amount of time we need between finalizing/releasing the client code and scheduling a hard fork? Can we make sure that hard forks happen on Wednesdays rather than on Saturdays? Do we need a certain set of people to be “on call” for the fork? Is there a minimum amount of time we need to see an upgrade running successfully on a testnet before we schedule a mainnet hard fork? Which testnet? Does it need to be an “active”, PoW-based testnet like Ropsten? Do we need some escape hatch, e.g., we could send a transaction to call off or postpone a testnet hard fork if a bug is discovered?
-
Fork monitor - apparently @Arachnid and @cdetrio worked on this before. This might be something like a modified version of EthStats. We would definitely find this useful for the #Ewasm testnet. Would this be of value?
-
There’s an issue in geth where it spends a lot of resources exploring chains with bad blocks that appear to have a higher TD (cf ethereum/AllCoreDevs - Gitter). According to @holiman, it may explore the same chain or same set of blocks multiple times, esp. when it’s not running in archive mode (and thus doesn’t retain all the block data). We may need some efficient way for geth to remember bad ancestors to prevent this issue (ethereum/AllCoreDevs - Gitter).
-
Ropsten has effectively been unusable, or at least very difficult to use, for about four days as of this writing, and we still have several active forks. This begs the question, what is a testnet? What is its purpose? Do we need stable, “production” testnets and “staging” testnets? @LefterisJP probably has thoughts here