ERC-7585: MixHash and Public Data Storage Proofs

waterflier · December 28, 2023, 1:32am

Abstract

This proposal introduces a design for minimum value selection storage proofs on Merkle trees. The design consists of two main components:

A hashing algorithm termed MixHash, aimed to replace the commonly used Keccak256 and SHA256 algorithms.
Public data storage proofs. This enables anyone to present a proof to a public network, verifying their possession of a copy of specific public data marked by MixHash.

Additionally, the proposal discusses the practical implementation of this design in various scenarios and suggests some improvements to the ERC-721 and ERC-1155 standards.

Motivation

The ERC-721 and ERC-1155 standards are widely used in the NFT fields. However, the current standards do not provide a mechanism for verifying the existence of public data. This is a major obstacle to the development of many applications, such as decentralized data markets, decentralized data storage, and decentralized data oracles.

Specification

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 and RFC 8174.

MixHash

MixHash is a Merkle tree root hash value that incorporates data length information. Its structure is as follows:

     +-----------256 bits MixHash-----------+
High |-2-|----62----|----------192----------| Low

2   bits: Hash algorithm selection, where 0b00 represents SHA256, and 0b10 represents Keccak256. (0b01, 0b11 are reserved)
62  bits: File size. Hence, MixHash can support file sizes up to 2^62-1.
192 bits: The lower 192 bits of the Merkel root node value constructed by the designated hash algorithm.

Given a file, we can construct a MixHash through the following defined steps:

File MUST Split into 1KB chunks. MUST Pad zeros to the end of the last chunk if needed.
Calculate the hash for each chunk and the low 128bits is the Merkle Tree leaf value.
Construct a Merkle tree , root node hash algorithm is 256bits, other node use low 128bits of the 256bits hash result.
Return the combination of hash type, the file size, and the low 192 bits of the Merkle tree root node hash.

MixHash retains a length of 256 bits, so replacing the widely used Keccak256 and SHA256 with MixHash incurs no additional cost. Although including the file length in the upper 62 bits compromises security to some extent, the 192-bit hash length is already sufficient for defending against hash collisions.

The following is the pseudo code for generating Mixhash:

def generateMixHash(blockHeight,hashType,file):
    chunk_hash_array = []
    for chunk in file:
        if len(chunk) < 1024:
            chunk = chunk + b'\x00' * (1024-len(chunk))
        chunk_hash_array.append(getChunkHash(chunk,hashType))
    merkle_tree_root = getMerkleTreeRoot(chunk_hash_array,hash_type)
    return mix_hash(hash_type, len(file), merkle_tree_root)

Public Data Storage Proofs

When MixHash is used to identify a piece of public data, anyone can construct a storage proof to demonstrate possession of a copy of that data. Here is a typical process for using a public data storage proof:

Users eligible to submit storage proofs for rewards are referred to as Suppliers.
A Supplier prepares a storage proof for data D (with MixHash mix_hash_d) based on a block at height h. A 256-bit nonce value for the proof is derived from this block (usually directly using the block’s hash).
To generate a correct storage proof, the Supplier needs to traverse every 1KB chunk of D to find the optimal leaf node m. This is done by attempting to append the nonce value to the end of each chunk to minimize the new Merkle tree root hash. After determining m, the path m_path and leaf node value m_leaf_data of m are extracted.
The Supplier constructs the storage proof for data D at block time h using {mix_hash_d, h, m, m_path, m_leaf_data} and submits it to the public network.
The public network can validate the correctness of m, m_path, and m_leaf_data based on mix_hash_d: verifying that m is indeed a chunk of D. The timeliness of the proof can be verified through h. After passing both correctness and timeliness checks, the public network calculates proof_result_m based on the nonce value and existing proof information, and saves it.
The public network does not have enough information to verify the optimality of the proof, but other Suppliers with the full data set can submit a better {mix_hash_d, h, better_m, better_m_path, better_m_leaf_data} to challenge the published storage proof.
The public network can determine the success of the challenge by comparing proof_result_m and proof_result_better_m. A successful challenge indicates the old storage proof was forged. If no one challenges the published storage proof within a certain timeframe, it can be considered correct from a game-theoretic perspective.
To support healthy competition, the public network should design an appropriate economic model, rewarding users who provide correct storage proofs and penalizing those who submit false ones.

With an understanding of the above process, let us describe the generation and verification of storage proofs more precisely using pseudocode.

# generate proof off chain
def generateProof(mixHash, blockHeight,file) 
    nonce = getNonce(blockHeight)
    hash_type = getHashType(mixHash)
    chunk_hash_array = buildChunkHashArray(file,hash_type)

    min_index = 0
    min_merkle_tree_root = MAX_UINT256
    min_chunk = None

    m_index = 0
    for chunk in file:
      new_chunk = chunk + nonce
      chunk_hash_array[m_index] = getChunkHash(new_chunk,hash_type)
      merkle_tree_root = getMerkleTreeRoot(chunk_hash_array,hash_type)
      chunk_hash_array[m_index] = getChunkHash(chunk,hash_type)
      if (merkle_tree_root < min_merkle_tree_root):
        min_merkle_tree_root = merkle_tree_root
        min_index = m_index
        min_chunk = chunk
      m_index = m_index + 1

// verify on chain
function verifyDataProof(mixHash, blockHeight, m_index, m_path, m_leaf_data) {
    if(current_block_height - blockHeight > MAX_BLOCK_DISTANCE) {
       revert("proof expired");
    }
    hash_type = getHashType(mixHash);
    merkle_tree_root = getMerkleTreeRootFromPath(m_path,m_leaf_data,hash_type);
    if(low192(merkle_tree_root) != low192(mixHash)) {
       revert("invalid proof");
    }

    nonce = getNonce(blockHeight);
    proof_result = getMerkleTreeRootFromPath(m_path,m_leaf_data.append(nonce),hash_type);
    last_proof_result,last_prover = getProofResult(mixHash, blockHeight);
    if(proof_result < last_proof_result) {
      emit ProofPunish(last_prover);
      updateProofResult(mixHash, blockHeight, proof_result, msg.sender);
    } 
}

To minimize the size of the storage proof as much as possible, we have optimized the implementation of getMerkleTreeRoot: besides the RootHash, the hash values of other nodes are truncated to the lower 128 bits. This approach effectively compresses the hash value of a complete Merkle tree to half its size. The full implementation details can be found in the subsequent Reference Implementation section.

Defending Sourcing Attack

As can be seen from the process described above, the core of constructing public data storage proofs is based on a public, non-repeating nonce value generated at a specific moment. It requires traversing the entire content of the file within a designated time to construct a correct proof. Without restrictions, this process is vulnerable to external data source attacks: Suppliers do not store data locally but obtain it through network requests when constructing storage proofs. How does our design prevent such attacks?

Time-Limited Response: Suppliers must submit storage proofs within a specified time. On a typical public network like Ethereum, the block time is about 15 seconds. A typical maximum block interval could be 2 (MAX_BLOCK_DISTANCE = 2), meaning Suppliers must complete the construction and submission of the storage proof within 30 seconds. This duration is insufficient for most data sources to complete transmission, thus Suppliers must store data locally to have a chance to construct storage proofs within the allotted time.
Economic Game Theory: The economic model based on public data storage proofs usually rewards the first Supplier to submit a correct storage proof. This means that, from a game-theoretic standpoint, the inherent delay in using external data sources to construct storage proofs reduces the likelihood of successful submission. Economically, it’s less profitable than the expected gains from storing data locally. The economic model incentivizes Suppliers to store data locally.

Success Rate of Defending Sourcing Attack

Using a strategy combining block interval limitations and priority for first-time submissions is often effective in defending against external data source attacks. The effectiveness of this approach primarily relies on the difference in speed between reading files from local storage and retrieving files from the network. We can define the success rate `R`` of defending against external data source attacks using the following formula:

R = (TNetwork - TLocal) / AvgProofTime

The larger the AvgProofTime, the lower the success rate of defending against Sourcing Attack. Currently, the most significant factor affecting AvgProofTime is the average time for on-chain transactions. For example, in the BTC network, the time for 2 blocks is approximately 20 minutes. With such a large AvgProofTime, the success rate `R`` decreases rapidly, making sourcing attacks more likely to succeed. We can introduce a dynamically adjustable Proof of Work (PoW) mechanism to further defend against Sourcing Attack. This modification alters the formula as follows:

R = (TNetwork - TLocal) / (AvgProofTime-AvgPoWTime)

With the introduction of the Proof of Work (PoW) concept, the strategy for submitting storage proofs becomes: constructing and submitting storage proofs within a specified time while endeavoring to complete as much PoW computation as possible. In the valid proof time window, the storage proof with the greater amount of PoW computation prevails. Such a mechanism can effectively defend against external data source attacks, especially when AvgProofTime is large.

Integrating a PoW mechanism into the design of public data storage proofs is not complex. A simple implementation could modify the second step to:

2. To generate a correct storage proof, the Supplier needs to traverse all 1KB chunks of D to find the optimal leaf node `m`. The method involves attempting to append the nonce and a self-constructed noise value to the end of each chunk to minimize the new Merkle tree root hash and, according to PoW difficulty requirements, ensuring that the last x bits of the constructed `proof_result_m` are zero. After determining `m` and the noise, the path `m_path` and the leaf node value `m_leaf_data` of `m` are extracted.

The pseudocode adjusted according to the above modifications is as follows:

POW_DIFFICULTY = 16
function generateProofwithPow(mixHash, blockHeight,file) {
    nonce = getNonce(blockHeight);
    hash_type = getHashType(mixHash);
    chunk_hash_array = getChunkHashArray(file,hash_type);

    min_index = 0
    min_merkle_tree_root = MAX_UINT256
    min_chunk = None

    m_index = 0;
    noise = 0;
    while(true) {
      for chunk in file {
        new_chunk = chunk + nonce + noise;
        chunk_hash_array[m_index] = getChunkHash(new_chunk,hash_type);
        merkle_tree_root = getMerkleTreeRoot(chunk_hash_array,hash_type);
        chunk_hash_array[m_index] = getChunkHash(chunk,hash_type);
        if (merkle_tree_root < min_merkle_tree_root) {
          min_merkle_tree_root = merkle_tree_root;
          min_index = m_index;
          min_chunk = chunk;
        }
        m_index ++;
      }
      if(last_zero_bits(min_merkle_tree_root) >= POW_DIFFICULTY) {
        break;
      }
      noise++
    }
    m_path = getMerkleTreePath(chunk_hash_array, min_index);
    return strorage_proof(mixHash, blockHeight, min_index, m_path, min_chunk,noise);
}

Applying this mechanism increases the cost of generating storage proofs, which deviates from our initial intent to reduce the widespread effective storage of public data. Moreover, heavily relying on a PoW-based economic model may allow Suppliers with significant advantages in PoW through specialized hardware to disrupt the basic participatory nature of the game, reducing the widespread distribution of public data. Therefore, it is advised not to enable the PoW mechanism unless absolutely necessary.

Limitations

The storage proofs discussed in this paper are not suitable for storing very small files, as small files inherently struggle to defend against external data source attacks.
Public data storage proofs do not address the issue of whether the data is genuinely public. Therefore, it is important to verify the public nature of MixHash in specific scenarios (which is often not easy). Allowing Suppliers to submit storage proofs for any MixHash and receive rewards would lead to a situation where Suppliers create data only they possess and exploit this to gain rewards through constructed attacks, ultimately leading to the collapse of the entire ecosystem.

ERC Extension Suggestion: Tracking High-Value Public Data by MixHash

We can use the existing Ethereum ecosystem to confirm whether a MixHash is public data and track its value. For any contracts related to unstructured data, the ERCPublicDataOwner interface can be implemented. This interface determines whether a specific MixHash is associated with the current contract and attempts to return an Owner address corresponding to a MixHash. Additionally, for the existing and widely recognized NFT ecosystem, we suggest that new ERC-721 and ERC-1155 contracts implement a new extension interface ERC721MixHashVerify. This interface can explicitly associate an NFT with a MixHash. The specific interface definition is as follows:

/// @title ERCPublicDataOwner Standard, query Owner of the specified MixHash
///  Note: the ERC-165 identifier for this interface is <ERC-Number>.
interface ERCPublicDataOwner {
    /**
        @notice Queries Owner of public data determined by Mixhash
        @param  mixHash    Mixhash you want to query
        @return            If it is an identified public data, return the Owner address, otherwise 0x0 will be returned
    */
    function getPublicDataOwner(bytes32 mixHash) external view returns (address);
}

The ERC721MixHashVerfiy extension is OPTIONAL for ERC-721 smart contracts or ERC-1155 smart contracts. This extension can help establish a relationship between specified NFT and MixHash.

/// @title ERC721MixHashVerfiy Extension, optional extension
///  Note: the ERC-165 identifier for this interface is <ERC-Number>.
interface ERC721MixHashVerfiy{
    /**
        @notice Is the tokenId of the NFT is the Mixhash?
        @return           True if the tokenId is MixHash, false if not
    */
    function tokenIdIsMixHash() external view returns (bool); 
    
    /**
        @notice Queries NFT's MixHash
        @param  _tokenId  NFT to be querying
        @return           The target NFT corresponds to MixHash, if it is not Mixhash, it returns 0x0
    */
    function tokenDataHash(uint256 _tokenId) external view returns (bytes32);
}

Rationale

Storage proofs (often referred to as space-time proofs) have long been a subject of interest, with numerous implementations and related projects existing.

Compared to existing copy proofs based on zero-knowledge proofs, our storage proof is based on “Nash Consensus,” with its core principles being:
a. The public network (on-chain) cannot verify the optimality of a proof but relies on economic game theory. This significantly reduces the costs of construction and verification.
b. Data without value typically lacks game value and is naturally eliminated from the system. There is no commitment to elusive perpetual storage.
It can be fully implemented through smart contracts (although the GAS cost of the current reference implementation is somewhat high), separating storage proof from the economic model.
For public data, we do not strictly defend against Sybil attacks. A Sybil attack refers to a Supplier using multiple identities to commit to storing multiple copies of data D (e.g., n copies) while actually storing less (like just one copy) but providing n storage proofs, thereby succeeding in the attack. Strictly preventing Sybil attacks essentially means attaching more additional costs to data storage. The core of our storage proof is to increase the probability of the existence of public data copies through a combination of storage proofs and different economic models, rather than needing to strictly define how many copies exist. Therefore, from the perspective of the design of public data storage proofs, we do not need to defend against Sybil attacks.

Backwards Compatibility

Using HashType allows storage proofs to be compatible with EVM-compatible public blockchain systems, as well as BTC-Like public blockchain systems. In fact, MixHash could become a new cross-chain value anchor: it can track the value of the same data represented by MixHash across different public blockchain networks using different models, achieving the aggregation of cross-chain values. Considering the need for backward compatibility, we have set the default HashType of MixHash to SHA256. Two categories of HashType remain unused, leaving ample room for future expansion.

Security Considerations

This storage proof revolves around public data. In demonstrating storage proofs, it often involves sending 1KB segments of the data to the public network. Therefore, please do not use the storage proof design presented in this paper for private data.

The design of MixHash can support storage proofs for private files, but this requires some adjustments in the processing of the original data and the construction of the storage proof. A detailed discussion on the design of storage proofs for private files is beyond the scope of this paper. In fact, some of the projects mentioned in the Reference Implementation section use both public data storage proofs and private data storage proofs.

Test Cases

PublicDataProofDemo includes test cases written using Hardhat.

Reference Implementation

PublicDataProofDemo (GitHub - buckyos/PublicDataProof)

A standard reference implementation

DMC public data inscription

Based on public data storage certification, a complete economic model and gameplay has been designed on ETH network and BTC inscription network

Learn more background and existing attempts

DMC Main Chain
CYFS

Copyright

Copyright and related rights waived via [CC0]

Pull Request URL is the following:

github.com/ethereum/ERCs

Add ERC: MixHash and Public Data Storage Proofs

master ← buckyos:master

opened 01:56AM - 28 Dec 23 UTC

waterflier

+283 -0

This proposal introduces a design for `minimum value selection` storage proofs o…n Merkle trees. The design consists of two main components: 1. A hashing algorithm termed MixHash, aimed to replace the commonly used Keccak256 and SHA256 algorithms. 2. Public data storage proofs. This enables anyone to present a proof to a public network, verifying their possession of a copy of specific public data marked by MixHash. Additionally, the proposal discusses the practical implementation of this design in various scenarios and suggests some improvements to the ERC-721 and ERC-1155 standards.

abcoathup · December 28, 2023, 10:06pm

You can just include the link to the ERC PR, as the text of the ERC may change.

waterflier · December 29, 2023, 12:05am

Thank you for your suggestions.

Before posting the topic, I looked at how others have approached similar issues. I aim to include comprehensive content in the topic to facilitate the discussion, allowing participants to avoid jumping between different sections. When there are updates to the ERC TEXT, I will make sure to synchronize them promptly.

waterflier · January 4, 2024, 12:09am

I did not see your specific question.

The provided content is pseudocode, the basic idea is to continuously find a random number, noise, while keeping the nonce unchanged. This is done so that the new combination meets both the minimum value requirement and the difficulty criteria.

fulldecent · January 18, 2024, 10:01pm

Could you please make a PR to move this document to “DRAFT” status?

Also I don’t see why it is presumed that this initiative will get the number 7585.

abcoathup · January 19, 2024, 12:13am

The original PR is draft status and hasn’t been merged yet.

github.com/ethereum/ERCs

Add ERC: MixHash and Public Data Storage Proofs

master ← buckyos:master

opened 01:56AM - 28 Dec 23 UTC

waterflier

+283 -0

This proposal introduces a design for `minimum value selection` storage proofs o…n Merkle trees. The design consists of two main components: 1. A hashing algorithm termed MixHash, aimed to replace the commonly used Keccak256 and SHA256 algorithms. 2. Public data storage proofs. This enables anyone to present a proof to a public network, verifying their possession of a copy of specific public data marked by MixHash. Additionally, the proposal discusses the practical implementation of this design in various scenarios and suggests some improvements to the ERC-721 and ERC-1155 standards.

7585 was the number issued by an ERC editor.

john4real · July 1, 2024, 9:29am

Can you explain in detail how ERC-7585 secures private data? And if I upload private files in foggie client, how are my files backed up and how are they secured?

waterflier · July 1, 2024, 10:37pm

The core of ERC7585 is how to verify a storage proof on EVM based on the Merkle tree.

Therefore, if it is private data, the original data should be encrypted (the secret key needs to be saved locally), and then the encrypted data should be saved with the provider. Then the storage proof challenge is performed around the encrypted data.

Remember, never save unencrypted private data on the Internet~

john4real · August 4, 2024, 1:27pm

that helps a lot, thanks

fulldecent · August 27, 2024, 3:24am

So if we’re looking at a file backup scenario, here is an example. On macOS it saves your backups into a sparse bundle disk image. If you specified a password with your backup, then those files are encrypted and safe to upload publicly.

In this use case your only concern is backing up and ensuring the existence of data. And the fact that others could see some small parts of the encrypted data (for the mix proofs) should be of little concern to you.