Model Submissions for Ethereum Deep Funding

Hello model builders,

Consider this thread as your home for sharing all things related to your submissions in the deep funding challenge giving weights to the Ethereum dependency graph

In order to be eligible for the $30,000 prize pool, you need to make a detailed write-up of model submissions. You may submit this even after June 30th 2025, once the deadline for the contest is over. However, it needs to be submitted before prize distribution is finalized. Another $20,000 is kept aside for community members that improve the design of the competition, as suggested here.

We will give additional points to submissions with open source code and fully reproducible results. We encourage you to be visual in your submissions, share your jupyter notebooks or code used in the submission, explain the difference in performance of the same model on different parts of the ethereum graph and share information that is collectively valuable to other participants.

Since write-ups can be made after submissions close, other participants cannot copy your methodology. You can take cues for writeups from other competition we have held and also get some inspiration for baking your own model.

The format of submissions is open ended and free for you to express yourself the way you like. You can share as much or as little as you like, but you need to write something here to be considered for prizes.

More details at deepfunding.org

Good luck predictoooors

7 Likes

Quantifying Contributions of Open Source Projects to the Ethereum Universe

Overview

Ethereum, as a decentralized and rapidly evolving ecosystem, is built on the back of countless open-source projects. From core protocol implementations and smart contract frameworks to tooling, middleware, and developer libraries, the growth of the Ethereum universe is directly tied to the strength and progress of its open-source foundation.

Despite this, there is currently no widely adopted method to quantitatively evaluate the impact of individual open-source projects within Ethereum. This lack of visibility impairs the ability of stakeholders—including the Ethereum Foundation, DAOs, developers, researchers, and funders—to identify which projects are truly foundational and deserving of support, auditing, or recognition.

This initiative proposes a data-driven framework for quantifying the contributions of open-source repositories to Ethereum using a combination of ecosystem relevance, technical dependencies, development activity, and on-chain influence. The goal is to build a transparent, scalable, and objective system to rank the importance of repositories across the Ethereum universe.


Why Quantification Matters

Funding Allocation: Improve the accuracy and fairness of grants, retroactive public goods funding, and quadratic funding.

Ecosystem Security: Identify critical libraries and infrastructure projects that require audits and monitoring.

Developer Recognition: Highlight unsung contributors and undervalued repos with high ecosystem leverage.

Governance Insights: Support DAO tooling and decision-making with data-driven repository influence scores.

Sustainability: Ensure long-term viability of critical infrastructure by recognizing and supporting maintainers.


Core Evaluation Dimensions

To quantify contributions effectively, the model should evaluate repositories along multiple, weighted dimensions:

  1. Development Activity

Commit frequency, pull requests, issue resolution

Contributor diversity and project longevity

  1. Ecosystem Dependency

How many other repos depend on it (import graphs, dev toolchains)

Used in major L2s, DeFi protocols, wallets, or clients

  1. On-Chain Impact

Smart contracts linked to repo deployed on-chain

Volume of interactions, transaction count, or TVL influenced

  1. Protocol Alignment

Inclusion in Ethereum Improvement Proposals (EIPs)

Alignment with Ethereum’s roadmap (e.g., scalability, account abstraction, L2s)

  1. Community Footprint

Mentions in dev discussions (e.g., EthResearch, Reddit, Twitter)

Citations in academic or technical Ethereum publications


Quantification Methodology

The proposed methodology involves:

Repository Indexing: Identify a comprehensive list (~15,000) of Ethereum-relevant open-source repositories.

Data Aggregation: Pull data from GitHub, The Graph, GHTorrent, npm, smart contract registries (e.g., Etherscan), and social platforms.

Metrics Standardization: Normalize and weight features across categories (e.g., activity, adoption, dependency).

Modeling: Use rule-based scoring or machine learning models (e.g., gradient boosting, GNNs) to compute a unified contribution score.

Result: A ranked list of repositories with associated weights reflecting their quantified contributions to Ethereum.


Output Example

go-ethereum: 0.98

solidity: 0.95

OpenZeppelin/contracts: 0.89

ethers.js: 0.86

foundry-rs/foundry: 0.82

Lido-finance/lido-dao: 0.74

Uniswap/v3-core: 0.72

eth-infinitism/account-abstraction: 0.67

Scores are illustrative


Potential Applications

Grant Program Optimization (EF, Gitcoin, ARB Grants)

Retroactive Airdrops and Rewards (e.g., Optimism RPGF)

Reputation Systems for Devs and DAOs

Ecosystem Risk Mapping

Dynamic Leaderboards and Dashboards


Challenges and Limitations

Attribution Complexity: Linking code to impact is non-trivial and may involve indirect relationships.

Gaming and Bias: Repos could be gamed through artificial commits or inflated usage.

Subjectivity in Weighting: Choosing the right weights across dimensions can influence final scores; requires transparency and community input.

Temporal Dynamics: Repo relevance changes over time and needs continuous updates.

Project Background

I took on this contest to analyze the contribution of various code repositories within the Ethereum ecosystem. In essence, it involved scoring and ranking these repositories to determine their importance to the overall ecosystem. What seemed straightforward at first turned out to have quite a few hidden complexities.
My name is ewohirojuso and mail is ewohirojuso66@gmail.com

Overall Architecture

The entire system is divided into several core modules:

  • ArrayManager: Handles GPU acceleration.
  • FastCompute: For high-performance computations.
  • GraphConstructor: Builds dependency graphs.
  • TopologyFeatures: Extracts network topological features.
  • ImpactModel: Implements machine learning prediction models.
  • WeightEngine: Calculates weights.

The code structure is clear, with each module having a single responsibility. GPU acceleration was used primarily because of the large volume of data, which would be too slow to process on a standard CPU.


Design and Implementation of the Feature Standardization System

In this development, I believe the most valuable aspect to share is the feature standardization system. Initially, I didn’t give it much thought, assuming it was just basic data preprocessing. However, I soon discovered its surprising depth.

Problems Encountered

  • Vast differences in original data feature distributions: Some feature values ranged from 0-1 (e.g., win rate), while others were in the tens or hundreds of thousands (e.g., star count), and some were logarithmic (e.g., PageRank values). Feeding this raw data directly into the model resulted in abysmal performance.
  • Traditional Z-score standardization was not effective: Due to numerous outliers, traditional Z-score standardization didn’t work well here. For instance, a sudden surge in star count for a particular repository, acting as an outlier, would severely skew the entire distribution.

Solution

I implemented a robust standardization method based on the median and Interquartile Range (IQR):

def _initialize_feature_scalers(self) -> Dict[str, Dict[str, float]]:
    feature_statistics = defaultdict(list)
    
    # Collect all feature values
    for repo_features in self.all_features.values():
        for feature_name, value in repo_features.items():
            if not (pd.isna(value) or np.isinf(value)):
                feature_statistics[feature_name].append(float(value))
    
    scalers = {}
    for feature_name, values in feature_statistics.items():
        if len(values) > 0:
            values_array = np.array(values)
            scalers[feature_name] = {
                'median': np.median(values_array),
                'p75': np.percentile(values_array, 75),
                'p25': np.percentile(values_array, 25)
            }
    return scalers

The core idea is to replace the mean with the median and the standard deviation with the IQR:

def _standardize_feature_value(self, feature_name: str, value: float) -> float:
    if feature_name not in self.feature_scalers:
        return value
        
    scaler = self.feature_scalers[feature_name]
    
    median = scaler['median']
    iqr = scaler['p75'] - scaler['p25'] + 1e-8
    
    standardized = (value - median) / iqr
    
    # Soft clipping to avoid extreme outliers
    return np.tanh(standardized * 0.5) * 2.0

Why This Design?

  • Robustness: The median and IQR are insensitive to outliers and won’t be skewed by individual extreme values.
  • Soft Clipping: Using the tanh function for soft clipping limits the influence of extreme values while preserving their relative magnitudes.
  • Feature Type Adaptation: Different types of features are combined with different weighting strategies.

Actual Results

After switching to this standardization method, the model’s cross-validation score increased from just over 0.2 to over 0.4, a significant improvement. This was particularly effective for features with clear outliers, such as the star count of repositories.

Some Pitfalls Encountered

  • Division by Zero: IQR can be 0 (if all values are the same), so a small epsilon must be added to prevent division by zero.
  • Handling Missing Values: NaN and Inf values must be filtered out first, otherwise they will contaminate the statistics.
  • Feature Alignment: Ensure all repositories have the same set of features, filling in missing ones with default values.
# Tips for handling division by zero and missing values
iqr = scaler['p75'] - scaler['p25'] + 1e-8  # Prevent division by zero
standardized = (value - median) / iqr
return np.tanh(standardized * 0.5) * 2.0    # Limit to [-2,2] range

Other Technical Highlights

GPU Acceleration

We used CuPy for matrix operation acceleration, primarily during feature computation and model training. For large-scale network analysis, GPUs are indeed significantly faster than CPUs.

Machine Learning Model

Ultimately, XGBoost was chosen, mainly because it’s less demanding on feature engineering and has existing GPU support. Several other models were tried, but none performed as well as XGBoost.

Weight Allocation Strategy

This part is central to the entire system. We adopted a multi-level weight allocation approach: seed project weights, originality scores, and dependency relationship weights. Each layer has its unique calculation logic.


Throughout this project, the biggest takeaway was the critical importance of data preprocessing. Feature standardization might seem like a minor issue, but it profoundly impacts the final results. Often, poor model performance isn’t due to the algorithm itself, but rather how the data is fed into the model.

Furthermore, GPU acceleration is truly valuable, especially when dealing with large-scale graph data. However, careful memory management is crucial, as GPU memory is much more precious than CPU memory.

The code has been uploaded, and I welcome any discussions. This type of ecosystem analysis project is quite interesting, offering both engineering challenges and opportunities for algorithmic optimization.

Author: summer
renzhichua1@gmail.com
Version/Date: v2.12 / 2025-07-02


1. Data-Driven and Systematized Engineering

Facing the complex task of quantifying contributions to the Ethereum ecosystem, our core design philosophy is to build a robust, reproducible, and deeply data-driven systematized solution. We believe that excellent results do not just stem from a single clever algorithm, but rather rely on meticulously engineered design at every stage, from data processing and feature construction to model training and weight allocation.

Our solution aims to minimize reliance on hard-coded rules and subjective assumptions, instead building upon the following principles:

  • Maximize Data Utilization: We not only leverage raw training data but also extract implicit relationships from existing data through Data Augmentation techniques to enhance the model’s generalization ability.
  • Deep Feature Engineering: We believe that the quality of features directly determines the upper limit of model performance. Therefore, we have constructed a comprehensive system of multi-dimensional features, including network topology, team quality, historical performance, contributor profiles, and temporal evolution.
  • Layered Machine Learning: For the multi-level (Level 1, 2, 3) weight allocation tasks in the competition, we adopted different strategies. Notably, we elevated the Level 2 (originality) weight allocation, traditionally dependent on heuristic rules, into an independent machine learning task.

This report will focus on elaborating the key innovations in our scheme: machine learning modeling for Level 2 weights, and Enhanced Feature Representation for relation prediction.

2. Key Innovations: Layered Machine Learning and Enhanced Feature Representation

Our system features deep innovation in two key areas, aiming to replace fixed rules with learned patterns, thereby improving overall accuracy and robustness.

Machine Learning Modeling for Level 2 (Originality) Weights

“Originality” is a highly subjective concept. Traditional approaches typically rely on heuristic rules based on dependency count (e.g., more dependencies mean less originality). We believe this method is too crude and fails to capture complex realities.

To address this, we designed the ImprovedLevel2Allocator module, which transforms originality scoring itself into a supervised learning problem.

The key idea is: If a project frequently wins in direct comparisons with the libraries it depends on, this strongly indicates that it generates significant added value itself, i.e., it has high originality.

Our implementation steps are as follows:

  1. Signal Extraction: We filter all direct comparison records between “parent projects” and their “child dependencies” from the training data. For example, when web3.py is compared to its dependency eth-abi, the result (win/loss and multiplier) becomes a strong signal for measuring web3.py’s originality.
  2. Feature Construction: We build a set of feature vectors specifically for predicting originality for each core project (seed repo). Key features include:
    • Win rate against dependencies (win_rate_against_deps): This is the most crucial metric, directly reflecting the project’s ability to surpass its underlying dependencies.
    • Dependency count and depth (dependency_count, max_dependency_depth): Serve as basic penalty terms.
    • Comprehensive quality metrics: Reusing features from the main feature library such as team quality, network influence, and historical performance.
  3. Model Training: We use a GradientBoostingRegressor model, with these features as input and a “true” originality score (estimated by combining multiple signals) as the label, for model training. Cross-validation results show that the model’s R² score reached a respectable level, proving the feasibility of this approach.

Through this method, we upgraded the evaluation of originality from a simple “rule engine” to an “intelligent model” capable of learning complex patterns from data, whose judgments are far richer and more precise than simple dependency counting.

Enhanced Feature Representation and Data Augmentation for Relation Prediction

When predicting the relative contribution between two repositories (Repo A vs. Repo B), how effectively to present their differences to the model is crucial. Simply subtracting the feature vectors of the two repositories (feature_A - feature_B) loses a lot of scale and proportional information.

To address this, in AdvancedEnsemblePredictor, we designed the _create_enhanced_feature_vector function to generate a “feature group” for each original feature, containing multiple comparison methods:

  • Simple difference (A - B): Captures absolute differences.
  • Safe ratio (A / B): Captures relative proportions, crucial for multiplicative features (e.g., star count).
  • Log ratio (log(A) - log(B)): Insensitive to data scale, effectively handles long-tail distributed features, and aligns formally with the competition’s Logit Cost Function.
  • Normalized difference ((A - B) / (A + B)): Constrains differences to the [-1, 1] range, eliminating the influence of units.
  • Domain-specific transformations: Applying the most suitable transformations for different feature types (e.g., network centrality, count values, scores), such as logarithmic transformation, square root transformation, etc.

This enhanced representation greatly enriches the information provided to the gradient boosting tree model, allowing it to understand the relationship between two entities from different angles, thereby making more accurate predictions.

Furthermore, we introduced transitivity-based data augmentation. If the training data shows A > B and B > C, we generate a weakly labeled sample A > C and add it to the training set. This effectively expands the training data volume, helping the model learn more globally consistent ranking relationships.

3. Conclusion

Our solution is an end-to-end, highly engineered machine learning framework. It successfully extends the depth of machine learning application from “single prediction tasks” to “systematic decision processes” through layered modeling (building a dedicated model for Level 2 originality) and enhanced feature representation (providing richer information to the main predictor). We believe that this approach, emphasizing data-driven methods, system robustness, and engineering details, is a solid path towards more accurate and stable quantification results.That’s all.If you have any questions, please contact me.