Model Submissions for Ethereum Deep Funding

Hello model builders,

Consider this thread as your home for sharing all things related to your submissions in the deep funding challenge giving weights to the Ethereum dependency graph

In order to be eligible for the $25,000 prize pool, you need to make a detailed write-up of model submissions. You may submit this even after September 7th 2025, once the deadline for the contest is over. However, it needs to be submitted by September 20th.

$10,000 is given objectively based on leaderboard placement while $10,000 is on quality of writeups here. $5000 is kept aside for the jury to evaluate and give feedback on your models.

We will give additional points to submissions with open source code and fully reproducible results. We encourage you to be visual in your submissions, share your jupyter notebooks or code used in the submission, explain the difference in performance of the same model on different parts of the ethereum graph and share information that is collectively valuable to other participants.

Since write-ups can be made after submissions close, other participants cannot copy your methodology. You can take cues for writeups from other competition we have held and also get some inspiration for baking your own model.

The format of submissions is open ended and free for you to express yourself the way you like. You can share as much or as little as you like, but you need to write something here to be considered for prizes.

More details at deepfunding.org

Good luck predictoooors

7 Likes

Quantifying Contributions of Open Source Projects to the Ethereum Universe

Overview

Ethereum, as a decentralized and rapidly evolving ecosystem, is built on the back of countless open-source projects. From core protocol implementations and smart contract frameworks to tooling, middleware, and developer libraries, the growth of the Ethereum universe is directly tied to the strength and progress of its open-source foundation.

Despite this, there is currently no widely adopted method to quantitatively evaluate the impact of individual open-source projects within Ethereum. This lack of visibility impairs the ability of stakeholders—including the Ethereum Foundation, DAOs, developers, researchers, and funders—to identify which projects are truly foundational and deserving of support, auditing, or recognition.

This initiative proposes a data-driven framework for quantifying the contributions of open-source repositories to Ethereum using a combination of ecosystem relevance, technical dependencies, development activity, and on-chain influence. The goal is to build a transparent, scalable, and objective system to rank the importance of repositories across the Ethereum universe.


Why Quantification Matters

Funding Allocation: Improve the accuracy and fairness of grants, retroactive public goods funding, and quadratic funding.

Ecosystem Security: Identify critical libraries and infrastructure projects that require audits and monitoring.

Developer Recognition: Highlight unsung contributors and undervalued repos with high ecosystem leverage.

Governance Insights: Support DAO tooling and decision-making with data-driven repository influence scores.

Sustainability: Ensure long-term viability of critical infrastructure by recognizing and supporting maintainers.


Core Evaluation Dimensions

To quantify contributions effectively, the model should evaluate repositories along multiple, weighted dimensions:

  1. Development Activity

Commit frequency, pull requests, issue resolution

Contributor diversity and project longevity

  1. Ecosystem Dependency

How many other repos depend on it (import graphs, dev toolchains)

Used in major L2s, DeFi protocols, wallets, or clients

  1. On-Chain Impact

Smart contracts linked to repo deployed on-chain

Volume of interactions, transaction count, or TVL influenced

  1. Protocol Alignment

Inclusion in Ethereum Improvement Proposals (EIPs)

Alignment with Ethereum’s roadmap (e.g., scalability, account abstraction, L2s)

  1. Community Footprint

Mentions in dev discussions (e.g., EthResearch, Reddit, Twitter)

Citations in academic or technical Ethereum publications


Quantification Methodology

The proposed methodology involves:

Repository Indexing: Identify a comprehensive list (~15,000) of Ethereum-relevant open-source repositories.

Data Aggregation: Pull data from GitHub, The Graph, GHTorrent, npm, smart contract registries (e.g., Etherscan), and social platforms.

Metrics Standardization: Normalize and weight features across categories (e.g., activity, adoption, dependency).

Modeling: Use rule-based scoring or machine learning models (e.g., gradient boosting, GNNs) to compute a unified contribution score.

Result: A ranked list of repositories with associated weights reflecting their quantified contributions to Ethereum.


Output Example

go-ethereum: 0.98

solidity: 0.95

OpenZeppelin/contracts: 0.89

ethers.js: 0.86

foundry-rs/foundry: 0.82

Lido-finance/lido-dao: 0.74

Uniswap/v3-core: 0.72

eth-infinitism/account-abstraction: 0.67

Scores are illustrative


Potential Applications

Grant Program Optimization (EF, Gitcoin, ARB Grants)

Retroactive Airdrops and Rewards (e.g., Optimism RPGF)

Reputation Systems for Devs and DAOs

Ecosystem Risk Mapping

Dynamic Leaderboards and Dashboards


Challenges and Limitations

Attribution Complexity: Linking code to impact is non-trivial and may involve indirect relationships.

Gaming and Bias: Repos could be gamed through artificial commits or inflated usage.

Subjectivity in Weighting: Choosing the right weights across dimensions can influence final scores; requires transparency and community input.

Temporal Dynamics: Repo relevance changes over time and needs continuous updates.

Project Background

I took on this contest to analyze the contribution of various code repositories within the Ethereum ecosystem. In essence, it involved scoring and ranking these repositories to determine their importance to the overall ecosystem. What seemed straightforward at first turned out to have quite a few hidden complexities.
My name is ewohirojuso and mail is ewohirojuso66@gmail.com

Overall Architecture

The entire system is divided into several core modules:

  • ArrayManager: Handles GPU acceleration.
  • FastCompute: For high-performance computations.
  • GraphConstructor: Builds dependency graphs.
  • TopologyFeatures: Extracts network topological features.
  • ImpactModel: Implements machine learning prediction models.
  • WeightEngine: Calculates weights.

The code structure is clear, with each module having a single responsibility. GPU acceleration was used primarily because of the large volume of data, which would be too slow to process on a standard CPU.


Design and Implementation of the Feature Standardization System

In this development, I believe the most valuable aspect to share is the feature standardization system. Initially, I didn’t give it much thought, assuming it was just basic data preprocessing. However, I soon discovered its surprising depth.

Problems Encountered

  • Vast differences in original data feature distributions: Some feature values ranged from 0-1 (e.g., win rate), while others were in the tens or hundreds of thousands (e.g., star count), and some were logarithmic (e.g., PageRank values). Feeding this raw data directly into the model resulted in abysmal performance.
  • Traditional Z-score standardization was not effective: Due to numerous outliers, traditional Z-score standardization didn’t work well here. For instance, a sudden surge in star count for a particular repository, acting as an outlier, would severely skew the entire distribution.

Solution

I implemented a robust standardization method based on the median and Interquartile Range (IQR):

def _initialize_feature_scalers(self) -> Dict[str, Dict[str, float]]:
    feature_statistics = defaultdict(list)
    
    # Collect all feature values
    for repo_features in self.all_features.values():
        for feature_name, value in repo_features.items():
            if not (pd.isna(value) or np.isinf(value)):
                feature_statistics[feature_name].append(float(value))
    
    scalers = {}
    for feature_name, values in feature_statistics.items():
        if len(values) > 0:
            values_array = np.array(values)
            scalers[feature_name] = {
                'median': np.median(values_array),
                'p75': np.percentile(values_array, 75),
                'p25': np.percentile(values_array, 25)
            }
    return scalers

The core idea is to replace the mean with the median and the standard deviation with the IQR:

def _standardize_feature_value(self, feature_name: str, value: float) -> float:
    if feature_name not in self.feature_scalers:
        return value
        
    scaler = self.feature_scalers[feature_name]
    
    median = scaler['median']
    iqr = scaler['p75'] - scaler['p25'] + 1e-8
    
    standardized = (value - median) / iqr
    
    # Soft clipping to avoid extreme outliers
    return np.tanh(standardized * 0.5) * 2.0

Why This Design?

  • Robustness: The median and IQR are insensitive to outliers and won’t be skewed by individual extreme values.
  • Soft Clipping: Using the tanh function for soft clipping limits the influence of extreme values while preserving their relative magnitudes.
  • Feature Type Adaptation: Different types of features are combined with different weighting strategies.

Actual Results

After switching to this standardization method, the model’s cross-validation score increased from just over 0.2 to over 0.4, a significant improvement. This was particularly effective for features with clear outliers, such as the star count of repositories.

Some Pitfalls Encountered

  • Division by Zero: IQR can be 0 (if all values are the same), so a small epsilon must be added to prevent division by zero.
  • Handling Missing Values: NaN and Inf values must be filtered out first, otherwise they will contaminate the statistics.
  • Feature Alignment: Ensure all repositories have the same set of features, filling in missing ones with default values.
# Tips for handling division by zero and missing values
iqr = scaler['p75'] - scaler['p25'] + 1e-8  # Prevent division by zero
standardized = (value - median) / iqr
return np.tanh(standardized * 0.5) * 2.0    # Limit to [-2,2] range

Other Technical Highlights

GPU Acceleration

We used CuPy for matrix operation acceleration, primarily during feature computation and model training. For large-scale network analysis, GPUs are indeed significantly faster than CPUs.

Machine Learning Model

Ultimately, XGBoost was chosen, mainly because it’s less demanding on feature engineering and has existing GPU support. Several other models were tried, but none performed as well as XGBoost.

Weight Allocation Strategy

This part is central to the entire system. We adopted a multi-level weight allocation approach: seed project weights, originality scores, and dependency relationship weights. Each layer has its unique calculation logic.


Throughout this project, the biggest takeaway was the critical importance of data preprocessing. Feature standardization might seem like a minor issue, but it profoundly impacts the final results. Often, poor model performance isn’t due to the algorithm itself, but rather how the data is fed into the model.

Furthermore, GPU acceleration is truly valuable, especially when dealing with large-scale graph data. However, careful memory management is crucial, as GPU memory is much more precious than CPU memory.

The code has been uploaded, and I welcome any discussions. This type of ecosystem analysis project is quite interesting, offering both engineering challenges and opportunities for algorithmic optimization.

Author: summer
renzhichua1@gmail.com
Version/Date: v2.12 / 2025-07-02


1. Data-Driven and Systematized Engineering

Facing the complex task of quantifying contributions to the Ethereum ecosystem, our core design philosophy is to build a robust, reproducible, and deeply data-driven systematized solution. We believe that excellent results do not just stem from a single clever algorithm, but rather rely on meticulously engineered design at every stage, from data processing and feature construction to model training and weight allocation.

Our solution aims to minimize reliance on hard-coded rules and subjective assumptions, instead building upon the following principles:

  • Maximize Data Utilization: We not only leverage raw training data but also extract implicit relationships from existing data through Data Augmentation techniques to enhance the model’s generalization ability.
  • Deep Feature Engineering: We believe that the quality of features directly determines the upper limit of model performance. Therefore, we have constructed a comprehensive system of multi-dimensional features, including network topology, team quality, historical performance, contributor profiles, and temporal evolution.
  • Layered Machine Learning: For the multi-level (Level 1, 2, 3) weight allocation tasks in the competition, we adopted different strategies. Notably, we elevated the Level 2 (originality) weight allocation, traditionally dependent on heuristic rules, into an independent machine learning task.

This report will focus on elaborating the key innovations in our scheme: machine learning modeling for Level 2 weights, and Enhanced Feature Representation for relation prediction.

2. Key Innovations: Layered Machine Learning and Enhanced Feature Representation

Our system features deep innovation in two key areas, aiming to replace fixed rules with learned patterns, thereby improving overall accuracy and robustness.

Machine Learning Modeling for Level 2 (Originality) Weights

“Originality” is a highly subjective concept. Traditional approaches typically rely on heuristic rules based on dependency count (e.g., more dependencies mean less originality). We believe this method is too crude and fails to capture complex realities.

To address this, we designed the ImprovedLevel2Allocator module, which transforms originality scoring itself into a supervised learning problem.

The key idea is: If a project frequently wins in direct comparisons with the libraries it depends on, this strongly indicates that it generates significant added value itself, i.e., it has high originality.

Our implementation steps are as follows:

  1. Signal Extraction: We filter all direct comparison records between “parent projects” and their “child dependencies” from the training data. For example, when web3.py is compared to its dependency eth-abi, the result (win/loss and multiplier) becomes a strong signal for measuring web3.py’s originality.
  2. Feature Construction: We build a set of feature vectors specifically for predicting originality for each core project (seed repo). Key features include:
    • Win rate against dependencies (win_rate_against_deps): This is the most crucial metric, directly reflecting the project’s ability to surpass its underlying dependencies.
    • Dependency count and depth (dependency_count, max_dependency_depth): Serve as basic penalty terms.
    • Comprehensive quality metrics: Reusing features from the main feature library such as team quality, network influence, and historical performance.
  3. Model Training: We use a GradientBoostingRegressor model, with these features as input and a “true” originality score (estimated by combining multiple signals) as the label, for model training. Cross-validation results show that the model’s R² score reached a respectable level, proving the feasibility of this approach.

Through this method, we upgraded the evaluation of originality from a simple “rule engine” to an “intelligent model” capable of learning complex patterns from data, whose judgments are far richer and more precise than simple dependency counting.

Enhanced Feature Representation and Data Augmentation for Relation Prediction

When predicting the relative contribution between two repositories (Repo A vs. Repo B), how effectively to present their differences to the model is crucial. Simply subtracting the feature vectors of the two repositories (feature_A - feature_B) loses a lot of scale and proportional information.

To address this, in AdvancedEnsemblePredictor, we designed the _create_enhanced_feature_vector function to generate a “feature group” for each original feature, containing multiple comparison methods:

  • Simple difference (A - B): Captures absolute differences.
  • Safe ratio (A / B): Captures relative proportions, crucial for multiplicative features (e.g., star count).
  • Log ratio (log(A) - log(B)): Insensitive to data scale, effectively handles long-tail distributed features, and aligns formally with the competition’s Logit Cost Function.
  • Normalized difference ((A - B) / (A + B)): Constrains differences to the [-1, 1] range, eliminating the influence of units.
  • Domain-specific transformations: Applying the most suitable transformations for different feature types (e.g., network centrality, count values, scores), such as logarithmic transformation, square root transformation, etc.

This enhanced representation greatly enriches the information provided to the gradient boosting tree model, allowing it to understand the relationship between two entities from different angles, thereby making more accurate predictions.

Furthermore, we introduced transitivity-based data augmentation. If the training data shows A > B and B > C, we generate a weakly labeled sample A > C and add it to the training set. This effectively expands the training data volume, helping the model learn more globally consistent ranking relationships.

3. Conclusion

Our solution is an end-to-end, highly engineered machine learning framework. It successfully extends the depth of machine learning application from “single prediction tasks” to “systematic decision processes” through layered modeling (building a dedicated model for Level 2 originality) and enhanced feature representation (providing richer information to the main predictor). We believe that this approach, emphasizing data-driven methods, system robustness, and engineering details, is a solid path towards more accurate and stable quantification results.That’s all.If you have any questions, please contact me.

1 Like

1. Problem

Translating contributions to the Ethereum ecosystem into weights is an interesting problem. We saw this as an impact quantification problem, to assess the value each repo brings to Ethereum.

2. Our Approach

We rephrased the question to: “What is the dollar value that each repo generates for Ethereum?” Taking inspiration from the Relentless Monetization evaluation technique by the Robin Hood Foundation, which is used to measure the dollar-cost ratio for NGOs, and noting that projects like VoiceDeck also use it for measuring the impact of journalism, we decided to use LLMs like GPT-5 and Gemini to help generate a gross and net benefit calculation of each of the 45 repos.

3. Key Learnings

  1. Relentless Monetization accounts for the cost in order to give a benefit-cost ratio (which translates to “for every 1 dollar, Y dollars’ worth of value”). We needed a way to just get an amount for the overall benefit generated, as the cost to develop each repo was unavailable.
  2. In the end, we had to resort to using POML to structure the system prompt for the LLMs. This enabled deterministic responses from the LLMs, which was particularly important for the subsequent benefit report generation steps. You can view the system prompt here: [GitHub - dipanshuhappy/impact-quantifier-system-prompt] or use the GPT plugin here: [ChatGPT - Impact Quantifier].
  3. Gemini 2.5 Pro had a more conservative approach, while GPT-5 had a decent approach. To view the prompt responses for the repositories, you can look at:
  1. Gemini took a more holistic and broad approach to assessing benefit than GPT-5.

4. Solution

The LLM can access the internet and get the relevant context of the repository, like the GitHub link, and runs the following process:

Step 1: Define Outcomes

Clear outcomes of the project are defined. This includes listing both tangible and intangible outcomes, especially the readme files of each repo.

For each outcome, the number of beneficiaries reached and the benefit per beneficiary (in dollar value) is defined or estimated by the LLM.

Step 2: Measurement of Causal Effect

This step attempts to quantify what percentage of these outcomes can be fairly attributed as a result of the repository exclusively, as opposed to other factors or network effects.

Note: This technique emphasizes referencing related studies, papers, and international reports and citing them as proxy sources for quantifying attribution and benefits. Providing clear evidence of data and numbers at every step is the most critical aspect of this method.

Step 3: Calculating Gross Benefit

For each of the listed outcomes, the benefit per outcome is calculated by:

  • Outcome 1 = (Number of Beneficiaries) × (Benefit per Beneficiary)
  • Outcome 2 = (Number of Beneficiaries) × (Benefit per Beneficiary)
  • Outcome 3 = (Number of Beneficiaries) × (Benefit per Beneficiary)
  • …
  • Outcome N = (Number of Beneficiaries) × (Benefit per Beneficiary)

The Gross Benefit is the summation of all benefits thus calculated per outcome.

Gross Benefit = Sum(Benefit per Outcome_i) for i=1 to N

Step 4: Counterfactual Analysis

This calculates the net incremental benefit of the project by adjusting for the loss or gain in benefits if the repository had not existed. In some cases, the counterfactual of a repo like viem not existing was that developers would use ethers js, for example

Net Benefit = Gross Benefit - Counterfactual

Step 5: Discounted Future Benefits

Finally, the net benefit amount is adjusted for the decreasing dollar exchange value in the years following the repository’s creation.

Discounted Net Benefit = Net Benefit / (1 + r)^t

The discounted net benefit/net present value of the benefit thus calculated is taken as the outcomes (in dollars) generated per repository.

After getting the values for the gross benefit of every repository, the next step was to normalize them into weights adding up to 1. Submitting results from Gemini 2.5 Pro and GPT-5 yielded an error rate of ~10. Just by taking the average of the two results, the outcome was 35% better than taking an individual weighted approach, giving an error of only 6.8.

5. Conclusion

This solution demonstrated how Relentless Monetization can be coupled with an LLM in order to provide a value score for a repository. It does have its limitations, but I believe these can be accounted for with the right context engineering and fine-tuning, as well as by incorporating other concrete impact metrics into the weight calculations. Moreover, combining different models like Gemini and GPT yielded better scores than using each one individually. Further refinement is still in progress and I will share updates as the competition progresses.

Deep Funding: Weighing OSS Contributions to Ethereum

What I built

I predict weights for the 45 seed repositories (weights sum to 1) using a simple hybrid: a juror-aligned core trained on the pairwise votes in train.csv, plus a handful of GitHub activity/health signals. Then I add a light category layer so the output is readable (execution/consensus/infra/tooling/security/specs/apps/other). Everything normalizes to 1 and is exported as repo,parent,weight.


Why this setup

The scoring is about how well your weights match human juror comparisons (squared error on log weight ratios with their multiplier). So I anchored on the juror data first, then used GitHub signals to keep things robust and explainable.

  • Juror-aligned core
    • Bradley–Terry (BT): From train.csv I convert each A vs B (+ multiplier) into a log-ratio target and solve a regularized system. Softmax → BT weights on the simplex.
    • Pairwise logistic (feature deltas): For each training pair I compute feature(B) − feature(A) and fit a logistic model to predict the juror’s pick. Coefficients become a per-repo utility (normalized).
  • GitHub-derived signals (all normalized, light log transforms)
    • Vitality: stars, forks, contributors, 365-day commits, PR acceptance, issue volume.
    • Momentum: extra weight on recent commits (180/365 blend).
    • QF-style breadth: √contributors × √commits_365 (breadth without raw size dominance).
    • Contributor overlap (CCI): how much a repo’s contributors overlap with other seed repos (normalized by its own size).
    • Semantic proxy: super lightweight: avg commit-message length + total issue comments (both log-scaled). It’s a practical text signal, not embeddings.
  • Categories (for interpretability)
    I assign each repo to execution / consensus / infra / tooling / security / specs / apps / other using simple rules over topics, README, and description. I apply mild multipliers per category, cap “other” so it can’t dominate, and do a gentle shrink toward a balanced mix. This keeps the chart sane without steamrolling the learned signal.
  • Ensemble
    Additive blend of: BT, logistic, vitality, QF, momentum, CCI, semantic → apply category layer → cap/shrink → normalize to sum = 1.

What went into the CSV

  • Only train.csv is used for fitting BT and the logistic.
  • test.csv (the 45 repos) are scored with the trained pieces + GitHub features.
  • Weights are normalized to exactly 1.
  • No leaderboard peeking, no use of public/private test labels (obviously).

Sanity check (tiny diagnostic)

I added a one-cell ablation at the end to check that components actually help on train (same log-ratio cost form as the challenge). Results from my run:

  • Base train cost: 5.7603 (lower is better)
  • Removing BT: +0.1110
  • Removing CCI: +0.0374
  • Removing LOGI: +0.0326
  • Removing SEM: +0.0000
  • Removing QF: +0.0000
  • Removing VITAL: −0.0113
  • Removing MOM: −0.1110

Takeaway: the juror-anchored bits (BT, LOGI) are doing the heavy lifting; CCI adds a small lift; momentum (and a bit of vitality) look over-weighted for this split; QF/semantic were neutral here (likely redundant with the core). I’m leaving the weights as-is for the submission to avoid over-tuning on public data, but I’m flagging momentum/vitality/semantic as the first places I’d tighten next.


What the results look like (qualitative)

  • The top range is mostly clients (execution + consensus), plus Solidity/Vyper/specs, and important tooling/infra.
  • Category pie looks believable: clients + infra/tooling dominate; “other” is small (by design), so the story is legible.

Limitations (being honest)

  • Category assignment is heuristic; some repos will still land in other. I bounded it so it can’t drown the distribution.
  • The “semantic” piece is a proxy, not real embeddings over issues/PRs/README.
  • Contributor overlap doesn’t weight recency or contribution intensity; it’s a first pass.
  • API reality: GitHub rate limits and missing metadata happen. I cache, but depth varies by repo.

If I had more time (roadmap)

  • Replace the semantic proxy with embeddings over README/issues/PRs and use those for category + similarity.
  • Move to a weighted bipartite centrality on the repo–contributor graph (with recency/volume).
  • Light manual pass or embedding-based classifier to refine categories for high-weight repos.
  • Re-tune ensemble weights using ablation guidance (and keep BT/LOGI as the backbone).

That’s it. I wanted something juror-anchored, transparent, and defensible under a quick read. Happy to take feedback and improve it. :slight_smile:

[Submission] Robust WLS+IRLS Bradley-Terry for Ethereum Deep Funding L1

Author: Casuwyt Periay (alertcat)

Contact: alertcat7212@gmail.com


TL;DR

I fit the Level‑1 task with a direct weighted least squares (WLS) solver of the official objective. I add three robustness layers:

  1. Juror‑aware reweighting (per‑juror MAD scale).

  2. Huber IRLS on pairwise residuals (downweights outliers).

  3. Temperature + prior calibration by cross‑validation (CV).

The method is deterministic, fast, and improved the public leaderboard score to 4.8701.


1) Problem framing

Let each repo i have a log‑strength s_i. The submitted weights are w_i = exp(s_i) / sum_j exp(s_j).

For each juror comparison k involving repos a_k and b_k, and multiplier m_k >= 1, define

  • If juror chose b_k: t_k = +ln(m_k)

  • If juror chose a_k: t_k = -ln(m_k)

Official raw cost (same as leaderboard inner term):


L_raw(s) = mean over k of ( (s_{b_k} - s_{a_k}) - t_k )^2

So fitting s by least squares on pairwise differences is exactly optimizing the leaderboard objective (before the final softmax normalization).


2) Core model: closed‑form WLS via a Laplacian

I solve the weighted problem


minimize sum_k w_k * ( (s_b - s_a) - t_k )^2

by building a weighted Laplacian system L s = r. For each pair k with nodes a, b and target t, do:


L[a,a] += w_k

L[b,b] += w_k

L[a,b] -= w_k

L[b,a] -= w_k

r[a] += -w_k * t

r[b] += +w_k * t

Add a small ridge lambda * I (e.g., lambda = 1e-4) for numerical stability, pin one node to 0 (gauge fixing), solve, then recenter s to mean 0.

This yields a deterministic least‑squares solution in microseconds for 45 nodes.


3) Robustness

3.1 Base per‑sample weights

Use clipped log‑multipliers to avoid extreme votes dominating:


w_base_k = clip( abs( ln(m_k) ), 0.25, 2.0 ) + 1e-3

3.2 Juror‑aware multiplier

Compute a robust scale per juror j:


scale_j = 1.4826 * median( abs( t - median(t) ) ) # MAD

g_j = clip( median(scale) / scale_j, 0.5, 2.0 )

Set w_k = w_base_k * g_{juror(k)}.

3.3 Huber IRLS

Run ~6 rounds of IRLS. Each round, compute residuals


r_k = (s_b - s_a) - t_k

and update per‑sample weights with the Huber influence (delta = 1.0 by default):


omega_k = 1 if abs(r_k) <= delta

omega_k = delta / abs(r_k) otherwise

w_k <- max( w_k * omega_k, 1e-3 )

Then re‑solve the WLS system with the updated w_k.


4) Calibration to submission weights

Map s to final weights with temperature‑scaled softmax plus a tiny uniform prior:


w_i(T, eps) = (1 - eps) * exp( s_i / T ) / sum_j exp( s_j / T ) + eps * (1 / N)

Choose (T, eps) by CV (Section 5).


5) Cross‑validation (CV)

I pick:

  • Huber delta and ridge lambda for the WLS+IRLS solver;

  • Temperature T and uniform mix eps for the final softmax;

by 5‑fold CV minimizing the same raw cost:


mean over k in validation fold of ( (s_b - s_a) - t_k )^2

This avoids both “too peaky” (small T) and “too flat” (large T or large eps) solutions.


6) Optional small ensemble

Optionally add:

  • Colley rating using weights clip(abs(ln m), 0.25, 2.0);

  • A tiny BT‑Logit trained with those sample weights.

Blend with a small simplex grid search. By default I submit only the calibrated WLS model (best for me).


7) Implementation highlights

  • Single script: solution_pro.py.

  • Inputs: data/train.csv, data/test.csv.

  • Output: submission.csv with columns repo,parent,weight (weights sum to 1).

  • Dependencies: numpy, pandas (PyTorch optional).

  • Deterministic: no stochastic optimizer in the core solver; fixed seed for CV splits.

How to run:


python solution_pro.py --data_dir data --out submission.csv --device auto

# Optional:

# --include_colley

# --include_bt


8) Why this fits the competition

  • The training loss is literally the leaderboard’s inner term:

min_s sum_k w_k * ( (s_b - s_a) - t_k )^2.

  • Robust yet conservative weighting: clipped |ln m|, juror MAD scaling, Huber IRLS.

  • CV‑chosen temperature and prior to avoid over/under‑confidence.

  • CUDA acceleration significantly speeds up computations.


9) Diagnostics & sanity checks

A small quick_check.py computes, for any repo:

  • number of pairs,

  • wins/losses,

  • net sum of (+/- ln m) across its pairs.

This flags cases where a repo shows weak pairwise evidence but somehow spikes in weight (often due to mishandling of multipliers).


10) Results snapshot

  • Public leaderboard: 4.8701 (top‑10 at the time of writing).

  • Qualitative ordering aligned with L1 evidence (e.g., strong solidity, go‑ethereum, major clients, core tooling).


11) Ablations (what mattered)

  • Closed‑form WLS vs. SGD/Adam: deterministic and stable on tiny data.

  • Juror MAD scaling: prevents high‑variance jurors from dominating.

  • Huber IRLS: dampens a few inconsistent or extreme comparisons.

  • CV calibration: controls peaky or flat distributions.


12) Limitations & future work

  • Juror‑grouped CV would be closer to the private split if juror IDs are fully available.

  • Feature priors (e.g., usage telemetry, EIP references) can be added as quadratic penalties on s for L2 and beyond.

  • A full hierarchical juror model is an elegant but heavier alternative; MAD+IRLS is a strong practical baseline.


Appendix A — Minimal usage


python solution_pro.py --data_dir data --out submission.csv --device auto

Optional extras:


python solution_pro.py --data_dir data --out submission.csv --include_colley --include_bt

Key defaults (from CV):

  • Huber delta and ridge lambda from small grid by 5‑fold CV.

  • Temperature T and prior eps likewise by CV.


Thanks to the organizers and jurors!