Model Submissions for Ethereum Deep Funding

kalen · October 14, 2025, 11:15pm

Deepfunding Ethereum Challenge

Hey there, David Gasquez over here! I’m excited to share my approach to the Quantifying Contributions of Open Source Projects to the Ethereum Universe challenge. This write up focuses specifically on the “assigning weights to 45 open source repositories relative to Ethereum” competition shape, not the other ones!

Competition

The goal was to assign weights indicating the relative contribution of 45 core open-source repositories to the Ethereum universe. Ideally, we’d leverage machine learning, LLMs, or anything that could make it easy to scale human expertise. Before jumping into the models themselves, let’s quickly see what the data looks like!

Exploration

I did a small analysis of the competition train.csv, containing 407 labeled comparisons across 321 unique repo matchups, covering 47 repositories and 37 jurors. Here are some insights:

252 matchups (78.5%) were seen by a single juror. Only 69 matchups have multiple opinions.
Median juror handled 12 comparisons (min 2, max 23), leaving long tails of lightly sampled jurors.
The six busiest jurors cast 28% of all votes, so their behavior disproportionately shapes scores.
Around 22% of matchups ever receive validation beyond a single opinion.
Order bias is negligible in aggregate (A vs B is the same as B vs A), but some jurors (e.g., L1Juror16, L1Juror19) show biases/preferences (weak evidence without more data) for one side.
When two or more jurors review the same matchup, 58% reach unanimous decisions; the remainder split 2–1, highlighting a small set of contentious comparisons worth re-checking.
Multiplier usage varies greatly. These heavy-handed jurors (especially L1Juror8 with 17 decisions) can swing repo scores by a lot!
There are no logical per-juror inconsistencies (e.g., a juror saying A>B, then B>C but also C>A)! Something I was worried about. At the same time, there are some inconsistencies when taking into account the intensity (A is 3x better than B, B is 2x better than C, but A is only 2x better than C). This suggests jurors are not very good at quantifying the intensity of their preferences.
While some jurors agree on which repository is better, they disagree substantially on how much better.
The graph of comparisons is sparse in some regions, which leads to high uncertainty in scores for some clusters. Repos like alloy-rs/alloy act as anchors and a few comparisons have a disproportionate impact on scores.

Score Resiliency

One thing I was curious about was how much the scores would change if we had a different set of jurors or matchups. This is important because we want the scores to be robust and not overly sensitive to who voted or which comparisons were made. The simplest way to test this is to do a “leave one out” analysis: remove a juror or random comparisons, derive the weights from train.csv using the same method as the competition, and then check how much the scores change compared to the original.

Looking at the juror impact, we can see that some have a much larger impact than others when they are removed. Here are the top 5 most “impactful” jurors.

Juror	Comparisons	Mean Change	Std Change
L1Juror8	17	33.54%	29.31%
L1Juror13	23	5.45%	2.64%
L1Juror27	21	4.86%	2.41%
L1Juror28	20	4.74%	2.28%
L1Juror29	21	4.35%	2.41%

The table shows that L1Juror8’s 17 comparisons (4.2% of the data) have a disproportionate impact due to the extreme variance in the scores.

When checking which repositories are most affected by removing a juror, we see that some repositories are more sensitive than others. Here are the top 5 most “variance-sensitive” repositories.

This shows a potentially large impact on the scores when adding comparisons. This makes me suspect the final weights are quite sensitive to the specific training data shared. The final private leaderboard scores will shake things a lot!

Once the complete competition dataset is out, I’ll do a proper analysis of the score resiliency. It would be interesting to know things like Removing L1JurorXX changes ethereum/go-ethereum weight from N → M or Removing the repository XXXX moves ethereum/go-ethereum from N → M.

Approaches

For the actual modeling, I tried a few different approaches. I started with a simple baseline before moving to more complex models. I wrote a small custom cross-validation script to evaluate the models based on the pairwise cost function defined in the competition. This way, I could see how well each model was able to predict the juror comparisons. I ran both random and per-juror cross-validation.

Baseline

The baseline model uses simple gradient descent to minimize a cost function based on logit differences. Everything comes from train.csv, with no external data. This is an effective way to get a first estimate of the weights, though the model is quite sensitive to the specific training data!

ML

Once I had the baseline working, I tried a classic ML approach: take as much data per repository as possible and train a model to predict the weights. In this case, I used a few models (Regression, Random Forest, SVM). As for the features, I used:

GitHub data such as stars, forks, issues, PRs, contributors, commits, etc.
Top-contributor overlap
For almost all of the repositories, I generated embeddings with Qwen3-Embedding-8B (one of the best right now!). I mostly used these to compute distances between repositories and the competition instructions, juror outputs, and other repositories. This way, I could capture some of the semantic meaning of the repositories in a few features and avoid using the large vector directly.

Even with the custom cross-validation, I did not get a very good result. The model was not able to generalize well to new data. That makes sense given the few data points we have. I think with more data, this approach could work much better.

Arbitron

A few weeks before the competition ended, I worked on a project inspired partly by this competition itself. The idea is to use a bunch of LLMs to do pairwise comparisons of repositories. The model is given two repositories, a few tools (search online, check GitHub), and then asked to compare them based on “their contribution to the Ethereum ecosystem”.

So, I created a new Arbitron contest with the 45 repositories and their description/stats and asked the models to compare them. Using nine different LLMs (GPT-5, Claude, Qwen, Mistral, etc.), I collected 612 comparisons, to which I applied the same cost function as the jurors.

This turned out to be a very effective way to get a new set of comparisons. The comparisons by themselves score quite well in cross-validation and on the public leaderboard! I held #1 on the public leaderboard most of the time without using any training data at all! When the new repositories were added, I only had to update a bunch of YAML files and get a few agents to compare them.

Since we had some training data, I ended up mixing the Arbitron comparisons with the juror ones. This way, I could get the best of both worlds. The Arbitron comparisons connect the graph while the juror ones add some guidance. Surprisingly, adding the juror comparisons to the Arbitron ones did not improve the score on the leaderboard!

Postprocessing

I did a few final touches to the weights:

I shared the best weights with GPT-5 and Claude and asked them to “Act as an expert on the Ethereum ecosystem and adjust the weights”.
I built a few “ensembles” by blending two or more weight CSVs in log space.
I realized that smoothing Arbitron weights gave a slight improvement and started applying it to the juror weights.
I post-processed the weights to fit a lognormal distribution (a heavy-tailed distribution).

The only thing that ended up working was the ensembling on log space. The rest did not improve the score on the final leaderboard.

Learnings

One of the biggest challenges when building ML/AI models for this competition was the lack of training data. Pure ML models didn’t work well at the repo-weighting task due to the limited samples.

The Arbitron approach is powerful. It doesn’t need any training data and produced a very good result. It would be interesting to see how well it scales with more repositories and more comparisons, especially on the cost side. It is not cheap to run a few hundred comparisons with multiple LLMs!

Even with Arbitron and the best post-processing, you don’t get much further than the baseline! The simple gradient-descent approach on train.csv, minimizing the cost function, would have given anyone a top-3 spot on the leaderboard. This means the competition could have been approached purely as an optimization problem. No need for fancy models or LLMs. Just optimize the weights to minimize the cost function. Of course, that approach doesn’t generalize at all.

I also suspect the training data has quite a bit of noise. Some jurors have a much larger impact on the scores than others just because of the specific comparisons they were assigned. I’d love to do a proper analysis on the resiliency and also think about better ways to collect the data and derive the weights.

This competition has been very exciting, and I learned a lot from it. Since the start, I’ve spent a lot of time thinking about the problem, suggesting alternatives, sharing feedback, and even writing a bunch of potential ideas for future competitions, some of which already made it into the current one!

Kudos to the Deepfunding team for organizing it! I know it wasn’t easy to organize a competition like this one and there were many challenges along the way. I’m looking forward to competing in the next one.

Feel free to reach out with any questions or suggestions!

Update. I’ve had some time write an idea of meta-mechanism and also to run an analysis on all the available data (private pairs too!) and published it on GitHub. Check it out! It has some interesting insights on the data and potential alternatives!