Model Submissions for Ethereum Deep Funding

Deep funding submission
Pond username: duemelin
Score: 0.004 on final leaderboard

Here is my writeup:

Of course. Here is a complete guide on how to run the model and craft the write-up, framed with a clear, direct narrative.

The background story here is one of first principles. Instead of trying to guess what jurors are thinking by adding more (potentially noisy) external data, this approach trusts the only real signal we have: the competition’s own data and its explicit scoring formula.


The Background Story: A First-Principles Approach

In a competition with noisy, subjective data, the biggest risk is overfitting. It’s easy to build a complex model with dozens of features that perfectly explains the training data but fails spectacularly on the real test set. My core hypothesis was that most external data (GitHub stars, commit counts, etc.) are noisy proxies for what the jurors actually value.

The competition, however, gives us one piece of pure signal: the exact mathematical formula it uses for scoring.

So, I adopted a first-principles approach. Instead of adding more data to the problem, I stripped it down to its mathematical core. The goal was not to build a psychological profile of each juror, but to find the set of weights that is the most mathematically consistent with the aggregate judgments we were given. This transforms the problem from “predicting human taste” to “solving a system of constraints.” The model is simple, deterministic, and directly optimizes for the one thing that matters: the leaderboard score.


The Write-Up

Here is a ready-to-post write-up for the forum, incorporating the results from your script’s output.


Submission: A First-Principles Approach via Direct Optimization

Participant: Alex
Final Internal Score (MSE): 4.2208

1. The Philosophy: Signal Over Noise

My approach is built on a simple premise: in a data-sparse environment, the most robust model is often the one that makes the fewest assumptions. Instead of incorporating external OSINT data, which can introduce noise and lead to overfitting, this model focuses exclusively on the ground truth provided: the train.csv file and the competition’s explicit cost function.

The core philosophy is to treat this as a direct optimization problem, not a feature-engineering one.

2. The Model: A Bradley-Terry Framework

The model is a classic Bradley-Terry implementation. It assumes each of the 45 repositories has a latent “strength” score (s_i). The model’s prediction for the comparison between repo A and repo B is simply the difference in their scores: s_B - s_A.

The competition’s evaluation metric is the mean squared error between this predicted log-ratio and the juror’s stated log-ratio (ln(multiplier)). Therefore, my model’s objective function is identical to the competition’s scoring function.

3. Implementation: One Script, No Dependencies

The entire process is contained in a single Python script that has no external data dependencies.

  1. Data Preparation: It maps each of the 45 repository URLs to a unique integer index. It then processes train.csv into a list of comparisons, keeping only those between the 45 target repos.
  2. Optimization: It uses the L-BFGS-B algorithm from the SciPy library to find the 45 latent strength scores that directly minimize the mean squared error across all 374 valid comparisons.
  3. Weight Generation: The final, optimized scores are converted into a probability distribution (summing to 1.0) using a softmax function.

The process is deterministic, fast, and completely transparent.

4. Results & Interpretation

The optimization was successful, converging to a final internal MSE of 4.2208. The resulting weights reveal what the model learned about the jurors’ collective philosophy, purely from their pairwise choices:

Top 15 Repositories by Weight:

                                                      repo    parent    weight
9                         https://github.com/ethereum/eips  ethereum  0.167716
13                 https://github.com/ethereum/go-ethereum  ethereum  0.156800
16                    https://github.com/argotorg/solidity  ethereum  0.105199
8              https://github.com/ethereum/consensus-specs  ethereum  0.062509
11              https://github.com/ethereum/execution-apis  ethereum  0.059294
30       https://github.com/safe-global/safe-smart-account  ethereum  0.036416
19                  https://github.com/ethers-io/ethers.js  ethereum  0.035686
32                      https://github.com/sigp/lighthouse  ethereum  0.035314
20                   https://github.com/foundry-rs/foundry  ethereum  0.032588
27  https://github.com/openzeppelin/openzeppelin-contracts  ethereum  0.032228
42              https://github.com/ethpandaops/checkpointz  ethereum  0.030794
25             https://github.com/nethermindeth/nethermind  ethereum  0.029315
3                    https://github.com/chainsafe/lodestar  ethereum  0.024123
29                  https://github.com/prysmaticlabs/prysm  ethereum  0.020804
28                     https://github.com/paradigmxyz/reth  ethereum  0.017781

Key Insight: Without being told anything about the projects, the model learned to assign immense value to core protocol infrastructure and foundational standards. The top 5 projects (eips, go-ethereum, solidity, consensus-specs, execution-apis) represent the absolute bedrock of Ethereum. The model correctly inferred their importance simply by observing how consistently they “won” their comparisons against other projects.

5. Conclusion

In a data-sparse and subjective competition, directly optimizing the known objective function is a powerful and robust strategy. It avoids the risk of overfitting to external features and provides a clear, defensible set of weights based purely on the provided ground truth. I learned late about the contest and I wish I would have made more submissions.


How to Run This Model on Google Colab

Here are the step-by-step instructions to replicate this submission.

Step 1: Set Up Your Colab Environment

  1. Go to colab.research.google.com and create a new notebook.

Step 2: Upload the Competition Data

  1. On the left-hand sidebar, click the folder icon :file_folder:.
  2. Click the “Upload to session storage” icon (it looks like a page with an upward arrow :page_facing_up::up_arrow:).
  3. Upload two required files from the competition:
    • train.csv
    • test.csv
  4. Wait for them to finish uploading. You should see them in the file list.

Step 3: Paste and Run the Code

  1. Copy the entire Python script from here (the “Direct Optimization with a Bradley-Terry Model” script).
  2. Paste it into a single code cell in your Colab notebook.
  3. Run the cell by clicking the play button :play_button: or pressing Shift + Enter.

Step 4: Verify the Output

  1. The script will run for a few seconds. You should see output in your notebook that matches what you provided:
    • Mapped 45 unique repositories...
    • Processed 374 valid comparisons...
    • Optimization successful! Final cost (MSE): 4.2208
    • A list of the top 15 repositories.
  2. In the file browser on the left, a new file named submission_direct_optimization.csv will appear.

Step 5: Download and Submit

  1. Click the three dots next to submission_direct_optimization.csv and select Download.
  2. You can now submit this file to the competition platform.

Name: kwachunling
Submission score: top 10 on provisional

Methodology

- A Graph Neural Network based Approach

1. The Philosophy: Value is Relational, Not Intrinsic

Previous methodologies I had tried and submitted had a shared, fundamental flaw: they treat the 45 repositories as a list of independent entities. They model features, but they fail to model the structure of the problem itself.

My approach begins from a different first principle: the Ethereum ecosystem is a graph. Repositories are nodes, and the jurors’ comparisons are the weighted, directed edges that define the flow of influence and value. A project’s importance is not a property of its isolated features, but an emergent property of its position within this network topology.

To respect this structure, I employed a Graph Neural Network (GNN), the only class of model designed to learn from relational data.

2. The Methodology: Modeling the Flow of Information

Instead of training a model to predict a score from a feature table, the GNN learns a project’s score by “passing messages” across the comparison graph.

  1. Graph Construction:

    • Nodes: The 45 repositories. Each node is initialized with a feature vector derived from our OSINT data (stars, forks, age, etc.).
    • Edges: Each juror comparison (A, B) with log-ratio m becomes a directed edge from A -> B with m as an attribute. These edges define the “network of judgment.”
  2. The GNN Model (Graph Convolutional Network):

    • I implemented a two-layer Graph Convolutional Network (GCN) using PyTorch Geometric.
    • Message Passing: In each layer, every node aggregates feature information from its neighbors (the other repos it was compared to). A node’s representation is iteratively updated based on its local neighborhood.
    • Learning: The model learns to perform this aggregation in a way that the final output score for each node, when compared pairwise, minimizes the competition’s log-ratio MSE. It learns not just what a project’s features are, but what they mean in the context of the projects it was judged against.

3. Implementation Details

The entire process is contained in a single script that uses the PyTorch Geometric library.

  • Node Features: I used the simple, robust enriched_feature_set.csv (without text embeddings) to initialize the nodes. These features were standardized before being fed into the graph.
  • Training: The GNN was trained for 200 epochs using an Adam optimizer. The loss function was a direct implementation of the competition’s MSE, calculated on the difference between the scores of the connected nodes for each edge.

4. Results & Conclusion

This approach forces the model to learn a more holistic and context-aware representation of value. For example, a repository with mediocre features might still achieve a high score if it consistently “wins” its comparisons against other highly-scored projects. The GNN captures this “strength-of-schedule” concept automatically. While it did score top 10 in the provisional leaderboard, it did not obtain any points of the final one.

By modeling the ecosystem as it is, a network of influence, we can learn more robust and generalizable set of weights that respects the relational nature of value. This GNN submission represents a fundamentally different, and I believe more accurate, way of understanding the structure of the jurors’ collective intelligence. However, given its performance on the final hidden test set, it shows it overfit on the data.

code below

# ==============================================================================
#      DEEP FUNDING: THE NETWORK SCIENTIST'S MODEL (GRAPH NEURAL NETWORK)
#
# Author: N. Ode
#
# Philosophy: This submission rejects the premise that repositories are
# independent entities. The Ethereum ecosystem is a graph, and the jurors'
# comparisons define the edges of influence between nodes. A project's value
# is determined by its position within this network.
#
# This model uses a Graph Neural Network (GNN) to learn the importance of
# each repository. The GNN's "message passing" mechanism allows it to learn
# a node's score not just from its own features, but from the features and
# scores of the neighbors it's compared against.
#
# ==============================================================================

# ------------------------------------------------------------------------------
# Part 0: Environment Setup
# ------------------------------------------------------------------------------
print("Part 0: Installing and importing necessary libraries...")
# We need specialized libraries for graph machine learning
!pip install torch torch-scatter torch-sparse torch-cluster torch-spline-conv torch-geometric -f https://data.pyg.org/whl/torch-2.3.0+cu121.html -q
import pandas as pd
import numpy as np
import torch
import torch.nn.functional as F
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
from sklearn.preprocessing import StandardScaler
import warnings

warnings.filterwarnings('ignore')
print("Libraries imported successfully.\n")

# ------------------------------------------------------------------------------
# Part 1: Graph Construction
# ------------------------------------------------------------------------------
print("="*60)
print("Part 1: Constructing the Ecosystem Graph...")
print("="*60)

# --- Load Data ---
try:
    train_df = pd.read_csv("/content/train.csv")
    test_df = pd.read_csv("/content/test.csv")
    # This model uses the simpler, robust feature set
    features_df = pd.read_csv("enriched_feature_set.csv")
    print("  Successfully loaded all necessary files.")
except FileNotFoundError as e:
    print(f"  FATAL ERROR: A required file was not found: {e}.")
    exit()

# --- 1. Define Nodes and Node Features ---
all_repos = test_df['repo'].unique()
repo_to_idx = {repo: i for i, repo in enumerate(all_repos)}
num_nodes = len(all_repos)

# Select and scale the features for our nodes
numerical_features = features_df.select_dtypes(include=np.number).columns
scaler = StandardScaler()
node_features = scaler.fit_transform(features_df[numerical_features])
node_features_tensor = torch.tensor(node_features, dtype=torch.float)
print(f"  Created {num_nodes} nodes with {node_features.shape[1]}-dimensional feature vectors.")

# --- 2. Define Edges and Edge Attributes ---
edge_sources, edge_targets, edge_attrs = [], [], []
for _, row in train_df.iterrows():
    repo_a, repo_b = row['repo_a'], row['repo_b']
    if repo_a in repo_to_idx and repo_b in repo_to_idx:
        idx_a, idx_b = repo_to_idx[repo_a], repo_to_idx[repo_b]
        log_ratio = np.log(row['multiplier']) if row['choice'] == 2 else -np.log(row['multiplier'])
        
        # Add edges in both directions
        edge_sources.extend([idx_a, idx_b])
        edge_targets.extend([idx_b, idx_a])
        # The attribute for A->B is log(B/A), for B->A it's log(A/B)
        edge_attrs.extend([log_ratio, -log_ratio])

edge_index_tensor = torch.tensor([edge_sources, edge_targets], dtype=torch.long)
edge_attr_tensor = torch.tensor(edge_attrs, dtype=torch.float).view(-1, 1)
print(f"  Created {len(edge_sources)} directed edges representing the juror comparisons.")

# --- 3. Assemble the Graph Data Object ---
graph_data = Data(x=node_features_tensor, edge_index=edge_index_tensor, edge_attr=edge_attr_tensor)

# ------------------------------------------------------------------------------
# Part 2: Defining the Graph Neural Network
# ------------------------------------------------------------------------------
class GNNRanker(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels=1):
        super().__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.final_layer = torch.nn.Linear(hidden_channels, out_channels)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        # Message Passing Layers
        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.conv2(x, edge_index)
        x = F.relu(x)
        # Final prediction layer
        scores = self.final_layer(x)
        return scores.squeeze()

# ------------------------------------------------------------------------------
# Part 3: Training the GNN
# ------------------------------------------------------------------------------
print("\n" + "="*60)
print("Part 3: Training the Graph Neural Network...")
print("="*60)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = GNNRanker(in_channels=graph_data.num_node_features, hidden_channels=32).to(device)
graph_data = graph_data.to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01, weight_decay=5e-4)

def train():
    model.train()
    optimizer.zero_grad()
    
    # Get the scores for all nodes from the model
    all_scores = model(graph_data)
    
    # Get the scores for the start and end node of each comparison edge
    source_scores = all_scores[graph_data.edge_index[0]]
    target_scores = all_scores[graph_data.edge_index[1]]
    
    # The model's prediction for the log-ratio is the difference in scores
    predicted_log_ratios = target_scores - source_scores
    
    # The ground truth is stored in the edge attributes
    true_log_ratios = graph_data.edge_attr.squeeze()
    
    # Calculate the loss (Mean Squared Error, same as the competition)
    loss = F.mse_loss(predicted_log_ratios, true_log_ratios)
    
    loss.backward()
    optimizer.step()
    return loss.item()

for epoch in range(200):
    loss = train()
    if epoch % 20 == 0:
        print(f"  Epoch {epoch:03d}, Loss: {loss:.4f}")

# ------------------------------------------------------------------------------
# Part 4: Generating the Final Submission
# ------------------------------------------------------------------------------
print("\n" + "="*60)
print("Part 4: Generating Final Submission File...")
print("="*60)

model.eval()
with torch.no_grad():
    final_scores = model(graph_data).cpu().numpy()

# Convert the final scores into weights via softmax
final_weights = np.exp(final_scores - np.max(final_scores)) / np.sum(np.exp(final_scores - np.max(final_scores)))

# Create the submission DataFrame
submission_df = pd.DataFrame({'repo': all_repos, 'parent': 'ethereum', 'weight': final_weights})
submission_df.to_csv("submission_gnn.csv", index=False)
print("  SUCCESS: 'submission_gnn.csv' has been created.")
print("\n--- Top 15 Repositories from GNN Model ---")
print(submission_df.sort_values('weight', ascending=False).head(15).to_string())
print("\n" + "="*60)
print("PIPELINE COMPLETE")
print("="*60)

Greetings from Bhutan!

First of all, we would like to thank the organizers for hosting this competition. While we were late to join, it was a fun problem to think about. Moreover, reading others submissions and associated writing in this area has been intriguing. We have tried to look back through the communications, repos, and datasets to fully understand the context, but we may have missed some important information, so its entirely possible some things we discuss here have been resolved elsewhere. Overall, we will try to add something new to the conversation as there has alreaady been much good discussion above on things like sybil vulnerabilities, the dependency graph shifting, juror variability in scale and participation level, github features, etc.

General Approach

Our team found out about this competition late, and by the time we started, there were only about 5 hours left. We had to be strategic about where to focus our efforts. At the time, we did not realize there was a public repo with github data already in place, and decided it would be infeasible to build a strong database of our own in time. Even if we could, doing learning on the github features was the obvious strategy, and even if we could build a database, surely those who had analyzed it for many days would come out ahead. So we tried to focus on what could be done with the juror rating dataset (at this point the public test data had already been merged).

Briefly summarizing our procedure:

  1. Tried to understand the objective, the loss function, the dataset, how they work together, do they make sense?
  2. Computed a baseline
  3. Looked for properties of the mechanism / dataset that could be used to adjust the baseline. We tried 2 genres of adjustments, both worsened the cv score (although it’s definitely possible that these failed due to implementation issues).

Surprisingly, when we submitted our baseline computations, we ended up at first on the leaderboard. We knew we had overfit using the validation data, but it was surprising that noone else had seemed to try this.

Detailed Approach

Baseline = Ridge in Log-Space

We started with the BT model, as that is well-known pairwise comparison model. While the typical BT doesn’t make sense here as it is built for probabilities not a scoring system, there is a continuous variant that is equivalent to ridge regression in log space. We implemented both the closed form, and a more manual iterative solver to allow us to try our improvement hypotheses.

Improvement Attempt 1: Juror Irregularities

Jurors are predicting on different scales, and participating at different rates. The test data is also known to have the same ratios of juror participation, due to the 66-33 split. Also, we know that jurors will not receive the same repos in one group of 5 ratings. So the thinking is there may be some way to conclude that certain information from infrequent jurors should be ignored, since you will not be getting ratings from them in the test set.

The other idea was to remove triangle jurors, by giving consistency scores and tuning over removing these jurors.

As a few others have stated, we should view jurors as fundamentally different preferences, and this is why we thought that removing the data from the jurors might be removing a preference that is biased away from the mean of the remaining jurors. However, it seems that the lack of data in general won out, as keeping all data provided us with the best score in both cases.

Improvement Attempt 2: Scheduling Leak? Rating-Frequency Correlation

We noticed that if we gave win percentage ratings to each repo, the correlation between this win-rate and the number of comparisons was statistically significant. There was a strong positive correlation. Although the source code says that the ratings come up random, it seemed there was something weird going on. If the juror comparison algorithm was favoring more important repos, then inconsistencies in this relationship would indicate that the balance lies in the test set. So if you were performing very strongly, but didn’t have many comparisons, then we should shift your weight downwards, as you must have lost some battles in the test set.

This also didn’t work.

Reflections, Thoughts on the Mechanism and Competition

Here we will lay out a list of things we noticed, and then put our focus on one issue in particular.

Random Notes

  • Using timing as a feature. We can see differences in how long different jurors are taking to make their judgments (because often they do the 5 in one sitting). More time could mean lower variance in their answers, for example.
  • Look more into difference in # of comparisons per repo and how that affects results.
  • Is the goal purely to allocate funds as rewards? In order to incentivize or just as morality? Because if the goal is to best push forward the ecosystem, than the question of “contribution to ecosystem” is not necessarily same thing as our goal. Rather, partially we should fund towards expected highest future marginal contribution to the ecosystem. So a very impactful project that is aging with lots of competition should be downweighted.
  • To all the LLM users, we should consider that may not be scalable at the moment.
  • Is allocation happening based off of these winners? But then allocation is dependent on the test data selection. Seems not ideal but surely I’m missing something.
  • Clustering approaches wish had explored, some had success and think can be improved even more.

Focus: Train/Test Split Misbalanced = Failure in Credible Neutrality

I ran the deepfunding/jury-evaluation app locally, and read through the source code. The data is not processed into train/test here, that is done after it has been sent to the google docs, and i couldn’t find the documentation for how this split is orchestrated.

Even if it is available, somehow we ended up with several juror data sets being split in favor of train or test rather than the promised 2 to 1 split. The competition overview clearly states

''For level 1 comparisons between seed nodes, randomization is at the juror level. If a juror makes 30 judgments, they are divided equally between training, private and public test sets."

But we see things like juror 1 having 8 trains for 2 test and juror 40 being about 50-50. With the small training set, and a broken assumption, this can seriously affect results. Moreso if you consider the greater affects of arbitrary train-test selection (not saying this happened, but in the spirit of credible neutrality, we should make sure this isn’t a thought that crosses the mind).

Ok, that’s all for now. Thanks for reading.

Thanks for all your patience! The committee has completed review on all writeups and we now have final results!

I’ll start with the top 5 write-ups: Zoul, David Gasquez, Ash, Omniacs and Stuffer, who each receive $2000, along with a special $500 prize to Allan Niemerg for his original approach. In the next post, I’ll also share jury feedback on the other participants who submitted a writeup but didn’t win a prize.

Note: the committee primarily took into consideration insightfulness of write-ups, not model performance which has separate prizes based on leaderboard performance.

  1. @zoul404

juror 1: Good data exploration, juror reliability weighting similar to David Gasquez.

Good coverage of approaches with a similar and expected result of poor ML performance due to data scarcity.

Extremely good methodology section and explanation of Bradley-Terry as the correct choice for pair-wise comparisons, and it’s superiority to both ML and multiple features.

Juror 2:

  • Excellent write-up, technical depth and good arguments
  • Like the pros-cons comparison between BT and ML
  • Attention to juror heterogeneity
  • Good explanation/transparency on limitations of BT (i.e. can’t predict unseen repos)

Juror 3

  • This definitely represented a lot of effort, and some good application of proven statistical methods
  • That said, I don’t think they had any intuition or sufficient context to determine if the tools they selected were the right ones for this job
  • Overall, this felt too academic but strong nonetheless

Juror 4:

In My Top 5

The writeup here is strong. I’m biased towards the in-depth mathematical explanations. They lend credibility to specific design choices.

The visuals are great, as well. Definitely a strong submission.

Juror 5

This is a great write-up which really explores some issues with juror data in a reasonable, interesting way. Top 5

@kalen / David Gasquez

Juror 1

Excellent data exploration, especially with respect to resiliency when modifying the juror selection/matchups.

Good coverage of approaches (ml vs llm) and reasoning behind Ml’s weakness due to data scarcity.

Very interesting result that using LLM committee mimicking jurors led to very high ranking even without training data!

Superb effort.

Juror 2

Great submission

  • Good EDA
  • So far, only one to use vector embeds for the repos themselves (as an extra feature). Great idea.
  • Also like the acknowledgement of issues with dataset (not many data points, juror imbalance, etc) and further suggestions

Juror 3

  • Top 5
  • I know David well … and know that he knows this contest and the context very well.
  • That said, there’s a clear evolution in his thinking, I like how he tries multiple approaches and compares findings

Juror 4

In My Top 5

Very insightful data exploration, giving helpful perspective on the data available – especially the disproportionate impact of a single juror. The visualizations are effective.

The use of different approaches (classical ML.and LLM) led to intriguing comparisons. I appreciate the thorough process description.

Im intrigued to see how well this approach would work on other data sets

Juror 5

David clearly knows his stuff. At this point, after judging a few of these competitions, I’m pretty convinced that he’s one of the best people to help lead further explorations into deep funding.
I think this is the best writeup.

@AshDev

Juror 1

Detailed journey of the contestant through the thought process. Very well described even if telegraphically through bullet points. One would think that using such curt bullet points would takeway from a writeup but it was actually very economical and to the point. However some details were missing, especially the (interesting) ensemble approach of Bradley-Terry and LightGBM models for Juror and Repo data. Could have benefited from some details which were missing but well described otherwise

Juror 2

  • Very detailed approach, combination of visuals + formulas, good explanation
  • I expected some more discussion around parameters of models, different models, etc

Juror 3

  • Top 5
  • Excellent write-up, and the best of the “I chose an algorithm and continued to optimize it” submissions I’ve come across
  • It’s clear they put time and care into the process and had a very strong submission
  • The repo and code are also well documented (for a data challenge), which I enjoyed skimming through

Juror 4

In My Top 5

I found this easy to read, and appreciated the mix of writing styles (terse bullet points in some places, reflective paragraphs in others), and the effective use of visuals.

The personal narrative was effective for me, especially since the final model was so effective

Juror 5

I think this is an above-average writeup with some good takeaways, even if I wasn’t always able to understand the terse writing style. In my top 5.

@TheOmniacs

juror 1

A very detailed and well documented writeup. It tells the story very well and uses a lot of community conversations and other images apart from data and visualizations. It uses a very casual tone, which was fine for me.

juror 2

  • Engaging write-up
  • It would have been nice to see some more ML modelling and/or scores and quantitative-based iterative process
  • They did some interesting attempts in gaming the submission process and creating dummy accounts, but not directly helpful for the competition scores

juror 3

  • Top 5
  • Enjoyed seeing the combination of approaches they employed and their engagement throughout the competition
  • Checked out the code and there’s a lot of reusable stuff here (like the processing pipeline).

juror 4

In My Top 5

This is a detailed and engaging writeup, which addresses more than just the modeling techniques used. It’s nice to see a contestant giving personal reflection on how the contest’s parameters influenced their efforts. This writeup offers actionable insight for future contests.

“Grey hatting for good” – appreciate the clear narrative framing, with strong voice.

juror 5

This is a very entertaining post that might help the community reflect on how these competitions are run and what can be done better. I loved reading the writeup, but given the author’s willing admission to the “sin” of overfitting on the leaderboard, it feels hard to recommend this one for one of the top spots. Honestly though, it could be at the lower end of the top 5.

@Stuffer

Juror 1

Interesting ‘forensic’ approach with detailed data exploration.

Intriguing human-in-the-loop approach to the problem, I did not anticipate this. Well motivated in the writeup.

Overall very well described and motivated given the unique approach

Juror 2

  • Good juror analysis, specially NLP analysis of reasoning
  • Creative approach of human-in-the-loop
  • Not top-tier with respect to model being developed

Juror 3

  • I liked the narrative, going from EDA to data scraping to fill in some gaps, to interpreting the results
  • This is outside the top five for me because it’s not very generalizable and feels overfit to the immediate problem set on hand. But a good, practical solution!

Juror 4

In my Top 5

I really liked the thorough EDA and creative visualizations, which led to an actual cognitive model.

The “human-in-the-loop” model is a really interesting approach, both creative and seemingly successful.

Juror 5

Love the clarity of the visuals combined with easy-to-understand takeaways right below. The human-in-the-loop approach is also really cool imo. A great writeup, maybe second place behind David’s

@niemerg

(special note: the jury loved the ideas presented and awarded a special $500 prize outside the pot for this reason, but uniformly felt the writeup could have been better presented)

juror 1

Very short writeup which leans heavily on multiple other articles and basically does not provide details in the writeup itself - unsure how to evaluate that

juror 2

  • I like the LLM-first approach it took as in “let’s have 1 persona per juror”. If that works perfectly, he could have had the best model. However, this is gaming the system, since the goal was to develop a model to calculate repo weights. Credits for creativity.
  • Lacked more quantitative metrics and discussion

I liked the code (link) and think the approach is interesting

juror 3

  • Top 5
  • I read the blog and checked out the code, which go into more detail than the forum post (that said, I wish the forum had a tl;dr of his conclusions)
  • I also think that this is a much more generalizable approach than training against a very limited set of juror data points and repo features
  • I agree with other reviewers that he buries the lead

juror 4

Im willing to accept the framing that the writeup consists of multiple parts.

It’s clear that there is a lot of effort spent and insight gained by the author through the process. That said, it’s a bit hard to process because it’s so verbal and lacks details with regard to the exact approach

juror 5

The tools that he wrote seem interesting, but the core writeup isn’t too interesting. I guess if we’re allowed to consider the tools as part of the writeup this one is above average, but if not it’s below average.

Congrats to the winners! Pond will now be able to start processing payouts to all participants

1 Like

We’ll now share the feedback to the other participants. Even if not awarded prizes, we hope this feedback helps you in your journey as a model builder!

@innotech-ai

juror 1

This writeup starts off in a promising manner, describing their approach in a storytelling manner, but unfortunately, seems to basically be incomplete.

juror 2

  • Interesting analysis on modelling juror behavior
  • I like their baseline-first strategy
  • Didn’t achieve great results

juror 3

I liked how they tried to think outside the box, and was reading eagerly hoping they had something interesting to show for it … but they didn’t

juror 4

Writeup reflects the short timeframe that the modeler had. I would like to see their results, with a bit more time.

The techniques used here are pretty standard, nothing really sticks out.

juror 5

I think it’s awesome that these folks jumped in with just 5 hours left in the competition and gave it their all, but the writeup is hard to follow.

@kwachunling

juror 1

Very short writeup with no data exploration and only weak motivation to their (albeit very interesting) approach.

Would have like a lot more discussion of the idea of looking at the ecosystem as a graph, and the corresponding use of GNNs.

Left me wanting more.

juror 2

  • Great approach using GNN! Conceptually makes sense to model dependencies as graphs.
  • Write-up focuses on the GNN choice and the code, but didn’t explain what actions were taken after it failed to generalize (change hyperparams, etc)

juror 3

This feels very LLM-y, I don’t think they had any intuition for how to adjust the hyperparameters or interpret the quality of their predictions

juror 4

love the idea of using a GNN here, with the idea that “Ethereum ecosystem is a graph” (in fact, this idea is basically baked into the structure of the data).

Unfortunately, most of the writeup is technical and focuses on how the GNN is implemented. It would be nice to see more about the outcomes – what did the GNN predict, and/or what its performance tells us about the jurors’ mindset.

juror 5

I really appreciate the argument that we need to think about these codebases in a more network-oriented way, rather than as isolated entities. Ultimately it’s not clear to me if that positionality necessarily means that GNNs are the right tool, but still a good submission

@dipanshuhappy

juror 1

This is another one which doesn’t try to predict based on juror behaviour but decides to use some framework utilized by Robinhood. There is no motivation for this choice, and the description of the application of this technique is also light on details

juror 2

  • Creative framing of the problem using Relentless Monetization
  • LLMs were used (and it makes sense to some extent), but the description lacked details on what worked/didn’t work and comparisons with simpler ML-based approaches

juror 3

  • clever approach … I’d love to see the data they got back from their prompting … and the comparisons among LLMs
  • I’m not convinced this is the RIGHT prompt, but this is definitely an outside the box approach

juror 4

I like the novel application of an existing technique (“Relentless Monetization”, who knew?), but it feels disconnected from any juror data.

Using LLMs is an intriguing idea, but lacks explainability. I would want to see a model that generalizes to a different version of the data set.

juror 5

This writeup is not convincing. I think using LLMs this heavily without exploring their drawbacks isn’t great. It’s especially not great when you pair that with another untested idea (in this space at least) (“relentless monetization”). Below average.

@lisabeyy

juror 1

Short writeup with very few details. It does cover the entire process and reasoning, and is honest and clear. Despite being short, it at least motivates the choices made and explains the results.

juror 2

  • Good motivation and explanation of utilized techniques
  • Nice ablation study (kind of feature importance although not from model itself)
  • Didn’t take juror heterogeneity as part of model

juror 3

  • Smart, I like how they had a good intuition for how to approach the problem and how to sanity check the results
  • They did a good amount of fine tuning as well
  • Would have liked to see this person team up with Zoul

juror 4

appreciate the detailed writeup, as well as the humanity/humility in the author’s prose. There’s clear evidence of authentic engagement with the data, with feature engineering and other methodological choices well-explained.

juror 5

This is an average writeup, things are explained well enough but it’s also very terse and doesn’t really increase my understanding of deep funding.

@alertcat

juror 1

While detailed in terms of steps and mathematical details, it is too terse and telegraphic in nature to be a good writeup. Very little text description.

juror 2

  • This shows that author understood the mathematical nature of competition very well, and also presented proofs of why his model fits the competition criteria
  • Pure mathematical model, fast and doesn’t require training
  • Write-up could be made less technical and easier to read

juror 3

  • Similar feedback to lisabeyy – they knew a good tool for the job and applied it well
  • I personally like these straightforward technical approaches
  • That said, I don’t like BT-based approaches because they are based on juror features as opposed to repo features

juror 4

The writeup probably doesn’t do justice to the approach. The details are interesting to me (hard-math-y person), but almost read like filler.

I would like to see way more effort expanding the last few sections, which offer results and interpretability. It feels off to me to treat this as a pure optimization problem when the data set is so rich with potential insights.

juror 5

I wish this writeup had more natural-language explanations/ commentary. The approach might be cool, but I think the writeup is average. It’s not very illuminating, besides letting me know that maybe “direct weighted least squares” is an interesting technique to look into.

@rhythm2211

juror 1

Part 1 & 2 of the writeup are very detailed and provide details on exploratory data analysis and then they seem to run out of steam on 3 & 4, which are very short paragraphs with almost no details. I would consider this submission incomplete.

juror 2

  • Liked the extra effort to fetch missing features via Github API
  • Write up well written, detailed and easy to follow
  • Didn’t find many metrics (maybe inside Jupyter notebook)
  • I expected some more BT discussion

juror 3

  • Sigh. There’s potential here, but this feels like exploratory data analysis with no idea at how to arrive at a conclusion.
  • The notebook feels like a bunch of reused snippets from other competitions and LLM codegen (pandas is imported 43 times)
  • That said, the data viz is pretty and is the type of stuff you’d expect a junior data scientist to cook up while trying to learn about a dataset

juror 4

I liked the visuals, but didn’t always agree with the interpretation. (In particular, comparisons-per-jurordidn’t seem terribly imbalanced to me.)

What’s missing (in my view) is,: a connection between the lessons learned from the EDA, and the eventual modeling efforts. The modeling details are very sparse, in terms of both design choices and insights gained.

juror 5

I think that going with a model-per-juror approach is pretty interesting, but I don’t understand the claims about juror inconsistency. In particular, if a juror tends to use higher ratings, we shouldn’t be surprised that the standard deviation of their ratings is higher, and in any case I don’t think standard deviation and “inconsistency” are the same thing. Inconsistency to me denotes making the same choice differently over time (i.e. sometimes i prefer A to B and sometimes I prefer B to A), and having a high standard deviation is like saying “I like A a lot more than B, and B a little more than C”, which isn’t the same.
Average or below average.

@khawaish1902

juror 1

Detailed writeup, but relies too heavily on code, math and visualizations without enough text to explain. But it is at least detailed, and covers the entire effort.

juror 2

  • Detailed write-up, problem understanding and clear approach
  • Nice that used several models and compared them
  • Limited quantitative validation and discussion

juror 3

  • Top 5
  • Good, I like how the person tried a variety of different approaches and compared them … this reflects a maturity of thought that has been absent in most other submissions
  • Their conclusion is also spot on: “Direct optimization of the competition objective outperformed sophisticated feature engineering when training data was limited.

juror 4

The writeup has a lot of information, but doesn’t really offer reflection or insight.

My question is “What do we know after this modeling effort, that we didn’t know before – about the ecosystem data itself, or modeling with ecosystem data in the future?

juror 5

Not sure that some of the visuals are really necessary (breakdowns of which languages projects used? Text clouds? These don’t increase my understanding of the problem space). Aside from that, I couldn’t find much new / exciting info in this writeup.

@Oleh_RCL

juror 1

This is a strange writeup which seems to fail to identify the purpose of the contest, which was to mimic and therefore scale the jurors decisions, but rather to try to ‘do better’ than them! It is very brief and just a broad overview with basically no details.

juror 2

  • Discussion makes sense, different models, data enrichment
  • I lacked some technical analysis of features, rationale for choices

juror 3

  • Puzzling. The write up is sparse and the model is a single file that runs on local/private data with no output charts, so I have no way of reviewing any of their decisions
  • It could be good but I have no way of verifying

juror 4

I think the writeup is clear, and the process intriguing – for instance, augmenting the data with social media and GitHub commits.

The problem I have is that the code is massive and hard-to-parse. This would be a much stronger writeup if the prose elements clearly connected to the code, and/or if the code itself were easier to read.

juror 5

An above-average writeup, but I don’t think there’s too much new here that pushes our understanding of deepfunding forward. Feature engineering and model ensembles are pretty par for the course in these competitions. I do think using pagerank is a cool idea, though.

@Limonada

juror 1

Detailed explanation of the chosen approach, but lacking the data exploration and objective justification of the approach compared to other approaches. Would have been good to see how they went from the data exploration, to a few approaches leading to the final choice.

juror 2

  • Also suspect write-up is LLM generated
  • It lacks a bit of technical demonstration, as other write-ups have, like feature importance or code snippets

juror 3

Frustrating. This represents a novel approach to the problem and I would love to see code / charts / residuals / ANYTHING technical but there is NOTHING linked in the submission

juror 4

The writing here feels particularly flat, lacking detail. 8_suspect it’s LLM-generated. The methodology is actually quite interesting, but there needs to be more explanation. It would be great to have something more: visuals of the architecture, insights from the final model, – something.

juror 5

I appreciate the conceptual framing of PGF as an ongoing process. That feels like an important/unique contribution. But the writeup is also a little heavy on the jargon and a little light on data/ statistics. An illustrative example would have also helped me decode some of the author’s claims. Still above average, though.

@jpegy

juror 1

Somewhat breezy and not very strongly motivated or supported. Definitely a concerted approach trying to generate synthetic data to solve the data scarcity problem, but would have liked a statistical analysis to support the thesis. Very honest about the sub-optimality, which was the redeeming factor.

juror 2

  • Didn’t go into too much detail on the approach taken
  • RIghtfully acknowledged problems on dataset, interesting to use LLMs to generate synthetic data to tackle that

juror 3

  • Great write-up and would have been really interested to see more data / outputs from their attempts.
  • I feel like there’s a notebook / repo or set of charts that could have easily put this submission into the top 5 for me

juror 4

Excellent reflection on work. The approach might be promising if further refined.

It’s hard on me personally to work with text-only writeups: surely there’s some visual, number, etc that could be given in summary? Anyone can write any words they wish, and 8_feel credibility is added by including some alternate (and perhaps harder-to-generate) representations.

juror 5

This is an above average write up that explains some of the pitfalls of over-indexing on LLM generated data. Even though the idea of using a bunch of LLM data feels sketchy / weird to me, i’m glad jpegy did it because it increases our understanding of how things work both with LLMs and with the competition. Above average.

@MavMus

juror 1

Extremely disjointed and short descriptions, with large code blocks expected to explain themselves. Very low effort.

juror 2

  • explanations, not very good structure
  • Reasonable ideas that should have been better explored

juror 3

  • A bit light, though I appreciate their attempt to enrich with some additional data about the utility of the repo
  • Feels like a modest weekend effort but nothing stands out

juror 4

Clear description of process, but more care needs to be taken to ensure reproducibility.and interpretability when working with LLMs.

The writeup would benefit from at least one diagram to give an alternate representation of the model architecture

juror 5

I didn’t find this writeup to be very insightful, and didn’t think the code was explained well. Below average.

@Todser

juror 1

This one is a head-scratcher. It literally describes a multi-turn session with a chatbot where the contestant tried to literally guess the scores themselves. They got very bad results which is not surprising. This does not feel like a serious submission, and the attempts to make the writeup entertaining do not compensate in my opinion.

juror 2

I don’t see a model/code nor attempt at building one, just reflection on how to theoretically tackle this project

juror 3

  • I really really wish this person had some structured data they captured from each of these iterations, because I think this is an intriguing approach … he is basically brute forcing the RLHF approach on a very narrow use case
  • Not Top 5 but definitely a creative person at work!

juror 4

Intriguing idea, and i think that it’s possible this “Iterative Hypothesis Testing” concept could be successfully adapted to chatbot/agentic analysis approach (perhaps combined with more traditional quantifiable Machine Learning methods).

It seems like the approach was reasonably successful, but it’s kind of ad hoc – not likely to generalize well without more refinement.

juror 5

I appreciate the author’s interest in just thinking through the problem via simple heuristics and trying to maintain some level of human understandability, and I think the failure of this approach does shine a light on how hard deepfunding is (or how careful we need to be about collecting juror data!). However, I think this is the only thing going for this project, making it an average or slightly below average writeup.

@clesaege

juror 1

While an interesting and very well described approach of designing a prediction market, I wish this submission contained the basic data exploration which then could motivate the innovative choice of the contestant. As such, while it is indeed very well explained and structured like a very brief research paper, I felt it was a high level abstract idea and not as much of a roll-up-the-sleeves and dig into the data approach. I think that would have been very useful and could have motivated at least in part the approach taken.

juror 2

  • I am obviously biased here, but it’s been proven that, given proper incentives, PMs can act as great aggregators of information. Thus one could use such mechanism to find out the dependencies (similar to an ensemble of the best models)
  • Some problems on PMs however persist, e.g. low liquidity for participation, expert knowledge required, etc
  • Overall like the idea very much, but as a submission I’m ranking ML/LLM approaches higher because they were more closely aligned with goals

juror 3

  • Not sure?
  • Cool! But I’m not sure how to score this though. Wasn’t Seer also a part of the round design?
  • If not, then this is definitely a top 5 approach. If so, then I’d rather clear the way for independent / non-affilitated submissions

juror 4

In My Top 5

Intriguing use of prediction markets, to compete with the AI/Machine Learning models.

It’s somewhat difficult to compare, due to the difference in nature. The writeup is through, and the visuals help tell the story well.

juror 5

Prediction markets definitely have their strengths but I can’t get around the clear performativity issues that would come from using prediction markets as the sole mechanism to give away money (as opposed to testing them against a dataset). If you can spend money to adjust the weights of your project in a market, and then those weights determine some payout to that project, that seems like a clear opportunity for gaming the system. Still a cool writeup though, and I appreciate the ideas. Top 5

@duemelin

juror 1

This is clearly LLM-generated!

Maybe that is not a deal-breaker in an ML/AI competition, but the contestant has not even bothered to remove the LLM’s voice:

“Here is a ready-to-post write-up for the forum, incorporating the results from your script’s output.

juror 2

  • LLM-generated
  • If the code was there, then maybe I could have provided a better analysis

juror 3

Not much to add to what other reviewers have said about this submission

juror 4

The comment “Here is a ready-to-post write-up for the forum, incorporating the results from your script’s output.” seems a bit strange. LLM output?

To be honest, i think the overall writeup is fine. There isn’t much in the way of technical innovation or subject matter insight, though. i think “strip away context and do pure mathematical optimization” is an OK approach, if the optimization is well-done. This needs more explanation of the underlying optimization technique.

juror 5

I definitely did not love how the author forgot to remove the trimmings from the LLM response he copy-pasted into this forum post. Aside from that, it didn’t feel particularly insightful.
Below average.

1 Like