ERC-8183: Agentic Commerce

@pablocactus, there is three things specifically that we identified :

  1. Sybil resistance - the warmup period and minimum stake makes it expensive to spin up disposable evaluator wallets to game the weighted selection

  2. Misaligned incentives - the stake is slashable, so a consistently bad evaluator loses real value. The loss has to exceed the gain from corruption to be effective

  3. Collusion - the pseudo-random stake-weighted assignment means the client can’t pre-select their evaluator

Your AHS approach addresses something different here, the historical behavioural signal rather than economic commitment. The two aren’t competing because one punishes bad outcomes after the fact, the other informs selection before.

On the hybrid question, the spec only mandates the on-chain attestation. How the evaluator reaches that decision is out of scope by design. Off-chain scoring informing complete()/reject() is a valid pattern.

Thanks for the clear breakdown.. the before/after framing is useful for how we position the two approaches. Good to have confirmation the hybrid pattern is in scope. The Sybil resistance point is interesting.. our D1 solvency dimension picks up some wallet age/history signals that partially address disposable wallets, but it’s not a substitute for economic commitment. Worth exploring whether AHS score could feed into stake-weighting as a complementary signal rather than a replacement. Will keep watching the thread

@pablocactus the hybrid pattern fits cleanly into the verification schema @ThoughtProof is drafting as a companion to ERC-8210. Off-chain scoring like your AHS maps to one of the four assessor types in the schema:

"assessors": [
  { "type": "rule", "id": "ahs-d1-d2-d3", "verdict": "APPROVE", "confidence": 0.87 }
]

The scoring runs off-chain, the verdict gets pinned (IPFS or on-chain bytes), and the resulting CID is what gets passed as reason to complete()/reject() on ERC-8183 — or as evidence to fileClaim() on ERC-8210 if a dispute happens later. Same schema, two consumption points.

On the AHS-feeding-stake-weighting idea from your last post: this is actually where the three approaches in this thread become composable rather than alternatives:

  • Before the job (selection): AHS scoring informs which evaluator gets chosen

  • During the job (commitment): stake establishes economic commitment

  • After the job (accountability): slashing closes the loop on bad calls

None of the three replaces the others — they operate at different points in the lifecycle, and the verification schema is what carries the proof from one phase to the next.

I’ll add this as a second composition example in the multi-hop reference scenarios for ERC-8210 (alongside the EvaluatorSlashedfileClaim flow we agreed on with @Bakugo32). Two concrete examples of how independent specs compose without coupling.

1 Like

This is the clearest articulation of the composition I’ve seen.. before/during/after maps cleanly to the three approaches without any overlap. The assessor schema is interesting; the ahs-d1-d2-d3 typing makes AHS pluggable as a named rule type rather than a one-off integration. If ERC-8210 formalises that assessor registry, there’s a natural path for AHM to publish scores as verifiable assessor outputs rather than just API responses. Happy to be included as a reference example.. worth coordinating directly if you want accurate D1/D2/D3 dimension descriptions for the spec.

1 Like

@pablocactus glad the framing landed. The assessor registry idea is interesting — would you mind sketching it out as a concrete proposal so the others in the thread can weigh in? That seems like the right way to validate it before anyone commits to scope. In the meantime I’ll keep moving on the multi-hop reference scenarios PR for ERC-8210 along the lines we’ve already aligned on (upstream field + EvaluatorSlashed → fileClaim). Happy to revisit how AHS-style hybrid scoring fits in once there’s broader input on the registry approach

1 Like

@ThoughtProof, governance executed this morning, protocol is fully live. We just ran a complete end-to-end cycle (create → fund → auto-assign → submit → complete) on Base Sepolia.

You’re good to start testing. Here’s a minimal reference implementation covering the full evaluator flow we wrote :
https://github.com/Demsys/agent-settlement-protocol/blob/main/docs/evaluator-starter-kit.ts

Staking, watching for assignments via watchForAssignments(), and calling complete() on-chain with your wallet. The only part to replace is yourEvaluationLogic().

Thanks for the starter kit @Bakugo32 this maps cleanly onto AHM’s existing evaluator pattern. Our yourEvaluationLogic() replacement calls /ahs/route/{address} and routes on grade. Main blocker is VRT for staking.. is there a faucet or do we request a mint?

Starter-kit received and looks super clean. Thanks for putting this together!

We already staked our 100 VRT on Base Sepolia yesterday (Tx: 0x01f129…), so we should be past the warmup period and eligible for assignments.

We’re currently wiring the AssignmentWatcher directly into the thoughtproof-sdk. Just adding some production-hardening on our end (nonce management for concurrent job resolutions, idempotency, and API retry logic) to ensure the node is rock solid.

Our automated reasoning evaluator will be live on Sepolia shortly!

Hey ! Just confirmed on-chain the job #1 is officially assigned to your wallet 0xB4B9...B9C0. You’re the first external evaluator on the protocol, congrats. To complete the evaluation, call complete() or reject() on AgentJobManager. We put together a minimal reference to get you started : docs/evaluator-starter-kit.ts in the repo as mentioned

Let us know if anything is unclear. This is the first real end-to-end cycle so your feedback matters a lot.

Hey ! To become an evaluator on testnet you need VRT (our protocol token) to stake. Here’s the full flow :

1. Get VRT, we just shipped a faucet:

curl -X POST https://agent-settlement-protocol-production.up.railway.app/v1/faucet/vrt \
  -H "Content-Type: application/json" \
  -d '{"address": "0xYOUR_WALLET", "amount": "500"}'

2. Stake — call stake(amount) on EvaluatorRegistry. Minimum is 1 VRT, more stake = higher probability of being assigned jobs.

3. Wait ~ 24h because of the warmup period before your address becomes eligible.

After that, jobs will auto-assign to you when funded. Drop your wallet address if you want us to check your status anytime.

Hey @Bakugo32 - got VRT from the faucet, just need a small amount of Base Sepolia ETH for gas to complete the stake. Evaluator address: 0x35eeDdcbE5E1AE01396Cb93Fc8606cE4C713d7BC — any chance of a drip?

Done, we just sent 0.005 ETH to 0x35eeDdcbE5E1AE01396Cb93Fc8606cE4C713d7BC, should be enough for staking + a few transactions. Let us know once you’re staked

1 Like

Hi all, dropping in from the ERC-8210 (Agent Assurance) thread, where a community discussion uncovered something I think is worth raising here as well.

While reviewing AAP’s interaction with 8183, Pablo from RNWY pointed out a concrete attack vector: if the Provider and Evaluator are the same actor behind two different addresses, the 8183 lifecycle runs to a clean Completed state, escrow releases to the Provider, and the Client receives nothing. The audit trail looks normal. The reputation layer (ERC-8004) cannot pick this up either, because there is no on-chain event signaling that the Client was harmed. Full discussion is in our thread here: ERC-8210: Agent Assurance

The mirror case is also worth mentioning: when evaluator = client (which the spec explicitly permits in the Roles section as a “no third-party attester” configuration), an honest Provider can deliver in good faith and the Client-acting-as-Evaluator can simply reject, refunding escrow to the Client. The Provider has no recourse.

I want to be clear that we don’t see this as a flaw in 8183. The decision to leave role-independence verification to the application layer is a reasonable scope choice, and it keeps the protocol minimal and composable, which is one of 8183’s strengths. But this also means the assumption needs to be acknowledged and addressed somewhere, and it’s worth thinking about whether some of that “somewhere” could live closer to the protocol than it currently does.

A few directions that came up in our discussion, sharing here in case any of them are useful:

An optional arbitration role above the Evaluator. Conceptually similar to how AAP separates the Claims Resolver from the Evaluator: a structurally independent party that can re-examine an Evaluator’s decision when challenged. This would give honest Clients a protocol-recognized path to dispute a Completed Job, which is the gap the current reputation-only model cannot close in collusion scenarios. We’re not proposing a specific design, just flagging it as a direction worth considering.

A standardized independence signal interface. Something like assessIndependence(addrA, addrB) returns (independent: bool, confidence: uint8, signals: bytes) that any trust scoring provider (RNWY, AgentProof, others) could implement. 8183 hooks could optionally call this at setProvider or createJob time to enforce or recommend independence between the proposed roles. This keeps trust evaluation outside the core spec while giving implementers a consistent way to plug in.

Explicit acknowledgement in the spec rationale. Even if the protocol layer doesn’t change, naming the role-independence assumption explicitly in the spec would help downstream implementers (like AAP) reason about what they can and cannot rely on. The current spec implicitly assumes independence without saying so, and that assumption shapes what every layer above 8183 can or cannot guarantee.

On AAP’s side, we’re treating this as something AAP itself needs to address as well. The next AAP revision will likely extend the EvaluatorDispute CoverageType (or introduce a new one) to provide a recovery path when role independence is violated post-completion, so honest victims can recover from the responsible party’s locked collateral even if the underlying 8183 state shows a clean Completed.

Happy to discuss any of this further in either thread. Thanks for the work on 8183, the protocol stack we’re building on top of it (8004 for identity, 8183 for commerce, 8210 for assurance) genuinely depends on it.

1 Like

Staked. Evaluator address: 0x35eeDdcbE5E1AE01396Cb93Fc8606cE4C713d7BC approve: 0xdc483c0de7c2ff6db56873f666e6b48993160576be57958ebb0f1e00e183b442 stake: 0x157875f13a90fdd75d594b9bf36a5257c824fdde9611281834ba6d49faaa3323

Glad to see it, ur stake has been confirmed on-chain. You’re in the evaluator pool now. There’s a 24h warmup period before you’re eligible for job assignments, that’s the anti-Sybil mechanism. After that, jobs will start auto-assigning to you when funded. Welcome to the network pablocactus

1 Like

Hi agenttech, thanks for bringing this here, i think it’s worth discussing at the 8183 layer.

The root of the problem is clear : evaluator != provider as an address check is enforceable on chain. evaluator != provider as an actor check is not. Two addresses funded from the same wallet are structurally indistinguishable from two genuinely independent parties at the contract level. The state machine can enforce structural separation, not behavioral independence. That boundary is real and worth naming explicitly in the spec rationale as you suggest

One design choice that meaningfully changes the threat model is the auto-assignment. When the evaluator is drawn pseudo-randomly from a pool of staked participants rather than designated by the client, Case B (client = evaluator) is closed structurally and the client simply has no influence over who evaluates their job. Case A (provider = evaluator) becomes probabilistic rather than trivial. The malicious actor needs to stake, pass a warmup period, and then get randomly selected on the specific job they’re also the provider for. Thats not a complete solution but the attack surface is materially different from a model where the client picks the evaluator freely.

The auto-assignment path does raise its own open question worth documenting here : if the random draw lands on the provider’s address, what is the correct behavior ? Revert is auditable but penalizes the client for a collision they didn’t cause. Re-draw preserves liveness but adds complexity and a potential loop in small pools. We haven’t seen this discussed yet but it feels like something every auto-assignment implementation will eventually hit, and a recommendation in the rationale would help.

On the deeper problem, about same actor, two addresses, the assessIndependence() direction Pablo raised is the correct complement. Not a replacement for on-chain enforcement, but the layer that on-chain enforcement structurally cannot reach. The contract enforces structure. An independence oracle evaluates behavior. Both are needed.

2 Likes

Bakugo32, thanks for this. The “structure vs behavior” framing crystallizes in one sentence what I was reaching for in several. Contracts can enforce structural separation but not behavioral independence. That’s the boundary, and it does deserve to be named explicitly in the spec rationale.

The auto-assignment design you describe is a direction I hadn’t fully considered, and it’s worth unpacking.

On auto-assignment

Drawing the Evaluator pseudo-randomly from a pool of staked participants instead of letting the Client designate one materially changes the threat model. I agree with your assessment:

  • Case B (Client = Evaluator) is structurally closed. The Client has no influence over who evaluates their Job, so the entire class of self-evaluation attacks is shut down at the entry point
  • Case A (Provider = Evaluator) shifts from a trivially reachable attack to a probabilistic one. The malicious actor has to stake, pass a warmup period, and then get randomly selected on the specific Job where they are also the Provider. Both the cost and the determinism of the attack go up significantly

On the assignment-collision open question

The problem you flagged is a genuine engineering tradeoff that every auto-assignment implementation will eventually run into. My intuition leans toward re-draw over revert, for two reasons:

  1. Revert pushes the cost of the collision onto an innocent Client. The Client did nothing wrong; they simply hit a randomness edge case, and now their Job creation fails. This violates the principle that no-fault parties shouldn’t be penalized for system behavior
  2. Re-draw preserves liveness and lets the Client move forward with their workflow without retrying or giving up

But re-draw needs bounds, otherwise small pools or high Provider concentration could send it into a loop. A reasonable design might look like:

  • Cap the retry count (e.g. 3 attempts)
  • Exclude previously drawn addresses on each retry
  • After N consecutive collisions, fall back to revert and emit a specific event (e.g. EvaluatorAssignmentFailed) so the Client knows it’s a pool composition issue, not something they caused
  • That same event doubles as a governance signal that the pool needs to grow

A short recommendation along these lines in the 8183 rationale would help every auto-assignment implementation avoid stepping on the same rake.

On the structure vs behavior layering

I fully agree with your closing point. Contracts handle structure, independence oracles handle behavior, and the two are complementary rather than substitutes. This layering can actually be generalized across the entire agent economy stack:

  • 8183 enforces as much as it can at the structure layer (auto-assignment being a good example)
  • IRiskHook and assessIndependence-style extensions handle behavior evaluation
  • AAP handles post-hoc recovery at the claim resolution layer, giving honest victims a path to redress when the first two layers don’t catch the attack in time

This layered view will show up in AAP v2’s Inherited Assumptions section as the explicit interface contract between AAP and 8183.

Thanks for pushing this discussion to a more precise level.

1 Like

Thanks for the detailed breakdown agenttech. The bounded re-draw design makes sense - cap the retries, exclude previously drawn addresses, fall back to revert with EvaluatorAssignmentFailed.

The exclusion of already-drawn addresses matters most during the bootstrapping phase that every network goes through, where the eligible pool is small enough that retrying without exclusion could exhaust the set without progress. The event doubling as a governance signal is a property we hadn’t considered and it’s a good one.

On the layering (8183 structure, IRiskHook behavior, AAP recovery) seeing it written out that explicitly confirms what we were intuiting from the implementation side. Each layer has a hard boundary on what it can guarantee, and naming those boundaries in AAP v2’s Inherited Assumptions section will prevent implicit trust assumptions from creeping into downstream implementations.

We’ll implement the re-draw pattern and document it as a reference - having alignment with the spec side on the approach makes the implementation more useful as a shared reference point.

1 Like

Quick update: the multi-hop reference scenarios PR for ERC-8210 is up: ERC-8210: Add multi-hop workflow reference scenarios by cmayorga · Pull Request #1653 · ethereum/ERCs · GitHub , with all three composition examples we discussed here:

1. Multi-hop dependency tracking via the `upstream` field

2. `EvaluatorSlashed` → `fileClaim` (with @Bakugo32)

3. Hybrid off-chain scoring with shared `reasoningCID` (inspired by the AHS approach @pablocactus described in this thread)

@pablocactus , heads up, I’ve credited you in the scenario 3 docs by your forum handle since I don’t know your GitHub username. Happy to update the reference if you let me know.

CI is green, self-contained Foundry project, no dependency on the test vectors PR. Feedback welcome.

2 Likes

Glad we converged on this. The bootstrapping phase is the right framing for why exclusion matters most: that’s exactly when small-pool effects bite hardest, and without exclusion the same Provider address could keep getting drawn until the retry cap fires. By the time the network reaches steady state the issue largely resolves itself, but the spec needs to handle the early phase explicitly because that’s when most implementations will be tested.
Looking forward to seeing the re-draw pattern land in Demsys’s implementation. Once it’s in, it would be great to reference your contract as the canonical implementation example for stake-weighted Evaluator selection in the AAP v2 spec rationale (specifically in the Inherited Assumptions section, where we discuss what the structure layer can enforce). Having a real, deployed reference rather than a hypothetical pattern makes the section much more useful for downstream implementers.
Will also incorporate the EvaluatorAssignmentFailed event signature into the AAP v2 IRiskHook discussion, since hooks observing this event become a natural feedback channel for monitoring pool health.