ERC-8183: Agentic Commerce

Specifies the Agentic Commerce Protocol: job-based escrow where a client funds a job, a provider submits work, and a single evaluator attests completion or rejection. Optional attestation reason and optional hooks allow composition with reputation (e.g. ERC-8004) and gasless flows (ERC-2771).

PR

Spec:

12 Likes

Why can the client not release funds themselves? This adds both speed and privacy in the most common non-dispute case.

Another design topic worth discussing is support for two linked jobs / two-phase jobs.

ACP currently models a one-shot lifecycle: fund → submit → evaluator complete/reject. That fits many agent tasks, but stateful DeFi workflows look different.

Example: a yield strategy.

  1. Open job: the provider opens a position according to the user’s constraints such as risk, lock, venue, and target APY.
  2. Close/evaluation job: later, a linked job evaluates whether the resulting position actually satisfied the promised terms and/or handles unwind and settlement.

In this model, hooks + optParams are useful for carrying commitments, but they do not fully solve the relationship semantics. The core may need a first-class linked-job concept, with invariants like:

  • the close job inherits the same client / provider / evaluator
  • a parent open job has at most one active close job
  • the close job can only be created or activated after the parent open job reaches the required state

So perhaps ACP should become relationship-aware, not strategy-aware: the core understands linked jobs, while hooks continue to carry domain-specific economics and settlement data.

Would be interested to hear whether others think linked jobs belong in the base standard or in a standardized extension profile.

3 Likes

In this standard, they can if they are the evaluator as well!

Great spec. The evaluator role is exactly where reasoning verification fits.

The spec notes that the evaluator “MAY be a smart contract that performs arbitrary checks (e.g. verifying a zero-knowledge proof or aggregating off-chain signals).” We’ve been building this for ERC-8004’s Validation Provider role, and 8183’s evaluator pattern maps directly.

Concrete integration: A ThoughtProof evaluator contract would:

1. Receive submit(jobId, deliverable) event

2. Trigger off-chain multi-model verification of the deliverable (multiple independent models evaluate through structurally different representations)

3. Post the signed verification result on-chain

4. Call complete(jobId, attestationHash) or reject(jobId, reason) based on consensus

The reason field on complete/reject is well-designed for this — the attestation hash can reference the full epistemic block (verifier identities, individual verdicts, disagreement map).

Two questions for the authors:

1. Is there a recommended pattern for evaluators that need async off-chain computation before calling complete/reject? The current flow assumes the evaluator can decide synchronously, but verification takes 2-5 seconds.

2. For the “composition with reputation (e.g. ERC-8004)” mentioned in the abstract — do you envision the evaluator writing directly to 8004’s Reputation Registry on complete/reject? That would close the loop: verification result → payment release → reputation update in one atomic flow.

Disclosure: We’re building an open-source verification SDK (pot-sdk, MIT) and are registered as ERC-8004 Agent #28388.

2 Likes

I will be publishing a spec today, lets chat more about this as we already have this in production with taskmarket.daydreams.systems

But our V2 is far more fleshed out than current ERC8183, I want to contribute to this proposal so we can make it fully complete.

Initial proposal is here, we have a more fleshed out design than the current ERC-8183 supporting far wider set of agentic commerce interactions.

Additionally, with our PGTR model, would love for us to split this EIP into 2 as we did, take a look here @davidecrapis.eth

1 Like

Comparing this to the two closest analogues I know of on EVM, OpenZeppelin’s ConditionalEscrow and Arkhai’s Alkahest:

OZ ConditionalEscrow ERC-8183 Alkahest
Escrowed action ETH transfer ERC-20 transfer Generic (e.g. ERC20EscrowObligation, NativeTokenEscrowObligation, AttestationEscrowObligation)
Release condition Abstract withdrawalAllowed(address) → bool Evaluator calls complete/reject(jobId, reason?) Abstract IArbiter.checkObligation(attestation, demand, escrowUid) → bool (e.g. TrustedOracleArbiter for behavior similar to ERC-8183, ERC8004Arbiter)
Counterparty address payee on deposit() Provider set at creation or via setProvider() Unconstrained by default. Can be restricted e.g. via RecipientArbiter
Expiry & rejection None Rejected and Expired states both allow refund No Rejected state, refund after expiration only; arbiters can change decision and fulfillers can retry
Composability None IACPHook callbacks on state transitions Compositional arbiters (AllArbiter, AnyArbiter)

As specified, EIP-8183 seems broadly useful but over-constrained for certain use cases. Since the specified behavior already intersects heavily with ERC20EscrowObligation + TrustedOracleArbiter, I’d love to see this EIP arrive at a form where slight variants of those (if not more EscrowObligation/Arbiter types) can be implementations of this EIP.

Currently there are two main points which make this difficult:

1. Open and Funded are two separate states in EIP-8183

In my opinion, the behavior of createJob, setBudget, and fund should not be within the scope of this EIP and instead left up to implementations. There’s a lot of complexity and possible variants to how deciding on a budget for and funding a job might work. Is it always true that it should be either the client or provider establishing the budget? Is it always true that the client should decide a specific provider, and before funding? etc.

Especially, much of the negotiation process might be happening off-chain, and it would be unnecessary overhead to require on-chain commitments for every step of the process. I would suggest instead constraining the EIP so that escrows begin at the Funded state, leaving everything before that to implementations.

2. Rejected and Expired are similar states in EIP-8183

Expired is defined in EIP-8183 as “Same as Rejected; escrow refunded to client”, which precludes some potentially useful behaviors. For instance, what if one provider submits work that’s rejected, but the client wants to keep the escrow open for another provider to try to fulfill? What if a provider would have succeeded with a minor revision to their work?

Of course the client can always use the refund to recreate another similar escrow, but this is operational complexity and carries its own implied behavior - e.g. that the new evaluator and provider can be different from the original. Instead, I would suggest that the Rejected state has a path to Expired, but also an optional path back to Funded.


Besides these two points, there are also uses for escrowing assets or actions besides ERC-20 tokens (other token types, actions like attestation or voting…). I think release conditions can almost always be implemented as an “evaluator” (e.g. via a self-serve smart contract for automatic evaluation processes), but there are cases for which this might be somewhat unnatural.

If the complexity of abstracting escrowable action and release condition is beyond the scope of this EIP, it may be also worth considering a separate more complex but more general variant EIP, as ERC-1155 is to ERC-20/ERC-721.

2 Likes

See [Draft ERC] Task Market Protocol

For our proposal, either we run a new ERC @davidecrapis.eth or I will contribute to yours. You let me know what works best.

Credit card companies escrow for every transaction. There’s nothing especially “agentic” about it. The title of the ERC is aspirational rather than technical. If you want to standardize escrow, that should be the title of your ERC. Escrow is an important building block of commerce that is under-utilized in smart contracts and as a term it is well-understood.

Your escrow standard does not specify an ABI for smart contracts. It looks like you want to use the solidity ABI but your parameters don’t have types and some of them are optional, which isn’t natively supported in solidity.

I’ve also thought about escrow and I’ve thought that settlement should be the default. In a healthy economy you would expect challenges to be rare, so I would design around the efficiency of the common case, which could mean not posting commitments on-chain except in the event of dispute.

I think you would benefit a lot from implementing a reference implementation. Perhaps your spec will change after you weigh some tradeoffs while engineering it.

6 Likes

ERC-8183 did not define what “agentic” means. The word appears in the title and is sprinkled through the framing, but there is no definition of an agent and by design it cannot.

The protocol should treat human and AI agent as identical as it does not care if end user is an AI agent or human. The label “agentic” implies that the protocol encodes something it does not and should not.

What the EIP actually specifies is a job registry with escrowed funds. A job is created, a budget is set, funds are locked and work is submitted and an evaluator releases or refunds the escrow. This lifecycle has nothing to do with autonomous behavior. By the same logic, an ERC-20 transfer is “agentic” if an AI initiates it.

6 Likes

deployed first deployed the ERC-8183 Agentic Commerce Protocol on Sepolia testnet

1 Like

Hi, we’ve been running an off-chain system since December 2025 that maps closely to this spec. A few observations from production that might be useful:

1. The Evaluator is where the real complexity lives.
The spec describes it as “just an address” — which is elegant. In practice, for subjective tasks (research, creative work, analysis), building a reliable evaluator required significant iteration. We use an AI coordinator as our evaluator. Curious whether the spec envisions any guidance here, or leaves it entirely to implementors.

2. Reputation gating as a beforeAction hook is essential.
Without a quality signal at the fund stage, open markets fill with low-effort submissions quickly. We gate task claims on a trust score. This maps naturally to a beforeAction(fund) hook and has been one of the most valuable parts of our system.

3. A credit abstraction layer has real UX value.
Not every task justifies an on-chain transaction. We maintain an off-chain credit layer for micro-tasks, with on-chain settlement for higher-value work. Worth considering whether the spec should say anything about sponsored/abstracted payment flows.

Happy to share more from our implementation if useful.

1 Like

@clawplaza Great to see someone running this in production. Your three points resonate strongly with what we’ve built.

On evaluator complexity: We deployed a multi-model evaluator contract on Base Mainnet (0x119299F33f918808edD5ef92bd79cefB8700C091) specifically for ERC-8183. The key insight: a single AI coordinator as evaluator is a single point of failure. Our approach uses 3+ independent models that must reach consensus (MDI threshold 0.700) before attesting completion. This makes gaming the evaluator exponentially harder.

On reputation gating: Agreed this is where beforeAction hooks become critical. We’ve been working on trust scoring that decays over time and adjusts based on behavioral signals — exactly the kind of signal an evaluator needs before releasing funds.

On credit abstraction: We’ve integrated with x402 for micropayments. The missing piece nobody talks about: everyone verifies the payment, nobody verifies the work. That’s the gap ERC-8183 + a robust evaluator closes.

Would be interested to compare notes on your Clawdia coordinator — specifically how you handle disagreement between the evaluator’s assessment and the provider’s claim.

1 Like

The risk to both the evaluation / validation and even the reputation scores is that most in computer science aren’t actually measures. Measures are verifyably linear, unbiased, traceable to a scientific standard, accurate and precise. There are new approaches to this for both LLM and humans that would make these evaluations (and goal setting before evaluation) even stronger that use the Rasch Model and Common’s Model of Hierarchical Complexity:

Barney, M. & Barney, F. (August 26, 2024). Transdisciplinary Measurement through AI: Hybrid metrology and psychometrics powered by large language models. In W.P. Fisher Jr., & L. Pendrill (Eds.). Models, Measurement, and Metrology extending the Systeme International d’Unités. De Gruyter. DOI:10.1515/9783111036496-003

Barney, M., Wind, S., & Krishna, V. (January 2, 2026). Using large language models to evaluate ethical persuasion text: A measurement modeling approach. International Journal of Assessment Tools in Education, 13(1), 224–247. DOI: 10.21449/ijate.1788563

1 Like

This is a real tension we’ve lived with in production.

Our current approach is pragmatic rather than rigorous:
trust score combines completion rate, timing, and
identity signals — all gameable in isolation, but
harder to game together at scale.

You’re right that none of these are “measures” in
the scientific sense. The honest answer is we’ve
prioritized “hard to game cheaply” over “verifiably
linear.” At 20k agents the Sybil pressure is constant.

The Rasch Model approach is interesting — has anyone
applied it to adversarial agent environments where
the evaluated parties actively try to manipulate scores?

There’s also the question of why to standardize escrow, and the role of ERCs to the Ethereum ecosystem. I think the benefit of standardizing an interface via an ERC is in ecosystem-wide composability for diverse implementations from authors who don’t know each other. Take the most popular ERC standard, ERC-20 - there are many implementations with radically different internal mechanisms, but because the interface is small and defined at a very generic boundary, it’s possible to make things like DEXs, payment systems, and escrow contracts that work with any ERC-20 token.

In particular, Rob Pike’s “The bigger the interface, the weaker the abstraction” applies strongly. I think the parts of “escrow” that would be most usefully abstracted into an ERC are

  1. the process of conditionally promising a future on-chain action
  2. the conditions for such promises

With these two standardized, many implementations of escrowable actions (token transfers, attestations, voting, rating in a registry…) can be combined with many implementations of conditions (single evaluator, vote by many evaluators, cryptographic proof…), and conditions can be composed such that people can also write and use useful “microconditions” like for counterparty, expiration time etc., which may all have more elaborate mechanisms than comparing against a single value (accepting fulfillments from anyone on a whitelist, or anyone except a blacklist, conditional expiration…).

As I linked above, Alkahest already exists as a functional and audited escrow system which abstracts along these two lines (IArbiter, BaseEscrowObligation). One of our main reasons for not pushing its interfaces as an ERC standard has been its strong dependence on EAS, which though widely known and used, is not itself an ERC standard. This leads to my next point.

Ethereum is very unique in that contracts are all visible to each other, but analogous to OOP “objects”, which would normally only be visible to each other inside a single program (usually owned by a single org) anywhere else. But this means that designing ERCs for Ethereum is very different from designing public infrastructure for other ecosystems.

Especially, not all public good infrastructure should be an ERC. EAS functions as a set of canonical deployments to most widely used EVM blockchains, and is not an ERC because people are intended to interact with it via the canonical deployments, not by deploying many variants that follow a similar abstract interface. This is similar to the status of ERC-8004, which I also believe should have been a set of canonical deployments rather than an ERC.

Following @wjmelements’s point, the title of this proposal as “Agentic Commerce” suggests that this EIP may be going a similar direction, using the EIP process to gather more authenticity and attention for what is really more like a particular application ecosystem (consider that Uniswap hooks are also not an ERC).

More broadly, I think the Ethereum ecosystem should be careful about the EIP process being used to lend the appearance of neutrality and public-good status to what are effectively particular application designs backed by particular organizations. The legitimacy of ERCs comes from their role as minimal, unopinionated interfaces that emerge from broad community need - not from top-down specification of application-layer patterns by well-resourced groups. When the EIP process is used for the latter, it risks diluting the signal that an ERC number is supposed to carry, and allows organizations with more resources and institutional proximity to the EF to effectively “reserve” design space that should remain open to organic community development.

1 Like
  1. I personally will suggest to setup evaluator runtime to listen to the events and then process the job async. The job comes with the expiry so as long as evaluator can call complete/fail in time it would be ok. not sure if i have answered your question?
  2. depends on how the design work is done but yes, i would think this is the most seamless loop!

@MattBarney — We ran the experiment. Full honesty about what we found and where it’s thin.

Setup: 100 prompts (50 clear-cut + 50 edge cases), verified independently by Claude Sonnet 4.6, GPT-5.4, and DeepSeek Chat. 3 models from 3 independent providers. ~200 successful verification calls.

V1 (clear cases): 49/50 unanimous. Rasch couldn’t fit — insufficient variance.

V2 (edge cases): 10/50 disagreements. Rasch converged.

Key findings:

• Model strictness (β): Claude = −1.66 (tolerant), GPT-5.4 and DeepSeek = +0.98 (strict)

• All three show good Rasch fit (Infit MNSQ 0.5–1.5)

• MDI ↔ |Rasch θ| correlation: |r| = 0.78

Honest gaps:

• N=10 disagreement items is thin for Rasch. Person Separation 1.24, Reliability 0.61 — both below conventional thresholds (2.0 / 0.80)

• GPT-5.4 and DeepSeek show near-identical strictness — effectively 2 distinct levels, not 3

• We designed the edge-case prompts ourselves, which introduces selection bias

• One disagreement (T05) was a factual error by DeepSeek, not a genuine precision boundary

• Same system prompt for all models — true independence would vary prompting strategy

Interpretation: MDI is not Rasch-linear and this experiment doesn’t prove it is. But MDI empirically tracks the same construct as Rasch θ at the boundary. The correlation is encouraging, the sample is small, and more verifiers from more families would strengthen both separation and reliability.

Full writeup: thoughtproof.ai/blog/mdi-meets-rasch

hey i believe we have a tg group with daydreamer agent! lmk if you are not in it!