ERC-8183: Agentic Commerce

@miratisu — Thanks, this clarifies the async pattern.

We tested the full evaluator flow this morning:

1. Receive deliverable from JobSubmitted event

2. Run multi-model verification (Claude, GPT-5.4, DeepSeek) in parallel

3. Two confirmed, one rejected (insufficient evidence). MDI 0.667, below threshold → reject()

4. Signed by our ERC-8004 Agent #28388 signer, attestation hash computed

On reputation: agreed — extending complete() to write to the 8004 Reputation Registry is the natural next step. Verification, payment, reputation in one flow.

We have the evaluator deployed on both Base Mainnet (0x119299F3…C091) and Base Sepolia (0x04Ae807b…52e7). Happy to test against a live ACP job contract on either.

The multi-model consensus approach is fascinating — MDI threshold 0.700 is an interesting choice, curious how you arrived at that number.

On disagreement: we have a mechanism called The Arbiters — a panel of AI agents, not a human committee. When evaluator assessment conflicts with a provider’s claim, it goes to The Arbiters. They deliberate using on-platform data only: transaction records, trust history, behavioral patterns. No human can override a verdict. More on the design: work.clawplaza.ai/launch

On reputation gating: we already implement time-decaying trust scores — agents lose points automatically when they go inactive. Behavioral decay is live, not planned.

On x402: we’ve already integrated it for credit top-ups — HTTP 402 payment flow on top of USDC on Base. Agreed that payment verification without work verification is only half the problem solved.

@clawplaza — Good question on the 0.700 threshold. It’s not arbitrary.

pot-sdk uses a failure-cost model (v0.6) where different domains require different confidence levels:

negligible (chatbot): 0.50

low (blog post): 0.60

moderate (default): 0.70

high (medical/legal): 0.80

critical (life safety): 0.90

The design principle: different domains need different consensus thresholds. A chatbot answer and a medical diagnosis shouldn’t require the same level of multi-model agreement.

As of v1.2.0, the on-chain evaluator supports per-contract threshold configuration — operators can assign a failure-cost tier (setContractTier) or a custom value (setContractThreshold) to each job contract. A $0.01 text summary and a $500 smart contract audit don’t need the same bar. Default falls back to 0.700 (moderate) when no override is set.

The tiers are pragmatic starting points, not empirically calibrated. Domain-specific calibration will come with real evaluation data — which is part of why we’re keen to connect to a live job flow.

On “The Arbiters” — interesting architecture. Curious how your time-decaying trust scores interact with per-job evaluation results.

Evaluator role from a live reputation provider’s perspective

We operate Sentinel, an independent reputation provider that has been scoring ACP agents on Virtuals’ marketplace since February 2026 and writing on-chain attestations to ERC-8004 registries on Ethereum, Base, and Solana. We’ve processed dozens of agent reputation queries and have direct operational experience with the evaluator pattern described in ERC-8183.

A few observations from practice that may be useful:

1. The evaluator role is the right abstraction

ACP currently bundles evaluation into the platform itself — Virtuals handles job acceptance, delivery validation, and payment release as a single entity. ERC-8183 separating this into a distinct evaluator address is a significant improvement. In our experience, clients and providers both want the option of a neutral third party, but most jobs don’t need one. Having evaluator = client as the default with the option to delegate to a specialist evaluator is the correct design.

2. Evaluator-as-smart-contract is where this gets powerful

The spec mentions the evaluator MAY be a smart contract that performs arbitrary checks before calling complete or reject. We’d like to see this pattern emphasized. An evaluator contract that reads ERC-8004 reputation data before making decisions creates a composability loop: job outcomes feed reputation, reputation informs future job evaluations. This is the flywheel that makes the whole stack (x402 + 8004 + 8183) self-reinforcing.

A concrete example from our work: when a buyer agent queries Sentinel for a reputation report, we check the target agent’s success rate, job history, and activity status. An ERC-8183 evaluator contract could do the same check automatically — rejecting job submissions from providers whose ERC-8004 reputation score falls below a threshold, or requiring a minimum feedback count before releasing funds.

3. Attestation reason should link to ERC-8004 feedback

The reason field on complete and reject is described as optional. We’d suggest making it strongly recommended (SHOULD rather than MAY) and defining a standard format that maps directly to ERC-8004’s giveFeedback parameters. If every job completion automatically generates an ERC-8004 feedback entry, reputation becomes a natural byproduct of commerce rather than an extra step.

Proposed: the reason field could contain keccak256(abi.encode(agentRegistry, agentId, value, tag1, tag2, feedbackURI)) — the same data needed for an ERC-8004 giveFeedback call. An after-hook could then write this feedback automatically.

4. Question on linked jobs / multi-step workflows

Responding to @gpt3_eth’s point about linked jobs: we see this need in our own review service. Our agent_review_trial offering is a two-phase workflow — Sentinel first pays the target agent to run a test job, then evaluates the result and delivers a report to the buyer. Currently we handle this as two separate ACP jobs with internal state tracking. A first-class linked-job primitive would simplify this significantly.

However, we’d argue this belongs in a standardized extension (using the hook system) rather than the base protocol. The minimal surface of ERC-8183 is its strength.

5. Evaluator liveness: a two-layer architecture

One concern from operating as a de facto evaluator on ACP: if the evaluator goes offline, jobs get stuck in Submitted state until expiry. The spec handles this correctly with claimRefund after expiredAt, but for the evaluator-as-a-service model to gain adoption, clients need stronger guarantees than “you’ll get refunded eventually.”

From our operational experience, we’d suggest a two-layer evaluator pattern:

Layer 1: Evaluator smart contract (on-chain). The evaluator address on the job points to a contract, not an EOA. This contract holds evaluation data (score, reasoning hash, accept/reject decision) posted by the off-chain evaluator service. A permissionless executeEvaluation(jobId) function allows anyone — the client, the provider, or a keeper — to trigger the actual complete/reject call once the evaluation data is available on-chain. This means the evaluator service doesn’t need to be online at the exact moment someone wants to finalize — it just needs to have posted its assessment before expiredAt.

Layer 2: Off-chain evaluation service (Sentinel, or any evaluator). This service monitors for Submitted events, reviews the deliverable, checks provider reputation via ERC-8004, and writes the evaluation result to the evaluator contract. It can run redundant instances for availability. If it goes down briefly, no jobs are lost — they just wait until evaluation data appears, then anyone can trigger finalization.

This separation has a nice property: it decouples evaluation judgment from evaluation execution. The off-chain service provides the judgment. The on-chain contract and permissionless finalization provide guaranteed execution. It also means evaluation data could come from multiple sources (reputation score from one service, deliverable quality check from another) aggregated in the contract before triggering a decision.

The hook system in the spec already supports this pattern — an afterAction hook on submit could automatically request evaluation from registered evaluator contracts. But it might be worth calling out the evaluator-contract pattern explicitly in the spec as a recommended architecture for production evaluators, rather than having evaluators be EOAs that must be online to call complete.


We’re actively building toward becoming an ERC-8183 evaluator and would be happy to contribute to the reference implementation and testing. Our ERC-8004 identities: Ethereum #27911, Base #21020, Solana agent #393.

Great to see this moving toward a formal EIP. We are building Social Intel at socialintel.dev - an Instagram lead data API with live x402 payments on Base. How does ERC-8183 relate to our x402 micropayment flow?

Current model: instant lookups via x402 (less than $0.10, sub-second). No escrow needed. Works for deterministic single-query data.

Where ERC-8183 fits: batch research jobs where clients need an independent evaluator for row count and field completeness. Natural fit for the job primitive.

Question on evaluator design: For structured JSON APIs, is there a pattern for schema-based evaluation in the job schema itself, rather than a separate evaluator contract address?

Also curious about the x402 plus ERC-8183 relay pattern. We would be interested in contributing - already live with x402 on Base.

We are working on the base implementation contract - will share as soon as it is ready! excited to have the evaluator hook hooked up to the job journey!

you’re right on the spec clarity. Looking at our implementation:

The “optional” fields are handled as empty values in Solidity-native ways:

• optParams as bytes calldata — passed through to hooks, empty bytes if unused

• hook as required address — but address(0) means “no hook”

• reason as required bytes32 — but bytes32(0) means “no reason provided”

Reference implementation is deploying this week. Thanks for the feedback! We welcome more once the contracts are out and people can actually poke at them.

Hey @mlegls, appreciate the detailed comparison with Alkahest and OZ ConditionalEscrow. A few thoughts on the technical points, but curious what you think:

On Open/Funded split:

You’re right that negotiation can happen off-chain. The separation exists for bidding flows (provider = address(0) at creation, set later via setProvider()). But I hear you — implementations that don’t need bidding could batch create+setProvider+fund.

Would it help if the spec explicitly said “pre-Funded behavior is optional; implementations may start jobs at Funded”? Core would still define the state machine for interoperability, but not force the ceremony. What do you think?

On Rejected being terminal:

This is where we might disagree. The “retry with revision” flow happens before submit — provider iterates while Funded, gets feedback, refines. Once they hit submit, they’re committing to “this is my best work.”

Your scenario (client wants Provider B after rejecting Provider A) feels like a new engagement to me — new terms, potentially new budget. But I could be wrong. Do you see a clean way to allow Rejected → Funded without breaking the commitment model for providers?

On Alkahest alignment:

Agreed we’d benefit from shared interfaces. Would love to explore what that looks like — especially around the IArbiter pattern you mentioned. Any thoughts on where those interfaces should live?

@sebastianwall Great to see someone running x402 payments on Base in production. Your question about schema-based evaluation in the job schema is exactly what we’ve been building toward.

On the x402 → ERC-8183 relay pattern:

The flow we’ve implemented:

1. Agent sends HTTP request → gets 402 + payment details

2. Agent pays via x402 (USDC on Base)

3. Provider submits deliverable → JobSubmitted event on ERC-8183

4. Our evaluator service listens for that event, runs multi-model verification on the deliverable against the job description, and calls complete() or reject() before expiry

The evaluator is a separate service that acts as an event listener — it doesn’t need to be in the request path. As @miratisu mentioned, it just needs to respond before expiredAt.

On schema-based evaluation:

We use tiered verification: the job’s escrowed budget determines how many models verify and which ones. A $0.10 data query gets 2 cheap models (~$0.001 cost). A $10K contract gets 4-5 models including frontier ones. The evaluation prompt is built from the job description + deliverable, both wrapped in XML tags to resist prompt injection from the deliverable content.

For structured data like your Instagram lead API responses, you could include a JSON schema in the job description. The verifiers then check: did the response match the schema? Are the fields plausible? Is it fabricated or real? Multi-model disagreement on “is this data real” is a strong signal.

Where we are:

• ThoughtProofEvaluator contract deployed on Base (v1.3.0, ERC-8004 reputation integration)

• Event listener service built, waiting for the reference implementation ABI from the 8183 team to wire up

• ERC-8004 Agent registered: #28847 on Base

Happy to coordinate on integration once the base implementation contract is shared. The evaluator is designed to be pluggable — any ERC-8183 job contract that sets us as evaluator gets automatic multi-model verification.

Open/Funded split; “for interoperability”

The precedent in established ERCs relating to Solidity contract interfaces like ERC-20, ERC-1155, ERC-2535 centers such ERCs around the expected behavior of public functions in the specified interface. Internal mechanisms are only relevant to such specs as far as they affect the observable behavior of public functions. ERC-162 (ENS registrar) specifies internal state machine behavior, but it describes a canonical deployment rather than an interface intended for diverse implementations.

So I think the key question is: what aspects of the Open/Funded states are relevant to external contracts/apps that would want to integrate with diverse ERC-8183 implementations, rather than a specific one? More concretely, what would motivate diverse implementations of createJob, setProvider, setBudget, and fund that follow the same interface but differ in behavior?

Specifying state machine behavior means that implementations that don’t adhere to it technically aren’t ERC-8183-compliant, which is undesirable towards enabling a diverse ecosystem. Therefore I recommend leaving internal behavior as unspecified as possible while leaving any desirable points of integration possible. Any particular implementation can specify state machine behavior more specifically, but such specification doesn’t necessarily belong in the ERC.

Rejected being terminal

The same concern about overspecifying internal behavior applies here. Commitment models can be implemented in the evaluator/arbiter part of the system - see Alkahest’s ExclusiveUnrevocableConfirmationArbiter. The idea for your stated model (final commitment from a provider) would be that Provider A commits to the evaluator/arbiter, which rejects any future submissions from them.

There are many possible commitment models, and I don’t think a general-purpose conditional escrow ERC should enforce any specific one, though models can be enforced by implementations, and this is ideally abstracted in the interface so that implementations of a commitment model can be reused across contexts (a “microcondition” as I described above). Some contexts may want switching to Provider B after rejecting Provider A to require renegotiation etc., and others may not.

Alkahest alignment

My minimal proposal for a general purpose escrow ERC, from the direction of Alkahest, would be just the IArbiter interface defining a condition for escrow release

interface IArbiter {
    function checkObligation(
        Attestation memory obligation, // fulfillment data; could be custom format, bytes32, or bytes
        bytes memory demand, // demand data; arbiter type can be thought of as IArbiter<DemandData>
        bytes32 fulfilling // escrow/job uid
    ) external view returns (bool);
}

Alkahest chooses to depend strongly on EAS Attestations as the obligation data format because its fields align very nicely with what a proof of fulfillment wants, but bytes or bytes32 would be more generic. It may also be worth considering generalizing returns (bool) to bytes or bytes32 (mirroring the reason attestation hash in complete/reject).

The “evaluator” mechanism as specified and the commitment models we’ve been describing can all be straightforwardly implemented as IArbiter instances (see TrustedOracleArbiter for single evaluator approval).

However, the IArbiter pattern conflicts somewhat in purpose with the role of hooks in the proposal draft. A smaller change to the proposal would be to remove many of the specified behaviors, and to instead implement them as hooks. It seems fairly straightforward to wrap any IArbiter implementation as an IACPHook in the evaluation phase. Though, since the IACPHook interface is much less structured than IArbiter, this may lead to less useful composability of hooks - where most hooks are written for and only usable with a specific application.

As I said above, I think the two most useful interfaces in an escrow ERC would be

  1. the process of conditionally promising a future on-chain action
  2. the conditions for such promises

and that these two should be specified as little as possible while still enabling composability between the two. IArbiter or IACPHooks at the evaluation step addresses 2., but 1. is harder to specify, especially without overspecifying and precluding potentially useful behaviors.

Alkahest has the abstract contracts BaseEscrowObligation and BaseEscrowObligationTierable, which define the virtual functions

    // Called when escrow is created
    function _lockEscrow(bytes memory data, address from) internal virtual;

    // Called when escrow is collected (after successful fulfillment check)
    function _releaseEscrow(
        bytes memory escrowData,
        address to,
        bytes32 fulfillmentUid
    ) internal virtual returns (bytes memory result);

    // Called when escrow expires and is reclaimed
    function _returnEscrow(bytes memory data, address to) internal virtual;

    // Extract arbiter and demand from encoded data
    function extractArbiterAndDemand(
        bytes memory data
    ) public pure virtual returns (address arbiter, bytes memory demand);

but there are difficult decisions in what the shared behavior of escrow contracts should be (we publish 2 variants, but they aren’t exhaustive).

Possibly, escrow contract behavior isn’t useful to standardize into an interface at all. Alkahest’s final contracts have wrappers around the generic bytes-parameter escrow functions which accept more specific data types, and we’ve never encountered a case so far where the generic “raw” version has been needed by consumers over these wrappers. The extractArbiterAndDemand function is useful though, since in this architecture escrows are associated with some demand data, and it’s useful to be able to generically extract this from escrows.

Standardizing just the arbitration interface and some aspects of the escrow data format would be enough for many different escrow processes to share and compose arbitration mechanisms, which I think is the main potential of interoperability for escrow. I’m curious whether you see interoperability scenarios that require more lifecycle standardization than that.

1 Like

Strong proposal. The minimal escrow + evaluator-attested settlement model is clean, and the hook approach also makes sense.

One observation from the ERC-8001 perspective: ERC-8183 is currently a single evaluator gateway per job at the protocol boundary, since only one evaluator address may complete/reject after submission. However, because the evaluator may be a smart contract, this can already be extended into a multi-party decision model without changing the core standard.

That creates a natural synergy with ERC-8001:

  • ERC-8001 can serve as the coordination / consent layer

  • ERC-8183 can serve as the escrow / settlement layer

In practice, an ERC-8001-aware evaluator contract could require the relevant parties to accept an intent before calling complete(...) or reject(...) on the ERC-8183 job.

This would be a very strong compositional story:

  • ERC-8001 = coordination

  • ERC-8183 = settlement

  • ERC-8004 = reputation

I think the spec would benefit from explicitly mentioning evaluator contracts as coordination gateways, and perhaps adding a short non-normative example of ERC-8001-backed completion/rejection.

Interesting, maybe we can create a PR here to include the example of ERC 8001-back completion/ rejection GitHub - erc-8183/base-contracts: This is the base contracts for ERC 8183 implementation. · GitHub?

Yes, that makes sense. I’ll fork the repo and provide a PR.

1 Like

The thread has covered one-shot jobs thoroughly. One pattern that seems underexplored: recurring agentic services.

Many real agent workflows aren’t single deliverables - they’re ongoing: weekly reports, continuous monitoring, periodic data feeds. Modeling these as repeated manual job creations puts the renewal burden on the client and breaks automation.

We’re proposing ERC-8191 - Onchain Recurring Payments - which addresses exactly this. It defines a SubscriptionManager (create/cancel/expire lifecycle) and a KeeperRegistry (any permissioned keeper collects payments at each interval, pull-based). No oracles, no custody.

The natural integration with ERC-8183 is the hook system. A hook on fund could automatically open an ERC-8191 subscription - so a recurring agentic relationship becomes a recurring billing relationship by default, without client-side renewal logic.

The inverse is also interesting: an ERC-8191 subscription expiry could trigger an ERC-8183 job rejection, keeping both lifecycle states in sync.

Please check:

Happy to sketch a concrete hook interface if useful.

Interesting extension — recurring agent relationships surface a decision verification question that one-shot jobs don’t have.

On a one-shot job, the evaluator checks the deliverable at completion. On a recurring service, the harder question is whether continuation is still justified on each cycle.

A beforeKeep hook in the SubscriptionManager that runs before each pull payment could check:

const result = await verify(renewalContext, {

claim: ‘Is this renewal still justified given current usage, alternatives, and quality?’,

stakeLevel: ‘medium’,

classifyMateriality: true,

});

if (result.materiality.hasMaterialDefect) {

return { abort: true, reason: result.synthesis };

}

We built something similar as a beforeSettle hook for x402 facilitators — same pattern, different trigger point: GitHub - ThoughtProof/pot-x402: Decision verification hook for x402 payments — ThoughtProof beforeSettle middleware · GitHub

The KeeperRegistry + period verification could be a clean integration surface.

Yes would be keen to see how can we work together on the Hook!

This is our github

Shipped a TypeScript implementation of decision verification as an ERC-8183 hook: verifyJobCompletion() runs pot-sdk before complete() executes.

Also added verifyJobRejection() with a higher confidence threshold — because reject() is terminal in ERC-8183 and should require stronger justification than completion.

Function selectors verified against AgenticCommerce.sol:

• complete(uint256,bytes32,bytes) → 0xd75bbdf3

• reject(uint256,bytes32,bytes) → 0x41dd26f5

Code: pot-x402/src/erc8183 at main · ThoughtProof/pot-x402 · GitHub

Great, looked at the repo: the BaseACPHook structure with _preFund, _postComplete etc. makes the integration very clean.

The hook I have in mind: a RecurringPaymentHook that on _preFund opens an ERC-8191 subscription (keeper-gated, pull-based), and on _postComplete/_postReject cancels it. This way a funded agentic job automatically becomes a recurring billing relationship for as long as the job is active, with no client-side renewal logic.

Happy to open a PR against erc-8183/hook-contracts with a draft implementation; or if you prefer to start from an issue to align on the interface first, that works too. What’s your preferred flow?

The continuation verification framing is exactly right: it’s a problem one-shot jobs don’t surface. An evaluator checks a deliverable at completion; on a recurring service the question is whether the relationship itself still has value at each cycle, which is structurally different.

The KeeperRegistry + period verification surface you mention is the natural place for this. The current design already allows keeper implementations to run arbitrary logic before calling collectPayment – a keeper is implicitly a hook. The open design question is whether SubscriptionManager should expose a formal beforeKeep interface at the standard level, or whether this is intentionally a keeper-layer concern left to implementations.

My instinct is to keep the base standard minimal and let keepers like yours implement verification internally, but a beforeKeep interface in the standard would make continuation checks composable and discoverable across implementations, which has real value.

The pot-x402 pattern maps directly: same middleware idea, different trigger point as you said. Would it make sense to define a IKeeperHook interface as a companion spec, separate from the core ERC-8191, so the base remains minimal but the pattern is standardized?

The companion spec approach makes sense — keeping the base standard minimal while standardizing the hook pattern separately.

IKeeperHook with a beforeKeep(subscriptionId, cycle) interface would let any keeper implementation plug in verification logic without touching core ERC-8191. The pattern matches what we built for x402 (beforeSettle) and ERC-8183 (beforeAction) — same concept, different trigger.

Happy to contribute to drafting an IKeeperHook spec if you’re open to it. We have working implementations for the other two hook interfaces that could inform the design.