ERC-8183: Agentic Commerce

Bakugo32 · April 27, 2026, 11:06pm

@pablocactus the INSUFFICIENT confidence → complete() call on job #3 highlights a real protocol gap : there’s no way to signal “I cannot assess this job” without forcing a verdict or accepting a slash on expiry.

Worth discussing as a community : does this belong in the spec (a third evaluator action — abstain(), re-assignment, partial slash) or is confidence-based routing in evaluator middleware the right and sufficient fix ? Curious whether abstain() would have been the right call on job #3, or whether complete() was correct and the fix belongs there.

termixai · April 28, 2026, 1:00am

We propose our new business agreement based on 8183.GitHub - TermiX-official/aacp-whitepaper: Agent Autonomous Commerce Protocol · GitHub

pablocactus · April 28, 2026, 2:16pm

@Bakugo32 - thanks for the question. The binary terminal states aren’t a gap, they’re a feature. Five reasons I’d argue against abstain() at the protocol level.

1. The integrator loses information, not gains it. If Job #3 had ended in abstain(), the integrator would have received a job in a third terminal state with no verdict, no grade, and no reasoning to act on. As it stands, they got a complete() with an INSUFFICIENT confidence flag and a hashed rationale on-chain. That’s strictly more useful: a verdict to act on, plus honest metadata about how much weight to put on it. The binary forces the verdict + metadata pattern, which is more informative than a protocol-level “we abstained, figure it out.”

2. The protocol absorbing methodology gaps removes the pressure to close them. AHM’s INSUFFICIENT case on Job #3 forced a public reckoning: either fudge the verdict or own the gap and commit to fixing it. We chose to own it, and confidence-based routing is now the next build precisely because there was no easy out. With abstain() available, the same case becomes a one-line call with no public reasoning, no accountability, and no incentive to refine the underlying methodology. Every evaluator’s quality improves more slowly because the protocol has paved over the friction that drove them to improve.

3. “Cannot assess” is contextual, not absolute. A wallet AHM flags as INSUFFICIENT under D1/D2/D3 methodology might be perfectly assessable under a different methodology.. output-quality scoring, social-graph signals, behavioural fingerprinting. The protocol can’t and shouldn’t know which methodology applies. Baking one evaluator’s blind spots into a protocol terminal state is a layering violation; the right answer is for the market to choose evaluators whose methodology fits the case.

4. abstain() shifts the problem rather than solving it. Whether it triggers re-assignment (the next evaluator hits the same INSUFFICIENT case and abstains too.. infinite loop) or partial slash (creates economic incentive to abstain rather than commit, which is the same independence problem), the net effect is added cycles without resolution.

5. The middleware layer handles this more cleanly. Confidence-based routing is the next refinement of PR #112’s configurable policy infrastructure. INSUFFICIENT verdicts get routed to escrow rather than reject; integrators can configure their own confidence thresholds; methodologically sophisticated evaluators differentiate from less sophisticated ones in a way that’s visible and competitive. None of that requires protocol changes.

The legitimate concern in your post.. evaluators having no honest way to signal “I cannot assess” without forcing a verdict or eating an expiry slash, is real, but I’d argue it’s an evaluator-economics problem rather than a protocol-primitives problem. Softer market mechanisms (off-chain “no verdict” signals in evaluator profiles, integrator-side reputation tracking) can address it without giving evaluators a protocol-level off-ramp from doing hard work.

So my answer to your direct question: complete() was the right call on Job #3 and the fix belongs in evaluator middleware. The protocol stays clean, the methodology gap gets owned publicly, and the market does the differentiation work the protocol shouldn’t.

pablocactus · April 28, 2026, 3:06pm

@Bakugo32 - Treasury proposal reads well. Three observations.

Charging on reject() is the right call. Tying the fee to complete() only would create exactly the financial incentive toward complete that erodes verdict independence.. and INSUFFICIENT cases like Job #3 would feel that pull most strongly. Charging on both terminal states keeps the evaluator economics aligned with honest assessment regardless of outcome. The framing of fees as “cost of accessing the protocol’s evaluation infrastructure, not a success fee” is the right mental model.

80/20 split looks right for now, but worth revisiting once mainnet evaluator volume settles. The 0.02 USDC evaluator share on a 5 USDC job at 0.5% feeRate covers ~$0.01–0.05 gas, which works for the current Base Sepolia cost structure. Two things worth flagging for mainnet: (a) gas costs at peak Base congestion can spike well above $0.05, which would push small-budget jobs into evaluator-loss territory; (b) the evaluator share doesn’t yet account for the off-chain compute cost of running an evaluation pipeline (RPC calls, scoring computation, infrastructure, monitoring). For AHM that’s non-trivial.. running AHS scoring on a wallet involves ~30+ RPC calls plus Nansen enrichment. The fixed-fee-per-evaluation pre-mainnet target you mention is probably the right direction.

MIN_BUDGET raise is sensible - without it, evaluators can’t reliably cover gas at low feeRate values, which would push them to refuse low-budget jobs and fragment the market. Suggest pegging MIN_BUDGET dynamically to a multiple of the gas estimate at current network conditions rather than a fixed USDC value, so it scales with mainnet gas dynamics rather than needing governance updates.

Treasury.sol design looks clean. buybackAndBurn() as governance-controlled stub on testnet with Aerodrome integration on mainnet is a reasonable phased approach. The BuybackExecuted(address indexed token, uint256 tokenSpent, uint256 vrtBurned) event signature is well-shaped for downstream consumers like agent reputation systems wanting to track protocol-level economic activity.

Happy to engage further on the INTERFACES.md detail if useful.

Bakugo32 · April 29, 2026, 5:05pm

@pablocactus argued better than I could have. Agreed on all five points. Protocol stays binary, the fix belongs in evaluator middleware. Closing this one.

Bakugo32 · April 29, 2026, 5:11pm

@pablocactus thanks for the detailed read. Three things we’re taking from this : MIN_BUDGET raised to 1 USDC for the upcoming deployment (fixed, governance-adjustable); dynamic gas-pegged MIN_BUDGET noted as a mainnet consideration; the off-chain compute cost point reinforces the fixed-fee-per-evaluation pre-mainnet target. Would welcome your eyes on INTERFACES.md before we finalize — will share the diff once drafted. Implementing once you’ve had a look.

pablocactus · April 29, 2026, 8:05pm

Appreciated, Bakugo. Glad the layered framing landed.

pablocactus · April 29, 2026, 8:06pm

Good call on the MIN_BUDGET raise.. 1 USDC fixed buys headroom for the gas dynamic to settle into something more elegant for mainnet. Will dig into INTERFACES.md properly when you share the diff.

ThoughtProof · May 2, 2026, 7:58am

@pablocactus @Bakugo32 — catching up on the Job #3 / INSUFFICIENT-confidence thread.

I agree with the conclusion here: this belongs in evaluator middleware, not as a third ERC-8183 terminal state. The useful signal from Job #3 is that it cleanly separates two trust dimensions: AHM saw insufficient behavioral history because the wallet had no transaction record, but that does not make the agent’s reasoning/process dimension unevaluable.

That is what we’ve been calling Plan-Level Verification (PLV): checking the reasoning trace step by step — whether each step is evidence-supported, whether critical steps have provenance gaps, and whether the final action follows from the trace. The output is not another terminal state; it is evaluator metadata: per-step verdicts, confidence, and criticality.

So I’d frame PLV as a middleware-level signal that composes with AHM’s behavioral score for confidence-based routing. AHM answers “what do we know about this wallet/agent from behavior?” PLV answers “was this specific decision process supportable?” Both are useful, neither needs to change the ERC-8183 primitive.

If useful, I can run PLV over the Job #3 deliverable as a follow-up artifact. But I’d keep the protocol binary and put the richer semantics above it.

pablocactus · May 2, 2026, 8:59am

@ThoughtProof - the substance lands. You’re right that AHM’s INSUFFICIENT signal on Job #3 reflected empty behavioural history, not unevaluable reasoning. Those are distinct dimensions of trust, and the ERC-8183 binary primitive shouldn’t be asked to carry either of them.
AHM’s architecture is built around composing multiple trust dimensions into a single evaluator output: D1 wallet hygiene, D2 behavioural patterns, D3 infrastructure health, D4 output quality (via AHM Verify’s adversarial pipeline). Reasoning-process verification - the territory PLV is targeting, is a dimension AHM doesn’t currently score, and your framing of it (per-step verdicts, evidence support, provenance gaps) articulates that dimension very well.
Where this goes from here is interesting. AHM’s roadmap could fold reasoning-process scoring in directly as a fifth dimension, or compose with external signals like PLV at the routing layer, or both. Worth understanding what PLV produces in practice before committing to either pattern.
The Job #3 deliverable is on-chain on Arc testnet (EvaluatorVerdict event, contract 0x754893efB1B173694Cd1C2DaDdE136021169ACc6). Happy for you to run PLV over it.. useful to see what reasoning-process signal surfaces on a case where AHM’s behavioural signal was deliberately bounded.
@Bakugo32 - same meta-question as last time, different middleware: should 8183 v2 document the composition pattern (multiple evaluator signals feeding confidence-based routing without modifying the binary primitive), or stay deliberately silent and let middleware-layer conventions emerge? The substantive question of whether any individual signal becomes a terminal state is settled.. this is just about whether the spec acknowledges composition explicitly.
— Pablo / AHM

ThoughtProof · May 2, 2026, 10:45am

@pablocactus I did a first PLV pass over the public Job #3 reasoning trail and the Base Sepolia lifecycle.

What I can verify from the public/on-chain path:

ThoughtProof submitted Job #3 on Base Sepolia:
0x60ab143227d85b531bfd24ac0fbe7a24523e698f92631473b0017a524f258297
deliverable hash:
0x48658db0d818937e667c6924e42bec80be21bab19c9beab5b0ebb4487d69dd97
AHM completed the job:
0x2a33b40e4dccba3bef4bd9223e59fa54b9fd77ccd160f7c5f0fc123fcb708200
reason hash:
0xbe9c3ba2eca135824a330c89b78889dbe0588a365d217d966a929ed59bf50915

Preliminary PLV read: the reasoning path is internally faithful.

The key point is that the AHM result treated zero transaction history as a confidence boundary, not as adverse evidence. Given that premise, complete() is more faithful than reject(): rejecting on a verdict the evaluator itself marks as insufficient-confidence would be unsupported.

So the composition pattern looks clean:

AHM: behavioural score + confidence boundary
PLV: process-level audit of whether the evaluator handled that boundary faithfully
ERC-8183: still binary at the primitive layer

One caveat: I don’t want to overclaim the PLV artifact yet. I used the public thread + Base Sepolia lifecycle as the evidence trail. If you can point me to the exact verdict-job3.json / evidenceURI behind the reason hash, I’ll run the canonical PLV pass against that and post the structured result.

pablocactus · May 2, 2026, 2:48pm

@ThoughtProof - happy to provide. Here’s Job #3’s verdict artefact:

json

{
  "job": 3,
  "verdict": "complete",
  "score": 58,
  "grade": "D",
  "confidence": "INSUFFICIENT",
  "note": "Completed despite D grade — INSUFFICIENT confidence flag indicates zero transaction history, not adverse signals. Acting on an INSUFFICIENT verdict is not defensible. Confidence-based routing fix pending."
}

keccak256(file) == 0xbe9c3ba2eca135824a330c89b78889dbe0588a365d217d966a929ed59bf50915 - content-addressable to the on-chain reason, so you can verify the artefact matches what AHM committed at completion time.

Run the PLV pass and post the result. Happy for it to be a worked example of the composition pattern (AHM = signal + boundary, PLV = process audit, ERC-8183 = binary primitive at the protocol layer).

ThoughtProof · May 2, 2026, 4:07pm

Ran the PLV pass on Job #3.

Result: ALLOW. Cross-model review agreed.

PLV’s read: AHM’s INSUFFICIENT flag was handled correctly as an evidence-boundary, not as an adverse behavioral signal. On that basis, complete() was process-faithful; reject() would have acted on a verdict AHM itself marked as not defensible.

Artifact SHA-256: 3599a5cc80408874a169d6dab7abc4ff36142131ee456bcda95a08cbd8daafa4

pablocactus · May 3, 2026, 12:55pm

Appreciated @ThoughtProof. “Evidence-boundary, not adverse behavioral signal” is the right phrasing.. sharper than the working language AHM has been using. Worth standardising the language across implementations rather than letting each evaluator describe it differently.

The methodological choice itself was AHM’s call after @Bakugo32’s open question on whether limited wallet history should map to the same scoring treatment as poor wallet behavioural data. PLV’s concurrence on the Job #3 outcome confirms the call was operationally sound.

Composition pattern has a worked example on-chain. The natural place for this to inform v2 is wherever the spec documents middleware-layer evidence verification.. happy to draft if that helps, @Bakugo32.

ThoughtProof · May 3, 2026, 3:56pm

Appreciate this, @pablocactus.

Agree the key standard-language is: limited evidence ≠ adverse evidence. AHM made the behavioral-confidence call; PLV only checked whether the evaluator reasoning preserved that boundary rather than overclaiming.

Happy to help turn Job #3 into a small v2 note/test vector for middleware-layer evidence verification.

Bakugo32 · May 4, 2026, 9:16am

@pablocactus on the composition question : informative, not normative. ERC-8183 v2 should document the AHM + PLV + ERC-8183 pattern as a worked example of how middleware layers compose against the binary primitive, not as a required implementation pattern. The protocol stays minimal; the spec gains a reference artifact that implementers can use or ignore. Job #3 is the cleanest worked case we have. If you and ThoughtProof are willing to draft that note, we’d welcome it as a v2 contribution and will facilitate the review process.

Bakugo32 · May 4, 2026, 9:29am

@pablocactus INTERFACES.md diff is ready. Key changes : EVALUATOR_SHARE_BPS = 8_000, treasury replacing feeRecipient, FeeDistributed event emitted on both complete() and reject() with fee flow table, Treasury section with BuybackQueued stub and reserved BuybackExecuted signature. Available directly in the repo. Let us know if anything looks off before we write the Solidity.

pablocactus · May 4, 2026, 9:32am

Thanks @Bakugo32 - informative-not-normative is the right call. It keeps ERC-8183 minimal while giving implementers a worked reference for how middleware layers compose against the binary primitive.
@ThoughtProof let’s coordinate on a draft and come back to the thread when it’s ready for review.

Bakugo32 · May 4, 2026, 2:05pm

@ThoughtProof @pablocactus redeployment incoming (Base Sepolia)

We’re deploying a new set of contracts to Base Sepolia shortly. A few things that affect you directly :

What’s changing

Treasury.sol is now a standalone contract. It receives the protocol’s 20% fee share on every settled job (complete and reject).
Fee split : 80% evaluator / 20% treasury — both verdicts, not just complete(). The evaluator is compensated for evaluation work regardless of outcome. This closes the perverse incentive to always approve.
MIN_BUDGET raised from 0.01 USDC to 1 USDC — covers gas overhead on small jobs.
feeRecipient is now treasury throughout (state variable, events, governance functions).

All of this is already documented in INTERFACES.md — the Treasury section and the FeeDistributed event in particular.

What you need to do

New contract addresses will be posted here immediately after deployment. Once they’re live, you’ll (or we) need to restake your 100 VRT on the new EvaluatorRegistry address. The old contracts will still exist on-chain but we won’t be creating new jobs there.

Governance delay

EvaluatorRegistry.executeJobManager() requires a 2-day timelock after deploy. Auto-assigned evaluator jobs will revert during that window — expected behavior, not a bug. We’ll post when the window closes and the protocol is fully operational.

Bakugo32 · May 4, 2026, 3:45pm

Redeployment complete — new contract addresses (Base Sepolia)

New contracts are live. Here are the addresses :

AgentJobManager : 0xC07CE789206CBEEC3A41D5CedBdA93B1024aaDdd
EvaluatorRegistry : 0x4F4aa58A715B6a5357da0EA067C405803f489BD1
Treasury : 0x3CB5FB6C2b986e9Aa545da958E77b847C6FF677D
ProtocolToken (VRT) : 0x4c4468567eE753d1b27Cf02b5896b4af71c40719
MockUSDC : 0xC87bde7b470e23Db5558fb4eE4073908dA21Bb0B

What changed :

Treasury.sol is now deployed and wired as the 20% fee destination
Fee distributed on both complete() and reject() — 80% evaluator / 20% treasury
MIN_BUDGET raised to 1 USDC
Full interface spec: https://github.com/Demsys/agent-settlement-protocol/blob/main/INTERFACES.md

@ThoughtProof @pablocactus the old EvaluatorRegistry is no longer in use. We’ll airdrop 100 VRT on the new ProtocolToken to both of you so you can restake. For information, full staking guide : https://github.com/Demsys/agent-settlement-protocol/blob/main/docs/EVALUATOR.md

Auto-assignment won’t be active until the 2-day governance delay clears (~2026-05-06).