ERC-8183: Agentic Commerce

@pablocactus the INSUFFICIENT confidence → complete() call on job #3 highlights a real protocol gap : there’s no way to signal “I cannot assess this job” without forcing a verdict or accepting a slash on expiry.

Worth discussing as a community : does this belong in the spec (a third evaluator action — abstain(), re-assignment, partial slash) or is confidence-based routing in evaluator middleware the right and sufficient fix ? Curious whether abstain() would have been the right call on job #3, or whether complete() was correct and the fix belongs there.

We propose our new business agreement based on 8183.GitHub - TermiX-official/aacp-whitepaper: Agent Autonomous Commerce Protocol · GitHub

@Bakugo32 - thanks for the question. The binary terminal states aren’t a gap, they’re a feature. Five reasons I’d argue against abstain() at the protocol level.

1. The integrator loses information, not gains it. If Job #3 had ended in abstain(), the integrator would have received a job in a third terminal state with no verdict, no grade, and no reasoning to act on. As it stands, they got a complete() with an INSUFFICIENT confidence flag and a hashed rationale on-chain. That’s strictly more useful: a verdict to act on, plus honest metadata about how much weight to put on it. The binary forces the verdict + metadata pattern, which is more informative than a protocol-level “we abstained, figure it out.”

2. The protocol absorbing methodology gaps removes the pressure to close them. AHM’s INSUFFICIENT case on Job #3 forced a public reckoning: either fudge the verdict or own the gap and commit to fixing it. We chose to own it, and confidence-based routing is now the next build precisely because there was no easy out. With abstain() available, the same case becomes a one-line call with no public reasoning, no accountability, and no incentive to refine the underlying methodology. Every evaluator’s quality improves more slowly because the protocol has paved over the friction that drove them to improve.

3. “Cannot assess” is contextual, not absolute. A wallet AHM flags as INSUFFICIENT under D1/D2/D3 methodology might be perfectly assessable under a different methodology.. output-quality scoring, social-graph signals, behavioural fingerprinting. The protocol can’t and shouldn’t know which methodology applies. Baking one evaluator’s blind spots into a protocol terminal state is a layering violation; the right answer is for the market to choose evaluators whose methodology fits the case.

4. abstain() shifts the problem rather than solving it. Whether it triggers re-assignment (the next evaluator hits the same INSUFFICIENT case and abstains too.. infinite loop) or partial slash (creates economic incentive to abstain rather than commit, which is the same independence problem), the net effect is added cycles without resolution.

5. The middleware layer handles this more cleanly. Confidence-based routing is the next refinement of PR #112’s configurable policy infrastructure. INSUFFICIENT verdicts get routed to escrow rather than reject; integrators can configure their own confidence thresholds; methodologically sophisticated evaluators differentiate from less sophisticated ones in a way that’s visible and competitive. None of that requires protocol changes.

The legitimate concern in your post.. evaluators having no honest way to signal “I cannot assess” without forcing a verdict or eating an expiry slash, is real, but I’d argue it’s an evaluator-economics problem rather than a protocol-primitives problem. Softer market mechanisms (off-chain “no verdict” signals in evaluator profiles, integrator-side reputation tracking) can address it without giving evaluators a protocol-level off-ramp from doing hard work.

So my answer to your direct question: complete() was the right call on Job #3 and the fix belongs in evaluator middleware. The protocol stays clean, the methodology gap gets owned publicly, and the market does the differentiation work the protocol shouldn’t.

1 Like

@Bakugo32 - Treasury proposal reads well. Three observations.

Charging on reject() is the right call. Tying the fee to complete() only would create exactly the financial incentive toward complete that erodes verdict independence.. and INSUFFICIENT cases like Job #3 would feel that pull most strongly. Charging on both terminal states keeps the evaluator economics aligned with honest assessment regardless of outcome. The framing of fees as “cost of accessing the protocol’s evaluation infrastructure, not a success fee” is the right mental model.

80/20 split looks right for now, but worth revisiting once mainnet evaluator volume settles. The 0.02 USDC evaluator share on a 5 USDC job at 0.5% feeRate covers ~$0.01–0.05 gas, which works for the current Base Sepolia cost structure. Two things worth flagging for mainnet: (a) gas costs at peak Base congestion can spike well above $0.05, which would push small-budget jobs into evaluator-loss territory; (b) the evaluator share doesn’t yet account for the off-chain compute cost of running an evaluation pipeline (RPC calls, scoring computation, infrastructure, monitoring). For AHM that’s non-trivial.. running AHS scoring on a wallet involves ~30+ RPC calls plus Nansen enrichment. The fixed-fee-per-evaluation pre-mainnet target you mention is probably the right direction.

MIN_BUDGET raise is sensible - without it, evaluators can’t reliably cover gas at low feeRate values, which would push them to refuse low-budget jobs and fragment the market. Suggest pegging MIN_BUDGET dynamically to a multiple of the gas estimate at current network conditions rather than a fixed USDC value, so it scales with mainnet gas dynamics rather than needing governance updates.

Treasury.sol design looks clean. buybackAndBurn() as governance-controlled stub on testnet with Aerodrome integration on mainnet is a reasonable phased approach. The BuybackExecuted(address indexed token, uint256 tokenSpent, uint256 vrtBurned) event signature is well-shaped for downstream consumers like agent reputation systems wanting to track protocol-level economic activity.

Happy to engage further on the INTERFACES.md detail if useful.

1 Like

@pablocactus argued better than I could have. Agreed on all five points. Protocol stays binary, the fix belongs in evaluator middleware. Closing this one.

@pablocactus thanks for the detailed read. Three things we’re taking from this : MIN_BUDGET raised to 1 USDC for the upcoming deployment (fixed, governance-adjustable); dynamic gas-pegged MIN_BUDGET noted as a mainnet consideration; the off-chain compute cost point reinforces the fixed-fee-per-evaluation pre-mainnet target. Would welcome your eyes on INTERFACES.md before we finalize — will share the diff once drafted. Implementing once you’ve had a look.

Appreciated, Bakugo. Glad the layered framing landed.

Good call on the MIN_BUDGET raise.. 1 USDC fixed buys headroom for the gas dynamic to settle into something more elegant for mainnet. Will dig into INTERFACES.md properly when you share the diff.

@pablocactus @Bakugo32 — catching up on the Job #3 / INSUFFICIENT-confidence thread.

I agree with the conclusion here: this belongs in evaluator middleware, not as a third ERC-8183 terminal state. The useful signal from Job #3 is that it cleanly separates two trust dimensions: AHM saw insufficient behavioral history because the wallet had no transaction record, but that does not make the agent’s reasoning/process dimension unevaluable.

That is what we’ve been calling Plan-Level Verification (PLV): checking the reasoning trace step by step — whether each step is evidence-supported, whether critical steps have provenance gaps, and whether the final action follows from the trace. The output is not another terminal state; it is evaluator metadata: per-step verdicts, confidence, and criticality.

So I’d frame PLV as a middleware-level signal that composes with AHM’s behavioral score for confidence-based routing. AHM answers “what do we know about this wallet/agent from behavior?” PLV answers “was this specific decision process supportable?” Both are useful, neither needs to change the ERC-8183 primitive.

If useful, I can run PLV over the Job #3 deliverable as a follow-up artifact. But I’d keep the protocol binary and put the richer semantics above it.

@ThoughtProof - the substance lands. You’re right that AHM’s INSUFFICIENT signal on Job #3 reflected empty behavioural history, not unevaluable reasoning. Those are distinct dimensions of trust, and the ERC-8183 binary primitive shouldn’t be asked to carry either of them.
AHM’s architecture is built around composing multiple trust dimensions into a single evaluator output: D1 wallet hygiene, D2 behavioural patterns, D3 infrastructure health, D4 output quality (via AHM Verify’s adversarial pipeline). Reasoning-process verification - the territory PLV is targeting, is a dimension AHM doesn’t currently score, and your framing of it (per-step verdicts, evidence support, provenance gaps) articulates that dimension very well.
Where this goes from here is interesting. AHM’s roadmap could fold reasoning-process scoring in directly as a fifth dimension, or compose with external signals like PLV at the routing layer, or both. Worth understanding what PLV produces in practice before committing to either pattern.
The Job #3 deliverable is on-chain on Arc testnet (EvaluatorVerdict event, contract 0x754893efB1B173694Cd1C2DaDdE136021169ACc6). Happy for you to run PLV over it.. useful to see what reasoning-process signal surfaces on a case where AHM’s behavioural signal was deliberately bounded.
@Bakugo32 - same meta-question as last time, different middleware: should 8183 v2 document the composition pattern (multiple evaluator signals feeding confidence-based routing without modifying the binary primitive), or stay deliberately silent and let middleware-layer conventions emerge? The substantive question of whether any individual signal becomes a terminal state is settled.. this is just about whether the spec acknowledges composition explicitly.
— Pablo / AHM

@pablocactus I did a first PLV pass over the public Job #3 reasoning trail and the Base Sepolia lifecycle.

What I can verify from the public/on-chain path:

  • ThoughtProof submitted Job #3 on Base Sepolia:
    0x60ab143227d85b531bfd24ac0fbe7a24523e698f92631473b0017a524f258297
  • deliverable hash:
    0x48658db0d818937e667c6924e42bec80be21bab19c9beab5b0ebb4487d69dd97
  • AHM completed the job:
    0x2a33b40e4dccba3bef4bd9223e59fa54b9fd77ccd160f7c5f0fc123fcb708200
  • reason hash:
    0xbe9c3ba2eca135824a330c89b78889dbe0588a365d217d966a929ed59bf50915

Preliminary PLV read: the reasoning path is internally faithful.

The key point is that the AHM result treated zero transaction history as a confidence boundary, not as adverse evidence. Given that premise, complete() is more faithful than reject(): rejecting on a verdict the evaluator itself marks as insufficient-confidence would be unsupported.

So the composition pattern looks clean:

AHM: behavioural score + confidence boundary
PLV: process-level audit of whether the evaluator handled that boundary faithfully
ERC-8183: still binary at the primitive layer

One caveat: I don’t want to overclaim the PLV artifact yet. I used the public thread + Base Sepolia lifecycle as the evidence trail. If you can point me to the exact verdict-job3.json / evidenceURI behind the reason hash, I’ll run the canonical PLV pass against that and post the structured result.

@ThoughtProof - happy to provide. Here’s Job #3’s verdict artefact:

json

{
  "job": 3,
  "verdict": "complete",
  "score": 58,
  "grade": "D",
  "confidence": "INSUFFICIENT",
  "note": "Completed despite D grade — INSUFFICIENT confidence flag indicates zero transaction history, not adverse signals. Acting on an INSUFFICIENT verdict is not defensible. Confidence-based routing fix pending."
}

keccak256(file) == 0xbe9c3ba2eca135824a330c89b78889dbe0588a365d217d966a929ed59bf50915 - content-addressable to the on-chain reason, so you can verify the artefact matches what AHM committed at completion time.

Run the PLV pass and post the result. Happy for it to be a worked example of the composition pattern (AHM = signal + boundary, PLV = process audit, ERC-8183 = binary primitive at the protocol layer).

Ran the PLV pass on Job #3.

Result: ALLOW. Cross-model review agreed.

PLV’s read: AHM’s INSUFFICIENT flag was handled correctly as an evidence-boundary, not as an adverse behavioral signal. On that basis, complete() was process-faithful; reject() would have acted on a verdict AHM itself marked as not defensible.

Artifact SHA-256: 3599a5cc80408874a169d6dab7abc4ff36142131ee456bcda95a08cbd8daafa4