ERC-8239: Agent Skill Registry

Hey all,

Agent ecosystems need a canonical way to reference the skills an agent offers. Today skills declarations live only inside offchain URI files, which makes it harder to discover agents by skill, aggregate reputation at the skill level, or link a completed job to the specific skills that produced it.

I’d like to propose an EIP that introduces:

  • Canonical Onchain Skill Identity: An ERC-721 SkillRegistry where each token is a skill, with tokenURI pointing to a standardized offchain manifest describing the skill. Skills are stable, transferable, and identifiable across the ecosystem.
  • Skill Registration File Format: Fields native to skills such as: invocations (prompt bundles, MCP servers, code packages, plugins), each with an installation slot for commands or LLM-ready prompts. license, dependencies, requirements, taxonomy (OASF, etc.), and status
  • Event Level Attestation Layer: A SkillAttestation contract emits SkillUsed events covering various authorization paths including self-attestation, ERC-8183 job evaluator attestation, and ERC-8183 hook attested routes. Indexers fold events into per-agent-per-skill and per-skill views.
  • Event Level Installation Layer: A SkillInstallation contract emits install/uninstall events so each agent can declare capabilities with the particular skills it offers.

Proposed Spec: ERC PR

Reference Implementation: vorpalengineering/agent-skill-registry

Happy to dig into the spec for any questions you might have!

A few open questions to spark discussion:

  • An earlier version of the spec used ERC-1155 instead of ERC-721 for the Skill Registry. The idea here was that registrations would create a new tokenId (with standard Ownable features) and balances would represent installations of that skill in a soulbound-like fashion (balance capped at 1 and non-transferrable). The current event-only design was favored to keep the spec light and more gas-efficient, however it does prevent some interesting token-gated skill possibilities and onchain skill installation query-ability through balances.
  • How do we feel about piggybacking off the ERC-8004 Reputation system? Currently it requires taking up both tag1 and tag2 on the giveFeedback() calls. Rolling our own skill reputation system felt unnecessary when we could tap into existing rails, and this allows indexers to filter and parse on specific tags.
  • The current attestation design covers 3 unique types: self-attested, evaluator-attested, and hook-attested. Is this sufficient coverage? Is this design missing any critical types or blocking support for standardized attestations (i.e. EAS attestations?)
2 Likes

Thanks for drafting this.

Using Ethereum as secure, peer reviewed distribution layer for agent skills and software in general seems like a big opportunity.

I wrote some L1/L2 eth AI leaders on Feb 7 and made a similar suggestion, “ercX adds a skill registry/marketplace to 8004. It’s a set of abstractions/data structures to augment the agent capabilities of 8004, but for the skill level. 8004 tells you which agents exist and which to trust and helps you contact these agents for tasks. ercX does the same for skills and helps you install the skills securely"

The frontier of eth x AI often seems to be particularly driven by showcases that go viral. Otherwise, it can be difficult to motivate folks to pursue new standards.

A strawman showcase to consider publishing: we could prototype this with a short video showing a claw securely installing a skill from IPFS with the hash from the L1, where the installed skill is verified by your claw as having been peer reviewed by Eth Attestation Service attestations from multiple security vendors, including a fictitious (or real) human security firm, and several fictitious (or real) highly rated security-reviews-as-a-service agents registered in 8004.

I think it’d be a magic moment for the audience to see how an installed skill was not just signed by L1, but byte-verified by the claw’s local IPFS node, and security-verified by a marketplace of reviewers, themselves verified as reputable. All on ethereum. The video could end with another skill that your claw refuses to install because it fails one or more of these checks, and/or a skill upgrade that fails because the published upgrade to the previously installed skill has insufficiently many security reviews on the new upgrade.

2 Likes

Thanks for your feedback!

Great suggestion regarding the showcase, I think it could help sell the utility value quite well. That does bring up another part I didn’t cover though: a distribution client spec. Not sure if that belongs in the ERC or should be left open as I don’t want to be overly prescriptive. Perhaps a thin client spec? (would appreciate feedback on this). I will work on a simple CLI tool to get started.

Also great feedback on the EAS attestations as I didn’t formally include that in the spec. I’d like to take advantage of established standards for this so I’m open to ideas on how that could be implemented as well.

Thanks for drafting this.. the SkillIdentity + attestation event model is a clean separation, and the four-source attestation taxonomy (self / peer / job / hook / installer) is more thought-through than most of what I’ve seen on this thread.

Two contributions, one on taxonomy, one on the EAS question.

On the taxonomy field in skill metadata

We’ve been working on agent-level taxonomy at AHM (Agent Health Monitor - diagnostic layer scoring ~10,800 agents across ACP, Olas, ERC-8004, Arc, and Celo) and recently published a v1 public taxonomy with 10 function categories (Financial, Intelligence & Analytics, Research, Creative, Orchestration, Verification, Infrastructure, Commerce, Physical World, Identity & Trust). It’s at intelligence.agenthealthmonitor.xyz/taxonomy if useful as a reference.

The thing I’d flag: agent-level and skill-level taxonomy probably want to be related but not identical. An agent’s function (what it does in aggregate) and a skill’s category (what this specific capability is) are different abstractions; the same skill (say, “summarise-document”) might appear inside agents we’d classify as Research, Intelligence, or Creative. A naïve flat taxonomy at skill level risks either being too coarse to be useful or too fine-grained to be stable.

A few options worth considering:

  1. Leave taxonomy as a free-form string in v1 and let conventions emerge, with a follow-up ERC formalising it once there’s evidence of convergence.

  2. Reference an external taxonomy registry (similar to how MIME types or SPDX licenses work) so the taxonomy itself can evolve without bumping the skill ERC.

  3. Adopt a tag-set rather than a single category, since most non-trivial skills cross categories.

Happy to share what we learned classifying ~10K agents from on-chain activity - the short version is that a single-label taxonomy works for ~60% of cases and falls apart on the rest.

On bramdiofeom’s question about EAS attestations

We’ve been running an output-verification pipeline in production for ~6 weeks (AHM Verify - 4-generator + critic + synthesizer, x402-priced) and the thing I’d push back on gently is the framing “should we support EAS in v1.” The more useful question is: what does an attestation need to commit to for it to be useful to a downstream consumer? Once that’s clear, EAS vs alternative attestation schemes vs custom events becomes a downstream implementation choice.

From what we’ve seen, the fields that actually matter for a skill-installation or skill-use attestation are:

  • The skill identity (you have this - SkillIdentity contract + tokenId)

  • A reproducibility commitment (input hash, output hash, or a verifier the consumer can re-run)

  • The attestor’s identity and their stake or reputation - an attestation from a random EOA is worth less than one from an attestor with skin in the game (cf. ERC-8210 IRiskHook, which formalises this)

  • A timestamp and a version pin (skills can change)

If the v1 attestation events carry those fields in their data payload, EAS becomes a wrapper rather than a dependency. If they don’t, EAS won’t save you.

Ryanberokmens’ point about agent capability bundles on top of L1 is also right and is the harder problem - single-skill attestations are necessary but not sufficient if real agents compose 5–20 skills per task.

Reach me at pablo@agenthealthmonitor.xyz if any of this is useful to dig into further.

— Pablo / AHM

Thanks for the feedback! Great points.

On taxonomy: your agent vs skill distinction makes sense. I’ve updated the spec to state that taxonomy is keyed by provider name, that skills may carry multiple classifications under a single provider if the underlying taxonomy supports it, and that skill taxonomy describes the skill itself and not agents that compose with it. I’m keeping this free-form and skill owner-declared for now rather than prescribing a certain taxonomy at the ERC level - but I’m open to changing this if we can establish a sound alternative.

On EAS: good framing here. I’ve added an EAS schema to the SkillAttestation contract and modified the spec language to optionally support EAS. The SkillUsage schema is "uint256 agentId,uint256 skillId,address jobContract,uint256 jobId,bytes32 evidenceHash,string evidenceURI". This should be expressive enough to satisfy your attestation fields. Reproducibility is covered by evidenceHash + evidenceURI, timestamp is implicit via the block.timestamp, and attester identity is on every event. I’ve added a version pin at the registry layer rather than per-attestation. SkillRegistry now stores a manifestHash updated atomically with URI changes.

Thoughts on this design? Let me know if this aligns with what you were thinking or if I’ve missed something important. Happy to iterate on this!

Hello, I hope you’re doing well.

  1. Will agents participate in reputation despite paying gas fees?

At the protocol level, incentives for honest participation seem weak. ASR relies on ERC-8004’s permissionless feedback model and even assumes gas sponsorship (EIP-7702), so reviewers are not naturally motivated to pay gas themselves. Participation may depend on external drivers like subsidies, policy mandates, or compliance needs. Meanwhile, manipulators have stronger incentives (ranking, visibility, revenue), so they are more likely to pay, leading to under-supplied honest signals and over-supplied manipulated ones.

  1. Can reputation be manipulated to promote unsafe skills?

This appears likely via three paths:

  • Bait-and-switch: build reputation with a safe version, then change the manifest under the same skillId.

  • Sybil amplification: inflate scores using multiple agents and reviewer addresses under the same skill tag.

  • Fake attestations: weak verification may allow fabricated usage or job context.

What is your view on these risks?

1 Like

@bransdotcom - the schema looks well-aligned. The jobContract + jobId pairing maps cleanly to ERC-8183’s job lifecycle, which means AHM Verify verdicts could sit naturally as evidenceHash + evidenceURI in a SkillUsage attestation.. the evaluator attests to skill quality at job completion time.

The manifestHash atomic update is the right call. One thought: if skills can carry multiple taxonomy classifications under a single provider, it’s worth considering whether the taxonomy field should be an array or a structured enum to preserve queryability… free-form strings will fragment quickly at scale.

Happy to dig into the EAS attester role further if useful.

- Pablo / AHM

1 Like

Hello Hello ,

Interesting standard. A skill registry answers what can this agent do. But it doesn’t answer how well does it do it , and that gap matters once agents start executing autonomously at scale.

ERC-8240 adds a quality dimension to registered skills. A registered skill scored by ERC-8240 becomes a verified skill: getQualityForAgent(registry, agentId) returns score (0-100), trend, volatility, timestamp. A marketplace can filter not just by capability but by quality tier (AAA down to CCC).

The interfaces are composable , ERC-8239 handles discovery, ERC-8240 handles quality. Different layers, same agent.

Would love your thoughts on how you see the skill-to-quality link.

Feedback welcome !

2 Likes

@pablocactus

your point on the evaluator attesting to skill quality at job completion time is exactly the right pattern. That’s where ERC-8240 plugs in.

We’ve already built an ERC-8183 adapter . it maps AHM job verdicts into quality scores that feed a ring buffer (24 slots per subject, on-chain trend + volatility). So the flow becomes concrete:

  • ERC-8239 → what can this agent do (discovery)
  • ERC-8183 → did it do the job (verdict)
  • ERC-8240 → how well does it do it, consistently (quality over time)

The evidenceHash you mention as part of the SkillUsage attestation maps cleanly to our evaluator input — the evaluator doesn’t need to trust the evidence, it scores the outcome. Different trust assumptions, composable layers.

On your taxonomy point — structured enum over free-form strings is the right call. Free-form fragments fast and becomes unqueryable. Same reason we use deterministic subjectIds (keccak256 of chainId + registry + id) rather than string labels.

Would be good to explore how skill-level quality scoring could work alongside the AHM attestation flow. Happy to dig into that on the ERC-8240 thread if you’re interested.

1 Like

@myuksal raises the right concern. Bait-and-switch on registered skills is a real attack surface, and reputation-only defenses are gameable.

One mitigation worth considering: skill-execution attestations that commit not just to the output but to the reasoning trace that produced it. If a SkillUsed event includes an evidenceHash over the verified reasoning chain (not just the final result), a later auditor can detect when a skill’s internal logic changed — even if the manifest and the output look identical.

This doesn’t replace reputation, but it makes the bait-and-switch detectable after the fact and adds an independent signal alongside self-attestation and peer review. The evidenceHash + evidenceURI fields @bransdotcom added in the latest spec revision are the right hook for this — the question is whether the attestation schema should recommend what evidence format the hash commits to.

1 Like

The three-layer stack lands well.. discovery (8239) → verdict (8183) → quality over time (8240) - and the ring buffer architecture (24 slots, on-chain trend + volatility) is exactly the kind of composition that benefits from binary verdicts at the 8183 layer.

The evidenceHash → evaluator input mapping is the part worth understanding properly. AHM Verify outputs are designed as factual, signed measurements with confidence metadata; if those are already being consumed by your ERC-8183 adapter, the integration surface is closer than I assumed. The evidence-vs-outcome separation maps cleanly onto a debate playing out on the 8183 thread about whether evaluator middleware or protocol primitives should handle uncertainty.. same layered logic, trust assumptions live where the work happens.

Glad the structured enum framing landed. Same instinct as deterministic subjectIds - composability requires queryable primitives, and free-form labels lose the property fast.

Yes, would value digging into the ERC-8240 alignment properly. Will pick this up on the 8240 thread when you have time.

– Pablo / AHM

1 Like

Hi Pablo

appreciated, the three-layer mapping is exactly how we think about it internally too: 8239 for skill discovery, 8183 for binary verdicts on jobs, 8240 for quality over time. The clean separation lets each layer keep its own trust assumptions without leaking into the others.

On the AHM Verify integration surface : yes, it’s closer than it looks. Our ERC-8240 evaluator input expects a signed measurement with confidence metadata, which is essentially the AHM Verify output shape. The mapping is straightforward:

- AHM Verify factual measurement → ERC-8240 evaluator input - AHM Verify signature → evaluator authentication - AHM Verify confidence metadata → score uncertainty handled at the ring buffer aggregation layer (TrendOracle + EarlyWarningEngine) - evidenceHash → optional ZK attestation surface (ZK1, future)

Concrete points worth coordinating on, in order of priority:

1. Schema alignment : confidence metadata format (scalar 0–1, percentile, or distribution). We currently use a uint8 0–100 score with a versionId for methodology traceability. Open to convergence if AHM has a stricter format already in production.

2. Aggregation semantics : when multiple AHM-verified measurements feed the same subjectId, how do you see the merge happening: at evaluator level (one evaluator consolidates) or at protocol level (multi-evaluator consensus)? Our current design uses MetaTrendOracle for N-evaluator consensus.

3. subjectId alignment : we use keccak256(abi.encode(uint256 chainId, address registry, address agent)). Confirming AHM uses or is compatible with this would close the loop on cross-protocol composability.

Happy to pick up on a call if useful, or continue async on this thread.

Patrick

1 Like

Patrick - thank you for the depth here. The three-layer mapping landing the same way both sides is significant.

Concrete answers on each, drawn from current production:

1. Schema alignment. AHM’s current shape diverges from uint8 0–100 + versionId in a specific way worth surfacing. Each scan returns both a numeric agent_health_score (0–100 composite) and a separate confidence enum (HIGH / MEDIUM / LOW / INSUFFICIENT). The score is the substantive trust signal; the confidence enum represents how much underlying data the score is built on (transaction count, history depth, presence of D3 infrastructure data).

The 0–100 score itself maps cleanly onto your uint8 expectation. The complication is that AHM’s confidence-of-confidence is categorical, not scalar. There’s a versioning field (model_version: "AHS-v1") but it’s report-level, not per-dimension.

Open to converging.. possible shapes worth discussing: (a) AHM emits the numeric score directly as ERC-8240 confidence and surfaces the categorical signal as separate metadata, (b) AHM derives a per-dimension methodology version for finer traceability, (c) some combination. Genuinely an open design question on our side.

2. Aggregation semantics. Worth being direct about a structural difference: AHM operates as a single evaluator, not as one of N. Internal aggregation is implemented at two layers.. weighted multi-dimensional composite (D1 wallet hygiene + D2 behavioural patterns + D3 infrastructure) and temporal EMA across scan history (alpha 0.6, JWT-anchored continuity). The EMA is the de-facto trend oracle - it’s not a separate service, but the temporal score replaces the raw composite as the final signal when scan history exists.

What AHM doesn’t currently implement is N-evaluator consensus, because there’s only one AHM evaluator per deployment. Your MetaTrendOracle is solving a coordination problem that doesn’t yet exist in our architecture. That said, if ERC-8240 consumers expect multi-evaluator inputs, there’s a coordination question about how AHM’s single-source verdict should be represented in a consensus framework.. either as one input among N, or as a pre-aggregated signal that consensus layers treat differently.

3. subjectId alignment. Honest divergence here. AHM uses raw lowercased Ethereum addresses as identifiers; chain context is ambient (one AHM deployment per chain). Multi-chain agent presence is tracked via a registries field that records which registries have claimed the same address, but the identifier itself isn’t chain-qualified.

Your keccak256(abi.encode(uint256 chainId, address registry, address agent)) scheme would compose more cleanly with ERC-8240’s cross-protocol expectations. Implementing this as a translation layer at AHM’s ERC-8240 emission boundary is straightforward… internal storage stays as-is, the canonical subjectId is constructed when emitting to evaluator-input consumers. If your scheme is the standard, AHM aligning to it is the right call.

Happy to keep this on the thread - value in the coordination being visible to others tracking the three-ERC composition, and async lets both sides reference implementation specifics precisely. Items 1 and 3 have clear convergence paths but specific shape needs decisions on both sides; item 2 is more of a clarification on whether AHM’s single-evaluator model needs to fit into N-evaluator consensus or whether ERC-8240 supports both shapes natively. Probably worth taking each in turn.. Q3 (subjectId) feels like the smallest decision and a good place to converge first.

— Pablo / AHM

2 Likes

Hi Pablo

Appreciate the depth here, this is exactly the kind of coordination conversation that makes the three-ERC composition real. Taking your suggested order, Q3 first as cleanest convergence.

Q3 : subjectId convergence

Confirmed on the canonical form: keccak256(abi.encode(uint256 chainId, address registry, address agent)) is the ERC-8240 subjectId. Translation layer at AHM’s emission boundary works perfectly — internal storage stays as-is, canonical form constructed at push time.

On the ambient-chain question: when AHM has no registry context (single deployment per chain), suggest using the AHM contract address itself as the registry parameter. Stable, queryable, keeps the canonical form well-formed. Open to alternatives if you’d rather use a designated sentinel / but the AHM address feels cleaner since it makes the data source verifiable on-chain.

That one’s closed if you’re aligned.

Q1 : schema alignment

Direction we’re taking: ERC-8240 keeps uint8 0–100 score + versionId as the core push payload. Your confidence enum (HIGH / MEDIUM / LOW / INSUFFICIENT) maps cleanly to a separate metadata channel rather than being folded into the score itself. The substantive trust signal stays canonical, the meta-confidence (how much data the score is built on) stays accessible but distinct.

Concrete plan: drafting an IEvaluatorMetadata interface separate from the core push. AHM emits both / score via pushScore(), confidence + categorical signals via metadata interface. Consumers that don’t need it ignore it; consumers that do (insurance pricing, risk dashboards, regulatory reporting) read both. Keeps the standard simple and lets richer evaluators express more without forcing all evaluators to implement it.

On model_version: ERC-8240 versionId is per-push uint16. Your “AHS-v1” report-level identifier carries fine. Per-dimension methodology versioning (your option b) is a good future extension , agree it’s worth keeping versionId as single uint16 for now to preserve simplicity.

Will share the IEvaluatorMetadata draft when ready, happy to iterate with you before PR.

Q2 : aggregation clarification

Quick one to clear up: ERC-8240 is evaluator-agnostic. Single evaluator pushing to the ring buffer is a fully valid composition. MetaTrendOracle is an optional higher layer / only invoked by consumers explicitly opting into N-evaluator consensus.

AHM’s design (single evaluator, internal weighted composite across D1/D2/D3, temporal EMA at alpha 0.6) is fully compatible. The EMA acts as your TrendOracle internally; what gets pushed is the post-EMA temporal score, which is the right call. No need to fit into N-evaluator consensus unless a downstream consumer asks for it.

For consumers that do want N: AHM’s single-source push is one input among N when N>1, or the sole input when N=1. The standard handles both natively, no coordination layer needed at the ERC-8240 level.

Net

Q3 closed (your call on registry sentinel). Q1 needs IEvaluatorMetadata draft, will follow up. Q2 clarified, single-evaluator and N-evaluator both valid compositions. Folding all three into the next ERC-8240 PR update at ethereum/ERCs.

Stepping back for a moment

The three-ERC mapping you laid out , 8239 discovery → 8183 verdict → 8240 quality over time — is starting to feel like a coherent stack to us, even if none of the three was designed with the others in mind originally. Worth surfacing because it changes how each standard should think about boundaries / what we expose, what we leave to the layer above or below, where coordination matters.

A few questions worth opening up to bransdotcom and ryanberckmans on the 8239 thread, or here if it makes sense :

  • shared subjectId conventions across the three ERCs (avoids translation layers everywhere) — whether a common IEvaluatorMetadata pattern works for 8183 verdicts and 8004 reputation as well, or whether each layer needs its own metadata shape
  • how 8004 reputation signals compose with 8240 quality measurements (different epistemic sources, but the same consumers will likely want both)

Happy to start a coordination thread if that’s useful , or keep the conversation here and let it spread organically. Either way, the fact that AHM and ALIA are converging on these specific shapes feels like a useful reference point for whoever picks up the other ERCs.

Will follow up directly on the AHM-specific R&D side . there are concrete things worth exploring beyond the spec layer that probably belong in a focused conversation rather than a public thread.

Thanks again Pablo, makes the design conversation actually decidable rather than theoretical.

Patrick

2 Likes

Thank you all for the detailed feedback! I’m currently working through all the suggestions but a few to highlight:

  1. Should Skill Reputation be covered by this ERC or delegated to another system? In the current spec we have:
    1. Reputation assigned via the ERC-8004 giveFeedback() tags.
  2. Same question but for Skill Attestation. In the current spec we expose:
    1. The SkillAttestation contract can emit SkillUsed events by calling attestSelfUsage(), attestEvaluatorUsage(), and attestHookUsage() functions. The attestHookUsage()function must be called via an ERC-8183 IACPHook.
    2. The SkillAttestation contract can emit SkillInstalled and SkillUninstalled events via the installSkill() and uninstallSkill() functions.

On Reputation: I developed ASR with the intent to establish standardized Skill registration and identity while exposing lightweight hooks for handling reputation/curation/discovery by offchain entities, rather than building skill-focused versions of the ERC-8004 Reputation and Validation registries (which felt cumbersome and duplicative of work).

On this @myuksal raises valid points regarding incentives and gamed reputation. However, ASR is not intended to solve the trust chain. Installing an arbitrary skill should carry the same trust expectations as installing an arbitrary npm package. That is, ASR should be viewed like a package manager for agent skills, not an attestation to skill quality or safety.

I am open to more discussion on this point though (after all, ERC-8004 rolled its own Reputation system), but would prefer to lean on existing systems and architecture to cover this instead of reinventing the wheel. Exposing lightweight hooks seemed like the best middle ground here which is why we see that in the current spec iteration.

On Attestation: same thoughts as above - however a quick distinction between the 2 attestation types:

  • Skill Usage: denotes that a skill was used to complete a particular job/task. This can be self-attested by the agent, evaluator-attested by a job/task evaluator, or hook-attested by an ERC-8183 job hook. This is meant to be called to associate a skill with job/task completion and aggregated over time. Aggregation methods were intentionally left out of the spec so indexers could establish their own curation systems around these signals.
  • Skill Installs/Uninstalls: denotes that an agent has installed a particular skill and can invoke it to complete relevant jobs/tasks. This is meant to be called once when an agent installs/uninstalls a skill to signal its capabilities to offchain indexers.

If we think either of these signals could be delegated to established systems I’m open to discussing the pros and cons.

If I haven’t commented on a specific suggestion please tag me directly and I’ll do my best to respond. Please keep the feedback coming!

@bransdotcom I think your “package manager, not trust chain” framing is the right boundary for 8239.

The core standard should probably answer three narrow questions:

1. What is the canonical skill identity?

2. Which agent claims to have installed / invoked that skill?

3. What evidence commitment links that invocation to a concrete job or task?

I would avoid making 8239 itself responsible for skill quality, safety, or reputation. Those are downstream interpretations of the evidence stream, and different consumers will want different trust policies: ERC-8004 feedback tags, ERC-8183 evaluator attestations, ERC-8240 quality-over-time, EAS attestations, or external verification systems can all read the same usage evidence without the skill registry choosing one trust model.

The important thing is that `SkillUsed` commits to enough provenance for those downstream systems to disagree productively:

- `agentId`

- `skillId`

- `jobContract` / `jobId` where applicable

- `manifestHash` or skill version at time of use

- `evidenceHash`

- `evidenceURI`

- ideally an `evidenceType` / `evidenceProfile` field so indexers know whether the evidence is a deliverable hash, evaluator verdict, reasoning trace, benchmark run, EAS UID, etc.

That keeps 8239 small and composable: identity + usage provenance in the ERC, trust semantics above it.

1 Like