Key takeaways
  • Agent Certified evaluates autonomous AI agents across seven dimensions that together total one hundred points, with weights tuned to where operational risk actually concentrates.
  • Trust and Safety (weight 18) and Governance (weight 16) carry the most weight because they are the two dimensions most directly tied to harm events and to board level accountability.
  • The Autonomy Envelope is the single clearest determinant of whether an agent is safe to rely on. Insurers read this dimension first, before any other.
  • Every tier sets a minimum raw score per dimension. A weak dimension caps the tier even if the weighted total is high. The framework does not reward lopsided agents.
  • Each dimension maps to at least one primary reference instrument: ISO 42001, NIST AI RMF, the EU AI Act, or EIOPA supervisory statements on AI in insurance.

The promise of an AI agent certification framework is simple to state and hard to execute. A reader should be able to look at a single score, placed into a tier, and form a reliable view on whether the agent is safe to rely on. Boards, insurers, procurement teams and regulators should read the same signal and draw compatible conclusions. That outcome only holds if the underlying dimensions are chosen carefully, weighted honestly, and described in a way that two competent assessors would score within a point of each other.

Agent Certified uses seven dimensions. This article explains what each measures, why the weights sit where they do, and what the framework actually asks an operator to produce in evidence. The goal is not to reproduce the full methodology, which is published openly at the methodology page. The goal is to give a reader researching AI agent certification a clear picture of how the framework thinks about risk.

Why seven

Most early attempts at AI assessment collapse the problem into a single score from a checklist of dozens of items. That approach produces a number, but the number is rarely defensible because the underlying checklist lumps unrelated controls together. An agent with strong guardrails and weak governance looks the same on paper as an agent with weak guardrails and strong governance, despite presenting very different real world risk.

Agent Certified separates the problem into seven distinct control surfaces. Seven was chosen deliberately. Five forces unrelated concerns into the same dimension and loses resolution. Ten multiplies overlap and makes it difficult for two assessors to stay consistent. Seven is the smallest number that cleanly separates the decisions a real operator has to make without producing overlap that assessors will score differently.

The seven dimensions are: Trust and Safety, Context Integrity, Distribution Control, Product Maturity, Governance, AI Integration, and the Autonomy Envelope. Together they cover the technical surface of the agent, the institutional scaffolding around it, and the explicit limits on what it is allowed to do.

Dimension 01: Trust and Safety (weight 18)

Trust and Safety is the highest weighted dimension because it is the one most directly tied to harm events. It covers the measurable prevention of unsafe, unauthorised or harmful actions, and the discipline with which unsafe outputs are detected, contained and remediated.

In practice this dimension evaluates the guardrail layer, red team discipline, abuse and misuse detection, incident playbooks, and the verifiability of the kill switch. A Certified tier agent has guardrails that cover prompt injection, jailbreak, data exfiltration and unsafe tool use. An Elite tier agent adds continuous red teaming, real time detection tied to automatic containment, and at least one documented real world incident that was contained cleanly.

Trust and Safety sits at weight 18 because a failure in this dimension is the kind of failure that ends up in regulator inquiries, insurance claims and press coverage. Every other dimension can be excellent and the agent is still not safe to rely on if this one is weak.

Dimension 02: Context Integrity (weight 14)

Context Integrity measures the quality of the information the agent reasons over. Modern agents retrieve from live data sources, corporate knowledge stores, email, ticketing systems and external APIs. If that context is stale, poisoned, or ungoverned, the agent's reasoning inherits the problem.

The dimension evaluates provenance metadata on retrieved documents, refresh cadence, lineage from source to decision trace, and input validation on user supplied content that reaches the agent. It is closely related to, but distinct from, Trust and Safety. Trust and Safety is about the agent's outputs. Context Integrity is about its inputs.

Weight 14 reflects that context quality is a precondition for everything else. An agent with perfect guardrails acting on bad context will produce bad decisions. The dimension maps directly to Article 10 of the EU AI Act on data governance.

Dimension 03: Distribution Control (weight 12)

Distribution Control answers three questions. Who can invoke the agent? Under what authority? And how are the downstream actions bounded? It is the dimension where identity, authorisation and blast radius meet.

Assessors look for authenticated invocation with no shared credentials, role based authorisation tied to the organisation's identity provider, rate limits and spend caps per caller, environment segregation, and documented blast radius per tool. The weighting is 12 because distribution failures tend to be contained, but they can scale fast when they are not.

Dimension 04: Product Maturity (weight 14)

Product Maturity measures the degree to which the agent behaves as a production grade product rather than a prototype. It covers measured uptime, versioned prompts and models under change control, regression evaluation on every change, behaviour change logs, and observability at the reasoning trace level.

This dimension is weighted at 14 because prototype grade agents deployed into production are one of the more common sources of operational failure. An agent that cannot be versioned, cannot be rolled back and cannot be monitored is not a product. It is an experiment running in front of customers.

Dimension 05: Governance (weight 16)

Governance is the institutional scaffolding around the agent. A named senior owner. An AI risk policy referenced in board minutes. A risk register entry with current rating and mitigations. An audit trail of decisions with retention aligned to sector requirements. Documented vendor and model supplier due diligence.

At weight 16, Governance is the second highest dimension. The weight reflects the fact that the vast majority of production failures involve a governance gap, not a purely technical one. An agent that no one is accountable for, that no board has reviewed, and that no policy governs is a risk the operator cannot defend in front of a regulator, an insurer or an auditor. Governance is where the ISO 42001 anchor is strongest: operators already certified to ISO 42001 will find the evidence they need is largely reusable.

Dimension 06: AI Integration (weight 12)

AI Integration measures how the agent sits inside the organisation's existing systems of record, identity, approval and escalation. The question is whether the agent extends institutional memory or bypasses it. Does it write to the CRM under the real user's identity, or under a service account that destroys attribution? Do escalations route to a named reviewer, or to a generic shared inbox? Do its actions appear in the same audit log that the rest of the business uses?

The weight is 12 because integration problems are usually recoverable, but they are also the dimension that drives the slow erosion of trust in an agent over time. A poorly integrated agent accumulates audit gaps that only become visible during an incident investigation.

Dimension 07: Autonomy Envelope (weight 14)

The Autonomy Envelope is the explicit boundary between what the agent may do without human confirmation and what requires a human in the loop. It is the single clearest determinant of operational risk. Insurers and regulators read this dimension first, before any other.

Assessors look for a written autonomy policy specifying what the agent can and cannot do unsupervised, human in the loop thresholds tied to impact rather than convenience, revocation capability exercisable by non engineering staff, rollback of agent initiated actions where technically possible, and hard stops for action classes that are never delegated. The dimension maps directly to Article 14 of the EU AI Act on human oversight and to Article 26 on deployer obligations.

Weight 14 reflects that the Autonomy Envelope is not the dimension most likely to fail, but it is the dimension most likely to be looked at by an outside party deciding whether to trust the agent. An Elite tier Autonomy Envelope is tied directly to insurance policy wording, which is the most mature form the dimension takes.

Why the weights are not equal

A framework that weighted every dimension equally would be easier to explain but would not reflect how risk actually concentrates. The weights were set by working backward from two sources of real data: the historical distribution of AI incidents reported publicly, and interviews with European insurers pricing AI risk in 2025 and early 2026. Both sources point to the same conclusion. Harm events cluster in Trust and Safety. Accountability gaps cluster in Governance. The Autonomy Envelope is the dimension that determines whether an incident becomes a controlled event or a runaway one.

Equal weights would also create a perverse incentive. An operator could strengthen five low cost dimensions and ignore the expensive ones, producing a high score without moving the actual risk. Weighted scoring makes the cheapest path to a high score also the path that most reduces real risk.

How the dimensions combine into a tier

Each dimension receives a raw score from one to ten based on the scoring rubric published in the methodology. The raw score is multiplied by the dimension weight and summed. The result is normalised to a one hundred point scale and placed into one of five tiers: Pre Assessment, In Progress, Certified, Advanced, or Elite. Full details of the tiers are on the certification levels page.

The important detail is that each tier also sets a minimum raw score per dimension. Certified requires four on every dimension. Advanced requires six. Elite requires eight. This means the framework does not reward lopsided agents. An operator scoring ten on Governance and three on Trust and Safety will be capped at Pre Assessment, regardless of weighted total. The tier is the envelope, not just the number.

What each dimension maps to

Every dimension traces back to at least one primary reference instrument. Trust and Safety and Context Integrity draw heavily from the NIST AI Risk Management Framework, particularly its Measure and Manage functions. Governance and Product Maturity map to ISO 42001, particularly clauses 6, 8 and 9. The Autonomy Envelope maps to Articles 14 and 26 of the EU AI Act. Product Maturity also maps to Article 15 on accuracy and robustness. Insurer reliance on the framework is anchored to EIOPA supervisory statements on AI in insurance from 2024.

This mapping matters for a practical reason. Operators already aligned to one of these reference instruments do not start from zero. They start from wherever their existing evidence reaches, and the framework tells them where the gaps are.

What the framework asks an operator to produce

Across all seven dimensions the framework asks for the same class of evidence: written artefacts, technical telemetry, interview confirmations, and a documented chain of accountability. The methodology names specific evidence types for each dimension, but the pattern is consistent. A policy on paper does not produce a high score. A policy in use, referenced in board minutes, tested against real incident data, and tied to a named owner does.

In the authors' experience, the first formal assessment that an operator undertakes is more valuable than the score itself. The process of compiling the evidence reveals gaps the operator did not know were there. The score that follows is the public signal. The internal clarity is the underrated benefit.

Where to go next

Readers who want to score themselves against the framework should start with the full methodology and the certification levels. Operators who want to request a formal assessment can do so through the assessment request page. Readers comparing this framework to NIST AI RMF, ISO 42001 and the EU AI Act should read the companion article on what operators actually need from each instrument.