Trust & Safety
Definition
The measurable prevention of unsafe, unauthorised or harmful actions by the agent in production, and the discipline with which unsafe outputs are detected, contained and remediated.
Assessed Criteria
- Documented guardrail layer with coverage of prompt injection, jailbreak, data exfiltration and unsafe tool use
- Periodic red team exercises with findings tracked through to remediation
- Abuse and misuse detection telemetry available to the operating team in real time
- Incident playbooks for containment, notification and rollback
- Verifiable kill switch at the agent and at the tool access layer