AI SAFE² v3.0: From Detection to Adversarial Depth
How v3.0 Adds 33 New Controls, Ships Production Enforcement Infrastructure, and Establishes the Agentic Control Plane as a Board-Level Governance Concept
Published by Cyber Strategy Institute | AI Governance Research | 2026
Framework: github.com/CyberStrategyInstitute/ai-safe2-framework
This Release: v3.0 Release Tag
All Releases: github.com/…/releases
Interactive Dashboard: cyberstrategyinstitute.github.io/ai-safe2-framework/dashboard/
TL;DR — What Changed in v3.0
23 new pillar-level controls | 10 new cross-pillar governance controls (CP.1–CP.10) | 1 duplicate removed | 3 existing controls expanded | Production-grade Control Gateway with multi-provider enforcement and HMAC-chained audit logging | First framework to integrate OWASP AIVSS v0.8 amplification scoring into a GRC risk formula | Agentic Control Plane formalized as board-level governance evidence | 32 compliance frameworks mapped | New threat category: T11 Multi-Turn Behavioral Conditioning
Why v3.0, and Why Now
AI SAFE² v2.1 was published as the industry’s most comprehensive agentic AI governance framework, introducing five critical gap fillers: Swarm and Distributed Agentic Controls, Context and Fingerprinting, Supply Chain Risk and Model Signing, Non-Human Identity, and Universal GRC Tagging and Memory Security. Organizations that implemented v2.1 had coverage no other framework could match across ISO 42001, NIST AI RMF, MITRE ATLAS, MIT AI Risk Repository, and OWASP Top 10 LLM simultaneously.
But 2025 and early 2026 changed the threat landscape faster than any prior period. Four things happened that made v3.0 necessary.
The adversarial attack taxonomy matured.
OWASP AIVSS v0.8, the Arcanum PI Taxonomy, and MITRE ATLAS’s October 2025 agent-specific technique release gave defenders a rigorous, production-grounded taxonomy of how AI systems actually get attacked — not theoretical, but validated. v2.1 had excellent framework coverage but lacked controls grounded in this new adversarial precision layer.
Agentic deployments hit production at scale.
AWS Bedrock, Azure AI Foundry, and n8n-based workflow agents moved from pilot to production across the enterprise. With production came production incidents: Bedrock Guardrail poisoning via UpdateGuardrail API, Azure indirect prompt injection in tool outputs, n8n sandbox escape vulnerabilities (CVE-2026-25049 and predecessors) exposing AI API credentials. v2.1 had agentic architecture coverage; it did not have platform-specific attack surface coverage.
The control plane problem became a board-level problem.
As agents began orchestrating other agents, making financial decisions, and operating with persistent memory across sessions, the question shifted from “do we have AI security controls?” to “who governs the AI control plane, and how do we evidence that to a board or regulator?” CSA’s Agentic Control Plane framework and NEXUS-A2A’s production implementation gave us the vocabulary and evidence base to formalize this.
The enforcement gap became indefensible.
Every governance framework — including AI SAFE² v2.1 — defined controls that lived outside the execution boundary. Documentation, system prompts, and periodic audits do not fire at the moment an LLM call is made. The gap between where controls were placed and where attackers operate was the fundamental problem. v3.0 closes it with a production-grade Control Gateway that enforces at the execution boundary.
AI SAFE2 v3.0 answers four questions v2.1 left open: Where exactly are the attack surfaces? What is the agentic risk score? Who governs the control plane? Where does enforcement actually happen?
The result is a framework that advances along four axes:
- Adversarial precision: controls grounded in red-team engineering, not framework mapping alone
- Agentic scoring integration: first framework to operationalize OWASP AIVSS v0.8 amplification factors in a GRC risk formula
- Control plane governance: ACT tiers and the Agentic Control Plane as systemic governance architecture, evidenceable to boards and regulators
- Execution-boundary enforcement: production-grade gateway that gates every LLM request before it executes
Governance Framework Structure: What Stayed, What Changed
The five-pillar S-A-F-E-E architecture is unchanged. All 99+ subtopics from v2.0 are maintained. All 35+ v2.1 sub-domains are maintained. v3.0 is additive — with one housekeeping correction — not a rewrite. Organizations that have already implemented v2.1 are upgrading, not starting over.
Here is how the total control count grew:
Layer | v2.0 Count | v2.1 Count | v3.0 Count |
Core subtopics (10 topics) | 99+ | 99+ | 99+ (unchanged) |
Gap-filler sub-domains | 0 | 35+ | 35+ (unchanged) |
New pillar-level controls | 0 | 0 | 23 new |
Cross-pillar Governance OS | 0 | Proposed | 10 formalized (CP.1–CP.10) |
Modifications to existing controls | 0 | 0 | 3 expanded + risk scoring |
Removed (housekeeping) | 0 | 0 | 1 exact duplicate removed |
Total: 161 controls (151 pillar + 10 CP Governance OS)
The One Removal: Fixing a Duplicate
v3.0 removes one control: P1.T1.8, Format Normalization and Encoding Validation. It was an exact duplicate of P1.T1.6. Both entries listed the same bullets: standardize input formats, validate character encodings, prevent encoding-based attacks, normalize text and binary inputs. Practitioners implementing both were duplicating work and getting confused about which control to report against.
The one additional bullet from P1.T1.8 — NFKC normalization and invisible character injection prevention language — has been merged into P1.T1.6 and expanded significantly in the new P1.T1.2 evasion technique bullets. P1.T1.9 (Supply Chain Artifact Validation) is renumbered to P1.T1.8.
This is the only removal in v3.0. No substantive security control has been eliminated. The framework only grew more complete.
Three Modifications to Existing Controls for Agent Governance
1. P1.T1.2 – Malicious Prompt Filtering: Add the Arcanum Evasion Layer
P1.T1.2 already covered prompt injection detection, semantic manipulation analysis, and jailbreak monitoring. What it did not cover were the specific evasion techniques that adversaries use to bypass those detections. The Arcanum PI Taxonomy’s Attack Evasions category documents a validated set of techniques in production use. v3.0 adds four new bullets:
- Homoglyph substitution detection (Unicode Consortium TR#39)
- Invisible character injection detection (zero-width joiners, zero-width non-joiners)
- Multi-turn conditioning pattern detection (tracking semantic intent drift across conversation turns)
- Encoded payload detection (Base64, ROT13, hex, and other encoding schemes)
Why this matters: without these specific detection capabilities, a detection pipeline that scores well on standard benchmarks can be trivially bypassed in production by an attacker who knows the Arcanum taxonomy. This is the gap between framework coverage and adversarial depth.
2. P2.T3.10 – Vulnerability Scanning: Add Cloud AI Platform Surfaces
v2.1’s vulnerability scanning subtopic covered general AI infrastructure. What it missed were the platform-specific attack surfaces that production deployments actually expose. v3.0 adds three new bullets:
- AWS Bedrock: audit UpdateGuardrail API call history via CloudTrail; scan Knowledge Base data source configurations via UpdateDataSource API event monitoring; review Lambda role permissions attached to AI workloads. The Bedrock Guardrail poisoning via UpdateGuardrail is a confirmed active attack path with proof of concept.
- MCP server scanning: scan for MCP servers bound to all network interfaces (NeighborJack-style vulnerabilities exposing all stored credentials)
- A2A protocol review: validate agent card configurations in Google A2A and equivalent protocols for unauthorized tool definition injection
3. Risk Scoring Formula: The AIVSS Agent Amplification Factor
The existing risk formula:
Combined Risk Score = CVSS Base Score + (100 – Pillar Score) / 10
Works well for static AI systems. But for agentic deployments, CVSS alone does not capture the amplification effects of execution autonomy, persistent state, multi-agent interaction, and self-modification. OWASP AIVSS v0.8 defines exactly these amplification factors — 10 of them — as a validated scoring standard.
v3.0 extends the formula for ACT-2 and above agentic deployments:
Combined Risk Score = CVSS Base Score + ((100 – Pillar Score) / 10) + (AAF / 10)
Where AAF is the Agent Amplification Factor (0–10), calculated by scoring the 10 OWASP AIVSS amplification factors for each deployed agent: Execution Autonomy, External Tool Control Surface, Natural Language Interface, Contextual Awareness, Behavioral Non-Determinism, Opacity and Reflexivity, Persistent State Retention, Dynamic Identity, Multi-Agent Interactions, and Self-Modification.
Each factor is scored: 0 (architecturally prevented), 5 (governed), 10 (uncontrolled). AIVSS severity bands: Critical (9.0–10.0), High (7.0–8.9), Medium (4.0–6.9), Low (0–3.9). Publish AAF alongside Pillar Score in the AI SAFE² governance dashboard.
AI SAFE² v3.0 is the first governance framework to incorporate OWASP AIVSS v0.8 amplification factors into a GRC risk scoring formula. No other framework has done this.
23 New Pillar-Level Controls: Pillar by Pillar
Pillar 1: Sanitize and Isolate — Six New Controls
Six new controls address the gap between syntactic input validation (v2.1) and the full adversarial attack surface of a deployed agentic system.
P1.T1.10 – Indirect Injection Surface Coverage (CRITICAL)
The highest-severity single addition in v3.0. OWASP AIVSS scores Tool Misuse (which encompasses indirect injection) at 9.9/10 — the highest score in the taxonomy. v2.1 covered direct prompt injection extensively. It did not systematically enumerate and govern the indirect injection surfaces that agentic systems process: RAG documents, tool output metadata, email and calendar content ingested by agents, API responses, MCP tool descriptions.
v3.0 requires organizations to: enumerate all indirect injection surfaces, implement output sanitization at tool-response boundaries before LLM parsing, deploy LLM-aware DLP that inspects tool outputs (not only prompts), and maintain a surface registry documenting each surface, consuming agents, and sanitization control.
Why it matters in production: Azure AI Foundry red-team testing confirmed indirect prompt injection in tool outputs as the primary active test vector. The gap between “we test for prompt injection” and “we have tested all indirect injection surfaces” is precisely where production incidents happen.
S1.3 – Semantic Isolation Boundary Enforcement (CRITICAL)
v2.1’s multi-agent boundary enforcement (P1.T2.1) operated at the network and protocol layer. S1.3 adds the semantic layer: embedding-space partitioning per agent session, cross-agent context tagging to detect unauthorized state transfer, per-session vector store isolation in multi-tenant deployments. Addresses production cross-agent state contamination observed in Bedrock and Azure agentic testing.
S1.4 – Adversarial Input Fuzzing Pipeline (HIGH)
Integrates adversarial testing into CI/CD pipelines as a deployment gate, not a post-deployment exercise. Minimum coverage: 80% of Arcanum attack categories tested before ACT-2+ deployment. Aligned with Azure AI Foundry Red Teaming Agent’s shift-left approach.
S1.5 – Memory Governance Boundary Controls (CRITICAL)
v2.1 addressed memory poisoning detection. S1.5 adds proactive write governance: memory write policies classifying contents as ephemeral/session/persistent, memory content inspection before persistence, cross-session isolation, retention expiration, and explicit approval workflows for content escalated to persistent memory tier. Grounded in AIID-documented GitHub Copilot memory-based data leakage incident.
S1.6 – Cognitive Injection Sanitization Layer (HIGH)
Addresses semantic and cognitive-level bypass techniques that evade syntactic detection: few-shot conditioning detection, role-play bypass detection (persona assignment, authority impersonation), and a cognitive evasion pattern registry updated from AIID telemetry. S1.6 is the primary detection control for T11 Multi-Turn Behavioral Conditioning (see Threat Matrix section below).
S1.7 – No-Code and Low-Code Agent Platform Security Controls (CRITICAL)
The most unique control in v3.0. No other major AI governance framework addresses no-code platform security. This gap is actively dangerous: CVE-2026-25049 (n8n critical, February 2026) demonstrates that no-code workflow platforms used for AI orchestration have severe sandbox escape vulnerabilities that expose all stored AI API credentials, cloud keys, and OAuth tokens.
S1.7 requires credential isolation per AI workload, hypervisor-level expression sandboxing (not application-layer), network segmentation for AI API connections, audit logging of expression evaluation events, and API key rotation within 48 hours of any platform CVE disclosure.
S1.7 is a competitive first-mover. CVE-2026-25049 is active now. No competitor framework has this control. Organizations running n8n, Zapier, or Power Automate for AI workflow orchestration are exposed without it.
Pillar 2: Audit and Inventory — Four New Controls
A2.3 – Model Lineage Provenance Ledger (HIGH)
Implements a directed acyclic graph (DAG) lineage model tracking base model to fine-tuned variants to deployment versions with cryptographic signing at each edge. Addresses OWASP AIVSS Risk #8 (Agent Supply Chain, 9.7/10) and MITRE ATLAS October 2025’s new Modify AI Agent Configuration persistence technique. Extends lineage to configuration states via OpenSSF OMS integration.
A2.4 – Dynamic Agent State Inventory (HIGH)
Maintains a real-time registry of all active agent sessions — memory tier, tool connections, authorization scope — plus the key new fields: owner_of_record (human, team, or business unit responsible), hear_agent_of_record (named HEAR for ACT-3/4), and control_plane_id (link to governing Agentic Control Plane). Any agent without an owner_of_record is a governance failure requiring resolution before production promotion.
A2.5 – Semantic Execution Trace Logging (CRITICAL)
Full semantic execution traces: reasoning chain, retrieved context chunks with chunk IDs and source URLs, tool selection rationale, decision confidence scores. Chain-of-thought audit trails for ACT-3 and ACT-4 agents. Addresses OWASP AIVSS Risk #9 (Agent Untraceability, 8.3/10). The Control Gateway implements A2.5 at the infrastructure layer via HMAC-SHA256 chained audit logs written to an append-only store the agent cannot modify.
A2.6 – RAG Corpus Diff Tracking (HIGH)
Detects both benign and adversarial RAG corpus modifications via content-diff auditing on updates, chunk-level provenance tracking, corpus changelogs with approval workflows, and retrieval pattern change alerting. Based on a confirmed production failure mode where a SaaS security questionnaire system began returning wrong answers after a document reorganization shifted retrieval rankings.
Pillar 3: Fail-Safe and Recovery — Four New Controls
F3.2 – Agent Recursion Limit Governor (CRITICAL)
Hard recursion depth limits by ACT tier: ACT-1 max 5 calls, ACT-2 max 20, ACT-3 requires human approval beyond threshold. With call-chain logging, graduated backpressure before hard termination, and partial task state preservation on abort. These limits are enforced at the API gateway layer — not the system prompt — so they cannot be bypassed via prompt injection. Grounded in AIID incident where an agentic financial trading system misinterpreted operational prompts and could not be stopped.
F3.3 – Swarm Quorum Abort Mechanism (HIGH)
Decentralized quorum-based abort: when 25%+ of agents in a swarm report anomalous conditions, trigger coordinated shutdown without centralized intervention. Sybil-resistant via cryptographic attestation-weighted voting. Addresses MAESTRO’s Sybil attack scenario where false agent identities gain disproportionate consensus influence.
F3.4 – Behavioral Drift Baseline and Rollback (HIGH)
Behavioral baselines per agent (response distributions, tool selection patterns, retrieval source distributions, confidence metrics), statistical drift detection, behavioral snapshots for rollback, automated rollback to last known-good state on drift-triggered abort. Addresses both retrieval-loop drift (benign production failure mode) and adversarial memory-based behavioral modification (MITRE ATLAS Memory technique). F3.4 is also the primary rollback control for T11 Multi-Turn Behavioral Conditioning campaigns.
F3.5 – Multi-Agent Cascade Containment (CRITICAL)
Extends blast radius containment beyond the agent network to SaaS platforms and automation tools. When an agent is quarantined, its pre-authorized integrations are also suspended or revoked. Maintains a blast-radius dependency graph. Circuit breaker propagation across A2A and SaaS integrations. Based on AIID supply chain poisoning incident where code review agent compromise cascaded to distribution of malware-infected updates.
Pillar 4: Engage and Monitor — Five New Controls
The monitoring pillar receives the most new controls in v3.0, reflecting that adversarial AI systems require adversarial-specific detection pipelines — not just statistical anomaly detection.
M4.4 – Adversarial Behavior Detection Pipeline (CRITICAL)
Deploys adversarial ML classifiers trained on known attack patterns, distinct from statistical anomaly detection. Tracks Attack Success Rate (ASR) as a continuous monitoring KPI. Integrates with threat feeds from Zenity Labs, Adversa AI, and MITRE ATLAS updates. Separates adversarial detection pipelines from performance monitoring to prevent poisoning the detection layer itself.
M4.5 – Tool-Misuse Detection Controls (CRITICAL)
Tool invocation behavioral profiling, anomaly alerting for unexpected tools or invocation sequences, output payload inspection for injection artifacts before returning content to the LLM reasoning loop, and tool squatting detection. Addresses OWASP AIVSS Risk #1 (Tool Misuse, 9.9/10 — the highest-scored risk in the AIVSS taxonomy) and MITRE ATLAS October 2025’s new Exfiltration via AI Agent Tool Invocation technique.
M4.6 – Emergent Behavior Anomaly Detection (HIGH)
Monitors for capability emergence (outputs exceeding documented capability boundaries), cross-agent emergent collaboration, self-modification attempts, and information-theoretic output novelty scoring. The critical addition: systematic decision bias — consistent skew in tool selection, escalation patterns, or user outcomes — MUST be treated as a security-relevant signal, not only an ethics concern.
M4.7 – Jailbreak and Injection Telemetry Layer (HIGH)
Structured attack attempt logging with technique type, evasion class, and intent classification metadata. Attack telemetry dashboard visualizing attack volume, technique distribution, and trend lines. Feeds attack data into P5 evolution controls and ISAC threat intelligence sharing.
M4.8 – Cloud AI Platform-Specific Monitoring (CRITICAL)
Platform-specific telemetry for AWS Bedrock (UpdateGuardrail API calls, UpdateDataSource invocations, cross-account role assumption events in CloudTrail), Azure AI Foundry (Foundry Control Plane governance telemetry), and n8n/automation platforms (base URL changes, expression evaluation events, outbound destination changes). No other framework has platform-specific AI monitoring controls at this level of specificity.
Pillar 5: Evolve and Educate — Four New Controls
E5.1 – Continuous Adversarial Evaluation Cadence (CRITICAL)
Shifts adversarial evaluation from periodic exercise to continuous practice: automated adversarial scanning at every CI/CD deployment gate, minimum weekly for production agents, covering all 10 OWASP Agentic AI Core Risks per cycle. ASR as a formal deployment gate KPI with maximum acceptable thresholds per agent tier. Aligned with Azure AI Foundry Red Teaming Agent’s shift-left model.
E5.2 – Capability Emergence Review Process (HIGH)
Governance review process for emergent agent capabilities that arise without deliberate upgrade events: a cross-functional capability emergence review board, defined emergence thresholds, four-tier classification (document / security review / board approval / suspend), and alignment with CSA Agentic Control Plane identity-first design.
E5.3 – Evaluation-Safe Pattern Library (MEDIUM)
Library of validated, security-reviewed implementation patterns for all AI SAFE² controls, with platform-specific reference implementations for Bedrock, Azure AI Foundry, LangGraph, AutoGen, and n8n. Released as open-source reference implementations via the AI SAFE² GitHub repository.
E5.4 – Red-Team Artifact Repository (HIGH)
Institutional red-team knowledge repository with defined artifact schema from the Arcanum PI Taxonomy: attack type, affected agent/model, AIVSS score, detection signature, remediation control. Required artifact creation as a deliverable from every red-team exercise. Repository integrated into continuous adversarial evaluation cadence (E5.1).
The Ten Cross-Pillar Controls: CP.1 through CP.10
The most architecturally significant additions in v3.0 are the ten cross-pillar controls. These are not pillar-specific subtopics — they are framework-wide governance concepts that span all five pillars and define how the framework operates as a system.
CP.1 and CP.2: New Tags for Adversarial Incident Characterization
CP.1 defines the Agent Failure Mode Taxonomy (AFMT) classifying failures by origin (adversarial/benign/emergent), scope (single-agent/multi-agent/cascade), and impact (CIA + a new fourth dimension: autonomy, capturing failures that cause unintended autonomous action).
CP.1 also introduces a mandatory tagging requirement: every agentic incident must be tagged with cognitive_surface=(model|memory|both) and memory_persistence=(session|cross_session). These tags separate ordinary prompt failures from belief and memory drift — a distinction that changes incident response, root cause analysis, and long-term control evolution.
CP.2 elevates the Adversarial ML Threat Model (AMLTM) from a mapping tag to a mandatory governance artifact for all ACT-2+ deployments. CP.2 also introduces temporal_profile=(immediate|delayed_days|delayed_weeks|chronic), enabling temporal risk aggregation and distinguishing burst exploits from long-horizon conditioning campaigns — the precise mechanism that makes T11 Multi-Turn Behavioral Conditioning detectable at the governance layer.
CP.3: Formalizing the ACT Capability Tiers
CP.3 formalizes the Agent Capability Tier (ACT) system as four fully specified tiers with defined mandatory controls at each level:
Tier | Name | Definition | State and Tools | Required Controls |
ACT-1 | Assisted | Human reviews all outputs before action | Read-only; no persistent state | Standard P1-P5 |
ACT-2 | Supervised | Agent acts with human checkpoints for critical actions | Limited tools; session state | AAF scoring; AMLTM |
ACT-3 | Autonomous | Agent operates; post-hoc review | Broad tools; persistent state; multi-agent | F3.2; M4.4; CP.2; owner_of_record; HEAR required |
ACT-4 | Orchestrator | Agent controls other agents; cross-system authority | Full enterprise-scale tools | All ACT-3 + CP.4 + CP.8 + CP.9 ARG + CP.10 HEAR |
CP.4: The Agentic Control Plane as a Board-Level Governance Concept
CP.4 is arguably the most strategically significant addition in v3.0. It defines the Agentic Control Plane — the set of cross-cutting controls governing agent identity, dynamic permission enforcement, orchestration boundaries, and runtime behavioral trust — as an explicit AI SAFE² governance concept.
The practical implication: boards and regulators should treat the combination of Non-Human Identities and agent orchestration as the primary control plane for autonomous AI. AI SAFE² ACT tiers and CP.4 controls are the canonical way to evidence governance of that control plane, including ownership, delegation, and runtime authorization for every agent and orchestration flow.
CP.4 also establishes that protocol-level meshes (A2A, MCP, ACP, and equivalent protocols) should be treated as Agentic Control Planes and evaluated against CP.3 through CP.7 — not as isolated tools or plugins. This has significant implications for vendor assessment and supply chain governance.
CP.5 through CP.8: Platform Profiles, Incident Loops, Deception, and Catastrophic Risk
CP.5 mandates Platform Security Profiles as companion documents for each AI platform (Bedrock, Azure AI Foundry, n8n, LangGraph, AutoGen, CrewAI) mapping AI SAFE² controls to platform-specific CVEs, implementation guidance, and monitoring telemetry. Profiles are version-pinned and updated on CVE cadence.
CP.6 mandates quarterly AIID agentic incident reviews and a 30-day Incident-Informed Control Review (IICR) process triggered by any new relevant AIID incident.
CP.7 adds Deception and Active Defense — AI-specific canary tokens in RAG corpora, honeypot tool endpoints, and adversarial credential traps in agent memory. AIDEFEND explicitly lists Deceive as a defensive tactic. No other major AI governance framework currently includes active deception controls. Canary documents seeded throughout RAG corpora detect retrieval by unauthorized agents and catch indirect injection attempts before they succeed.
CP.8 defines Catastrophic Risk Thresholds (CRTs): specific behavioral indicators that trigger emergency suspension regardless of business continuity impact. Example catastrophic paths include agentic ransomware leveraging NHI and orchestration for full kill-chain execution with legitimate credentials, protocol-layer supply chain compromise of widely deployed A2A/MCP servers, and persistent cognitive or bias failures impacting safety-critical decisions. Required as a condition of ACT-3 or ACT-4 deployment approval.
CP.9: Agent Replication Governance — the governance gap no other framework addresses ★ First in Field
Agent replication is the first identity-multiplying threat in enterprise AI. When an agent can clone itself, four security assumptions simultaneously collapse: one identity, one permission set, one execution context, one audit trail. Replication grows at machine speed. One agent can become a thousand identities before any detection system fires.
NIST, ISO, OWASP, and enterprise IAM have zero standards for replication authorization, permission inheritance, lineage tracking, or distributed swarm kill switches as of this writing. AI SAFE² v3.0 is the only governance framework to define these requirements.
CP.9 requires: explicit replication authority in deployment manifests enforced at the gateway layer; ephemeral credentials with scope narrowing at every delegation hop; cryptographic lineage tokens on every spawned agent; and a kill switch architecture that severs the full delegation tree at the gateway layer, revoking all descendant credentials within 500 milliseconds. ACT-3 agents are limited to a maximum of 2 delegation hops. ACT-4 to a maximum of 3. Stopping one agent in a swarm means nothing if replicas are already active. The kill switch must eliminate the hive’s atmosphere, not shoot individual bees.
CP.10: The HEAR Doctrine — Human Ethical Agent of Record ★ First in Field
No swarm can operate above Sovereignty Level S3 without a formally designated Human Ethical Agent of Record. The HEAR is a named individual — not a team, not a role — with cryptographic signing authority, real-time interrupt capability, and legal accountability for all Class-H actions taken by any agent within their designated deployment boundary.
A Class-H action is any irreversible, financially material, security-control-modifying, physical-infrastructure-crossing, or cross-organizational commitment action. Class-H actions require the agent to pause, present the semantic consequence in plain language to the HEAR, receive a cryptographic signature from the HEAR’s registered private key, and log the authorization before execution. If the HEAR is unreachable or the signature fails, the action does not proceed. There is no automatic approval path.
CP.10 satisfies EU AI Act Articles 9 and 14 (human oversight for high-risk AI), SEC Cybersecurity Disclosure accountability requirements, SOC 2 CC.7.4, and GDPR Article 22 automated decision safeguards. It is the only framework control that creates named individual accountability at the execution layer for autonomous AI.
60% of organizations cannot terminate a misbehaving agent. CP.10 HEAR Doctrine exists because governance committees and board-level oversight are too slow for machine-speed incidents. The kill switch must be held by a named human who can use it unilaterally when it matters.
New Threat Category: T11 — Multi-Turn Behavioral Conditioning
v3.0 adds a new threat category to the AISM Agent Threat Control Matrix. T11 addresses an attack class that v2.1 had no formal coverage for: conditioning attacks that operate across multiple sessions to gradually shift agent behavior, bypassing all per-message injection filters.
What makes T11 distinct from T1 Prompt Injection: T1 is detectable at the input boundary on a per-message basis. T11 operates across many sessions — planting behavioral patterns through few-shot implanting, role confusion, contextual anchoring, and persona drift. No individual message appears adversarial. Detection requires semantic analysis across multiple sessions, behavioral baseline comparison over time, and governance artifacts that model time-shifted attack campaigns.
Primary detection and response controls: S1.6 (Cognitive Injection Sanitization), F3.4 (Behavioral Drift Baseline and Rollback), A2.5 (cross-session trace logging), and CP.2’s temporal_profile field — which was specifically designed to capture delayed and chronic attack patterns that instantaneous controls miss.
Gateway Enforcement: Production-Grade Control at the Execution Boundary
The Control Gateway is the enforcement component that makes the rest of the framework operationally real. Every other AI governance framework — and AI SAFE² v2.1 — places controls in documentation, system prompts, and periodic audits. These sit outside the execution boundary. Attackers operate inside it. The gateway closes that gap by enforcing at the one point in the architecture where deterministic control is architecturally possible: the moment a request is made to an LLM provider.
The problem was never missing controls. It was where they were placed.
Multi-Provider Enforcement
Five providers supported via unified adapter architecture in gateway/provider_adapters.py:
Provider | Type |
Anthropic | Cloud – Claude models |
OpenAI / Codex | Cloud – GPT models |
Google Gemini | Cloud – Gemini models |
Ollama | Local – self-hosted, air-gapped capable |
OpenRouter | Aggregator – multi-model routing |
One config change switches providers. Enforcement policy — rate limiting, validation, logging, HITL circuit breaking — stays identical across all of them. A team that adds a second provider does not build a second governance layer.
Heartbeat-Linked Integrity Validation
GENESIS_HASH derived from SHA-256 of gateway configuration at startup, validated on every heartbeat. Missing, stale, or tampered hash — hard stop, no fallback, no graceful degradation into an unvalidated execution path. Governance that proceeds when integrity cannot be confirmed is not governance.
HMAC-SHA256 Chained Audit Logs
Every request logged. Every provider tracked. Every response recorded. Each log entry includes a hash of the previous entry. Chain break triggers safe mode automatically. This is A2.5 (Semantic Execution Trace Logging) implemented at the infrastructure layer — the append-only, tamper-evident audit trail required for ACT-2+ deployments, without application-layer instrumentation.
Runtime-Aware Risk Scoring
Formula: Action x Sensitivity x Historical Context with modifiers for detected prompt injection (+5) and A2A impersonation detection (+3). A per-request implementation of the Combined Risk Score — not a periodic assessment but a live calculation that adjusts to what is happening in the execution stream.
4-Tier HITL Circuit Breaker
Tier | Behavior |
LOW | Proceed + log |
MEDIUM | Proceed + enhanced logging |
HIGH | Queue for async human review |
CRITICAL | Hard stop – requires out-of-band HMAC 2FA before proxy |
CRITICAL tier is the gateway implementation of CP.10 Class-H action protocol. No token, no proxy. The HEAR doctrine is enforced infrastructure, not policy document.
Bidirectional Enforcement
Requests gated outbound. Responses inspected inbound. Provider formats normalized via provider_adapters.py before detection logic runs — no provider-specific detection gaps regardless of which model is behind the gateway.
NEXUS-A2A v0.2 Compatibility
Header detection for agent identity passthrough, delegation chain logging for CP.9 lineage tracking, and passthrough enforcement mode ship enabled by default. Full NEXUS-A2A enforcement activates with config flag nexus_a2a_enforcement: true. For teams not yet running NEXUS-A2A, the hooks are present and logging.
QA: 48 Passing Tests
Test coverage: multi-provider adapters, GENESIS_HASH integrity validation, HMAC chain integrity, HITL circuit breaker tiers, risk scoring with modifiers, NEXUS-A2A passthrough, and bidirectional inspection. The gateway ships production-ready.
Detection finds threats after they execute. Enforcement determines whether they execute at all.
Tool Ecosystem: From Framework to Deployed Infrastructure
v3.0 ships a complete tool ecosystem alongside the framework itself. These are not companion resources — they are the delivery mechanism that collapses the gap between framework definition and operational deployment.
Interactive Dashboard
Live: cyberstrategyinstitute.github.io/ai-safe2-framework/dashboard/
161 controls, zero install, runs entirely in browser. Persona-routed lenses (Executive, Architect, Builder, GRC, Researcher, Explorer), ACT Tier Classifier (6 questions → tier + mandatory controls + HEAR/CP.9/CP.8 flags), live risk calculator with CVSS + Pillar + AAF, 32-framework compliance crosswalk, dark/light mode.
Controls Schema
161 controls, v3.0 schema, 32 compliance frameworks. Machine-readable. New fields: builder_problem, act_minimum, version_added, first_in_field.
curl https://raw.githubusercontent.com/CyberStrategyInstitute/ai-safe2-framework/main/dashboard/public/data/controls.json
MCP Server – MCP Security
The MCP security consists of 7 tools, 51 passing tests, dual transport (stdio for Claude Code / HTTPS for remote). Brings all 161 controls and 32 frameworks into any MCP-compatible AI coding assistant. Tools: lookup_control, risk_score, compliance_map, code_review, agent_classify, get_governance_resource, get_workflow_prompt.
Scanner v3.0
40+ rules, AST analysis, SARIF output, ACT tier estimation from code structure, 32 framework mapping. New v3.0 rules: indirect injection surface detection (P1.T1.10), memory governance gap detection (S1.5), gateway-layer recursion limit absence (F3.2), HEAR requirement detection for ACT-3/4 code patterns.
AI Builder Pre-Flight Checklist
35 structured questions across 7 categories: Input Defense, Data Governance, Human Oversight, Fail-Safe Design, Audit & Logging, Compliance, ACT Tier Gates. Maps directly to AI SAFE² v3.0 controls. Free download at cyberstrategyinstitute.com/ai-safe2/.
GRC Coverage: v2.1 vs v3.0
The headline GRC story in v3.0 is OWASP AIVSS v0.8: from unmapped to 100% coverage with scoring integration. Beyond that, every framework in the reviewed set improved. Total: 32 compliance frameworks mapped, up from approximately 14 in v2.1.
Framework | v2.1 | v3.0 | What Drove the Increase |
OWASP AIVSS v0.8 | Not mapped | 100% | First integration of all 10 core risks + AAF scoring formula |
MITRE ATLAS (Oct 2025) | 98% | 100% | 14 new agent techniques (Oct 2025 release) fully mapped to new controls |
MAESTRO (CSA 7-Layer) | ~55% | 95% | Layers 5-7 coverage added via M4.6, M4.7, A2.5, E5.1 |
Arcanum PI Taxonomy | ~65% | 95% | Evasion techniques (P1.T1.2), indirect surfaces (P1.T1.10), cognitive layer (S1.6) |
AIDEFEND (7 Tactics) | ~60% | 90% | Deceive tactic (CP.7), Evict (F3.5), Harden shift-left (S1.4) |
CSA Zero Trust for LLMs | ~50% | 90% | Micro-perimeter per agent (S1.3); policy-as-code controls (CP.3, CP.4) |
CSA Agentic Control Plane | ~40% | 85% | CP.4 covers identity, authorization, orchestration, runtime trust |
MIT AI Risk Repository v4 | ~85% | 95% | Catastrophic risk pathways (CP.8), CBRN risks extended |
AIID Agentic Incidents | ~70% | 90% | CP.6 incident feedback loop; M4.8 platform monitoring; CP.8 catastrophic paths |
DORA | Not mapped | Full | CP.10, F3.2–F3.5, gateway enforcement |
SEC Cybersecurity Disclosure | Not mapped | Full | CP.10 HEAR named accountability, A2.5 audit trails |
ISO 42001, NIST AI RMF, OWASP Top 10 LLM, Google SAIF, and CSETv1 all maintain 100%, 100%, 100%, 95%, and 92% coverage from v2.1 respectively.
How v3.0 Compares to Competing Frameworks – Universal Rosetta Stone
AI governance frameworks proliferated rapidly in 2024-2025. The landscape now includes NIST AI RMF, ISO 42001, OWASP Top 10 LLM, OWASP AIVSS, MAESTRO, AIDEFEND, and various cloud provider security guidelines. AI SAFE² v3.0’s competitive position:
Capability | SAFE² v3.0 | NIST AI RMF | OWASP LLM | OWASP AIVSS | MAESTRO |
AIVSS AAF scoring integrated | YES (first) | No | No | Defines only | No |
Platform-specific controls (Bedrock, Azure, n8n) | YES (new) | No | No | No | No |
Active deception controls (CP.7) | YES (first) | No | No | No | No |
No-code platform security (S1.7) | YES (first) | No | No | No | No |
Formal ACT capability tiers | YES | No | No | Partial | Partial |
Agentic Control Plane governance | YES | Partial | No | No | Partial |
Agent protocol assessment (A2A, MCP) | YES | No | No | No | No |
AIID incident feedback loop | YES | No | No | No | No |
Agent Replication Governance (CP.9) | YES (first) | No | No | No | No |
HEAR Doctrine: named kill-switch authority (CP.10) | YES (first) | No | No | No | No |
Production enforcement gateway | YES (first) | No | No | No | No |
T11 Multi-Turn Behavioral Conditioning coverage | YES (first) | No | No | No | No |
ISO 42001 + NIST AI RMF + HIPAA + PCI-DSS + SOC 2 coverage | YES (100%) | 100% | Partial | No | No |
AI SAFE² v3.0 introduces nine capabilities no competing framework has: AIVSS scoring integration, no-code platform controls, active deception layer, cloud AI platform monitoring, Agent Replication Governance, HEAR Doctrine, production enforcement gateway, T11 threat category, and first-in-class NEXUS-A2A compatibility. These are not incremental improvements. They are the governance architecture the field has been missing.
Implementation Guidance: How to Upgrade your AI Agents from v2.1
Organizations already implementing v2.1 should follow this prioritized upgrade path.
Week 1–2: Fix the One Removal
- Remove P1.T1.8 from all compliance tracking, control registers, and audit checklists
- Confirm the unique content from P1.T1.8 is reflected in your P1.T1.6 implementation (NFKC normalization, invisible character detection)
- Renumber P1.T1.9 to P1.T1.8 in all documentation
Week 1–4: Apply the Three Modifications
- T1.2: Test your existing prompt injection detection against the four Arcanum evasion categories (homoglyph, invisible characters, multi-turn conditioning, encoded payloads). Most detection pipelines will fail at least two of these.
- T3.10: If you run Bedrock or Azure AI Foundry, immediately audit your CloudTrail for UpdateGuardrail and UpdateDataSource API calls. Scan for MCP servers bound to all interfaces.
- Risk scoring: Calculate AAF scores for your top three highest-autonomy deployed agents. This will immediately reveal which agents are materially higher risk than their CVSS scores suggest.
Month 1–3: Prioritize Critical-Rated New Controls
In priority order:
- T1.10 (Indirect Injection Surface Coverage): Enumerate every indirect injection surface. Most organizations will discover surfaces they had not formally tracked.
- 5 (Memory Governance Boundary Controls): Audit what is currently being written to persistent agent memory and whether write policies exist.
- 5 (Semantic Execution Trace Logging): Implement chain-of-thought audit trails for any ACT-3 or ACT-4 agents before next governance review. Consider deploying the Control Gateway to satisfy this at the infrastructure layer.
- 2 (Agent Recursion Limit Governor): Implement hard recursion limits at the gateway layer, not the system prompt. Fast, high-impact.
- 5 (Tool-Misuse Detection): Establish tool invocation baselines and anomaly detection. Start with your highest-autonomy agents.
- 8 (Cloud AI Platform-Specific Monitoring): Stand up Bedrock and Azure AI Foundry platform-specific monitoring pipelines.
- 7 (No-Code Platform Security): Audit n8n, Zapier, Power Automate, and Workato instances for credential isolation and expression sandboxing.
Month 1–3: Deploy the Control Gateway
The Control Gateway is the single highest-leverage action in v3.0. It implements F3.2, A2.5, and CP.10 Class-H blocking at the infrastructure layer without additional application changes. Deploy against your highest-autonomy agents first.
Source and setup: gateway/README.md.
Month 3–6: Implement Cross-Pillar Governance Controls
- 3: Classify all deployed agents by ACT tier
- 4: Document the Agentic Control Plane for your organization, including NHI inventory, agent orchestration graph, and governance evidence for board reporting
- 4: Ensure every deployed agent has an owner_of_record, hear_agent_of_record, and control_plane_id in your agent state inventory
- 6: Establish a quarterly AIID review process
- 8: Define Catastrophic Risk Thresholds as a condition of any new ACT-3 or ACT-4 deployment approval
- 9: Audit all agentic deployments for replication capability. Implement gateway-enforced replication limits and ephemeral credential issuance for any agent authorized to spawn sub-agents
- 10: Designate a named HEAR for every ACT-3 and ACT-4 deployment. Register the HEAR in A2.4. Establish the cryptographic signing capability before next deployment approval
Organizations that have already implemented v2.1 can treat v3.0 as a targeted upgrade, not a framework change. The pillars are unchanged. The subtopic numbering is stable (with the one documented renumbering). The new controls are additive.
AI Security & Governance Documentation Index
Resource | Location |
Framework overview | README.md → github.com/CyberStrategyInstitute/ai-safe2-framework |
Full v3.0 features & benefits | guides/v3-release-overview.md |
Interactive dashboard | dashboard/README.md |
MCP server setup | skills/mcp/README.md |
Scanner | scanner/README.md |
Gateway | gateway/README.md |
NEXUS-A2A protocol | Coming Soon |
Controls schema | dashboard/public/data/controls.json |
Integrations | INTEGRATIONS.md |
Threat matrix | AISM/AISM-Agent-Threat-Control-Matrix.md |
Release notes | RELEASE-NOTES-v3.0.0.md |
All releases | github.com/CyberStrategyInstitute/ai-safe2-framework/releases |
Get the AI SAFE² v3.0 Implementation Toolkit
The AI SAFE² v3.0 Framework is open source and freely available at github.com/CyberStrategyInstitute/ai-safe2-framework. The full framework document, all research notes, the interactive dashboard, the AISM crosswalk, the Pre-Flight Checklist, and the scanner are published there at no cost.
For organizations that need to move faster than a do-it-yourself GitHub implementation, the AI SAFE² Implementation Toolkit provides:
- 161-Point Audit Scorecard with auto-calculated combined risk scoring including the v3.0 AAF formula
- Enterprise AI Governance Policy template covering ACT tier assignments, HEAR designation requirements, and CP.9 replication authorization language
- AI SAFE² v3.0 Framework Document with full control text and all 10 cross-pillar governance controls
- Vendor Risk Questionnaire updated for v3.0 including protocol-layer supply chain assessment (CP.5)
- 30-Day Implementation Roadmap taking your organization from v2.1 or greenfield to v3.0 compliance
- Risk Command Center — board decision instrument with hexagonal posture radar, compliance readiness by framework, and auto-generated board brief
Consultants charge $5,000 to $15,000 for equivalent framework implementation work. The Toolkit is $97, one time.
Every week you run ACT-3 or ACT-4 agents without a designated HEAR, without CP.9 replication governance, and without a gateway enforcing at the execution boundary, you are operating at a risk level no other governance framework even has language to describe. That gap closes with AI SAFE² v3.0.