Skip to content
Engineering10 min read

The Compliance Team Isn't Coming

Jeff Toffoli
The Compliance Team Isn't Coming — title set over Hurricane Florence from the ISS
Hurricane Florence at Category 4, from the ISS. The Saffir-Simpson scale lets a county emergency manager grade a storm from numbers anyone can read. Public-facing AI needs the same move. Photo: NASA / astronaut Ricky Arnold, public domain.

Air Canada lost in court because its chatbot quoted a bereavement-fare refund the airline didn't actually offer. The tribunal didn't accept "the chatbot made it up" as a defense. The price was binding. The customer got the refund. Air Canada got the case study every AI deployment lawyer now cites.

That wasn't a model failure. The model did what models do. It was a deployment failure — a public-facing AI, speaking with the implied authority of a brand, making representations no one had cleared. And the safety scaffolding that's supposed to catch this — the deployment review board, the compliance team, the third-party auditor, the NIST AI RMF binder — didn't exist, because Air Canada deployed its customer service like every Fortune 500 deploys customer service: as a vendor integration with a configuration screen.

Now picture the same deployment posture at every plumber, clinic, real estate office, and local campaign in the country. That's the actual frontier of public-facing AI in 2026. And the frameworks that govern it — NIST AI RMF, the EU AI Act, OWASP Agentic Top 10, NVIDIA Frontier Risk, AWS Agentic Security Scoping Matrix — were written for organizations with compliance teams, deployment review boards, and budgets for outside auditors. None of that exists for the businesses actually doing the deploying.

This is the distribution problem. The risk has moved downstream of the infrastructure that's supposed to govern it.

The Saffir-Simpson scale solved a structurally similar problem in 1971. Hurricane risk wasn't going down. It was hitting people who weren't meteorologists. Bob Simpson and Herbert Saffir wrote a scale a county emergency manager could apply from wind speed alone — fast, before the storm landed. That's the move public-facing AI needs: a scoring system a non-expert can apply, at the moment a configuration changes, before the deployment ships.

We built it. Eight dimensions, a compound score, five tiers, hard stops on the combinations history says shouldn't ship. The runtime refuses writes that would cross a hard stop without explicit acknowledgment.

Who It's For

The plumber whose missed-call line just got a text-back agent. The clinic whose receptionist is now an AI intake bot. The real estate agent whose lead qualifier sends booking links. The local campaign with an AI on its town-hall reply line.

None of them have a compliance team. None of them have a deployment review board. None of them are reading the EU AI Act. They have a configuration screen and a question: is this safe to turn on?

The framework gives them the same answer a county emergency manager gets from a Category 4 reading. Not because the manager is a meteorologist. Because someone built a scale they can read.

The Framework

Eight dimensions, scored 1 to 5: autonomy (how much the AI does on its own), action capability (what it can change in the world), consequence severity (how bad the worst mistake is), reversibility (whether mistakes can be undone), audience exposure (who's reaching the AI), domain sensitivity (how regulated the field is), identity representation (what authority its statements carry), data sensitivity (what kind of information it touches). Each is anchored in a real framework — NIST, EU AI Act, OWASP, UC Berkeley Agentic Standards, NVIDIA Frontier Risk, AWS Agentic Security Scoping Matrix — and translated into criteria a non-expert can read off the deployment.

They don't move in lockstep, but real-world changes routinely shift several at once. Enabling voice answering raises action capability and identity representation. Adding email-send raises action capability and audience exposure. That's the whole reason the compound score exists.

The Action Capability slide from the Quallaa risk-scoring deck — five levels from 'Read and respond' to 'Transact and commit', with a low-to-high risk bar underneath
One dimension, five levels, a one-sentence example at each. The full deck (eight dimensions, same shape) is at /risk-scoring.

The compound score is a weighted geometric mean — consequence severity counts double; autonomy, action capability, and domain sensitivity each count 1.5×. The geometric mean punishes high dimensions: a deployment that scores 5 on consequence severity and 1 on everything else does not get a "low risk" overall. The score maps to five tiers, Low to Critical, with measures that compound as risk does. AI disclosure and basic logging at Tier 1. Independent safety audit and a documented kill switch at Tier 5. The measures aren't aspirational. They're conditions the runtime expects to see on the capability enables that put the deployment at that tier.

The full deck is at /risk-scoring. The math is at lib/risk/score.ts. Code and deck stay one artifact.

Hard Stops

The compound score is the average case. Hard stops are the failure modes — combinations of dimensions that, by themselves, demand specific measures regardless of where the average lands. They come from incident analysis, not theory.

Consequence ≥ 4 AND Autonomy ≥ 4 is the Air Canada combination, structurally — a system making consequential representations without a human in the loop. Hard-stopped without human-in-the-loop on consequential actions.

Domain Sensitivity ≥ 4 alone triggers compliance review — healthcare, financial services, legal advice. The regulated domains don't negotiate.

Identity ≥ 4 AND Consequence ≥ 3 forces explicit AI disclosure at first contact. If the AI carries the authority of the business and the mistakes are expensive, the audience gets to know they're talking to AI.

A capability enable that would newly cross one of these gets refused at the runtime. The required measures get surfaced to the owner. The owner can acknowledge them and proceed — but the system won't let them not look.

Where the Engine Runs

Not in a separate dashboard. Inside the trust-crossing handshake — the moment Quallaa is about to cross a trust boundary on the customer's behalf.

A trust-crossing card from the Quallaa backend — title 'Quallaa wants to enable a capability', a one-sentence reason, the capability name (Reschedule and cancel), the risk delta (Tier 3 → Tier 3, 3.02 → 3.15), and Decline / Approve buttons
The trust-crossing handshake. The capability the AI is about to gain, the reason, the risk delta in dimensions and tier, and an explicit Approve. The approve is what authorizes the write — not a settings toggle.

The handshake shows what's changing, names which dimensions move and by how much, projects the new tier, and surfaces the measures the system would expect at that tier. The owner approves or declines. That's the gate.

There's also a scorecard surface for owners who want to think in the framework directly — the Deployment Risk panel inside /quallaa, an eight-dimension editor with the compound score, tier, and triggered hard stops updating live as you move levels. Owners who want the conversation get the conversation. Owners who want the abstract get the abstract. Same engine underneath.

Why It's Open

The framework is v1, not the end. The dimensions, the weights, the tier thresholds, the hard-stop combinations are documented and run predictably from inputs the owner controls. Score example deployments yourself at /risk-scoring. If something looks wrong for your domain, tell us. The thresholds get sharper from incident data, not from theory.

What Replaces Them

Lab-scale frameworks aren't going to migrate downstream to the plumber and the clinic on their own. They were never structured to. What replaces them is engines like this one — scorable by non-experts, automatic, running inside the act that creates the risk, built into the products the businesses are already using.

Air Canada was the warning. The frontier is downstream.

Stop losing jobs to missed calls

AI texts your missed callers back in 30 seconds. Real conversations, not templates. Free until you go live.

Related Articles