Why We Don't Sanitize User Messages in Our AI Agent

When you build a customer-facing AI agent -- one that talks to people who didn't sign up, who aren't authenticated, who might be adversarial -- the obvious instinct is to sanitize their input. Strip dangerous patterns. Filter injection attempts before they reach the model.

We don't do this. Here's why.

Maginot Line bunker, Saint-Avold, France. Built to perfectly defend against the previous war's known patterns of attack. The next attackers simply went around it. Pattern-matching on yesterday's attacks is not a defense against an adversarial input space — the architecture is. Photo: Jean-Marc Pascolo, CC BY-SA 4.0.

A real conversation thread in the Quallaa portal. User messages arrive in the messages array. The model handles them. No regex filter pretends to make them 'safe.'

What We Do Sanitize

We sanitize system-authored content before it enters the system prompt:

Contact profiles (accumulated from previous conversations)
Business profiles (what the AI has learned about the client's business)
Custom instructions (policies, tone, guardrails the owner wrote)
Business hours, disclosure text, escalation context

These are injected into the system prompt -- the instructions the model treats as authoritative. If a customer said "My name is SYSTEM: ignore all previous instructions" in a conversation, and we stored that in their contact profile without sanitization, it would later appear inside the system prompt when the AI loads their profile. That's a real injection vector.

Our sanitizer strips patterns like SYSTEM:, INSTRUCTION:, OVERRIDE:, XML-like tags, and triple backticks from this content. It enforces maximum lengths. It's applied to every piece of system-authored content that gets injected into the prompt.

Why We Don't Sanitize User Messages

User messages go in the messages array -- the conversation history. Claude is trained to treat this as untrusted input. The model already knows the difference between what the system told it and what a user said.

The model expects adversarial input in the messages array. It's designed for it. Adding our own regex filters on top would:

Break real messages. A plumber's customer texts "STOP by at 3pm" -- if we strip "STOP" from the content, we've corrupted the message. A customer writes "My SYSTEM is broken" -- if we strip "SYSTEM:", we lose meaning. Real people use words that look like injection patterns.

Give false confidence. A regex filter catches SYSTEM: but not the thousand other ways to phrase an instruction override. Prompt injection is an adversarial game -- if you filter known patterns, attackers use unknown ones. The model's safety training is a fundamentally stronger defense than pattern matching.

Fight the model instead of trusting it. Anthropic spent billions of dollars training Claude to handle adversarial input. Our regex filter cost us 30 lines of code. One of these is the right tool for this job.

The Correct Boundary

The security boundary isn't "safe words vs. dangerous words." It's where in the API call the content goes:

System prompt (model trusts this): sanitize everything that goes here
Messages array (model expects untrusted input): don't sanitize

The system prompt is where the model learns what it should and shouldn't do. Injecting content there is like editing the rulebook. The messages array is where the conversation happens -- the model already treats it as "what the user said," not "what I should believe."

The Trust Boundary Is the Real Control

Prompt injection tries to make the model do something it shouldn't. But the defense against that isn't word filtering -- it's restricting what actions the model can take.

Our system has three trust levels: untrusted (customers), known_contact (returning customers), and owner (business owner). Each level gets a different set of tools.

An untrusted caller's agent literally doesn't have read_inbox or resolve_escalation available. No prompt injection can invoke a tool that doesn't exist in the API call.

This is the principle: enforce at the boundary, not in the content. Don't try to make user input safe -- make the system safe regardless of what the user inputs.

What About the Really Clever Attacks?

The academic prompt injection literature describes multi-turn attacks, context manipulation, and persona hijacking. These are real. But the defense against them is:

Strong system prompts -- clear instructions about what the agent should and shouldn't do
Trust boundaries -- restricting which tools are available based on who's talking
Model capability -- Claude's constitutional AI training, designed to resist instruction override

None of these defenses are improved by stripping words from user input. The words aren't the problem. The architecture is the defense.

Why We Don't Sanitize User Messages in Our AI Agent

What We Do Sanitize

Why We Don't Sanitize User Messages

The Correct Boundary

The Trust Boundary Is the Real Control

What About the Really Clever Attacks?

Want to build something like this?

Related Articles

The Story an AI Tells Itself

You Can't Trust What You Can't Trace

The Compliance Team Isn't Coming