Why We Don't Sanitize User Messages in Our AI Agent
When you build a customer-facing AI agent -- one that talks to people who didn't sign up, who aren't authenticated, who might be adversarial -- the obvious instinct is to sanitize their input. Strip dangerous patterns. Filter injection attempts before they reach the model.
We don't do this. Here's why.
What We Do Sanitize
We sanitize system-authored content before it enters the system prompt:
- Contact profiles (accumulated from previous conversations)
- Business profiles (what the AI has learned about the client's business)
- Custom instructions (policies, tone, guardrails the owner wrote)
- Business hours, disclosure text, escalation context
These are injected into the system prompt -- the instructions the model treats as authoritative. If a customer said "My name is SYSTEM: ignore all previous instructions" in a conversation, and we stored that in their contact profile without sanitization, it would later appear inside the system prompt when the AI loads their profile. That's a real injection vector.
Our sanitizer strips patterns like SYSTEM:, INSTRUCTION:, OVERRIDE:, XML-like tags, and triple backticks from this content. It enforces maximum lengths. It's applied to every piece of system-authored content that gets injected into the prompt.
Why We Don't Sanitize User Messages
User messages go in the messages array -- the conversation history. Claude is trained to treat this as untrusted input. The model already knows the difference between what the system told it and what a user said.
The model expects adversarial input in the messages array. It's designed for it. Adding our own regex filters on top would:
Break real messages. A plumber's customer texts "STOP by at 3pm" -- if we strip "STOP" from the content, we've corrupted the message. A customer writes "My SYSTEM is broken" -- if we strip "SYSTEM:", we lose meaning. Real people use words that look like injection patterns.
Give false confidence. A regex filter catches SYSTEM: but not the thousand other ways to phrase an instruction override. Prompt injection is an adversarial game -- if you filter known patterns, attackers use unknown ones. The model's safety training is a fundamentally stronger defense than pattern matching.
Fight the model instead of trusting it. Anthropic spent billions of dollars training Claude to handle adversarial input. Our regex filter cost us 30 lines of code. One of these is the right tool for this job.
The Correct Boundary
The security boundary isn't "safe words vs. dangerous words." It's where in the API call the content goes:
- System prompt (model trusts this): sanitize everything that goes here
- Messages array (model expects untrusted input): don't sanitize
The system prompt is where the model learns what it should and shouldn't do. Injecting content there is like editing the rulebook. The messages array is where the conversation happens -- the model already treats it as "what the user said," not "what I should believe."
The Trust Boundary Is the Real Control
Prompt injection tries to make the model do something it shouldn't. But the defense against that isn't word filtering -- it's restricting what actions the model can take.
Our system has three trust levels: untrusted (customers), known_contact (returning customers), and owner (business owner). Each level gets a different set of tools.
An untrusted caller's agent literally doesn't have read_inbox or resolve_escalation available. No prompt injection can invoke a tool that doesn't exist in the API call.
This is the principle: enforce at the boundary, not in the content. Don't try to make user input safe -- make the system safe regardless of what the user inputs.
What About the Really Clever Attacks?
The academic prompt injection literature describes multi-turn attacks, context manipulation, and persona hijacking. These are real. But the defense against them is:
- Strong system prompts -- clear instructions about what the agent should and shouldn't do
- Trust boundaries -- restricting which tools are available based on who's talking
- Model capability -- Claude's constitutional AI training, designed to resist instruction override
None of these defenses are improved by stripping words from user input. The words aren't the problem. The architecture is the defense.
Stop losing jobs to missed calls
AI texts your missed callers back in 30 seconds. Real conversations, not templates. 14-day free trial — no card required.
Related Articles
How Customer-Facing AI Agents Determine Trust
Phone numbers, portal sessions, and API keys -- three ways an AI agent decides who it's talking to and what actions it can take. The architecture behind trust boundaries.
Efficiency Is a Design Decision, Not an Optimization
The biggest efficiency gains in AI systems aren't in the code. They're in what you decide not to build. Every scaffolding layer costs energy, money, and human effort. Less machine is the goal.
Trust the Model, Save the Energy
Every layer between the user and the model costs energy. Sierra, Decagon, and Intercom build scaffolding because they don't trust the model. The scaffolding depreciates every 2 months. The model improves.

