For CX leaders evaluating AI vendors

Ungrounded AI hallucinates in 15–30% of customer service responses. Here's the four-layer system that keeps it under 1%.

Air Canada paid for a refund policy its chatbot invented. DPD's bot wrote insulting poetry about the brand. A Chevrolet dealership's bot agreed to sell a Tahoe for one dollar. Hallucination is not a model problem you solve with more training data. It's an architectural problem you solve with layered validation. This is the system Richpanel uses in production, and what to test for in any vendor evaluation.

By Amit RG, Founder, Richpanel Published 2026-05-17 Updated 2026-05-18 ~12 min read
AR
Amit RG is the founder of Richpanel, an AI-first customer service platform serving 1,000+ brands. He has been hands-on in the architecture decisions described here, including the pre-launch eval pipeline, the QA AI second-pass design, and the deterministic tool layer. Source data: production telemetry from live Richpanel deployments. On X: @realamitrg.
What it costs when it goes wrong

Three real incidents that cost real money.

These are not edge cases. They are the predictable failure mode of LLMs deployed without architectural validation.

Air Canada · 2024 · British Columbia Civil Resolution Tribunal

The chatbot invented a bereavement refund policy. The tribunal made the airline pay it.

A customer asked Air Canada's chatbot about bereavement fares. The bot fabricated a policy that allowed retroactive bereavement refunds, a policy Air Canada had never offered. When the customer was later denied, the tribunal ruled that Air Canada was bound by what its chatbot promised. The airline tried to argue the chatbot was a "separate legal entity." The tribunal disagreed. The lesson is older than AI: your customer-facing systems are extensions of your contract with customers, and what they say creates liability.

DPD · 2024 · viral social media incident

A customer asked the bot for help. The bot wrote a poem insulting DPD.

DPD's customer service chatbot, running on a stock LLM with insufficient guardrails, was prompted by a frustrated customer and produced a haiku calling DPD "the worst delivery firm in the world." The exchange went viral, the bot was pulled offline, and DPD spent a quarter of news cycles answering for it. The technical cause was prompt injection plus no output validation. The reputational cost has never been disclosed but was material.

Chevrolet of Watsonville · 2023 · viral social media incident

The bot agreed to sell a $76,000 SUV for one dollar: "legally binding, no takesies-backsies."

A user trolled the dealership's customer service chatbot until it agreed to sell a 2024 Chevy Tahoe for $1. The bot even confirmed the deal was "legally binding." The dealership did not honor it, but the bot was offline within hours and the dealership's reputation took a hit that lasts in search results to this day. The root cause was an ungrounded LLM with no transaction-validation layer, the bot could literally generate any commercial commitment in fluent English.

The pattern is identical in every case. An ungrounded generative model produced fluent, confident, factually fabricated content. A customer relied on it. The company carried the cost. None of these failures were detected pre-deployment because none of these companies treated hallucination prevention as a system. They treated it as a model property ("the model is smart, it'll mostly get it right") and accepted the residual error rate as a normal cost of doing business.

Published research finds ungrounded LLMs hallucinate at meaningfully non-zero rates across task types, 3–25% on summarization (Vectara HHEM leaderboard[1]), 3–44% on Q&A and RAG tasks depending on model (Galileo Hallucination Index[2]), and 58–82% on legal queries (Stanford HAI / RegLab[3]). For customer service specifically (a mix of factual lookup, policy reasoning, and action proposal) we observe 15–30% hallucination rates on ungrounded LLMs in pre-launch evaluations across new customers, depending on query complexity[4]. At enterprise scale (say, 50,000 tickets per month) that's 7,500 to 15,000 hallucinated responses per month. Most are harmless. Some are not. The math is brutal: even a 0.1% material-error rate at that volume produces 50 reputation incidents per month.

The honest diagnosis

Hallucination is not a model property. It's an architecture property.

Most "AI hallucination prevention" content recommends one of three fixes: (a) use a smarter model, (b) write better prompts, (c) add more training data. None of these reliably work.

Smarter models hallucinate too. GPT-5, Claude Opus 4.7, and Gemini 3 Pro all hallucinate at meaningfully non-zero rates on factual customer-service queries. The 2024-era models hallucinated more, but the gap between today's frontier models is real and smaller than the gap between architectures, a worse model with a good architecture beats a better model with a bad architecture, every time.

Better prompts reduce hallucination at the margins. They cannot eliminate it. A prompt that says "do not make things up" is fighting the fundamental mechanic of generative models: they predict plausible next tokens, not verified facts.

More training data has diminishing returns. A model fine-tuned on your specific business will hallucinate less about your specific business, but it will still fabricate when asked about edge cases, will still invent policies that sound like your policies, and will still produce confident wrong answers when underlying source data is silent or contradictory.

The only reliable fix is to stop trusting the model to be the source of truth and start treating it as a renderer of facts that come from somewhere else, validated by something else, and overridden by something else when uncertain.

That "somewhere else, something else, something else" is what we call the four-layer defense.

How Richpanel handles it

The four-layer defense, explained.

Each layer reduces a different class of failure. Together, they take hallucination from the 15–30% range that ungrounded LLMs ship at to under 1% in production at Richpanel customers.

01

Pre-launch evaluation against the customer's own historical tickets.

Before AI ever touches a real customer, we run it against your past 1,000–10,000 resolved tickets and score the accuracy. Below 95–99% on your data, it doesn't go live.

Every customer's domain is different. A Shopify subscription brand has a different vocabulary, policy stack, and product taxonomy than a SaaS company. An AI that scores 98% on a generic benchmark may score 70% on your specific business. The only meaningful pre-launch test is against your data, not someone else's.

Richpanel's pre-launch evaluation pipeline pulls your historical tickets (anonymized, with PII stripped), runs the AI against the customer messages, and scores the response against what your human agents actually wrote. We measure: factual accuracy, policy compliance, brand-voice fit, tool-call correctness, and escalation appropriateness. The agent does not graduate to production traffic until it crosses the threshold.

How to test this in a vendor evaluation: "Run the AI against 100 of my historical tickets and show me the response-by-response accuracy. Then walk me through the specific failures." Any vendor unwilling or unable to do this is selling efficiency, not resolution.
02

QA AI that reviews every response before send.

A second model (with a different prompt, different objective, and access to the same source content) reviews every response from the first AI before the customer sees it. Disagreement triggers escalation.

This is the layer that catches hallucinations the first AI didn't realize were hallucinations. The QA AI is given the same conversation, the same source content, and is asked one question: "Is this response factually grounded in the source content, on-policy, and on-brand?" When the QA AI says no, the response is either rewritten (if the failure is minor) or routed to a human (if the failure is substantive).

The architectural insight: a model that generated a hallucinated response is unlikely to detect its own hallucination, it produced the response because the response felt plausible. A second model with adversarial intent and the same source material is meaningfully better at noticing when something doesn't match the source. Not perfect. Meaningfully better.

This is the same logical structure as a compiler running a type check after parsing, or a code reviewer reading a PR: separation of generation from validation. The QA AI also produces an audit log entry for every response, so you can review the reasoning behind any decision later.

How to test this: "Show me the QA layer in your product UI. Can I see, per response, the QA AI's reasoning for why it approved or rejected? Can I tune the QA AI's thresholds?" If the answer is "we have automated quality control" without a visible mechanism, the layer doesn't exist.
03

Deterministic tool execution. The AI proposes; the system disposes.

Remember the Chevy dealership chatbot that agreed to sell a Tahoe for one dollar? That was a Layer 3 failure. The AI generated a fluent commercial commitment in free text and no validation layer existed between generation and customer-visible response. Layer 3 makes that failure architecturally impossible.

When a Richpanel AI agent decides to issue a refund, it does not produce a string like "I'll refund you $76,000." It produces a structured tool call: refund_order(order_id, amount_cents, reason_code) where amount_cents is bounded by the order total and your refund policy, and reason_code is from an enumerated list. The AI proposes the parameters. The system validates them against typed constraints before any commitment is rendered to the customer.

The deterministic layer means that even if the AI hallucinates a refund decision in natural language, the system cannot execute the wrong thing. A refund call that exceeds policy thresholds is blocked at the tool layer, not at the model layer. A cancellation that should have been a pause is escalated to a human because the AI's reasoning chain didn't satisfy the cancellation tool's preconditions. Refunds, cancellations, order edits, subscription changes, address updates, every action your AI is allowed to take maps to a typed tool definition with policy-bounded parameters. There is no free-text path to a transaction.

How to test this: "Show me the tool definitions your AI uses. What are the typed parameters? What are the validation constraints? What gets blocked at the tool layer vs. at the model layer?" If the AI just generates free-text responses and your team copy-pastes the refund amount somewhere, there is no deterministic layer.
04

Human fallback. The AI must know what it doesn't know.

Confidence thresholds are explicit. When the AI's confidence drops below threshold, when the QA AI rejects a response, or when the conversation hits any of a defined set of escalation triggers, a human picks it up. The AI does not guess to keep its resolution rate up.

The deepest failure mode in customer service AI is an AI that has been incentivized, by design or by metric, to never escalate. A bot that resolves 95% of tickets sounds great until you find that 30% of "resolutions" were the customer giving up. This is the deflection-vs-resolution gap that every honest practitioner now talks about.

Richpanel measures resolution, not deflection. A conversation is resolved when the customer's actual issue is solved, validated, and confirmed, not when the customer stops talking. When the AI cannot confidently resolve, it hands off with full context: the conversation history, what the AI considered, why it escalated, what data it pulled, and what action (if any) it was about to take. The human agent picks up with everything they need, not from zero.

Escalation triggers include: low model confidence, QA AI rejection, sensitive intent detection (account closure, complaint, legal language), tool failure or ambiguity, customer explicitly asking for a human, and any case where the AI would otherwise be guessing. Each trigger is configurable per-tenant.

How to test this: "Show me the escalation logic in the product. What's the published confidence threshold? Can I tune the escalation triggers? What does the handoff look like from the human agent's perspective, do they see the AI's reasoning, or do they start from scratch?" The handoff fidelity is the deepest signal of whether the platform was built for resolution or for vanity-metric deflection.
The math, layer by layer

What each layer is worth.

Approximate hallucination rates at each architectural maturity, on customer-service-shaped queries (factual lookup + policy reasoning + action proposal). Each cell is a Richpanel synthesis from internal pre-launch evals across customer onboardings, calibrated against published benchmarks for the underlying technique.[4]

ArchitectureHallucination rate
Ungrounded LLM (raw model, no retrieval)15–30%
+ Layer 1: Naive RAG (retrieval-augmented generation)[5]5–10%
+ Layer 2: RAG + QA AI second-pass validation2–4%
+ Layer 3: + Deterministic tool execution1–2%
Richpanel: all four layers + human fallback<1%

The four-layer system is not magic. Each layer makes the next one cheaper to operate. Pre-launch evaluation surfaces 80% of the failure modes before any customer sees them. QA AI catches the next tier, hallucinations the first AI's pre-launch eval didn't anticipate. Deterministic tool execution prevents the failures that are most expensive even if they're rare. Human fallback handles the long tail.

The economics work because each layer's failure mode is different. You don't need every layer to be perfect. You need every layer to fail in a different direction so a failure in one layer doesn't propagate.

What this looks like in production

Aeons, the premium UK supplements brand founded by Dr. Sarah Brewer, trusted by 100,000+ customers, ran 4,881 fully autonomous AI replies in 42 days at 4.43 / 5 CSAT, higher than their own human team's 4.25.

60% of every inbound message, sent by AI without human review, in autoMode. A 5–10% hallucination rate would have shown up in those 185 customer satisfaction ratings within days. It didn't, the AI's CSAT is the proof. Read the full case study →

Where the category sits

Which platforms ship which layers, as of May 2026.

Drawn from each vendor's public documentation, demo materials, and product UI. Where a vendor's capability is implicit but not separately documented in the UI, we note "implicit." Where a vendor sells the capability as a separately-priced product, we note that, because operationally, a paid-add-on you didn't buy is the same as no capability.

Platform L1 Pre-launch eval L2 QA AI on every response L3 Deterministic tools L4 Human fallback
Richpanel Published threshold per-customer (95–99% on your historical tickets before go-live) Native, every response, visible reasoning logs Typed tool calls, policy-bounded parameters Configurable triggers, full-context handoff
Intercom Fin Publishes aggregate accuracy across customer base; per-customer eval is not visible in product Workflow review at flow-design time; not a separate per-response check Workflows with structured actions Escalation to human
Ada Reasoning Engine grounds responses in connected systems Not separately surfaced in UI Custom action library across systems Yes
Decagon Strong; emphasized for SaaS/fintech accuracy Multi-step reasoning chains, audit trail Authenticated action execution Yes
Fini "Reasoning-first" architecture validates before answering Implicit in reasoning chain; not a separate model Authenticated actions in Shopify, Recharge, Loop Yes
Sierra Experience Manager framework supports pre-launch experiments per-customer; specific accuracy thresholds not publicly disclosed Agent OS observability layer; per-response second-model review not separately documented in UI Structured action framework; emphasis on bring-your-own enterprise systems Yes
Maven AGI Limited publicly documented methodology Generative answer engine with citation grounding; not a separate model pass Enterprise action library across CRM, billing, and identity systems Yes
Zendesk Advanced AI Not publicly documented Zendesk QA (formerly Klaus), separately priced product Triggers + automations + actions; AI-generated text is less constrained Yes

Each platform name links to the vendor's primary product page used to source the row. Cells reflect each platform's public documentation, product pages, and changelog as of May 2026. Where deep-link URLs (e.g. specific feature pages, changelog entries) were used in addition to the linked landing page, they are listed in Sources & References below. If your reading of any cell is materially different from current product reality, email amit@richpanel.com and we'll update, the goal is to be accurate, not gratuitously negative. The version history of this table is published at the bottom of the page so changes are auditable.

The honest read: the category is converging on Layers 3 and 4, most serious platforms now have deterministic actions and human fallback. The differentiation is in Layers 1 and 2, where the gap between platforms that publish per-customer pre-launch thresholds and run second-pass QA on every response, versus platforms that gesture at "safeguards" or sell QA as a separate product, is wide and growing.

The honest caveat

What no architecture can prevent, and what we do about it.

Even a perfectly designed system cannot prevent every failure mode. Honest practitioners admit these:

The honest claim is not "Richpanel never hallucinates." The honest claim is that Richpanel's architecture is designed so that the hallucinations that do occur are caught before they reach the customer, and the ones that aren't caught are bounded by deterministic guardrails. Under 1% in production is what that architecture delivers today.

For your vendor evaluation

Eight questions to ask every AI customer service vendor.

Use these in your RFP, on your demo calls, and in your reference checks. Vendors who cannot answer them concretely should be disqualified, regardless of how good the demo looks.

1. What's your published pre-launch accuracy threshold?

Demand a specific percentage and ask to see the test methodology. "We have safeguards" is not an answer.

2. Will you run a pre-launch eval on my actual historical tickets?

Any vendor that can't or won't do this is selling demos, not production.

3. Show me your QA layer in the product UI.

If quality control isn't a visible artifact in the product, with reasoning logs, thresholds, and per-response audit trail, it doesn't exist.

4. What's the difference between deflection and resolution in your reporting?

If the vendor conflates them, walk away. Deflection-optimized AI is the worst possible state.

5. Show me your tool definitions and validation constraints.

If the AI generates free-text refund amounts instead of structured tool calls with typed parameters, there's no deterministic layer.

6. Walk me through escalation in the human agent UI.

How much context does the human see? AI reasoning? Source content? Tool history? Handoff fidelity is the deepest signal of platform maturity.

7. What's your published resolution rate guarantee, and what's the remedy if you miss it?

If there's no number and no remedy, the vendor's pricing isn't aligned with your outcome. Service credits are not enough, real guarantees include termination rights and refunds.

8. Connect me with three customers similar to me who went live within your stated timeframe.

Reference customers are the most reliable signal of vendor quality. More reliable than analyst rankings, more reliable than demo polish.

Richpanel is happy to answer every one of these in writing. The product is built around the answers, pre-launch eval pipeline, visible QA AI, deterministic tools, structured escalation, named resolution guarantee with full refund remedy, and a customer reference list available on the demo call.

Frequently asked

Hallucination defense, in plain English.

Can I bring my own preferred model, or are you locked to one provider?

Multi-model is default. Richpanel runs on Claude (Opus 4.7, Sonnet 4.6, Haiku 4.5), GPT-5, and Gemini 3 Pro. Customers choose per-workspace, and we can route different ticket intents to different models when there's a clear performance reason (Claude tends to win on reasoning-heavy escalations; Gemini Flash on high-volume short-form). The QA AI runs on a different model from the response model by default, which catches a class of failures that single-model setups miss.

What does the QA AI actually see: the full conversation, or just the last message?

Full conversation, plus the source content the response AI retrieved, plus the tool calls the response AI proposed, plus the relevant SOPs and policy fragments. The QA AI is not graded on coherence alone. It's graded on whether the proposed response is factually traceable to source content. A response that's well-written but invents a policy fails QA. A response that's clumsy but factually correct passes (and gets noted for brand-voice rewriting, not blocked).

How is the pre-launch eval different from a regression test set?

It's a regression test set for production traffic. We pull a stratified sample of your historical tickets (across intent types, with PII stripped), run the AI against the customer message, and score the response against what your human agent actually wrote. The eval is automated, repeatable, and re-run whenever (a) we update the underlying model, (b) you change a policy, or (c) you onboard a new channel. We keep a holdout set untouched so we can detect drift over time independently of the training-traffic measurement.

What happens to the audit logs? Can we export them?

Yes. Every response is logged with: the customer message, the source content the AI retrieved, the proposed response, the QA AI's reasoning (approve / rewrite / escalate), the final response sent, the tool calls executed (if any), and the human handoff trail (if any). Logs are exportable via API in JSON, retained for 13 months by default, and longer on request. SOC 2 Type II covers the storage. For HIPAA workspaces, logs are tenant-isolated.

How fast is "end-to-end" with all four layers running?

Median response time is under 10 seconds end-to-end on Richpanel, including layered validation. Pre-launch evaluation runs once before go-live (not per-conversation). Retrieval, response generation, QA pass, and tool execution all happen in the live path, the QA AI runs in parallel with the response model where possible, so the marginal latency is roughly 1–2 seconds versus a single-model, no-validation setup. Worth it; the failure modes the QA AI catches would be far more expensive to fix downstream than 2 extra seconds upfront.

If the AI escalates a lot, doesn't that defeat the point of automation?

Only if you're optimizing the wrong metric. Escalation rate goes down as the CX Manager AI captures new patterns, your knowledge base improves, and pre-launch eval coverage expands, which is what every Richpanel customer has seen on a 60–120 day curve. Early-stage escalation rates of 30–40% are normal; mature deployments run 10–20%. The point is to escalate the right tickets to humans (low-confidence, policy-conflict, sensitive intent), not to never escalate. A platform that resolves 95% of tickets but where 30% of "resolutions" are customers giving up is operationally worse than a platform that resolves 65% and escalates the rest cleanly.

How we measure what we measure

Methodology, in detail.

Every number in this article is either cited to a public benchmark or generated by the methodology below. We publish this so the rate-progression table and the <1% production claim are auditable, not aspirational.

Pre-launch eval pipeline

Sample size
Stratified random sample of 1,000–10,000 historical resolved tickets per tenant, weighted by intent category so rare intents (refund disputes, account closures, sensitive complaints) aren't under-represented.
What gets scored
Five dimensions per response, each on a 0–3 rubric: (1) factual accuracy, every claim must trace to retrieved source content; (2) policy compliance, response respects published policy as of the ticket date; (3) brand voice fit; (4) tool-call correctness, if an action was proposed, was it the right action with the right parameters; (5) escalation appropriateness, did the AI escalate when it should have, and not when it shouldn't.
Who labels
Two-stage: an LLM judge (running on a different model family from the response model) produces an initial score and natural-language reasoning. A human SME from the customer's own CX team reviews 100% of the LLM judge's “low confidence” cases and 10% of “high confidence” cases as a calibration check. Inter-rater agreement is logged per cohort.
What counts as a “hallucination”
A response containing one or more factual claims that cannot be traced to retrieved source content (knowledge base article, policy doc, prior conversation, structured tenant data), regardless of whether the claim happens to be correct. A response that confidently states the correct refund window without citing a policy source is scored as hallucination, because the architecture is the test, not the lucky guess.
Holdout discipline
10% of the historical ticket sample is set aside as a sealed holdout, never used for prompt tuning, knowledge-base shaping, or eval-set training. The holdout is re-scored on every model update and policy change, so drift is detected independently of training-traffic measurement.
What we count as the “production rate”
The <1% production figure is the QA AI's flagged-hallucination rate on the live response stream after Layer 2 review, sampled across all live Richpanel tenants over the trailing 30 days. It is not a self-reported number from a single account.

The rate-progression table earlier in this article uses approximate ranges (e.g. “2–4%”) rather than point estimates because the same architecture performs differently on different tenants. A heavy-policy business (insurance, healthcare) hallucinates differently from a heavy-catalog business (apparel, supplements). The ranges bracket what we've observed across both.

Sources & references

Where the numbers come from.

Inline citations [1][5] in this article map to the entries below. Each entry links to the public artifact used; if a link breaks, the citation text identifies the work so you can find the current canonical URL.

  1. Vectara Hughes Hallucination Evaluation Model (HHEM) Leaderboard. Public, continuously-updated benchmark of hallucination rates for major LLMs on a standardized summarization task. Establishes that even frontier models hallucinate at non-zero rates. github.com/vectara/hallucination-leaderboard
  2. Galileo AI, Hallucination Index. Cross-model hallucination benchmark across Q&A, summarization, and RAG tasks. Cited for the observation that hallucination rates vary widely (3–44%) across model families and task types. galileo.ai
  3. Dahl, Magesh, Surani, Ho (2024). “Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models.” Stanford HAI / RegLab. Quantifies hallucination rates of 58–82% on legal queries by ungrounded LLMs, cited as the upper-bound illustration of how badly ungrounded models behave on policy-heavy domains. hai.stanford.edu
  4. Richpanel internal pre-launch evaluations (2024–2026). The 15–30% customer-service hallucination range and the rate-progression table are synthesized from pre-launch evals run across new-customer onboardings, using the methodology disclosed in this article's “Methodology” section. Underlying tenant data is anonymized and bound by NDA; aggregate ranges are publishable. Methodology audit available on request via amit@richpanel.com.
  5. Lewis et al. (2020). “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” The foundational RAG paper from Meta AI. Cited as the canonical reference for what “Layer 1: Naive RAG” means in the rate-progression table. arxiv.org/abs/2005.11401

Version history, v1.1 (2026-05-18): added named author, inline citations, methodology disclosure, sources list, and Aeons production data point. v1.0 (2026-05-17): initial publication.

See the four-layer system running on your business.

30 minutes. We connect Richpanel to your actual store, run our pre-launch eval against 100 of your historical tickets, and show you the per-response accuracy plus the QA AI's reasoning on every one. No slide deck.

Book my 30-min demo →