Three real incidents that cost real money.
These are not edge cases. They are the predictable failure mode of LLMs deployed without architectural validation.
The chatbot invented a bereavement refund policy. The tribunal made the airline pay it.
A customer asked Air Canada's chatbot about bereavement fares. The bot fabricated a policy that allowed retroactive bereavement refunds, a policy Air Canada had never offered. When the customer was later denied, the tribunal ruled that Air Canada was bound by what its chatbot promised. The airline tried to argue the chatbot was a "separate legal entity." The tribunal disagreed. The lesson is older than AI: your customer-facing systems are extensions of your contract with customers, and what they say creates liability.
A customer asked the bot for help. The bot wrote a poem insulting DPD.
DPD's customer service chatbot, running on a stock LLM with insufficient guardrails, was prompted by a frustrated customer and produced a haiku calling DPD "the worst delivery firm in the world." The exchange went viral, the bot was pulled offline, and DPD spent a quarter of news cycles answering for it. The technical cause was prompt injection plus no output validation. The reputational cost has never been disclosed but was material.
The bot agreed to sell a $76,000 SUV for one dollar: "legally binding, no takesies-backsies."
A user trolled the dealership's customer service chatbot until it agreed to sell a 2024 Chevy Tahoe for $1. The bot even confirmed the deal was "legally binding." The dealership did not honor it, but the bot was offline within hours and the dealership's reputation took a hit that lasts in search results to this day. The root cause was an ungrounded LLM with no transaction-validation layer, the bot could literally generate any commercial commitment in fluent English.
The pattern is identical in every case. An ungrounded generative model produced fluent, confident, factually fabricated content. A customer relied on it. The company carried the cost. None of these failures were detected pre-deployment because none of these companies treated hallucination prevention as a system. They treated it as a model property ("the model is smart, it'll mostly get it right") and accepted the residual error rate as a normal cost of doing business.
Published research finds ungrounded LLMs hallucinate at meaningfully non-zero rates across task types, 3–25% on summarization (Vectara HHEM leaderboard[1]), 3–44% on Q&A and RAG tasks depending on model (Galileo Hallucination Index[2]), and 58–82% on legal queries (Stanford HAI / RegLab[3]). For customer service specifically (a mix of factual lookup, policy reasoning, and action proposal) we observe 15–30% hallucination rates on ungrounded LLMs in pre-launch evaluations across new customers, depending on query complexity[4]. At enterprise scale (say, 50,000 tickets per month) that's 7,500 to 15,000 hallucinated responses per month. Most are harmless. Some are not. The math is brutal: even a 0.1% material-error rate at that volume produces 50 reputation incidents per month.