AI Infrastructure January 9, 2026 9 min read By Maya Chen

Your AI Passed Every Eval and Still Fell Apart in Production. Here Is Why.

Evaluation datasets are clean by design. Production data is anything but. The gap between those two realities is where almost every incident in your AI deployment is actually living.

Last fall I consulted with a regional medical billing company on a pilot of an AI coding assistant. They ran the model against 90 days of historical encounter data. The eval results looked great. CPT code accuracy at 96 percent. ICD-10 specificity at 94. The team got the go-ahead. Production rollout was set for the following month.

Three weeks in, accuracy was at 71 percent.

The model had not regressed. Nothing about it had changed between the eval set and production. What changed was the data.

Their evaluation set was built from clean, completed encounters that had already been through human coding review at least once. Every field filled. Every diagnosis documented in the structured fields the model was trained against. Every chart note complete.

Production was nothing like that. Encounters showed up with half-completed assessments. Diagnoses tucked into free-text fields the model had never been trained to read carefully. Contradictory entries where the medical assistant's note and the physician's structured diagnosis pointed to different conditions.

The model did what language models do when they hit missing or contradictory information. It improvised. It picked the statistically likely answer. It built a confident-sounding code. The audit caught it. The audit always catches it eventually. The question is how long that takes and what it costs you in the meantime.

This is the most common failure pattern for production AI, and it is the most preventable one. Pretty test data gives you pretty dashboards. Ugly real data is where your incidents actually live.

                The core idea: A model that hits 95 percent on a clean evaluation set might hit 70 percent on production data with the same distribution of missing fields, formatting noise, and contradictions the real environment contains. The 25-point gap does not show up until the system is live, users are relying on it, and the eval metrics have already been presented to leadership.
            

Why Eval Data Lies About Production Performance

Evaluation datasets are curated. Almost always. By design.

They get built from completed, reviewed, clean records because clean records are the only ones with reliable ground truth. You need ground truth to score. You cannot score a model against data whose correct answer nobody is sure of.

That requirement creates a sampling bias most teams underestimate. The records that make it into an eval set are the ones that survived a human's attention long enough to be cleaned. Production data is everything else.

In the billing pilot, the eval set came from encounters that had already been through coding review. Those encounters had a documentation completeness rate above 90 percent. Production data had a completeness rate around 60. The model was being asked to do the same task on a dataset that was, in measurable ways, a different dataset.

This bias is invisible to most evaluation pipelines because the pipeline reports the score against the evaluation set, not against any sample from production. The reported metric describes a version of reality the model will not encounter once it is live.

The Silent Improvisation Problem

When a language model hits incomplete data in production, it does not stop and ask. It does not flag uncertainty. It improvises.

A missing field gets filled with the most statistically likely value based on surrounding context. A contradiction gets resolved by whichever signal the model weights more heavily. A free-text note that does not match the structured format the model was trained on gets interpreted as best it can.

This improvisation is silent. No error logged. No alert fired. The model generates a response and serves it with the same confidence it shows when the data is complete and clean.

The system looks like it is performing. Responses are fast. Error rates are low. The dashboard is green. The model is guessing on a meaningful percentage of its inputs and nobody is counting how often or how badly. In medical billing specifically, the patient or the payer almost always finds the wrong code before the AI team does.

25 pts

typical accuracy gap between clean evaluation data and the same model running against the messy production data of the same domain. The gap is the cost of evaluating on what is easy to evaluate.

How Do You Evaluate LLMs on Real-World Data?

The fix is uglier test data. Before any AI system goes to production, its evaluation dataset should include the same distribution of data-quality problems the production environment actually contains.

Missing fields, in the patterns real users actually leave them. If 40 percent of your intake forms have a blank insurance field in production, 40 percent of your eval set should have that field blank. And the missing-ness should be patterned the same way. More frequent for self-pay patients, less for established commercial patients. Random uniform blanks do not test the same thing.

Formatting noise. Abbreviations, misspellings, mixed date formats, inconsistent capitalization. The kind of variation that comes from real data entry at scale. The model needs to be evaluated against the messy formats it will see, not the structured formats the eval set defaults to.

Contradictory entries. The same data point recorded differently across different systems. A patient name spelled three ways across three records. A diagnosis present in the assessment but absent in the structured field. The eval set needs to include these so the score reflects the model's ability to disambiguate, not just its ability to copy.

Free-text noise. Real notes written by real people under real time pressure. Not the cleaned-up exemplars in training documentation. The kind of notes where "pt c/o SOB x3d, neg sob on exam, w/u in progress" is a complete sentence to the person who wrote it.

If your evaluation dataset does not include these conditions, your metrics describe a reality your AI will never face. The testing framework guide walks through how to actually build datasets this way.

What Is Data Drift in AI Systems?

Even if your initial evaluation dataset is production-realistic, the production data will shift over time. User behavior changes. New data sources are added. Business processes evolve. Seasonal patterns alter input distributions. New product or service categories introduce data the model was never evaluated against.

This shift is data drift. It is not a failure of the model. It is a feature of the environment.

A model calibrated for Q1 data may underperform on Q3 data because the seasonal mix of encounters changed. A code change in the EHR may alter which fields are populated and which are left blank. A new payer contract may add a class of encounter the eval set never represented.

Continuous evaluation against production samples is the only way to catch drift before it shows up in the experience users actually have. Periodic re-evaluation using real production data, not the original clean eval set, keeps the reported accuracy honest about current conditions.

Why Do LLMs Hallucinate When They Lack Information?

Language models are trained to produce coherent, complete responses. When they hit missing information, they fill the gap with generated content that is statistically plausible but not grounded in actual data.

That behavior is a feature during creative tasks and a liability during factual ones. The model does not internally distinguish between "I am generating because I have the data to support it" and "I am generating because I need to produce a complete response and I am filling the gap."

Handling out-of-distribution data in LLM applications takes explicit uncertainty handling. The model needs to be instrumented to detect when it is operating with insufficient context, and then either flag that uncertainty to the operator or hand the case to a human reviewer instead of improvising silently.

Logging the Improvisation

Even with production-realistic evaluation, the model will hit data conditions in production that were not in any test set. Real data is infinite in its variety. There will always be cases the eval set did not cover.

The safeguard is logging when the model improvises. Every time it fills a missing value instead of flagging it. Every time it resolves a contradiction by choosing one signal. Every time it interprets messy text that does not match training format. Those moments need to be captured.

If you can see when the model is guessing, you can tune it. Find the data conditions that cause the most improvisation. Flag low-confidence completions for human review. Build a feedback loop that improves the model's handling of messy data over time. Claire's billing and intake workflows log every field the model filled from inference rather than from a confirmed source, with a confidence score and a lineage trace back to whatever evidence (or lack of evidence) triggered the inference. Operators can sort by improvisation rate and find the specific conditions that need tightening. That sort is the difference between a 25-point production gap that surprises everyone and a 5-point gap nobody is surprised by.

Pretty test data gives you pretty dashboards. Ugly real data is where your incidents live. Test on the ugly data before you deploy. Log the improvisation after you deploy. And never confuse clean evaluation metrics with production readiness.

The deeper read on testing AI under messy data.

A working framework for building evaluation suites that mirror production distribution instead of the clean exemplars in training documentation.

Read the testing framework guide

Maya Chen is the voice behind Maya Builds AI, a video and podcast series on enterprise AI infrastructure for the people building and operating these systems. Three new videos a week on YouTube. The podcast lands weekly on Spotify and Apple Podcasts. For the downstream consequences when the gap is not caught, this is what billing AI hallucinations look like in production.