Conversation State in Enterprise AI: Why Air Canada Was Liable for Its Chatbot and the Architecture That Prevents It
Legal and Technical Reference
Moffatt v. Air Canada: What the Chatbot Actually Got Wrong
In November 2022, Jake Moffatt purchased an Air Canada flight after the airline's chatbot informed him that he could apply for a bereavement fare discount within 90 days of the original ticket purchase, even if he booked at the regular price first. He was traveling to his grandmother's funeral. He bought the ticket at full price, expecting to claim the bereavement discount retroactively as the chatbot had suggested. Air Canada later refused the discount, stating that bereavement fares must be requested at the time of booking.
The BC Civil Resolution Tribunal's February 2024 decision found Air Canada liable for $812.02 CAD (approximately $650 CAD in refund plus court costs), holding that the chatbot's incorrect statement was a negligent misrepresentation by Air Canada. The tribunal's analysis is significant for enterprise AI deployments: it rejected Air Canada's argument that the chatbot was a separate legal entity whose statements were not binding. The organization is responsible for what its AI systems communicate to customers.
The immediate technical failure was not context loss in the traditional sense — the chatbot appears to have retrieved and stated incorrect policy information. But the case illustrates the broader category of AI customer communication failure that context management problems contribute to: when a chatbot cannot maintain context across a conversation, it may give contradictory information at different points in the conversation, inconsistently apply policies depending on how questions are framed, and fail to follow up on previous commitments or statements made earlier in the conversation.
The Context Window Problem and Memory Architecture Patterns
Every LLM has a context window — the maximum number of tokens (roughly: words and word-fragments) it can process in a single inference call. The context window defines the "working memory" available to the model: everything the model knows about the current conversation must fit within this window. When a conversation exceeds the context window, earlier parts of the conversation must be dropped — and with them, any context, commitments, or information from those earlier turns.
Context windows have grown dramatically: GPT-3.5-turbo launched with 4,096 tokens; GPT-4-turbo reaches 128,000 tokens; Claude 3's largest models support 200,000 tokens; Gemini 1.5 Pro supports 1,000,000 tokens. These larger windows reduce but do not eliminate the context management problem in enterprise conversational AI. Long enterprise interactions — multi-session support threads, complex multi-step workflows, extended sales interactions — can exceed even 200,000 token windows. And cost is a factor: inference cost scales with context window size, so larger contexts significantly increase operating costs.
Three Primary Memory Architecture Patterns
Production enterprise conversational AI systems use three complementary memory patterns:
Pattern 1: In-Context Memory (Conversation Buffer)
The simplest pattern: maintain the full conversation history in the context window for each inference call. This provides perfect recall of everything in the window but is limited by window size, costs scale linearly with conversation length, and there is no persistence across sessions (when the context window resets, all context is lost). LangChain's ConversationBufferMemory implements this pattern. For short, single-session interactions with small context windows, this is sufficient — but for enterprise use cases with extended conversations or multi-session requirements, it fails.
Pattern 2: Vector Database Memory (Semantic Search)
Rather than keeping the full conversation in context, key information from past conversation turns is embedded as vectors and stored in a vector database (Pinecone, Weaviate, Chroma, pgvector in PostgreSQL). At each inference call, the current query is embedded and used to retrieve semantically relevant past context. This provides effectively unlimited memory depth — the vector database can contain years of interaction history. The limitation is retrieval precision: semantic search retrieves contextually similar content, but may miss specific factual statements from past turns that are not semantically related to the current query. LangChain's VectorStoreRetrieverMemory implements this pattern.
Pattern 3: Session Store Memory (Key-Value with TTL)
Redis or other key-value stores provide fast, persistent session storage for conversation metadata: user identity confirmed in this session, workflow stage (which step of a multi-step process the conversation is at), explicitly stated user preferences, and commitments made by the AI system. Redis enables sub-millisecond access to session state and TTL-based automatic expiration of sessions. This pattern is complementary to vector memory: session stores handle structured state (the workflow is at Step 3, the user has confirmed their email) while vector stores handle unstructured semantic context (the user mentioned preferring morning appointments two messages ago).
Context Compression: Managing Token Budgets in Long Conversations
Context compression is the process of reducing the token count of conversation history while preserving the information necessary for coherent continuation. Several compression strategies are used in production enterprise AI systems:
Summarization-Based Compression
When the conversation history approaches the context window limit, earlier portions of the conversation are summarized into a compact representation. The summary replaces the original turns in the context window, freeing space for new turns. LangChain's ConversationSummaryMemory and ConversationSummaryBufferMemory implement this pattern. The risk is summarization fidelity: specific factual statements, commitments, or numbers mentioned earlier in the conversation may be generalized or omitted in the summary. For enterprise applications where specific commitments were made (as in the Air Canada case), summarization may compress out the very context needed to maintain consistency.
Selective Retention
Rather than summarizing uniformly, selective retention identifies high-importance turns — those containing specific commitments, numerical information, user decisions, or conflict-prone statements — and retains those verbatim while summarizing lower-importance turns. Importance scoring can be implemented using a classifier model or heuristic rules (messages containing dollar amounts, specific dates, policy references, or user confirmations are high importance). This approach better preserves the factual consistency that the Air Canada case makes clear is legally necessary.
Entity Extraction and Tracking
Entity extraction identifies and tracks named entities mentioned in the conversation: people, organizations, dates, monetary amounts, policy references, locations. These entities are stored in a structured format in the session store and injected into the context at each inference call regardless of summarization. Entity tracking ensures that even highly summarized conversation history preserves the factual anchors necessary for consistent AI responses about specific policy applications, commitments, or user-specific circumstances.
Conversation State Architecture Technical Audit Checklist
- Context Window Management — No Silent Truncation Verify that conversation history is managed explicitly — not silently truncated by the LLM API. When approaching context window limits, implement explicit compression or summarization. Log when compression occurs. Verify compressed context preserves critical factual statements.
- Commitment Tracking — AI Statement Consistency Implement structured tracking of commitments or representations made by the AI system during a conversation. Test that subsequent turns in the same conversation are consistent with prior AI statements. Test that context compression does not cause the AI to contradict prior statements.
- Redis Session State — TTL and Consistency Verify Redis session store TTL is appropriate for expected conversation length. Verify session state is consistent across multiple API requests within a session. Implement session state versioning to detect state corruption. Test behavior when session expires during an active conversation.
- Vector Database Memory — Retrieval Accuracy Testing Test semantic memory retrieval accuracy against a corpus of representative enterprise conversations. Measure recall of specific factual statements from earlier in the conversation. Verify retrieval accuracy meets requirements for the specific use case — customer commitments require higher recall than general preference tracking.
- Entity Extraction — Coverage of Critical Information Test entity extraction against scenarios where specific numerical commitments, policy references, dates, and user decisions are mentioned. Verify entities are tracked correctly and injected into context for subsequent turns. Verify entity store is not vulnerable to injection attacks (entities should not contain executable content).
- Multi-Session Continuity — Cross-Session Memory For use cases requiring multi-session continuity (customer support threads spanning multiple days), verify that relevant context from previous sessions is retrievable and correctly surfaced. Test that user identity is verified before previous session context is exposed.
- Context Compression Fidelity — Commitment Preservation Test context compression against scenarios where specific commitments were made in the compressed portion of the conversation. Verify the compressed representation preserves sufficient fidelity for the AI to remain consistent with those commitments. High-stakes commitments should be retained verbatim, not summarized.
- Privacy — Session Data Isolation and Retention Verify session state is isolated per user identity — no cross-user context leakage. Implement session data retention policy with automatic deletion after defined period. Verify session data is included in user data deletion fulfillment for GDPR/CCPA rights requests.
- Failure Mode — Graceful Degradation on Memory Failure Test AI system behavior when memory retrieval fails (Redis unavailable, vector search timeout). System should degrade gracefully — acknowledging context limitation rather than providing hallucinated context from failed retrieval. Log memory failures for operational monitoring.
- Audit Trail — Conversation Reconstruction Capability Verify that complete conversation history can be reconstructed from audit logs — including compressed and vectorized turns. The ability to reconstruct a conversation in an investigation (as the Air Canada case required) depends on retaining the full conversation record separately from the active context management layer.