AI SLA Design

AI SLA Design for Enterprise: Uptime Tiers, Latency SLAs, Accuracy SLAs, and Incident Response

Updated February 202613 min readSLA Design • 99.9% vs 99.99% • Latency SLAs • Accuracy SLAs

Key Reference Data

99.9% SLA Downtime Allowance

8.7 hrs/year

99.99% SLA Downtime Allowance

52 minutes/year

Average Enterprise AI SLA

99.5% (too low)

Mean Time to Detect AI Issues

31 hours

OpenAI's SLA and the November 2023 Outage: What Enterprises ExperiencedOpenAI experienced a major outage on November 8, 2023 affecting both ChatGPT and the API for approximately 4 hours. Enterprises that had production customer-facing AI running on OpenAI's API experienced complete service disruption. OpenAI's standard API terms do not include uptime SLA commitments — enterprises are dependent on OpenAI's historical track record, not contractual guarantees. This case illustrates the need for AI SLAs that go beyond the LLM provider's standard terms, including provider redundancy provisions and fallback mechanisms that activate when provider SLAs are not met.

Section 01

Uptime SLA Tiers: 99.9% vs 99.99%

The difference between 99.9% ('three nines') and 99.99% ('four nines') uptime SLA is significant in practice. 99.9% allows 8.7 hours of downtime per year; 99.99% allows 52.6 minutes per year. For customer-facing AI applications in regulated industries — healthcare triage, financial transaction processing, customer service — the appropriate tier depends on the business impact of downtime. An AI system handling emergency healthcare inquiries requires 99.99%+ uptime; an AI handling low-urgency document processing can tolerate 99.9%. Define the SLA tier before vendor evaluation — it will determine architecture and cost.

Achieving 99.99% uptime for AI systems requires: redundant LLM provider routing (if one provider is down, automatically route to another), regional failover (multiple deployment regions), synchronous health monitoring with sub-30-second failover triggering, and robust fallback behavior (graceful degradation to human queue when AI is unavailable). 99.99% uptime for AI is achievable but requires significant architectural investment.

Section 02

AI Accuracy SLAs: A New Frontier in Enterprise Agreements

Traditional software SLAs cover uptime and response time — deterministic metrics. AI systems introduce a novel SLA dimension: accuracy. An AI system that is available and fast but wrong is not meeting its service commitments. For regulated industries, accuracy SLAs define the minimum acceptable correctness rate for AI decisions. Examples: healthcare AI must have false negative rate below 5% for symptom severity classification; financial AI must have credit decision accuracy above 92% measured on labeled test set monthly; compliance AI must correctly identify regulatory violations with at least 90% precision.

Accuracy SLAs require: a labeled evaluation set for continuous monitoring, measurement methodology (LLM-as-judge, human evaluation, downstream outcome tracking), measurement frequency (daily automated, weekly human validation), and remediation obligations when accuracy falls below threshold (human fallback, model update, vendor notification).

Checklist

AI SLA Design Implementation Checklist

Define Uptime SLA Tier by Use Case CriticalityClassify each AI use case by criticality: Life/Safety Critical (99.999%), Business Critical/Customer-Facing (99.99%), Business Operational (99.9%), Back-Office/Batch (99.5%). Match SLA tier to criticality. Architect redundancy accordingly — 99.99% requires LLM provider redundancy and regional failover; 99.5% may be achievable with single provider.
Define Latency SLA by Use CaseDefine P50, P95, and P99 latency SLAs for each AI use case: real-time customer-facing (P99 <3s, streaming TTFT <500ms), semi-real-time (P99 <10s), and batch (P99 <60s). Include latency SLAs in vendor contracts. Monitor continuously — alert when P99 exceeds SLA x 1.5.
Define Accuracy SLA and Measurement MethodologyDefine accuracy SLA and measurement methodology before deployment: metric (precision, recall, F1, accuracy), minimum threshold, measurement frequency (daily automated + weekly human), labeled evaluation set management (update quarterly), and remediation obligations. Include accuracy SLA in vendor agreements for AI-as-a-service deployments.
Include Provider Redundancy RequirementsFor 99.99%+ uptime SLA compliance, include in architecture and vendor agreements: primary + fallback LLM provider configuration, automatic failover trigger criteria (error rate > X% for Y minutes), maximum failover time (target: under 60 seconds), and testing requirements for failover mechanism (quarterly failover drill).
Define Incident Response Time SLAsDefine incident response time commitments by severity: P0 (AI system down) — response within 15 minutes, resolution within 1 hour; P1 (significant degradation) — response within 30 minutes, resolution within 4 hours; P2 (partial issue) — response within 2 hours, resolution within 24 hours. Include response time commitments in vendor contracts with service credit remedies.
Service Credit RemediesNegotiate service credits for SLA breaches: typically 10% monthly fee credit for missing uptime SLA by up to 1%, 25% for missing by 1-5%, 50% for missing by >5%. Service credits should be automatic (not requiring a claim). Include termination rights for repeated SLA breach (e.g., 3 months of breach in a 12-month period). Credits are not adequate for high-severity impacts — negotiate right to terminate for material breach.
SLA Measurement and ReportingDefine SLA measurement methodology in the contract: who measures (vendor self-report vs. third-party monitoring), measurement data retention (minimum 12 months for audit), reporting frequency (weekly availability report, monthly accuracy report), and dispute resolution (independent arbitration for measurement disputes). Implement your own monitoring to verify vendor-reported SLA metrics independently.
Degraded Mode Service LevelDefine degraded mode service levels: what is the expected behavior and service level when AI availability falls below threshold? Define: human fallback queue capacity and response time, customer communication during AI unavailability, and automatic failover to previous AI model version if new deployment causes degradation.

FAQ

Frequently Asked Questions

What is the difference between 99.9% and 99.99% uptime SLAs?

99.9% uptime (three nines) allows 8.76 hours of downtime per year, or approximately 43.8 minutes per month. 99.99% uptime (four nines) allows 52.6 minutes of downtime per year, or approximately 4.4 minutes per month. 99.999% (five nines) allows 5.26 minutes per year. For AI systems, 'downtime' includes: complete unavailability, latency above P99 SLA threshold, and accuracy below minimum acceptable threshold — not just server unavailability. Define what constitutes 'downtime' for your AI system explicitly in the SLA.

Should enterprise AI contracts include accuracy SLAs?

Yes, for regulated industries where AI decisions have material business or regulatory impact. Accuracy SLAs hold vendors accountable for model quality maintenance over time — without them, vendors have no contractual obligation to address model drift or quality degradation. Include: minimum accuracy threshold on your labeled evaluation set, measurement methodology and frequency, vendor notification obligations when accuracy breaches are detected, and remediation timeline requirements. Accuracy SLA verification requires your own labeled evaluation set and measurement process.

What incident response time is appropriate for enterprise AI SLAs?

Industry benchmark SLAs for enterprise AI: P0 (complete unavailability) — response within 15 minutes, restoration within 1 hour; P1 (material accuracy degradation or partial unavailability) — response within 30 minutes, restoration within 4 hours; P2 (minor degradation) — response within 2 hours, restoration within 24 hours. Compare these benchmarks against your current vendor SLA commitments. Many AI vendors offer response commitments that are weaker than these benchmarks in standard terms — negotiate to enterprise-appropriate levels.

How does provider redundancy work for AI SLA compliance?

Provider redundancy for AI SLA compliance: configure two LLM providers in the AI platform (e.g., OpenAI as primary, Azure OpenAI or Anthropic as secondary). Implement health monitoring that tracks primary provider error rate and latency. When error rate exceeds threshold (e.g., 5% of requests failing for 60 seconds), automatically route new requests to secondary provider. Route all requests back to primary after confirmed recovery. Test failover mechanism quarterly. Document failover behavior in your SLA as part of availability commitment.

How does Claire's SLA compare to enterprise AI requirements?

Claire's enterprise SLA: 99.9% uptime SLA with 99.99% available for enterprise-tier customers, P99 latency under 3 seconds for user-facing endpoints with streaming TTFT under 500ms, multi-provider LLM redundancy with automatic failover under 60 seconds, P0 incident response within 15 minutes with 1-hour restoration target, monthly accuracy monitoring report against customer-provided evaluation set, and service credits of 10-50% of monthly fees for SLA breach. SLA terms are included in enterprise agreements — contact sales for enterprise SLA documentation.

Get an Enterprise AI SLA That Matches Your Requirements

Claire's enterprise agreements include uptime, latency, accuracy, and incident response SLAs designed for regulated industries.

Book a Demo See How It Works