RAG Architecture

Enterprise RAG Architecture: Chunking Strategies, Vector Database Selection, and Retrieval Accuracy

Updated February 202613 min readPinecone • Weaviate • pgvector • Chunking • HNSW

Key Reference Data

RAG vs Fine-Tuning Adoption

78% RAG preferred

Vector DB Market Growth

45% CAGR 2024

Optimal Chunk Size Range

256-512 tokens

HNSW Recall@10

95%+ achievable

RAG Hallucination Rate Without Reranking: 23% vs 8% With RerankingA 2024 study by Databricks found that RAG systems without a reranking step (cross-encoder reranking of retrieved chunks before LLM generation) produced factually incorrect answers 23% of the time even when the correct information was in the document corpus. With cross-encoder reranking added to the retrieval pipeline, hallucination rate dropped to 8% — a 65% reduction. For regulated industries where factual accuracy is a compliance requirement, reranking is not an optimization; it is a mandatory architectural component.

Section 01

RAG Architecture Components

Retrieval-Augmented Generation (RAG) augments LLM responses with relevant context retrieved from a document corpus. A production RAG pipeline has five components: (1) Ingestion pipeline — documents are chunked, embedded, and stored in a vector database; (2) Query pipeline — user queries are embedded and used to retrieve relevant chunks; (3) Reranking step — retrieved chunks are scored by a cross-encoder and ranked by relevance; (4) Context assembly — top-ranked chunks are assembled into the LLM context; (5) Generation — the LLM generates a response grounded in the retrieved context.

Each component has critical design decisions that affect retrieval accuracy. Chunking strategy affects how information is divided; vector database selection affects retrieval speed and accuracy; embedding model selection affects semantic search quality; reranking affects final precision; and context assembly affects how much information fits in the LLM context window. Optimizing these components independently and together is the core challenge of production RAG engineering.

Section 02

Vector Database Comparison: Pinecone vs Weaviate vs pgvector

Pinecone (managed cloud service) provides the simplest operational experience: fully managed, automatic scaling, and production-tested at enterprise scale. Pinecone's managed metadata filtering and namespace-based multi-tenancy make it suitable for enterprise multi-tenant RAG. Pricing is per pod (compute) plus storage. Enterprise-tier includes SOC 2 Type II compliance. Latency at 1M vectors: typically under 100ms for ANN search.

Weaviate (open source + cloud managed) offers more architectural flexibility: hybrid search (combining vector similarity with BM25 keyword search), built-in reranking modules, and multi-modal support. Weaviate's open-source license allows self-hosted deployment for data residency requirements. Weaviate Cloud provides managed service. pgvector (PostgreSQL extension) is the lowest-complexity option for enterprises already running PostgreSQL: vector search within existing database infrastructure, no new operational component. pgvector's HNSW indexing provides production-quality ANN search. Suitable for up to ~10M vectors; Pinecone or Weaviate recommended above that scale.

Checklist

RAG Architecture Implementation Checklist

Chunking Strategy SelectionChoose chunking strategy based on document types: fixed-size chunking (256-512 tokens, 50-100 token overlap) for uniform structured documents; semantic chunking (split on sentence boundaries, paragraph breaks) for narrative text; hierarchical chunking (parent-child relationships) for structured documents with sections. Test at least 3 chunk sizes on your document corpus before production.
Embedding Model SelectionSelect embedding model matched to your domain: text-embedding-3-large (OpenAI) for general enterprise text; domain-specific fine-tuned models for specialized domains (medical, legal, financial). Measure retrieval recall on domain-specific evaluation set for each candidate embedding model. Embedding model and vector database are coupled — re-embedding is required when changing embedding models.
Vector Database Selection and SizingSelect vector database based on: vector count (<10M = pgvector viable; >10M = Pinecone/Weaviate), operational model (managed vs self-hosted), data residency requirements, multi-tenancy architecture, and metadata filtering needs. Size storage: each vector = dimension count x 4 bytes (text-embedding-3-large: 3072 dimensions = 12KB/vector; 10M vectors = 120GB).
Hybrid Search ImplementationImplement hybrid search combining vector similarity with BM25 keyword search. Pure vector search misses exact keyword matches (product codes, entity names, regulatory citations). Pure keyword search misses semantic similarity. Hybrid search with configurable alpha weighting (vector vs keyword contribution) consistently outperforms either approach alone on enterprise document retrieval. Weaviate and Elasticsearch both support native hybrid search.
Cross-Encoder RerankingAdd cross-encoder reranking to the retrieval pipeline: retrieve top-50 candidates with vector search, then rerank using a cross-encoder (Cohere Rerank, BGE-reranker, or Jina Reranker). Pass top-10 reranked chunks to LLM. Reranking reduces hallucination rate by 60-70% in documented enterprise evaluations. Cross-encoder reranking adds 50-150ms latency — acceptable for most enterprise use cases.
Multi-Tenant Data IsolationFor multi-tenant RAG deployments, implement strict vector database tenant isolation: separate namespaces (Pinecone), tenant-level collections (Weaviate), or row-level security with tenant filter (pgvector). Verify at query time that retrieved vectors belong to the requesting tenant. Test tenant isolation by attempting cross-tenant retrieval in QA environment.
Retrieval Accuracy EvaluationEstablish retrieval accuracy baseline before production launch: define evaluation set of question-context pairs, measure Recall@K (does the correct document appear in top-K retrieved results?), measure MRR (Mean Reciprocal Rank), and measure context relevance (LLM-as-judge for retrieved chunk relevance). Target: Recall@5 > 90% for enterprise RAG.
RAG Pipeline MonitoringMonitor production RAG metrics: retrieval latency (embedding + ANN search + reranking), cache hit rate (for repeated queries), context relevance distribution (LLM-as-judge on production samples), answer faithfulness (does the answer contradict retrieved context?), and chunk utilization (which chunks are actually used in generation). Alert on retrieval quality degradation.

FAQ

Frequently Asked Questions

What chunk size is optimal for enterprise RAG?

Research by LlamaIndex (2024) found that 256-512 token chunk sizes with 10-20% overlap consistently outperformed larger chunks for most enterprise document retrieval tasks. Larger chunks (1024+ tokens) include more context per retrieval but reduce precision (the relevant information is diluted by surrounding text). Smaller chunks (128 tokens or less) reduce context per retrieval. Test your specific document corpus with 3 chunk sizes (256, 512, 1024) before selecting. Use hierarchical chunking for long documents with distinct sections.

Should enterprises use Pinecone, Weaviate, or pgvector?

Recommendation by use case: pgvector for enterprises with existing PostgreSQL infrastructure, vector count under 5M, and low-complexity retrieval needs; Pinecone for enterprises requiring a fully managed cloud service with proven enterprise scale, no desire to manage infrastructure, and multi-tenancy requirements; Weaviate for enterprises needing self-hosted deployment (data residency), hybrid search (vector + keyword), or multi-modal AI. All three are production-viable — the decision is primarily operational, not functional.

How does RAG accuracy compare to fine-tuning for enterprise knowledge retrieval?

For knowledge that changes frequently (policies, procedures, product catalogs, regulations), RAG consistently outperforms fine-tuning because: RAG can be updated by adding documents (no retraining), RAG can cite sources, and RAG does not suffer from hallucinations about updated information that fine-tuning 'remembers' incorrectly. For knowledge that is stable, specialized, and high-volume, fine-tuning a small model can outperform RAG in both accuracy and cost. The 2024 industry consensus: RAG first, fine-tuning for demonstrated gaps that RAG cannot close.

How should sensitive enterprise data be handled in a RAG vector database?

RAG vector databases store both raw documents and their vector embeddings. Both must be secured: encrypt at rest (AES-256) and in transit (TLS 1.3), implement access controls matching source document permissions, apply data retention and deletion procedures that cover both raw chunks and vectors (deleting a document requires deleting all its vector chunks), and exclude PII and classified documents from RAG corpora that may be accessible to broad user populations.

How does Claire's RAG architecture handle enterprise data quality requirements?

Claire's RAG implementation includes: configurable chunking strategies per document type, multi-provider embedding (OpenAI, Cohere, sentence-transformers), hybrid search (BM25 + vector), cross-encoder reranking via Cohere Rerank API, pgvector-based vector storage within the customer's database infrastructure (or Pinecone for managed scale), tenant-level namespace isolation, and document-level access control that filters retrieved chunks based on the querying user's permissions. Source citations are included in every RAG-grounded response.

Deploy Production-Ready RAG for Your Enterprise

Claire's RAG architecture includes hybrid search, cross-encoder reranking, and tenant isolation out of the box.

Book a Demo See How It Works