Skip to content
AI & Automation

Retrieval-Augmented Generation

~15–25%

hallucination rate for base LLMs on factual queries

Source: Stanford CRFM 2024

∼80%

reduction in factual errors when using RAG vs base LLM

Source: Meta AI Research 2023

What is RAG?

RAG (Retrieval-Augmented Generation) is an architecture for AI applications that retrieves relevant information from a private knowledge base before the language model generates a response. Rather than relying solely on what the model learned during training (which has a fixed cutoff date and contains no private data), RAG pulls specific, current context first, then asks the model to generate based on that evidence.

The core problem RAG solves: foundation models are trained on public internet data, not your files. Without RAG, asking GPT-4 about your past client proposals, your current pricing, or your internal SOP library gets you a hallucinated guess. With RAG, the model reads the actual document before answering, so the response is grounded in evidence you can verify, not a plausible-sounding fabrication.

IBM defines RAG as "an architecture for optimizing the performance of an AI model by connecting it with external knowledge bases." AWS describes it as "the process of optimizing the output of a large language model, so it references an authoritative knowledge base outside of its training data sources before generating a response." The word authoritative matters: RAG is about connecting models to sources you trust and control.

The RAG pipeline

INDEXING (done once) Your Documents Chunker splits text Embedding Model Vector Database embeddings QUERY (at runtime) User Query Embed query Retrieved Chunks LLM generates Response grounded

How each component works

1. Chunking: splitting documents into retrievable pieces

Documents are split into smaller segments before indexing. Chunk size is a critical hyperparameter: too large and chunks become too generic to match specific queries; too small and they lose context. A 200-token chunk about "our SEO retainer scope" retrieves better for "what's included in the retainer?" than a 2,000-token chunk covering the entire SOW. Common approaches: fixed-size chunking (split every N tokens) is simple but may cut mid-sentence; semantic chunking splits at natural boundaries (paragraphs, sections, headings) to preserve meaning. IBM: "When chunks are too large, the data points can become too general and fail to correspond directly to potential user queries. But if chunks are too small, the data points can lose semantic coherency."

2. Embedding: converting text to searchable numbers

An embedding model (e.g. OpenAI's text-embedding-3-small, Cohere Embed, or open-source models like nomic-embed-text) converts each chunk into a vector (a list of numbers representing the chunk's meaning). Semantically similar text lands close together in this high-dimensional space. "Acme Corp retainer scope" and "what services are included for Acme?" map to nearby coordinates, making retrieval possible. Critically important: the same embedding model must be used for both indexing and querying. Mismatched models produce meaningless comparisons.

3. Vector database: storing and searching embeddings at scale

A vector database stores your embeddings and answers similarity queries efficiently. When a user asks a question, it's embedded and the database returns the nearest matching chunks. Popular options: Pinecone (managed, production-grade, hybrid search), Weaviate (open-source, hybrid search built in), Chroma (open-source, easy local dev), pgvector (PostgreSQL extension, no new infrastructure if you're already on Postgres), FAISS (Meta's open-source library, fast local search), Qdrant (Rust-based, strong filtering and payload indexing). Managed options reduce ops overhead but add cost; self-hosted options give more control. IBM notes a security consideration: unencrypted vector stores are vulnerable to reverse-embedding attacks that reconstruct the original data.

4. Retrieval, augmentation, and generation

At query time: the user's question is embedded, the vector database returns the top matching chunks, and those chunks are inserted into a structured prompt alongside the question. The LLM is explicitly instructed to answer from the provided context, and in a well-designed system, to say it doesn't know rather than fabricate when the context lacks the answer. Pinecone's template captures the pattern: "Using the CONTEXT provided, answer the QUESTION. Keep your answer grounded in the facts of the CONTEXT. If the CONTEXT doesn't contain the answer, say you don't know."

RAG vs. fine-tuning

Both approaches make a model more useful for a specific domain. They work differently and the right choice depends on what you're trying to fix.

Dimension RAG Fine-tuning
Cost Low: no model retraining; pay for storage and retrieval calls High: compute-intensive parameter updates, specialist skill required
Data freshness Live: update the knowledge base any time, no retraining Static: knowledge is baked into weights at training time
Transparency Citable: you can see exactly which chunks produced each answer Opaque: model behavior is a product of weights, not traceable sources
What it teaches Specific facts from a knowledge base Writing style, task format, domain vocabulary, reasoning patterns
Developer control High: swap or update knowledge base any time Low: must retrain to change baked-in behavior
Best for Answering questions from private, frequently-changing data Teaching a consistent output format, tone, or reasoning style

They're not mutually exclusive. IBM: "RAG and fine-tuning are often contrasted but can be used in tandem. Fine-tuning increases a model's familiarity with the intended domain and output requirements, while RAG assists the model in generating relevant, high-quality outputs." A common pattern: fine-tune on output format and tone, use RAG to supply current factual context at inference time.

Advanced RAG patterns

Basic RAG (embed, store, retrieve, generate) is a starting point, not a ceiling. Production systems add steps that meaningfully improve retrieval quality.

Hybrid search

Combines semantic search (dense vectors: meaning) with keyword search (sparse vectors: exact terms). Semantic search finds documents that mean the same thing even in different phrasing. Keyword search catches domain-specific terms, product names, and acronyms that semantic search misses. Pinecone: "This becomes relevant when your users refer to internal, domain-specific language like acronyms, product names, or team names." Run both searches, merge and de-duplicate results, re-rank on unified score.

Re-ranking

After initial retrieval returns 20–50 candidates, a cross-encoder re-ranker scores each candidate against the query for true relevance. Initial retrieval is fast but approximate: cosine similarity is a proxy for relevance, not a guarantee. Re-ranking is slower but precise: the top 5 chunks after re-ranking are far more likely to be genuinely relevant than the top 5 from raw vector similarity alone. Google Cloud describes re-ranking as a standard step in production search.

HyDE (Hypothetical Document Embeddings)

Instead of embedding the user's question directly, ask the LLM to generate a hypothetical answer first, then embed that answer to search for matching documents. A question's embedding often doesn't closely match any document, but a hypothetical answer embedding will, because the hypothetical answer uses the same vocabulary and structure as your actual documents. Useful when users ask questions in very different language than your knowledge base is written in.

Agentic RAG

Rather than a single retrieve-then-generate pass, an AI agent orchestrates the retrieval process: decides what to search for, validates whether retrieved context is sufficient, queries again if not, and reasons over multiple documents before generating. Pinecone: "It's about deciding which questions to ask, which tools to use, when to use them, and then aggregating results to ground answers." Frameworks: LangChain, LlamaIndex, LangGraph.

Evaluating RAG quality

You can't improve what you can't measure. Before tuning chunk size, retrieval method, or prompt templates, establish a baseline: a test set of real questions with known correct answers drawn from your knowledge base. Pinecone: "Identifying a set of queries and their expected answers is critical to knowing if your application is working. Maintaining that evaluation set is also critical to knowing where to improve over time."

Faithfulness

Does the generated answer reflect only what's in the retrieved context, or does the model add unsupported claims? High faithfulness means the model stays grounded in the evidence. Google Cloud calls this "groundedness" in Vertex Eval.

Answer relevance

Is the response actually answering the question asked? A faithful response can still be irrelevant: the model correctly cites the retrieved text, but the retrieved text was about the wrong topic. Both retrieval and faithfulness have to succeed together.

Context precision

Of the chunks retrieved, how many were actually needed? Low precision means retrieval is pulling in noise, increasing token cost and the chance that the model gets distracted by irrelevant content rather than the correct answer.

Context recall

Did the retrieved chunks contain everything needed to answer the question? Low recall means the right documents are in your knowledge base but aren't being found, typically a chunking, embedding, or retrieval configuration problem.

RAGAS (github.com/explodinggradients/ragas) is an open-source framework that automates all four metrics using LLM-as-judge evaluations, with no hand-labeled ground truth required for faithfulness and answer relevance. Google Cloud's Vertex Eval Service covers the same ground with additional metrics: coherence, fluency, safety, and instruction-following.

Why agencies care about RAG

Client-specific answers

Index every SOW, proposal, and retainer scope. Ask "what did we agree to deliver for Acme in Q3?" and get an accurate answer with a citation, not a hallucinated summary. AWS's equivalent: "How much annual leave do I have?" returns the policy document alongside the individual employee's record. Same pattern, different data.

Knowledge that stays current

Your team's knowledge changes weekly: new client briefs, updated pricing, revised processes. RAG doesn't require retraining to absorb new information. Update the knowledge base and the assistant picks it up on the next query. IBM: "RAG models can also connect to APIs and gain access to real-time feeds." No model redeployment needed.

Auditable outputs

Each response comes with the source chunks that produced it. When a client questions a deliverable or a scope disagreement surfaces, you can trace the answer back to the exact document and passage. This is the primary operational advantage RAG has over fine-tuning: the reasoning chain is visible and verifiable.

When NOT to use RAG

RAG adds real complexity: a vector database to maintain, an embedding pipeline to keep synchronized with your documents, latency on every query, and retrieval quality to monitor continuously. For tasks where all the context fits in a single prompt (summarising a pasted document, drafting an email, answering general questions), a standard LLM call is simpler and faster. Google Cloud notes that a long context window "is a great way to provide source materials to the LLM" for smaller datasets; RAG only becomes necessary when your data is too large for a context window, changes frequently, or is too sensitive to paste into a prompt. Pinecone places RAG in a cost hierarchy: training from scratch → fine-tuning → context window stuffing → RAG (cheapest and most flexible, but still not free). Build RAG when your use case genuinely requires answers grounded in a private, frequently-updated knowledge base. Don't build it because it sounds impressive.

Frequently Asked Questions

How is RAG different from fine-tuning a model?
Fine-tuning trains a model on your data, changing its weights permanently. RAG doesn't change the model at all: it gives the model access to your documents at query time. RAG is faster to set up, cheaper, and easier to update (add a new document without retraining). Fine-tuning is better for changing how a model behaves; RAG is better for giving a model access to specific knowledge.
Do I need to be a developer to use RAG?
Not anymore. Tools like Notion AI, Guru, and many AI assistants already use RAG under the hood: you just connect your workspace. Purpose-built RAG platforms like LlamaIndex Cloud or Vectara offer no-code setup. For more control, you'll want a developer, but the barrier is lower than it was even a year ago.
What documents work best with RAG?
Documents with clear, dense information work best: SOPs, brand guidelines, proposal templates, product specs, meeting notes, FAQ documents, and structured reports. RAG struggles with documents that require visual interpretation (charts, diagrams), very long unstructured PDFs, or content that's highly context-dependent.
How do I keep the knowledge base up to date?
Most RAG systems support document syncing: connect your Google Drive, Notion, or Confluence and the system re-indexes when files change. For the highest accuracy, set a regular review cadence: quarterly is usually enough for stable docs like SOPs, but client-specific documents should sync automatically.
Can RAG hallucinate and make things up?
RAG significantly reduces hallucination because the model generates responses grounded in retrieved content rather than relying on memory. But it can still fabricate if the retrieved content is ambiguous or if the model is asked something the documents don't cover. Always design RAG systems to say 'I don't have information about that' rather than guessing.

Related Terms

Sagely

Put it into practice

Sagely helps agencies manage clients without the chaos: branded portals, approval workflows, and structured communication in one place.

Start free trial
Also in the Handbook