RAG from Scratch in 2026: Build Your Intelligent Search System in Python with LangChain and ChromaDB
While you were reading this sentence, some AI system somewhere probably hallucinated an answer. The problem isn't new — but the dominant solution in 2026 has a name: RAG.
It's not hype. The global RAG market went from $1.94 billion in 2025 and is projected to reach $9.86 billion by 2030, growing 38.4% annually (MarketsandMarkets). More importantly: 73% of enterprise AI deployments in 2026 use RAG or a hybrid architecture (Precision AI Academy).
But there's a catch. Most tutorials out there teach "Naive RAG" — which, according to the 2026 RAG Architecture Guide, fails at retrieval in approximately 40% of cases. The model responds confidently, but based on the wrong documents.
In this tutorial, you'll learn to build a complete RAG system from scratch — not the shallow one, but the one that actually works. We'll cover indexing, intelligent chunking, embeddings, semantic search, evaluation with RAGAS, and an upgrade to Advanced RAG with reranking. All in Python, with LangChain and ChromaDB.
At the end, there's a version that runs 100% free with Ollama. No API key, no credit card.
The Problem RAG Solves
Language models are incredibly good at generating text. But they have a serious problem: they don't know what they don't know. Ask about your company's internal documents, a 2025 report, or a recent scientific article — and the model will invent a convincing answer.
RAG solves this with a simple idea: before answering, the system retrieves the relevant documents for that question, and only then generates the answer based on them.
"RAG turns a general-purpose language model into a domain expert on your specific documents — without retraining anything." — Precision AI Academy
The result is significant: an 85% reduction in hallucination rate when implemented correctly, compared to pure prompting (Precision AI Academy). Companies deploying RAG report 30-70% efficiency gains in knowledge-intensive workflows (Techment).
How It Works: The 2-Phase Architecture
A RAG system has two distinct phases. The first runs offline, once. The second runs online, for each user query.
| Phase | What Happens | When It Runs |
|---|---|---|
| Indexing | Documents are loaded, split into chunks, converted to embeddings, and stored in the vector database | Offline (single processing) |
| Retrieval | The user's question is converted to an embedding and used to find the most similar chunks | Online (per query) |
| Generation | The retrieved chunks are injected into the prompt as context, and the LLM generates the final answer | Online (per query) |
Think of RAG as a library with two employees: a librarian (retriever) who finds the right books on the shelves, and an assistant (LLM) who reads the books and answers your question. Alone, the assistant would make everything up. With the librarian, it only answers based on what it has read.
Project Setup
Create a directory, isolate dependencies, and install the packages:
mkdir rag-from-scratch && cd rag-from-scratch
python -m venv .venv
source .venv/bin/activate
# On Windows: .venv\Scripts\activate
pip install langchain langchain-community langchain-chroma chromadb pip install langchain-openai python-dotenv pip install ragas datasets
Place your OpenAI key in a .env file:
OPENAI_API_KEY=your-key-here
No key? Skip to the "Free RAG with Ollama" section at the end. The code is almost the same.
Step-by-Step: Building the RAG
I'll divide the implementation into 8 code blocks. Each is self-contained and explained.
Step 1: Load the Documents
First, create a ./docs/ folder with some example .txt files — it could be technical documentation, articles, or reports. Then load everything:
from langchain_community.document_loaders import DirectoryLoader, TextLoader
loader = DirectoryLoader( path="./docs/", glob="**/*.txt", loader_cls=TextLoader, ) documents = loader.load()
print(f"Documents loaded: {len(documents)}")
Step 2: Intelligent Chunking
Dividing documents into pieces is the most underestimated decision in the pipeline. Chunks that are too large dilute search relevance. Chunks that are too small lose context.
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # ~250 tokens chunk_overlap=200, # overlap to avoid cutting ideas in half separators=["\n\n", "\n", ". ", " ", ""], ) chunks = text_splitter.split_documents(documents)
print(f"Total chunks: {len(chunks)}")
The chunk_overlap of 200 characters ensures important contexts aren't cut between two chunks. It's a small detail that makes a big difference.
"Naive RAG pipelines fail at retrieval roughly 40% of the time, generating a confident answer grounded in the wrong documents." — JobsByCulture RAG Architecture Guide 2026
Step 3: Create Embeddings
Each chunk becomes a numerical vector — an embedding — that represents its meaning. The closer one embedding is to another, the more similar the content:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
OpenAI's text-embedding-3-small model costs pennies, has 1536 dimensions, and is the industry standard in 2026.
Step 4: Store in ChromaDB
With the embeddings ready, we populate the vector database. ChromaDB is open-source, runs locally, and persists data to disk:
from langchain_chroma import Chroma
vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db", )
print("Vector database saved to ./chroma_db")
Step 5: Configure the Retriever
The retriever finds the most relevant chunks for each question. By default, top-k semantic search works well:
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 4}
)
This returns the 4 chunks most similar to the question. Four is a good starting point — few enough not to pollute the context, enough to cover different aspects.
Step 6: Assemble the Complete Chain
Here we combine retrieval + prompt + LLM into a single pipeline:
from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
template = """You are an assistant specialized in answering based ONLY on the provided context. If the answer is not in the context, say you don't know.
Context: {context}
Question: {question}
Answer:"""
prompt = ChatPromptTemplate.from_template(template) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
def format_docs(docs): return "\n\n---\n\n".join(doc.page_content for doc in docs)
chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm )
Notice the RunnablePassthrough(): the user's question passes directly to the prompt, while the retriever fetches and formats the context.
Step 7: Ask Questions
The chain is ready. Let's test it:
question = "What is RAG and how does it reduce hallucinations in language models?"
answer = chain.invoke(question)
print(answer.content)
The flow is:
- Your question becomes an embedding and searches ChromaDB
- The 4 most similar chunks are retrieved
- They become the
{context}in the template - GPT-4o-mini generates the answer based only on that context
If the vector database doesn't have relevant documents, the model responds "I don't know" — instead of hallucinating.
Step 8: Evaluate with RAGAS
Building the RAG is the easy part. Knowing if it works is what separates amateurs from professionals. RAGAS provides objective metrics:
| Metric | What it measures | Ideal Score |
|---|---|---|
| Faithfulness | Is the answer faithful to the retrieved context? | > 0.85 |
| Answer Relevancy | Does the answer actually address the question? | > 0.80 |
| Context Precision | Are the retrieved documents relevant? | > 0.75 |
| Context Recall | Does the context cover the necessary information? | > 0.80 |
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.metrics import context_precision, context_recall
from datasets import Dataset
retrieved_docs = retriever.invoke(question) dataset = Dataset.from_dict({ "question": [question], "answer": [answer.content], "contexts": [[doc.page_content for doc in retrieved_docs]], })
results = evaluate( dataset, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall, ], )
print("Faithfulness:", results["faithfulness"]) print("Answer Relevancy:", results["answer_relevancy"]) print("Context Precision:", results["context_precision"]) print("Context Recall:", results["context_recall"])
If context_precision is low, the chunking or embedding quality is poor. If faithfulness is low, the LLM is ignoring the context — increase the temperature or adjust the prompt.
Upgrade: Advanced RAG with Re-ranking
The Naive RAG we built so far works. But the 40% retrieval failure rate is a real risk. The most effective solution in 2026 is re-ranking: retrieve more documents than necessary and use a specialized model to reorder the most relevant ones.
Re-ranking improves answer quality by 15-30% on standard RAG benchmarks (JobsByCulture).
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder
Free re-ranking model (runs locally)
reranker_model = HuggingFaceCrossEncoder( model_name="BAAI/bge-reranker-v2-m3" )
compressor = CrossEncoderReranker( model=reranker_model, top_n=4, )
advanced_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectorstore.as_retriever( search_kwargs={"k": 20} ), )
Updated chain with re-ranking
advanced_chain = ( { "context": advanced_retriever | format_docs, "question": RunnablePassthrough(), } | prompt | llm )
advanced_answer = advanced_chain.invoke(question) print(advanced_answer.content)
The logic: retrieve 20 chunks, pass them through the re-ranker (which is slower but more accurate), and select the best 4. The extra computational cost is worth every penny — especially in technical domains where precision matters.
Free RAG: Running Everything with Ollama
Don't want to pay for an API? The open-source ecosystem in 2026 is mature. With Ollama, you run embeddings and LLM locally — 100% free, 100% offline:
ollama pull nomic-embed-text
ollama pull llama3.2:3b
The code changes very little:
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama
Local embeddings
embeddings_local = OllamaEmbeddings(model="nomic-embed-text")
Local LLM
llm_local = Ollama(model="llama3.2:3b", temperature=0)
Identical pipeline, just swap the provider
vectorstore_local = Chroma.from_documents( documents=chunks, embedding=embeddings_local, persist_directory="./chroma_db_local", )
retriever_local = vectorstore_local.as_retriever(search_kwargs={"k": 4})
chain_local = ( {"context": retriever_local | format_docs, "question": RunnablePassthrough()} | prompt | llm_local )
print(chain_local.invoke("What is RAG?").content)
Same architecture. Zero API cost. Perfect for prototyping, sensitive data, or simply learning without spending anything.
Comparison: Embedding Strategies
| Model | Dim. | Cost | Accuracy (MTEB) | Offline? |
|---|---|---|---|---|
| text-embedding-3-small | 1536 | $$ low | 62.3% | ❌ |
| text-embedding-3-large | 3072 | $$ medium | 64.6% | ❌ |
| nomic-embed-text | 768 | Free | 59.7% | ✅ |
| bge-large-en-v1.5 | 1024 | Free | 63.7% | ✅ |
For production, text-embedding-3-small is the best cost-benefit. For prototyping or sensitive data, nomic-embed-text runs for free on your machine.
Next Steps
RAG is not a one-and-done project. What separates a mediocre implementation from an excellent one is iteration:
- Test different chunking strategies: 512 vs 1024 tokens, 10% vs 20% overlap
- Experiment with domain-specific embeddings: fine-tuned models outperform generic ones
- Implement a feedback loop: users mark answers as useful or useless → data to refine
- Add Graph RAG: for questions that connect multiple documents
"RAG is still the dominant architecture for grounding LLMs with external knowledge in 2026 — but the landscape has fractured into multiple distinct patterns." — Starmorph Blog
The AI knowledge management systems market reached $11.24 billion in 2026, growing 46.7% annually (Virtual Assistant VA). RAG is the foundation of it all. Knowing how to build, evaluate, and iterate is a skill that will be worth its weight in gold in the coming years.
Now you have a functional system, metrics to measure quality, and a clear path for evolution. The code is in your hands — just run it.
Want to go further? In the next tutorial, we'll explore Graph RAG — when semantic search isn't enough and you need to connect entities across multiple documents. That's where RAG really gets interesting.
Related Articles
Check out also: From Dataset to Ollama: Fine-Tuning LLMs with Unsloth on Your GPU in 2026 Check out also: [MCP in Practice: Build Your First Server in TypeScript in 30 Minutes (2026)](/blog/mcp-server-tutorial-
Related Articles
Semantic Search with Python and Open-Source Models
Practical tutorial on embeddings for semantic search in Python using open-source models such as BGE-M3 and GTE-Qwen2. Runnable code and performance metrics.
Hyperparameter Optimization with Hyperopt in 2026: Practical Guide
2026 practical tutorial: learn to optimize machine learning model hyperparameters using Hyperopt, with Bayesian search and result visualization.
Transcription and Response Pipeline with Whisper and Llama 3: Local Implementation in Python
Learn to build a complete voice processing pipeline using Whisper and Llama 3, all locally in Python, with no API costs and full privacy.