Female hand typing on laptop with code in the background, representing development of intelligent AI systems

RAG from Scratch in 2026: Build Your Intelligent Search System in Python with LangChain and ChromaDB

NeuralPulse|24 de maio de 2026|15 min read|Ler em Português

While you were reading this sentence, some AI system somewhere probably hallucinated an answer. The problem isn't new — but the dominant solution in 2026 has a name: RAG.

It's not hype. The global RAG market went from $1.94 billion in 2025 and is projected to reach $9.86 billion by 2030, growing 38.4% annually (MarketsandMarkets). More importantly: 73% of enterprise AI deployments in 2026 use RAG or a hybrid architecture (Precision AI Academy).

But there's a catch. Most tutorials out there teach "Naive RAG" — which, according to the 2026 RAG Architecture Guide, fails at retrieval in approximately 40% of cases. The model responds confidently, but based on the wrong documents.

In this tutorial, you'll learn to build a complete RAG system from scratch — not the shallow one, but the one that actually works. We'll cover indexing, intelligent chunking, embeddings, semantic search, evaluation with RAGAS, and an upgrade to Advanced RAG with reranking. All in Python, with LangChain and ChromaDB.

At the end, there's a version that runs 100% free with Ollama. No API key, no credit card.

The Problem RAG Solves

Language models are incredibly good at generating text. But they have a serious problem: they don't know what they don't know. Ask about your company's internal documents, a 2025 report, or a recent scientific article — and the model will invent a convincing answer.

RAG solves this with a simple idea: before answering, the system retrieves the relevant documents for that question, and only then generates the answer based on them.

"RAG turns a general-purpose language model into a domain expert on your specific documents — without retraining anything." — Precision AI Academy

The result is significant: an 85% reduction in hallucination rate when implemented correctly, compared to pure prompting (Precision AI Academy). Companies deploying RAG report 30-70% efficiency gains in knowledge-intensive workflows (Techment).

How It Works: The 2-Phase Architecture

A RAG system has two distinct phases. The first runs offline, once. The second runs online, for each user query.

Phase	What Happens	When It Runs
Indexing	Documents are loaded, split into chunks, converted to embeddings, and stored in the vector database	Offline (single processing)
Retrieval	The user's question is converted to an embedding and used to find the most similar chunks	Online (per query)
Generation	The retrieved chunks are injected into the prompt as context, and the LLM generates the final answer	Online (per query)

Think of RAG as a library with two employees: a librarian (retriever) who finds the right books on the shelves, and an assistant (LLM) who reads the books and answers your question. Alone, the assistant would make everything up. With the librarian, it only answers based on what it has read.

Project Setup

Create a directory, isolate dependencies, and install the packages:

mkdir rag-from-scratch && cd rag-from-scratch
python -m venv .venv
source .venv/bin/activate  
# On Windows: .venv\Scripts\activate

pip install langchain langchain-community langchain-chroma chromadb pip install langchain-openai python-dotenv pip install ragas datasets

Place your OpenAI key in a .env file:

OPENAI_API_KEY=your-key-here

No key? Skip to the "Free RAG with Ollama" section at the end. The code is almost the same.

Step-by-Step: Building the RAG

I'll divide the implementation into 8 code blocks. Each is self-contained and explained.

Step 1: Load the Documents

First, create a ./docs/ folder with some example .txt files — it could be technical documentation, articles, or reports. Then load everything:

from langchain_community.document_loaders import DirectoryLoader, TextLoader

loader = DirectoryLoader( path="./docs/", glob="**/*.txt", loader_cls=TextLoader, ) documents = loader.load()

print(f"Documents loaded: {len(documents)}")

Step 2: Intelligent Chunking

Dividing documents into pieces is the most underestimated decision in the pipeline. Chunks that are too large dilute search relevance. Chunks that are too small lose context.

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # ~250 tokens chunk_overlap=200, # overlap to avoid cutting ideas in half separators=["\n\n", "\n", ". ", " ", ""], ) chunks = text_splitter.split_documents(documents)

print(f"Total chunks: {len(chunks)}")

The chunk_overlap of 200 characters ensures important contexts aren't cut between two chunks. It's a small detail that makes a big difference.

"Naive RAG pipelines fail at retrieval roughly 40% of the time, generating a confident answer grounded in the wrong documents." — JobsByCulture RAG Architecture Guide 2026

Step 3: Create Embeddings

Each chunk becomes a numerical vector — an embedding — that represents its meaning. The closer one embedding is to another, the more similar the content:

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

OpenAI's text-embedding-3-small model costs pennies, has 1536 dimensions, and is the industry standard in 2026.

Step 4: Store in ChromaDB

With the embeddings ready, we populate the vector database. ChromaDB is open-source, runs locally, and persists data to disk:

from langchain_chroma import Chroma

vectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db", )

print("Vector database saved to ./chroma_db")

Step 5: Configure the Retriever

The retriever finds the most relevant chunks for each question. By default, top-k semantic search works well:

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 4}
)

This returns the 4 chunks most similar to the question. Four is a good starting point — few enough not to pollute the context, enough to cover different aspects.

Step 6: Assemble the Complete Chain

Here we combine retrieval + prompt + LLM into a single pipeline:

from langchain_openai import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough

template = """You are an assistant specialized in answering based ONLY on the provided context. If the answer is not in the context, say you don't know.

Context: {context}

Question: {question}

ElevenLabs

Transforme texto em voz com IA realista. Perfeito para narracoes, podcasts e audiolivros.

Testar gratuito

Answer:"""

prompt = ChatPromptTemplate.from_template(template) llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs): return "\n\n---\n\n".join(doc.page_content for doc in docs)

chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm )

Notice the RunnablePassthrough(): the user's question passes directly to the prompt, while the retriever fetches and formats the context.

Step 7: Ask Questions

The chain is ready. Let's test it:

question = "What is RAG and how does it reduce hallucinations in language models?"
answer = chain.invoke(question)

print(answer.content)

The flow is:

Your question becomes an embedding and searches ChromaDB
The 4 most similar chunks are retrieved
They become the {context} in the template
GPT-4o-mini generates the answer based only on that context

If the vector database doesn't have relevant documents, the model responds "I don't know" — instead of hallucinating.

Step 8: Evaluate with RAGAS

Building the RAG is the easy part. Knowing if it works is what separates amateurs from professionals. RAGAS provides objective metrics:

Metric	What it measures	Ideal Score
Faithfulness	Is the answer faithful to the retrieved context?	> 0.85
Answer Relevancy	Does the answer actually address the question?	> 0.80
Context Precision	Are the retrieved documents relevant?	> 0.75
Context Recall	Does the context cover the necessary information?	> 0.80

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
from ragas.metrics import context_precision, context_recall
from datasets import Dataset

retrieved_docs = retriever.invoke(question) dataset = Dataset.from_dict({ "question": [question], "answer": [answer.content], "contexts": [[doc.page_content for doc in retrieved_docs]], })

results = evaluate( dataset, metrics=[ faithfulness, answer_relevancy, context_precision, context_recall, ], )

print("Faithfulness:", results["faithfulness"]) print("Answer Relevancy:", results["answer_relevancy"]) print("Context Precision:", results["context_precision"]) print("Context Recall:", results["context_recall"])

If context_precision is low, the chunking or embedding quality is poor. If faithfulness is low, the LLM is ignoring the context — increase the temperature or adjust the prompt.

Upgrade: Advanced RAG with Re-ranking

The Naive RAG we built so far works. But the 40% retrieval failure rate is a real risk. The most effective solution in 2026 is re-ranking: retrieve more documents than necessary and use a specialized model to reorder the most relevant ones.

Re-ranking improves answer quality by 15-30% on standard RAG benchmarks (JobsByCulture).

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CrossEncoderReranker
from langchain_community.cross_encoders import HuggingFaceCrossEncoder

Free re-ranking model (runs locally)

reranker_model = HuggingFaceCrossEncoder( model_name="BAAI/bge-reranker-v2-m3" )

compressor = CrossEncoderReranker( model=reranker_model, top_n=4, )

advanced_retriever = ContextualCompressionRetriever( base_compressor=compressor, base_retriever=vectorstore.as_retriever( search_kwargs={"k": 20} ), )

Updated chain with re-ranking

advanced_chain = ( { "context": advanced_retriever | format_docs, "question": RunnablePassthrough(), } | prompt | llm )

advanced_answer = advanced_chain.invoke(question) print(advanced_answer.content)

The logic: retrieve 20 chunks, pass them through the re-ranker (which is slower but more accurate), and select the best 4. The extra computational cost is worth every penny — especially in technical domains where precision matters.

Free RAG: Running Everything with Ollama

Don't want to pay for an API? The open-source ecosystem in 2026 is mature. With Ollama, you run embeddings and LLM locally — 100% free, 100% offline:

ollama pull nomic-embed-text
ollama pull llama3.2:3b

The code changes very little:

from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.llms import Ollama

Local embeddings

embeddings_local = OllamaEmbeddings(model="nomic-embed-text")

Local LLM

llm_local = Ollama(model="llama3.2:3b", temperature=0)

Identical pipeline, just swap the provider

vectorstore_local = Chroma.from_documents( documents=chunks, embedding=embeddings_local, persist_directory="./chroma_db_local", )

retriever_local = vectorstore_local.as_retriever(search_kwargs={"k": 4})

chain_local = ( {"context": retriever_local | format_docs, "question": RunnablePassthrough()} | prompt | llm_local )

print(chain_local.invoke("What is RAG?").content)

Same architecture. Zero API cost. Perfect for prototyping, sensitive data, or simply learning without spending anything.

Comparison: Embedding Strategies

Model	Dim.	Cost	Accuracy (MTEB)	Offline?
text-embedding-3-small	1536	$$ low	62.3%	❌
text-embedding-3-large	3072	$$ medium	64.6%	❌
nomic-embed-text	768	Free	59.7%	✅
bge-large-en-v1.5	1024	Free	63.7%	✅

For production, text-embedding-3-small is the best cost-benefit. For prototyping or sensitive data, nomic-embed-text runs for free on your machine.

Next Steps

RAG is not a one-and-done project. What separates a mediocre implementation from an excellent one is iteration:

Test different chunking strategies: 512 vs 1024 tokens, 10% vs 20% overlap
Experiment with domain-specific embeddings: fine-tuned models outperform generic ones
Implement a feedback loop: users mark answers as useful or useless → data to refine
Add Graph RAG: for questions that connect multiple documents

"RAG is still the dominant architecture for grounding LLMs with external knowledge in 2026 — but the landscape has fractured into multiple distinct patterns." — Starmorph Blog

The AI knowledge management systems market reached $11.24 billion in 2026, growing 46.7% annually (Virtual Assistant VA). RAG is the foundation of it all. Knowing how to build, evaluate, and iterate is a skill that will be worth its weight in gold in the coming years.

Now you have a functional system, metrics to measure quality, and a clear path for evolution. The code is in your hands — just run it.

Want to go further? In the next tutorial, we'll explore Graph RAG — when semantic search isn't enough and you need to connect entities across multiple documents. That's where RAG really gets interesting.

Check out also: From Dataset to Ollama: Fine-Tuning LLMs with Unsloth on Your GPU in 2026 Check out also: [MCP in Practice: Build Your First Server in TypeScript in 30 Minutes (2026)](/blog/mcp-server-tutorial-

#rag#langchain#chromadb#python#ai-tutorials#intelligent-search

Python code running in a text editor with semantic similarity charts in the background

tutorials|5 min

Semantic Search with Python and Open-Source Models

Practical tutorial on embeddings for semantic search in Python using open-source models such as BGE-M3 and GTE-Qwen2. Runnable code and performance metrics.

13 de junho de 2026Read more