Python code on a monitor with dark color scheme symbolizing chatbot development with LLMs
llms-chatbots

7 Steps to a Hallucination-Free Chatbot: CoT, Self-Consistency, and DSPy in Python

NeuralPulse|1 de junho de 2026|14 min read|Ler em Português

You ask a simple question to the chatbot, and it makes up an answer with complete confidence. It's not a bug — it's a feature of LLMs. The problem is that 40% of chatbots in production still hallucinate frequently enough to cause damage, according to 2026 benchmarks (Source: Analytical Guide to LLM Evaluation, NeuralPulse).

The good news? You can drastically reduce this rate by combining three techniques that academic research has already validated: Structured Chain-of-Thought, Self-Consistency with voting, and automatic optimization with DSPy. This tutorial shows the 7 steps to build this system from scratch in Python.

"Chain-of-thought + self-consistency gives the biggest accuracy boost for reasoning tasks. For everything else, few-shot with 3-5 examples is the best cost-to-quality tradeoff." — SurePrompts Team, Every Prompt Engineering Technique Explained (2026)

Step 1: Understand why LLMs hallucinate (and what works against it)

Before writing code, it's worth understanding the enemy. LLMs are pattern-completing machines, not fact-checking ones. When they don't know the answer, they make it up — and they do this because they were trained to sound convincing, not to be accurate.

The three techniques we'll use attack the problem from complementary angles:

TechniqueWhat it doesAccuracy gain
Chain-of-Thought (CoT)Forces the model to reason step-by-step before answering+34% on multi-step reasoning tasks (Source: iBuidl Research, 2026, with GPT-5, Claude 4, and Gemini 2.5)
Self-ConsistencyGenerates multiple answers and chooses the most frequent (majority voting)+12 to +18% over pure CoT on GSM8K (Source: Adaline.ai, based on Wang et al., 2022)
DSPyAutomatically optimizes prompts based on metrics, without manual tuning-25% hallucination rates, +30 to +45% factual accuracy (Source: IEEE ESCI 2026, arXiv:2604.04869)

Together, these three techniques form a safety belt for your chatbot. Step 2 starts with the foundation of everything: the system prompt.

Step 2: Build a robust and lean system prompt

The system prompt is the first line of defense. Research from iBuidl Research (2026) revealed two important findings:

  1. System prompts above 800 tokens start losing effectiveness — the model dilutes adherence to instructions.
  2. Few-shot with just 3 examples raises structured output reliability from 71% to 94%.

In other words: less is more. A bloated system prompt harms more than it helps.

# system_prompt.py — Anti-hallucination base template

SYSTEM_PROMPT = """You are an AI assistant specialized in providing ACCURATE and VERIFIABLE answers.

MANDATORY RULES:

  1. NEVER invent information. If you don't know, say "I don't have enough information to answer."
  2. ALWAYS reason step-by-step before answering (use the reasoning format below).
  3. BASE each claim on data from the provided context.
  4. If the user asks for something outside your scope, politely decline.
  5. KEEP answers concise — maximum 3 paragraphs unless more detail is explicitly requested.

Example response format: REASONING: [Your step-by-step reasoning here]

ANSWER: [Your final answer based on the reasoning above] """

This template already applies two principles: lean (under 400 tokens) and structured (separates reasoning from answer). Keep it — we'll refine it in the next steps.

Step 3: Implement structured Chain-of-Thought

Chain-of-Thought is simple in theory and transformative in practice. Instead of asking for a direct answer, you force the model to explain the step-by-step before concluding. iBuidl Research documented a consistent 34% gain in accuracy with modern models (GPT-5, Claude 4, Gemini 2.5).

The secret is to structure the CoT — it can't be a generic "think step-by-step." You need a format the model follows strictly.

# cot_pipeline.py — Structured Chain-of-Thought

from openai import OpenAI import json

client = OpenAI()

def answer_with_cot(question: str, context: str) -> dict: """Executes structured Chain-of-Thought and returns reasoning + answer."""

messages = [
    {"role": "system", "content": """You are an analyst who reasons step-by-step.

Always follow this format:

STEP 1: Identify relevant data in the context STEP 2: Relate the data to the question STEP 3: Check if there is enough information STEP 4: Formulate the answer based ONLY on verified data

If at ANY step you realize you don't have enough data, STOP and inform."""}, {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"} ]

response = client.chat.completions.create(
    model="gpt-5.5",
    messages=messages,
    temperature=0.3,  # Low temperature for reasoning consistency
    max_tokens=1024
)

return {
    "full_answer": response.choices[0].message.content,
    "tokens_used": response.usage.total_tokens
}

Usage example

result = answer_with_cot( question="What was the company's revenue in Q3?", context="In Q1 we billed R$ 2M, in Q2 R$ 2.5M. The Q3 report will be released next week." ) print(result["full_answer"])

Notice what happens: the model goes through STEP 3 ("Check if there is enough information") and identifies that Q3 hasn't been released. Without CoT, it would likely invent a number. With CoT, it stops before hallucinating.

Step 4: Add Self-Consistency with majority voting

Chain-of-Thought already reduces hallucinations, but it still has a problem: the reasoning can be consistent and wrong at the same time. That's where Self-Consistency comes in.

The idea is brilliant in its simplicity: instead of running the model once, run it N times (with higher temperature to generate variety) and choose the most frequent answer. Research from 2025-2026 shows gains of 12 to 18% additional accuracy over pure CoT (Source: Adaline.ai).

# self_consistency.py — Majority voting over multiple CoT runs

from openai import OpenAI from collections import Counter import re

client = OpenAI()

def extract_answer(text: str) -> str: """Extracts the ANSWER block from the CoT-generated text.""" match = re.search(r"ANSWER:\s*(.*?)(?:\n|$)", text, re.DOTALL) return match.group(1).strip() if match else text.strip()

def answer_with_consistency( question: str, context: str, n_samples: int = 5, base_temperature: float = 0.7 ) -> dict: """ Generates N answers with Chain-of-Thought and chooses the most frequent. Higher temperature (0.7) ensures diversity in samples. """ answers = [] reasonings = []

for i in range(n_samples):
    messages = [
        {"role": "system", "content": """Think step-by-step and then answer.

STEP 1: Identify relevant data STEP 2: Analyze the relationship with the question
STEP 3: Check if there is enough data STEP 4: Conclude

Final format: REASONING: [your reasoning] ANSWER: [final answer]"""}, {"role": "user", "content": f"Context: {context}\n\nQuestion: {question}"} ]

    resp = client.chat.completions.create(
        model="gpt-5.5",
        messages=messages,
        temperature=base_temperature,
        max_tokens=1024
    )
    
    text = resp.choices[0].message.content
    reasonings.append(text)
    answers.append(extract_answer(text))

# Majority voting — chooses the most frequent answer
count = Counter(answers)
final_answer = count.most_common(1)[0][0]

return {
    "final_answer": final_answer,
    "all_answers": answers,
    "voting": dict(count),
    "n_samples": n_samples
}

Test: question with ambiguous information in context

result = answer_with_consistency( question="Is product X available?", context="Product X was launched in January. In March, production was paused for regulatory compliance. The forecast for return is next quarter, subject to ANVISA approval.", n_samples=5 )

print(f"Final answer (voting): {result['final_answer']}") print(f"Vote distribution: {result['voting']}")

The cost? 5 API calls instead of 1. But the gain in reliability is worth it — and you can adjust n_samples according to your token budget. For low-risk answers, 3 samples already make a difference.

Step 5: Optimize everything with DSPy — the prompt machine that learns

Steps 3 and 4 work, but they have a problem: the prompts were written by hand. Every wording adjustment, every temperature, every template requires trial and error. That's where DSPy comes in.

Developed by Stanford NLP, DSPy reverses the logic: instead of you manually tuning the prompt, you declare what you want (input, output, metric) and the framework automatically optimizes the prompt. The paper published at IEEE ESCI 2026 shows: 30-45% improvement in factual accuracy and ~25% reduction in hallucinations (arXiv:2604.04869).

"DSPy represents a fundamental shift in how engineers interact with language models — you declare what you want and the framework handles the 'how' while you focus on the 'what'." — Rob Ragan, Starlog (May/2026)

# dspy_optimizer.py — Automatic prompt optimization with DSPy

import dspy from dspy.teleprompt import BootstrapFewShot

Configures the LM — can be OpenAI, Anthropic, etc.

lm = dspy.LM("openai/gpt-5.5", temperature=0.3) dspy.settings.configure(lm=lm)

Defines the module — you declare WHAT, not HOW

class AntiHallucinationChatbot(dspy.Module): def init(self): super().init() # DSPy will learn the best prompt for this module self.responder = dspy.ChainOfThought("context, question -> answer")

def forward(self, context, question):
    return self.responder(context=context, question=question)

Training data — example pairs with expected answer

TRAINING = [ dspy.Example( context="Brazil has 26 states and 1 Federal District. Estimated population in 2026: 218 million.", question="How many states does Brazil have?", answer="26 states plus the Federal District." ).with_inputs("context", "question"),

dspy.Example(
    context="The company billed R$ 10M in 2025. The projection for 2026 is R$ 14M.",
    question="How much did the company bill in 2025?",
    answer="R$ 10 million."
).with_inputs("context", "question"),

dspy.Example(
    context="The TechConf event happens annually in São Paulo since 2019.",
    question="Where will TechConf 2026 be?",
    answer="I don't have enough information to answer. The context says the event happens in São Paulo, but does not confirm the 2026 edition."
).with_inputs("context", "question"),

]

Metric: penalizes answers that invent information

def accuracy_metric(example, pred, trace=None): if "don't have enough information" in pred.answer.lower(): return "don't have enough information" in example.answer.lower() return pred.answer.lower().strip() == example.answer.lower().strip()

Optimizer — DSPy will test prompt variations and choose the best

optimizer = BootstrapFewShot(metric=accuracy_metric) optimized_chatbot = optimizer.compile( AntiHallucinationChatbot(), trainset=TRAINING, max_bootstrapped_demos=3, # Maximum of 3 examples (iBuidl validation) max_labeled_demos=3 )

Test with real data — the prompt is now optimized

response = optimized_chatbot( context="Google invested US$ 30 billion in AI infrastructure in 2026.", question="How much did Google invest in AI?" ) print(f"Optimized answer: {response.answer}")

Note the max_bootstrapped_demos=3 — it's not a coincidence. iBuidl research shows that 3 examples are the sweet spot between quality and cost. DSPy respects this automatically.

Step 6: Manage the context window without losing information in the middle

You've already built the anti-hallucination pipeline, but there's a silent problem: as the conversation history grows, the model loses information in the middle of the context window.

The phenomenon is known as Lost in the Middle, documented by SurePrompts (2026):

Position in contextInformation recall
First 20%90%+
Middle (20% to 80%)60-70%
End (last 20%)85%+

(Source: SurePrompts, Context Window Management Strategies, 2026)

This means that, in a long conversation, critical information in the middle tends to be forgotten — and the model hallucinates precisely where it shouldn't.

# context_manager.py — Intelligent context management

from collections import deque

class ContextManager: """ Keeps relevant context at the beginning and end of the window, minimizing the 'Lost in the Middle' effect. """

def __init__(self, max_tokens: int = 6000):
    self.max_tokens = max_tokens
    self.history = deque()
    self.permanent_info = ""  # Fixed data (always at the beginning)

def set_permanent_info(self, info: str):
    """Sets information that stays ALWAYS at the top of the context."""
    self.permanent_info = info

def add_turn(self, question: str, answer: str):
    """Adds a conversation turn to the history."""
    self.history.append({
        "question": question,
        "answer": answer
    })

def build_context(self, new_question: str) -> list[dict]:
    """
    Builds the message list following the strategy:
    - Permanent info at the beginning (always visible)
    - Most recent turns at the end (higher recall)
    - Middle tur
#prompt-engineering#chain-of-thought#dspy#anti-hallucination#ai-reasoning#practical-guide
Compartilhar: