Scientist in a laboratory analyzing data and code on a monitor
tutorials

48% Don't Test, 40% Hallucinate: How to Evaluate LLMs in 2026 — Analytical Guide

NeuralPulse|31 de maio de 2026|12 min read|Ler em Português

Almost half of the professionals building AI agents simply do not test their models. This data comes from the LangChain State of Agent Engineering 2026: only 52% of teams have evaluation systems. The other 48% are flying blind.

The scenario is more serious than it seems. A poorly evaluated RAG system can hallucinate at high rates — LLM evaluation is pointed out as the "number one problem" in AI engineering today, according to the Amplify Partners AI Engineering Report. In critical sectors like legal, a Stanford study (2026) documented that tools like LexisNexis Lexis+ AI and Thomson Reuters Westlaw AI hallucinate between 17% and 33% of the time, even with RAG implemented. It's like piloting a plane without instruments: you might get somewhere, but you don't know where you'll land.

This analytical guide is not another step-by-step tutorial. It's an X-ray of existing approaches — traditional metrics, LLM-as-Judge, specialized frameworks — with real data, critical comparison, and executable code. The goal is for you to leave here knowing not just how to evaluate, but why each method works, where it lies, and when to switch strategies.

Why Evaluating an LLM is Fundamentally Different from Evaluating a Traditional Model

If you come from classic ML, you know the ritual: train/test split, accuracy, precision, recall, F1. Well-defined metrics because the output is categorical — class A or B, 0 or 1, a bounding box.

With LLMs, the output is natural language. Open. Creative. There is no "right answer" in the same sense. And that's where the problem lies.

Traditional metrics like BLEU and ROUGE were created for machine translation and summarization. They compare n-grams between the generated response and a reference. The problem? They penalize benign divergence. A model that says "The patient presents with high fever" instead of "The patient has an elevated temperature" loses points, even though it's semantically identical.

"BLEU and ROUGE penalize benign divergence in open-ended tasks," states Adnan Masood, PhD, in a recent analysis on LLM evaluation. For creative or conversational generation tasks, these metrics are worse than useless — they give a false sense of control.

The practical result? A team that blindly trusts ROUGE to evaluate a customer service chatbot is measuring the wrong thing. Worse: they are making deployment decisions based on noise.

The 3 Major Evaluation Approaches in 2026

After analyzing dozens of tools and frameworks, LLM evaluation can be classified into three families. Each solves a different problem, has its trade-offs, and ideally, you will use all three together.

1. Similarity Metrics (BLEU, ROUGE, BERTScore)

These are the oldest and still the most used — out of habit, not effectiveness. BLEU compares n-gram precision; ROUGE looks at recall; BERTScore replaces exact matching with similarity in vector space.

Where they work: tasks with a canonical answer, such as translation, extractive summarization, transcription.

Where they fail: any open-ended task. A code assistant that suggests two equally correct implementations is penalized if it doesn't match the reference.

2. LLM-as-Judge

The approach that gained traction in 2025-2026: uses an LLM (usually GPT-4, Claude 3.5, or Gemini 2.5) to evaluate the response of another LLM. You define criteria — relevance, factuality, tone, adherence to context — and the judge model classifies.

The numbers are promising. According to Zylos Research, LLM-as-Judge achieves between 80% and 90% agreement with human evaluators when well configured. The trick is in how you configure it.

MLflow MemAlign, released in 2026, improves LLM-as-Judge agreement by aligning the judge model's memory with the evaluation context (mlflow.org, official documentation). Instead of evaluating in a vacuum, the judge "remembers" previous examples and calibrates its criteria.

But there is an important caveat:

"If you can't trust the judge, how can you trust the results?" — DataRobot.

If the judge model has biases (and every LLM does), you are propagating those biases into your metric. A GPT-4 that tends to prefer longer responses will overestimate the quality of verbose texts. It's confirmation bias outsourced to an API.

3. Specialized Frameworks

The third approach — and the one with the fastest-growing adoption — are frameworks that combine multiple metrics, automated tests, and a continuous validation suite. Instead of a single metric, you have a dashboard.

DeepEval leads the movement: 100 million evaluations per day, over 150 thousand developers, 50+ metrics, Apache 2.0 license. It allows everything from unit tests with Pytest to complete RAG evaluation with integrated RAGAS.

Promptfoo, with 21.7k GitHub stars and acquired by OpenAI in 2026, focuses on side-by-side comparison of models and prompts. It's MIT licensed and has become a standard in many product teams.

RAGAS, in turn, is the most specific framework for RAG: it measures faithfulness, context relevancy, answer relevancy, and retrieval recall.

The table below shows how these three approaches compare on practical criteria:

CriteriaSimilarity MetricsLLM-as-JudgeSpecialized Frameworks
Cost per evaluationLow (local computation)High (API call)Medium (hybrid)
Accuracy in open tasksLowHigh (80-90% agreement)High (combine multiple)
Algorithmic biasNoneHigh (inherited from judge)Medium (mitigable)
Setup complexityMinimalMediumMedium-High
Metric coverage2-3 metricsUnlimited (definable)50+ in DeepEval
Suitable forTranslation, extractive summarizationGeneral quality, conversationRAG, agents, production

In Practice: Implementing Evaluation with DeepEval, RAGAS, and MLflow

Enough theory. Let's see how this works in code. The example below sets up an evaluation suite for a technical support RAG system.

Basic Setup with DeepEval + Pytest

DeepEval integrates natively with Pytest, meaning your evaluations become tests that pass or fail. This is powerful: you put evaluation into the CI/CD pipeline.

# test_evaluation.py
import pytest
from deepeval import assert_test
from deepeval.metrics import HallucinationMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase

Configures the judge model (default: GPT-4 via OpenAI)

DeepEval 2.0+ supports Claude, Gemini, and local models via Ollama

def test_rag_response_quality(): """Checks if the RAG response is faithful to the retrieved context."""

# Simulates a real technical support case
test_case = LLMTestCase(
    input="How do I reset the admin panel password?",
    actual_output="You can reset the password by accessing Settings > Security > Reset Password. "
                  "A link will be sent to your registered email within 5 minutes.",
    retrieval_context=[
        "Documentation: Admin panel password reset is in Settings > Security.",
        "The system sends a reset link to the registered email.",
        "The maximum time to receive the email is 5 minutes."
    ]
)

# Evaluates faithfulness to context (0.0 to 1.0)
faithfulness = FaithfulnessMetric(threshold=0.7)

# Evaluates if there is hallucination (0.0 to 1.0, ideal < 0.3)
hallucination = HallucinationMetric(threshold=0.3)

assert_test(test_case, [faithfulness, hallucination])

If the response invents information not present in the retrieved documents, the test fails. CI blocks the deployment. Simple and direct.

RAG Evaluation with RAGAS

For a more granular analysis of the retrieval pipeline, RAGAS offers specific metrics:

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    context_relevancy,
    answer_relevancy,
    context_recall
)
from datasets import Dataset

Example data: questions, answers, retrieved contexts, and ground truth

eval_dataset = Dataset.from_dict({ "question": [ "What is the warranty period for product X?", "How do I cancel my subscription?" ], "answer": [ "Product X has a 12-month warranty against manufacturing defects.", "You can cancel anytime through the panel." ], "contexts": [ ["12-month warranty for product X, covering manufacturing defects."], ["Cancellation available anytime via the customer panel."] ], "ground_truth": [ "The warranty for product X is 12 months for manufacturing defects.", "Cancellation can be done anytime through the customer panel." ] })

result = evaluate( dataset=eval_dataset, metrics=[faithfulness, context_relevancy, answer_relevancy, context_recall] )

print(f"Faithfulness: {result['faithfulness']:.2f}") print(f"Context Relevancy: {result['context_relevancy']:.2f}") print(f"Context Recall: {result['context_recall']:.2f}")

RAGAS returns scores from 0 to 1 for each metric. context_recall is particularly useful: it measures if the retrieval brought enough information to answer. If it's low, the problem isn't the LLM — it's the chunking or the embedding.

Enhanced LLM-as-Judge with MLflow MemAlign

The biggest problem with LLM-as-Judge is inconsistency between evaluations. MLflow MemAlign solves this by maintaining a "memory" of previous evaluations that calibrates the judgment:

import mlflow
from mlflow.models import EvaluationModel

Configures evaluator with MemAlign

evaluator = EvaluationModel( model_uri="models:/gpt-4-turbo/latest", evaluator_type="llm_judge", memory_config={ "enabled": True, "memory_type": "memalign", "calibration_samples": 20, # calibration samples "context_window": 5 # how many recent evaluations to "remember" }, criteria={ "factual_accuracy": "Is the response factually correct?", "completeness": "Does the response cover all aspects of the question?", "conciseness": "Is the response direct without irrelevant information?" } )

Evaluates a batch of responses

results = evaluator.evaluate( questions=[ "What is the impact of the Selic rate on real estate credit?", "How does payroll-deducted loans work for public servants?" ], responses=[ "The Selic rate directly influences real estate credit, as...", "Payroll-deducted loans for public servants have payroll deductions..." ], contexts=[ "Document on economic policies and credit.", "Guide to payroll-deducted loans for public servants." ] )

print(f"Factual accuracy: {results['factual_accuracy']:.2%}") print(f"Completeness: {results['completeness']:.2%}")

The gain from MemAlign is measurable: the official MLflow documentation reports significant error reductions and improved consistency between evaluations when memory alignment is active (mlflow.org, LLM-as-Judge documentation).

Critical Analysis of Frameworks: What to Use and When

With so many options, choosing the right framework depends on your context. Below is a direct comparison:

FrameworkLicenseGitHub StarsBest ForMain Limitation
DeepEvalApache 2.0~16kCI/CD, pipelines, 50+ metricsHigh learning curve
RAGASApache 2.0~13kPure RAG evaluationLimited to RAG scenarios
PromptfooMIT21.7kPrompt/model comparisonDev-focused, not production
MLflowApache 2.022k+Integrated MLOps + LLM-as-JudgeRequires MLflow ecosystem
LangSmithProprietaryTracing + integrated evaluationVendor lock-in with LangChain
TruLensApache 2.05k+Feedback based on feedback functionsSmaller community
Arthur BenchProprietary2k+Performance benchmarksEnterprise focus, high cost
LM Evaluation HarnessApache 2.08k+Academic benchmarksDoesn't scale for production

The ideal choice in 2026 is usually a combination: DeepEval for automated CI/CD tests, RAGAS for fine-grained retrieval diagnosis, and Promptfoo for quick validation during prompt development.

It's worth noting that CCRS (arXiv 2506.20128), submitted in June 2025 and accepted at LLM4Eval @ SIGIR 2025, proposes 5 zero-shot metrics for RAG using Mistral-7B as the judge model. This indicates a clear movement: evaluation without needing ground truth, running locally, without depending on expensive APIs. The cost per evaluation tends towards zero in the coming years.

The Cost of Not Evaluating: 40% of Projects Will Fail

Gartner was direct: more than 40% of agentic AI projects will be canceled by 2027 due to a lack of adequate evaluation systems. It's not speculation — it's a projection based on the pattern observed in 2024-2026.

The cost isn't just cancellation. It's the operational cost of a poorly evaluated LLM in production. Each hallucinated response in a customer service system generates rework, dissatisfaction, and in critical cases, regulatory fines. A single hallucination in a legal opinion can cost millions.

"A RAG system can score 0.95 faithfulness and produce wrong business answers if the retrieved content is stale," warns Atlan. The isolated metric is misleading — you need an evaluation system that monitors not just the model, but the quality of the source data.

Without evaluation, teams face what srajdev.com calls "change paralysis, quality degradation, and cost explosions":

"Without proper evaluation frameworks, enterprises face change paralysis, quality degradation, and cost explosions."

The cost of not evaluating is not zero. It's the cost of discovering the problem after the customer has paid the bill.

Where to Place Evaluation in the Pipeline

A robust evaluation strategy has three layers:

  1. Development: Promptfoo or local DeepEval to test prompt variations and
#llm-evaluation#ai-metrics#llm-evaluation#python#analytical-tutorial
Compartilhar: