Python code interface with audio waves and a virtual chatbot

Transcription and Response Pipeline with Whisper and Llama 3: Local Implementation in Python

NeuralPulse|11 de junho de 2026|7 min read|Ler em Português

Imagine you're a developer at a small medical clinic. Your boss asks for a system that transcribes consultations and generates automatic summaries, but the budget for cloud APIs is zero. Or you're a privacy enthusiast who wants a voice assistant that never sends audio to external servers. For these cases, running Whisper and Llama 3 locally isn't just an option—it's the only viable one.

With Whisper and Llama 3, you can build a complete voice processing pipeline in Python. No paying per token. No sending data to third parties. And with latency acceptable for production.

The EchoKit, an open-source project shared on Hacker News, showed the way: a voice agent running on an ESP32 with a Rust server. But you don't need special hardware. A laptop with an 8GB GPU is enough.

In this tutorial, I'll guide you through building a pipeline that captures audio, transcribes it with Whisper large-v3, and generates responses with quantized Llama 3 8B. All local, free, and in Python.

The Complete Pipeline: From Audio to Response in Seconds

The flow is simple: microphone captures audio → Whisper transcribes → Llama 3 generates a response → text is synthesized into speech (optional). The magic is in the integration.

Whisper large-v3 achieves only 5% Word Error Rate (WER) in Portuguese, according to OpenAI's official benchmark (2025). This means transcription errors are rare. For a voice pipeline, it's more than enough.

Llama 3 8B, with 4-bit quantization, runs on GPUs with 8GB VRAM, as documented in the Hugging Face model card (2025). The 70B version requires more hardware, but for quick responses, the smaller model delivers quality.

Total latency ranges from 3 to 8 seconds, depending on audio length and prompt. Acceptable for a virtual assistant.

Table 1: Comparison of Whisper and Llama 3 Models for Voice Pipeline

Model	Size	Required VRAM	WER (Portuguese)	Average Latency (5s audio)
Whisper tiny	39 MB	1 GB	12%	1.2s
Whisper base	74 MB	1.5 GB	8%	1.8s
Whisper small	244 MB	2 GB	6%	2.5s
Whisper large-v3	1.5 GB	4 GB	5%	4.0s
Llama 3 8B (4-bit)	4.5 GB	6 GB	—	2.5s (generation)
Llama 3 70B (4-bit)	35 GB	40 GB	—	8s (generation)

Source: OpenAI Whisper benchmarks (2025) and Hugging Face model cards (2025).

For a voice pipeline, the combination of Whisper large-v3 + Llama 3 8B offers the best cost-benefit. High quality without requiring a dedicated server.

Step-by-Step: Implementation in Python

I'll use Python 3.10+. The main libraries are whisper, transformers, and sounddevice. Install everything with pip.

pip install openai-whisper transformers torch sounddevice scipy

1. Capture and Transcription with Whisper

First, capture audio from the microphone. Use sounddevice to record for 5 seconds. Then, save it as a numpy array.

import sounddevice as sd
import numpy as np
import whisper

def record_audio(duration=5, rate=16000): print("Speak now...") audio = sd.rec(int(duration * rate), samplerate=rate, channels=1) sd.wait() return np.squeeze(audio)

Load Whisper model

whisper_model = whisper.load_model("large-v3")

Record and transcribe

audio = record_audio() result = whisper_model.transcribe(audio, language="pt") user_text = result["text"] print(f"You said: {user_text}")

The language="pt" parameter improves accuracy in Portuguese. Without it, Whisper might confuse it with Spanish.

2. Response Generation with Llama 3

Now, load the quantized Llama 3 8B. Use the transformers library with 4-bit loading to save VRAM.

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 )

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") llama_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", quantization_config=quant_config, device_map="auto" )

def generate_response(prompt): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = llama_model.generate(**inputs, max_new_tokens=150, temperature=0.7) response = tokenizer.decode(outputs[0], skip_special_tokens=True) return response

system_prompt = "You are a transcription assistant. Respond politely and objectively." full_prompt = f"{system_prompt}\nUser: {user_text}\nAssistant:" response = generate_response(full_prompt) print(f"Assistant: {response}")

4-bit quantization reduces the model from 16GB to ~4.5GB. It runs on GPUs like the RTX 3060 or higher.

ElevenLabs

Transforme texto em voz com IA realista. Perfeito para narracoes, podcasts e audiolivros.

Testar gratuito

3. Conversation Loop and Voice Output

For a fluid pipeline, put everything in a loop. The user speaks, the bot responds. For voice output, use pyttsx3 (offline TTS).

import pyttsx3

engine = pyttsx3.init() engine.setProperty('rate', 180) # Speech speed

while True: audio = record_audio() user_text = whisper_model.transcribe(audio, language="pt")["text"] if "bye" in user_text.lower(): break full_prompt = f"{system_prompt}\nUser: {user_text}\nAssistant:" response = generate_response(full_prompt) print(f"Assistant: {response}") engine.say(response) engine.runAndWait()

Offline TTS doesn't require internet. For higher quality, replace it with Bark or Coqui TTS, but they consume more GPU.

Optimizations for Production and Latency

Running locally is cheap, but requires adjustments. Here are three practical tips.

Reduce the Whisper Model for Simple Cases

If the pipeline is restricted to short commands (e.g., "what's my balance?"), use Whisper small. It has 6% WER and loads in 2 GB of VRAM. Latency drops to 2.5s.

Community reports point to errors mainly with very heavy accents. For short, direct commands, the small model gets the job done.

Use Response Caching for Frequent Questions

Llama 3 can generate generic responses for common questions. Create a cache of question-answer pairs. If the transcription matches a known question, return the predefined answer.

This cuts generation latency to milliseconds. And reduces GPU usage.

Consider EchoKit as Hardware Inspiration

The EchoKit, an open-source project with a Rust server, showed that an ESP32-S3 can run the entire pipeline. But its server does the heavy lifting: ASR with Whisper and LLM with Llama.

If you want a dedicated device, adapt this tutorial to send audio to a Python server. The ESP32 captures and the server processes. Less latency than a Raspberry Pi.

A fully local, fully controllable voice agent is viable with modest hardware — that is what projects like EchoKit demonstrate in practice.

Is It Worth It? Yes, with Caveats

Running Whisper and Llama 3 locally eliminates API costs. For a small business or MVP, it's a robust solution. Latency of 5 to 8 seconds is acceptable for voice processing.

But there are limits. Llama 3 8B can hallucinate on complex responses. For critical calls, a larger model (70B) or fine-tuning is necessary. And then hardware needs to scale up.

The future? Smaller and more efficient models. Whisper large-v3 is already state-of-the-art. And Llama 3 8B, with quantization, has become accessible for consumer GPUs. In 2026, the trend is for local voice pipelines to become standard in applications requiring privacy and low cost.

Final Thoughts

This tutorial showed how to build a complete voice pipeline with Whisper and Llama 3, all in Python and running locally. You learned to capture audio, transcribe with high accuracy, generate intelligent responses, and even synthesize speech, without relying on external services.

The combination of Whisper large-v3 with Llama 3 8B offers an ideal balance between quality and hardware requirements. With optimizations like response caching and smaller models for simple tasks, you can further reduce latency and resource consumption.

Remember that data privacy is a crucial benefit of this approach. No audio or text leaves your computer, ensuring compliance with regulations like the LGPD.

Now it's your turn: implement this pipeline, adapt it to your needs, and explore the possibilities. The ecosystem of local models is growing rapidly, and 2026 promises to be the year voice intelligence becomes truly accessible to everyone.

Also check out: How to Use AI to Create High-Quality Content in 2026 Also check out: From Dataset to Ollama: Fine-Tuning LLMs with Unsloth on Your GPU in 2026 Also check out: 48% Don't Test, 40% Hallucinate: How to Evaluate LLMs in 2026 — Analytical Guide

#whisper#llama-3#voice-pipeline#transcription#python#local#free#audio-processing#privacy

Python code running in a text editor with semantic similarity charts in the background

tutorials|5 min

Semantic Search with Python and Open-Source Models

Practical tutorial on embeddings for semantic search in Python using open-source models such as BGE-M3 and GTE-Qwen2. Runnable code and performance metrics.

13 de junho de 2026Read more