Transcription and Response Pipeline with Whisper and Llama 3: Local Implementation in Python
Imagine you're a developer at a small medical clinic. Your boss asks for a system that transcribes consultations and generates automatic summaries, but the budget for cloud APIs is zero. Or you're a privacy enthusiast who wants a voice assistant that never sends audio to external servers. For these cases, running Whisper and Llama 3 locally isn't just an option—it's the only viable one.
With Whisper and Llama 3, you can build a complete voice processing pipeline in Python. No paying per token. No sending data to third parties. And with latency acceptable for production.
The EchoKit, an open-source project shared on Hacker News, showed the way: a voice agent running on an ESP32 with a Rust server. But you don't need special hardware. A laptop with an 8GB GPU is enough.
In this tutorial, I'll guide you through building a pipeline that captures audio, transcribes it with Whisper large-v3, and generates responses with quantized Llama 3 8B. All local, free, and in Python.
The Complete Pipeline: From Audio to Response in Seconds
The flow is simple: microphone captures audio → Whisper transcribes → Llama 3 generates a response → text is synthesized into speech (optional). The magic is in the integration.
Whisper large-v3 achieves only 5% Word Error Rate (WER) in Portuguese, according to OpenAI's official benchmark (2025). This means transcription errors are rare. For a voice pipeline, it's more than enough.
Llama 3 8B, with 4-bit quantization, runs on GPUs with 8GB VRAM, as documented in the Hugging Face model card (2025). The 70B version requires more hardware, but for quick responses, the smaller model delivers quality.
Total latency ranges from 3 to 8 seconds, depending on audio length and prompt. Acceptable for a virtual assistant.
Table 1: Comparison of Whisper and Llama 3 Models for Voice Pipeline
| Model | Size | Required VRAM | WER (Portuguese) | Average Latency (5s audio) |
|---|---|---|---|---|
| Whisper tiny | 39 MB | 1 GB | 12% | 1.2s |
| Whisper base | 74 MB | 1.5 GB | 8% | 1.8s |
| Whisper small | 244 MB | 2 GB | 6% | 2.5s |
| Whisper large-v3 | 1.5 GB | 4 GB | 5% | 4.0s |
| Llama 3 8B (4-bit) | 4.5 GB | 6 GB | — | 2.5s (generation) |
| Llama 3 70B (4-bit) | 35 GB | 40 GB | — | 8s (generation) |
For a voice pipeline, the combination of Whisper large-v3 + Llama 3 8B offers the best cost-benefit. High quality without requiring a dedicated server.
Step-by-Step: Implementation in Python
I'll use Python 3.10+. The main libraries are whisper, transformers, and sounddevice. Install everything with pip.
pip install openai-whisper transformers torch sounddevice scipy
1. Capture and Transcription with Whisper
First, capture audio from the microphone. Use sounddevice to record for 5 seconds. Then, save it as a numpy array.
import sounddevice as sd
import numpy as np
import whisper
def record_audio(duration=5, rate=16000): print("Speak now...") audio = sd.rec(int(duration * rate), samplerate=rate, channels=1) sd.wait() return np.squeeze(audio)
Load Whisper model
whisper_model = whisper.load_model("large-v3")
Record and transcribe
audio = record_audio() result = whisper_model.transcribe(audio, language="pt") user_text = result["text"] print(f"You said: {user_text}")
The language="pt" parameter improves accuracy in Portuguese. Without it, Whisper might confuse it with Spanish.
2. Response Generation with Llama 3
Now, load the quantized Llama 3 8B. Use the transformers library with 4-bit loading to save VRAM.
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch
quant_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16 )
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") llama_model = AutoModelForCausalLM.from_pretrained( "meta-llama/Meta-Llama-3-8B-Instruct", quantization_config=quant_config, device_map="auto" )
def generate_response(prompt): inputs = tokenizer(prompt, return_tensors="pt").to("cuda") outputs = llama_model.generate(**inputs, max_new_tokens=150, temperature=0.7) response = tokenizer.decode(outputs[0], skip_special_tokens=True) return response
system_prompt = "You are a transcription assistant. Respond politely and objectively." full_prompt = f"{system_prompt}\nUser: {user_text}\nAssistant:" response = generate_response(full_prompt) print(f"Assistant: {response}")
4-bit quantization reduces the model from 16GB to ~4.5GB. It runs on GPUs like the RTX 3060 or higher.
3. Conversation Loop and Voice Output
For a fluid pipeline, put everything in a loop. The user speaks, the bot responds. For voice output, use pyttsx3 (offline TTS).
import pyttsx3
engine = pyttsx3.init() engine.setProperty('rate', 180) # Speech speed
while True: audio = record_audio() user_text = whisper_model.transcribe(audio, language="pt")["text"] if "bye" in user_text.lower(): break full_prompt = f"{system_prompt}\nUser: {user_text}\nAssistant:" response = generate_response(full_prompt) print(f"Assistant: {response}") engine.say(response) engine.runAndWait()
Offline TTS doesn't require internet. For higher quality, replace it with Bark or Coqui TTS, but they consume more GPU.
Optimizations for Production and Latency
Running locally is cheap, but requires adjustments. Here are three practical tips.
Reduce the Whisper Model for Simple Cases
If the pipeline is restricted to short commands (e.g., "what's my balance?"), use Whisper small. It has 6% WER and loads in 2 GB of VRAM. Latency drops to 2.5s.
Community reports point to errors mainly with very heavy accents. For short, direct commands, the small model gets the job done.
Use Response Caching for Frequent Questions
Llama 3 can generate generic responses for common questions. Create a cache of question-answer pairs. If the transcription matches a known question, return the predefined answer.
This cuts generation latency to milliseconds. And reduces GPU usage.
Consider EchoKit as Hardware Inspiration
The EchoKit, an open-source project with a Rust server, showed that an ESP32-S3 can run the entire pipeline. But its server does the heavy lifting: ASR with Whisper and LLM with Llama.
If you want a dedicated device, adapt this tutorial to send audio to a Python server. The ESP32 captures and the server processes. Less latency than a Raspberry Pi.
A fully local, fully controllable voice agent is viable with modest hardware — that is what projects like EchoKit demonstrate in practice.
Is It Worth It? Yes, with Caveats
Running Whisper and Llama 3 locally eliminates API costs. For a small business or MVP, it's a robust solution. Latency of 5 to 8 seconds is acceptable for voice processing.
But there are limits. Llama 3 8B can hallucinate on complex responses. For critical calls, a larger model (70B) or fine-tuning is necessary. And then hardware needs to scale up.
The future? Smaller and more efficient models. Whisper large-v3 is already state-of-the-art. And Llama 3 8B, with quantization, has become accessible for consumer GPUs. In 2026, the trend is for local voice pipelines to become standard in applications requiring privacy and low cost.
Final Thoughts
This tutorial showed how to build a complete voice pipeline with Whisper and Llama 3, all in Python and running locally. You learned to capture audio, transcribe with high accuracy, generate intelligent responses, and even synthesize speech, without relying on external services.
The combination of Whisper large-v3 with Llama 3 8B offers an ideal balance between quality and hardware requirements. With optimizations like response caching and smaller models for simple tasks, you can further reduce latency and resource consumption.
Remember that data privacy is a crucial benefit of this approach. No audio or text leaves your computer, ensuring compliance with regulations like the LGPD.
Now it's your turn: implement this pipeline, adapt it to your needs, and explore the possibilities. The ecosystem of local models is growing rapidly, and 2026 promises to be the year voice intelligence becomes truly accessible to everyone.
Related Articles
Related Articles
Semantic Search with Python and Open-Source Models
Practical tutorial on embeddings for semantic search in Python using open-source models such as BGE-M3 and GTE-Qwen2. Runnable code and performance metrics.
Hyperparameter Optimization with Hyperopt in 2026: Practical Guide
2026 practical tutorial: learn to optimize machine learning model hyperparameters using Hyperopt, with Bayesian search and result visualization.
Microsoft Launches Phi-4 for Edge: AI Running Locally on Phones and IoT in 2026
Microsoft's Phi-4 has 14 billion parameters and runs on devices with only 4 GB of RAM. Understand how this model is changing AI inference on...