From Dataset to Ollama: Fine-Tuning LLMs with Unsloth on Your GPU in 2026
Did you know that the average company spent $1.2 million on AI API calls in 2025? That figure comes from a January 2026 a16z survey — and represents a 108% increase from the previous year. (Source: kargin-utkin.com)
That's why ML teams worldwide are migrating to local fine-tuning. Instead of paying per API call, they buy a GPU and fine-tune the model in-house. Fixed cost replaces variable cost.
The tool leading this change is called Unsloth. And it solves the main problem of local fine-tuning: the lack of VRAM.
In this tutorial, you will learn how to fine-tune LLMs like Llama 4, Qwen 3.6, and Gemma 4 on your own GPU — from downloading the base model to deploying it on Ollama. All with consumer hardware (8 to 24 GB of VRAM), using QLoRA and the Unsloth library. The result: training 2x faster and 70% less VRAM usage compared to the HuggingFace TRL baseline. (Source: dibi8.com)
What is Unsloth and why it became the standard
Unsloth started as an experimental library of optimized kernels for fine-tuning. In May 2026, it has 64,900+ stars on GitHub and 1.8 million monthly downloads on PyPI. (Source: github.com/unslothai/unsloth)
The secret lies in low-level engineering: the library rewrites attention and feed-forward kernels in manual CUDA, eliminating memory waste that frameworks like HuggingFace TRL treat as an "acceptable cost." The practical result: a 7B model that would normally need 24 GB of VRAM for fine-tuning runs comfortably in 10 GB with QLoRA.
"Unsloth accelerates training by 2x and reduces VRAM usage by up to 70% compared to the HuggingFace TRL baseline, enabling fine-tuning of 7B models on GPUs with only 8-10GB of VRAM via QLoRA." — Unsloth official documentation
In 2026, the library made a leap: it joined the PyTorch ecosystem, released version v0.1.41-beta with support for 500+ models (including Llama 4, Qwen 3.6, Gemma 4, DeepSeek, and gpt-oss), and introduced MoE training 12x faster. (Source: unsloth.ai)
Before you start: what you need in terms of hardware
The beauty of Unsloth is that it runs on GPUs you probably already have. Here's a practical table:
| Model | Minimum VRAM (QLoRA 4-bit) | Recommended GPU | Estimated Time (500 examples) |
|---|---|---|---|
| Llama 3.2 1B / Qwen 2.5 0.5B | 4 GB | GTX 1660 / RTX 3050 | ~3 min |
| Llama 3.2 3B / Qwen 2.5 3B | 6 GB | RTX 3060 12GB | ~5 min |
| Llama 3 8B / Qwen 2.5 7B | 10 GB | RTX 3080 / RTX 4060 Ti | ~12 min |
| Gemma 2 9B / Llama 3 13B | 14 GB | RTX 3090 24GB | ~20 min |
| Qwen 3 32B / DeepSeek Coder 33B | 20 GB | RTX 4090 24GB | ~35 min |
| Llama 3 70B / Qwen 3 72B | 24 GB+ | RTX 4090 + offloading | ~60 min |
The most used line today is the middle one: Llama 3 8B with 10 GB of VRAM. It's the sweet spot between model quality and hardware accessibility. A used RTX 3080 (10GB) costs around $700 in the US (May/2026) — less than two months of OpenAI API for a small team.
Software dependencies
Before anything else, you need:
- Python 3.10+ (I recommend 3.11)
- CUDA 12.1+ and cuDNN 8.9
- PyTorch 2.4+ with CUDA support
- Git LFS (to download large models)
If you are on Windows, use WSL2 with Ubuntu 22.04 — the experience is much more stable for fine-tuning.
Step 1: Setting up the environment
Open your terminal and create a virtual environment:
python -m venv unsloth-env
source unsloth-env/bin/activate
Install PyTorch with CUDA:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
And then Unsloth:
pip install unsloth
This command installs Unsloth and all dependencies — including the compiled CUDA kernels. The installation takes 2 to 5 minutes. If you prefer the latest version (which may include experimental kernels for MoE):
pip install git+https://github.com/unslothai/unsloth.git
Done. Environment set up.
Step 2: Choosing the base model
Unsloth supports 500+ models, but not all are worth the effort. My recommendations in May 2026:
- Llama 3.2 3B — lightweight, fast, ideal for prototyping. Meta's latest version.
- Qwen 2.5 7B — best cost-benefit in Portuguese. Alibaba invested heavily in PT data.
- Gemma 4 9B — Google DeepMind's newest model. Strong in logical reasoning.
- Llama 4 8B — native MoE support, but requires more VRAM.
For this tutorial, we'll use Llama 3.2 3B. It runs on 6 GB GPUs, fine-tuning finishes in minutes, and the quality is good enough to test the complete pipeline.
In code:
from unsloth import FastLanguageModel
import torch
model_name = "unsloth/Llama-3.2-3B-bnb-4bit"
model, tokenizer = FastLanguageModel.from_pretrained( model_name=model_name, max_seq_length=2048, dtype=None, load_in_4bit=True, )
The load_in_4bit=True parameter loads the model quantized to 4 bits — this is what reduces VRAM from ~16 GB to ~6 GB. Without it, you would need a server-grade GPU.
Step 3: Preparing the dataset
Fine-tuning without a good dataset is like cooking with expired ingredients. The result might come out, but no one will want to eat it.
The format expected by Unsloth is simple: a list of dictionaries with instruction, input, and output. Or, if you prefer, just conversations in ChatML format.
I'll use a concrete example: we will fine-tune the model to answer technical questions about Python in the NeuralPulse style — didactic, direct, and with examples.
dataset = [
{
"instruction": "Explain what a closure is in Python.",
"input": "",
"output": (
"A closure is a function that 'remembers' the scope where it was created, "
"even after that scope has ended. In practice: you define "
"a function inside another, and the inner one accesses variables from the outer one. "
"Example:\n\ndef multiplier(n):\n def inner(x):\n "
"return x * n\n return inner\n\ndouble = multiplier(2)\n"
"print(double(5)) # 10"
),
},
{
"instruction": "What is the difference between a list and a tuple?",
"input": "",
"output": (
"A list is mutable (you can add, remove, change items). "
"A tuple is immutable (once created, it doesn't change). Therefore, tuples are faster "
"and safer for data that shouldn't be altered. Use lists "
"for dynamic lists, tuples for constants and dictionary keys."
),
},
# Add 100-500 more examples here
]
With 100 to 500 well-written examples, you'll already see a difference. With 1,000+, the model's behavior changes significantly.
Now, load the data into the format expected by Unsloth:
from datasets import Dataset
Convert to Alpaca format
def format_alpaca(example): prompt = ( "Below is an instruction that describes a task. " "Write a response that completes the request.\n\n" f"### Instruction:\n{example['instruction']}\n\n" f"### Response:\n{example['output']}" ) return {"text": prompt}
dataset_hf = Dataset.from_list(dataset) dataset_hf = dataset_hf.map(format_alpaca)
Important tip: if you have data in JSON or CSV, HuggingFace datasets accepts load_dataset("json", data_files="path.json") directly. Unsloth doesn't enforce a rigid format — it just needs a text field with the complete prompt.
Step 4: Configuring QLoRA
This is where the magic happens. QLoRA combines two tricks:
- LoRA (Low-Rank Adaptation): instead of training all the network's parameters (which would be unfeasible on your GPU), it only trains small matrices attached to the attention layers. It's like putting "stickers" on the original model.
- 4-bit NormalFloat: quantizes the original model's weights to 4 bits, drastically reducing VRAM consumption.
Unsloth simplifies the configuration:
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — the higher, the more adaptation capacity
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
max_seq_length=2048,
use_rslora=False,
)
Parameters to pay attention to:
| Parameter | Recommended Value | Effect |
|---|---|---|
r (rank) | 8-32 | Rank 16 is the gold standard. More than 32 overfits. |
lora_alpha | 16-32 | Scale of adaptations. Keep it equal to the rank. |
lora_dropout | 0 | Dropout hinders fine-tuning with small datasets. Only use if you have 10k+ examples. |
max_seq_length | 2048-4096 | Maximum context. 2048 is sufficient for 90% of cases. |
Step 5: Running the training
With the model configured and the dataset ready, it's time to train:
from trl import SFTTrainer
from transformers import TrainingArguments
trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset_hf, dataset_text_field="text", max_seq_length=2048, dataset_num_proc=2, args=TrainingArguments( per_device_train_batch_size=2, gradient_accumulation_steps=4, warmup_steps=5, num_train_epochs=3, learning_rate=2e-4, fp16=not torch.cuda.is_bf16_supported(), bf16=torch.cuda.is_bf16_supported(), logging_steps=1, optim="adamw_8bit", weight_decay=0.01, lr_scheduler_type="linear", seed=42, output_dir="outputs", report_to="none", ), )
trainer.train()
With learning rate 2e-4, LoRA rank 16, and 500 examples, training on Llama 3.2 3B takes between 5 and 10 minutes on an RTX 4090. On an RTX 3060 (12 GB), about 15 minutes. (Source: vucense.com)
You will see the loss decreasing at each step. A final loss between 0.4 and 0.8 usually indicates good fine-tuning. Below 0.3, it could be a sign of overfitting — in that case, it's worth reducing the number of epochs or increasing the dataset.
Step 6: Exporting to GGUF
With the model trained, you want to use it outside of Python. The GGUF format (created by llama.cpp) is the standard for efficient local inference — and Unsloth exports directly to it:
model.save_pretrained_gguf(
"fine-tuned-model-gguf",
tokenizer,
quantization_method="q4_k_m",
)
This command saves the model in GGUF format with Q4_K_M quantization (a good balance between quality and size). The generated file is about 2-3 GB for a 3B model, compared to 6 GB for the original format.
If you want to test before exporting, Unsloth itself allows direct inference:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
["""Below is an instruction that describes a task.
Write a response that completes the request.
Instruction:
Explain what a closure is in Python.
Response:"""],
return_tensors="pt",
).to("cuda")
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7) print(tokenizer.decode(outputs[0]))
Step 7: Deploying on Ollama
Ollama is the simplest tool for running LLMs locally. With the model in GGUF, deployment is a matter of minutes.
First, create a Modelfile:
FROM ./fine-tuned-model-gguf/consolidated-01.gguf
TEMPLATE """{{ .Prompt }}"""
PARAMETER temperature 0.7 PARAMETER top_p 0.9 PARAMETER num_ctx 2048
Then, create the model in Ollama and test it:
ollama create neuralpulse-python-tutor --f Modelfile
ollama run neuralpulse-python-tutor
Done. You typed ollama run and are now chatting with your fine-tuned model. No internet, no API key, no cost per call.
Cost: local vs cloud — when is it worth it?
The question everyone asks: "Is it financially worthwhile?"
Presenc.ai's research (2026) mapped the break-even point. (Source: presenc.ai)
| Scenario | Initial Investment | Monthly Cost | Calls/month | Cost per 1M tokens |
|---|---|---|---|---|
| Local (RTX 3090 + CPU 16GB RAM) | ~$1,200 (one-time) | ~$83 (power + maintenance) | Unlimited | ~$0.05 |
| OpenAI GPT-4o mini (API) | $0 | Variable | 10,000 | $0.15 |
| OpenAI GPT-4o (API) | $0 | Variable | 10,000 | $2.50 |
| Anthropic Claude 3.5 Sonnet (API) | $0 | Variable | 10,000 | $3.00 |
| Groq Llama 3 70B (API) | $0 | Variable | 10,000 | $0.59 |
The break-even point for a 7B model with moderate utilization (30%) is between 4 and 9 months. Above 10,000 calls per month, the local cost is drastically lower.
For teams doing iterative fine-tuning — test, adjust, repeat — the math closes even faster. Each experiment you would run in the cloud costs something. On your GPU, it only costs electricity.
Conclusion
Local fine-tuning of LLMs is no longer a lab thing with 8 A100 GPUs. With Unsloth + QLoRA, you take a 7-billion-parameter model, adapt it with your data, and put it into production — all on a consumer GPU, in under 15 minutes.
The ecosystem is mature: Unsloth solved the VRAM and speed bottleneck, GGUF standardized distribution, and Ollama simplified deployment. All that's left is for you to bring the data.
Related Articles
Function Calling in Practice: Python Tutorial for Chatbots with LLMs that Execute Actions in 2026
Learn how to implement function calling in Python with OpenAI, Anthropic Claude, and Google Gemini. Complete tutorial with code to integrate APIs, databases...
Fine-Tuning LLMs in 2026: LoRA vs QLoRA — Which Technique Delivers More for Less (with Code)
Practical and comparative guide to fine-tuning with LoRA and QLoRA for LLMs in 2026, with cost and performance benchmarks on consumer-grade GPUs. Includes Python code...
From Zero to Model: Build Your First Classifier with Python and scikit-learn (2026 Tutorial)
Hands-on tutorial: create your first ML classifier with Python and scikit-learn 1.8, using a real Porto Seguro dataset. Code, deployment, and production tips.