Computer motherboard with a graphics card (GPU) installed, representing hardware for LLM fine-tuning
tutorials

From Dataset to Ollama: Fine-Tuning LLMs with Unsloth on Your GPU in 2026

NeuralPulse|26 de maio de 2026|12 min read|Ler em Português

Did you know that the average company spent $1.2 million on AI API calls in 2025? That figure comes from a January 2026 a16z survey — and represents a 108% increase from the previous year. (Source: kargin-utkin.com)

That's why ML teams worldwide are migrating to local fine-tuning. Instead of paying per API call, they buy a GPU and fine-tune the model in-house. Fixed cost replaces variable cost.

The tool leading this change is called Unsloth. And it solves the main problem of local fine-tuning: the lack of VRAM.

In this tutorial, you will learn how to fine-tune LLMs like Llama 4, Qwen 3.6, and Gemma 4 on your own GPU — from downloading the base model to deploying it on Ollama. All with consumer hardware (8 to 24 GB of VRAM), using QLoRA and the Unsloth library. The result: training 2x faster and 70% less VRAM usage compared to the HuggingFace TRL baseline. (Source: dibi8.com)


What is Unsloth and why it became the standard

Unsloth started as an experimental library of optimized kernels for fine-tuning. In May 2026, it has 64,900+ stars on GitHub and 1.8 million monthly downloads on PyPI. (Source: github.com/unslothai/unsloth)

The secret lies in low-level engineering: the library rewrites attention and feed-forward kernels in manual CUDA, eliminating memory waste that frameworks like HuggingFace TRL treat as an "acceptable cost." The practical result: a 7B model that would normally need 24 GB of VRAM for fine-tuning runs comfortably in 10 GB with QLoRA.

"Unsloth accelerates training by 2x and reduces VRAM usage by up to 70% compared to the HuggingFace TRL baseline, enabling fine-tuning of 7B models on GPUs with only 8-10GB of VRAM via QLoRA." — Unsloth official documentation

In 2026, the library made a leap: it joined the PyTorch ecosystem, released version v0.1.41-beta with support for 500+ models (including Llama 4, Qwen 3.6, Gemma 4, DeepSeek, and gpt-oss), and introduced MoE training 12x faster. (Source: unsloth.ai)


Before you start: what you need in terms of hardware

The beauty of Unsloth is that it runs on GPUs you probably already have. Here's a practical table:

ModelMinimum VRAM (QLoRA 4-bit)Recommended GPUEstimated Time (500 examples)
Llama 3.2 1B / Qwen 2.5 0.5B4 GBGTX 1660 / RTX 3050~3 min
Llama 3.2 3B / Qwen 2.5 3B6 GBRTX 3060 12GB~5 min
Llama 3 8B / Qwen 2.5 7B10 GBRTX 3080 / RTX 4060 Ti~12 min
Gemma 2 9B / Llama 3 13B14 GBRTX 3090 24GB~20 min
Qwen 3 32B / DeepSeek Coder 33B20 GBRTX 4090 24GB~35 min
Llama 3 70B / Qwen 3 72B24 GB+RTX 4090 + offloading~60 min

The most used line today is the middle one: Llama 3 8B with 10 GB of VRAM. It's the sweet spot between model quality and hardware accessibility. A used RTX 3080 (10GB) costs around $700 in the US (May/2026) — less than two months of OpenAI API for a small team.

Software dependencies

Before anything else, you need:

  • Python 3.10+ (I recommend 3.11)
  • CUDA 12.1+ and cuDNN 8.9
  • PyTorch 2.4+ with CUDA support
  • Git LFS (to download large models)

If you are on Windows, use WSL2 with Ubuntu 22.04 — the experience is much more stable for fine-tuning.


Step 1: Setting up the environment

Open your terminal and create a virtual environment:

python -m venv unsloth-env
source unsloth-env/bin/activate

Install PyTorch with CUDA:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

And then Unsloth:

pip install unsloth

This command installs Unsloth and all dependencies — including the compiled CUDA kernels. The installation takes 2 to 5 minutes. If you prefer the latest version (which may include experimental kernels for MoE):

pip install git+https://github.com/unslothai/unsloth.git

Done. Environment set up.


Step 2: Choosing the base model

Unsloth supports 500+ models, but not all are worth the effort. My recommendations in May 2026:

  • Llama 3.2 3B — lightweight, fast, ideal for prototyping. Meta's latest version.
  • Qwen 2.5 7B — best cost-benefit in Portuguese. Alibaba invested heavily in PT data.
  • Gemma 4 9B — Google DeepMind's newest model. Strong in logical reasoning.
  • Llama 4 8B — native MoE support, but requires more VRAM.

For this tutorial, we'll use Llama 3.2 3B. It runs on 6 GB GPUs, fine-tuning finishes in minutes, and the quality is good enough to test the complete pipeline.

In code:

from unsloth import FastLanguageModel
import torch

model_name = "unsloth/Llama-3.2-3B-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained( model_name=model_name, max_seq_length=2048, dtype=None, load_in_4bit=True, )

The load_in_4bit=True parameter loads the model quantized to 4 bits — this is what reduces VRAM from ~16 GB to ~6 GB. Without it, you would need a server-grade GPU.


Step 3: Preparing the dataset

Fine-tuning without a good dataset is like cooking with expired ingredients. The result might come out, but no one will want to eat it.

The format expected by Unsloth is simple: a list of dictionaries with instruction, input, and output. Or, if you prefer, just conversations in ChatML format.

I'll use a concrete example: we will fine-tune the model to answer technical questions about Python in the NeuralPulse style — didactic, direct, and with examples.

dataset = [
    {
        "instruction": "Explain what a closure is in Python.",
        "input": "",
        "output": (
            "A closure is a function that 'remembers' the scope where it was created, "
            "even after that scope has ended. In practice: you define "
            "a function inside another, and the inner one accesses variables from the outer one. "
            "Example:\n\ndef multiplier(n):\n    def inner(x):\n        "
            "return x * n\n    return inner\n\ndouble = multiplier(2)\n"
            "print(double(5))  # 10"
        ),
    },
    {
        "instruction": "What is the difference between a list and a tuple?",
        "input": "",
        "output": (
            "A list is mutable (you can add, remove, change items). "
            "A tuple is immutable (once created, it doesn't change). Therefore, tuples are faster "
            "and safer for data that shouldn't be altered. Use lists "
            "for dynamic lists, tuples for constants and dictionary keys."
        ),
    },
    # Add 100-500 more examples here
]

With 100 to 500 well-written examples, you'll already see a difference. With 1,000+, the model's behavior changes significantly.

Now, load the data into the format expected by Unsloth:

from datasets import Dataset

Convert to Alpaca format

def format_alpaca(example): prompt = ( "Below is an instruction that describes a task. " "Write a response that completes the request.\n\n" f"### Instruction:\n{example['instruction']}\n\n" f"### Response:\n{example['output']}" ) return {"text": prompt}

dataset_hf = Dataset.from_list(dataset) dataset_hf = dataset_hf.map(format_alpaca)

Important tip: if you have data in JSON or CSV, HuggingFace datasets accepts load_dataset("json", data_files="path.json") directly. Unsloth doesn't enforce a rigid format — it just needs a text field with the complete prompt.


Step 4: Configuring QLoRA

This is where the magic happens. QLoRA combines two tricks:

  1. LoRA (Low-Rank Adaptation): instead of training all the network's parameters (which would be unfeasible on your GPU), it only trains small matrices attached to the attention layers. It's like putting "stickers" on the original model.
  2. 4-bit NormalFloat: quantizes the original model's weights to 4 bits, drastically reducing VRAM consumption.

Unsloth simplifies the configuration:

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank — the higher, the more adaptation capacity
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
    max_seq_length=2048,
    use_rslora=False,
)

Parameters to pay attention to:

ParameterRecommended ValueEffect
r (rank)8-32Rank 16 is the gold standard. More than 32 overfits.
lora_alpha16-32Scale of adaptations. Keep it equal to the rank.
lora_dropout0Dropout hinders fine-tuning with small datasets. Only use if you have 10k+ examples.
max_seq_length2048-4096Maximum context. 2048 is sufficient for 90% of cases.

Step 5: Running the training

With the model configured and the dataset ready, it's time to train:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset_hf, dataset_text_field="text", max_seq_length=2048, dataset_num_proc=2, args=TrainingArguments( per_device_train_batch_size=2, gradient_accumulation_steps=4, warmup_steps=5, num_train_epochs=3, learning_rate=2e-4, fp16=not torch.cuda.is_bf16_supported(), bf16=torch.cuda.is_bf16_supported(), logging_steps=1, optim="adamw_8bit", weight_decay=0.01, lr_scheduler_type="linear", seed=42, output_dir="outputs", report_to="none", ), )

trainer.train()

With learning rate 2e-4, LoRA rank 16, and 500 examples, training on Llama 3.2 3B takes between 5 and 10 minutes on an RTX 4090. On an RTX 3060 (12 GB), about 15 minutes. (Source: vucense.com)

You will see the loss decreasing at each step. A final loss between 0.4 and 0.8 usually indicates good fine-tuning. Below 0.3, it could be a sign of overfitting — in that case, it's worth reducing the number of epochs or increasing the dataset.


Step 6: Exporting to GGUF

With the model trained, you want to use it outside of Python. The GGUF format (created by llama.cpp) is the standard for efficient local inference — and Unsloth exports directly to it:

model.save_pretrained_gguf(
    "fine-tuned-model-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

This command saves the model in GGUF format with Q4_K_M quantization (a good balance between quality and size). The generated file is about 2-3 GB for a 3B model, compared to 6 GB for the original format.

If you want to test before exporting, Unsloth itself allows direct inference:

FastLanguageModel.for_inference(model)
inputs = tokenizer(
    ["""Below is an instruction that describes a task.
Write a response that completes the request.

Instruction:

Explain what a closure is in Python.

Response:"""],

return_tensors="pt",

).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7) print(tokenizer.decode(outputs[0]))


Step 7: Deploying on Ollama

Ollama is the simplest tool for running LLMs locally. With the model in GGUF, deployment is a matter of minutes.

First, create a Modelfile:

FROM ./fine-tuned-model-gguf/consolidated-01.gguf

TEMPLATE """{{ .Prompt }}"""

PARAMETER temperature 0.7 PARAMETER top_p 0.9 PARAMETER num_ctx 2048

Then, create the model in Ollama and test it:

ollama create neuralpulse-python-tutor --f Modelfile
ollama run neuralpulse-python-tutor

Done. You typed ollama run and are now chatting with your fine-tuned model. No internet, no API key, no cost per call.


Cost: local vs cloud — when is it worth it?

The question everyone asks: "Is it financially worthwhile?"

Presenc.ai's research (2026) mapped the break-even point. (Source: presenc.ai)

ScenarioInitial InvestmentMonthly CostCalls/monthCost per 1M tokens
Local (RTX 3090 + CPU 16GB RAM)~$1,200 (one-time)~$83 (power + maintenance)Unlimited~$0.05
OpenAI GPT-4o mini (API)$0Variable10,000$0.15
OpenAI GPT-4o (API)$0Variable10,000$2.50
Anthropic Claude 3.5 Sonnet (API)$0Variable10,000$3.00
Groq Llama 3 70B (API)$0Variable10,000$0.59

The break-even point for a 7B model with moderate utilization (30%) is between 4 and 9 months. Above 10,000 calls per month, the local cost is drastically lower.

For teams doing iterative fine-tuning — test, adjust, repeat — the math closes even faster. Each experiment you would run in the cloud costs something. On your GPU, it only costs electricity.


Conclusion

Local fine-tuning of LLMs is no longer a lab thing with 8 A100 GPUs. With Unsloth + QLoRA, you take a 7-billion-parameter model, adapt it with your data, and put it into production — all on a consumer GPU, in under 15 minutes.

The ecosystem is mature: Unsloth solved the VRAM and speed bottleneck, GGUF standardized distribution, and Ollama simplified deployment. All that's left is for you to bring the data.

#fine-tuning#unsloth#llm#qlora#ollama#tutorial
Compartilhar: