Illustration of machine learning deployment with Docker and Kubernetes in the cloud

ML Deployment in Production: Docker, Kubernetes and the Real Cost of Scaling in 2026 (Step-by-Step Tutorial)

NeuralPulse|6 de junho de 2026|10 min read|Ler em Português

78% of machine learning models never reach production. This data point from Gartner (2025) reveals a chasm between the data lab and the real world. The bottleneck is no longer model quality, but the infrastructure to serve predictions at scale.

This practical tutorial shows how to deploy an ML model using Docker and Kubernetes. We'll go from container to cluster, covering monitoring and a real cost breakdown for 2026.

Why Docker and Kubernetes Dominate Model Serving in 2026

Containerizing an ML model solves the classic "it works on my machine" problem. With Docker, you package the model, dependencies, and libraries into an immutable image. Kubernetes, in turn, orchestrates these containers in production.

The average deployment cost on Kubernetes is US$0.10 per hour per pod on AWS EKS (2026 data). It seems cheap, but it scales fast. A cluster with 10 pods running 24/7 costs US$720 per month. Without optimization, the budget blows up.

The choice between AWS, Google Cloud, or Azure depends on your ecosystem. But the deployment pattern is the same: Dockerfile + Kubernetes manifest.

Step 1: Containerize the model with Docker

We'll use a regression model trained with scikit-learn. The goal is to expose a REST API with Flask.

app.py file:

from flask import Flask, request, jsonify
import pickle
import numpy as np

app = Flask(name) model = pickle.load(open('model.pkl', 'rb'))

@app.route('/predict', methods=['POST']) def predict(): data = request.get_json() features = np.array(data['features']).reshape(1, -1) prediction = model.predict(features) return jsonify({'prediction': prediction.tolist()})

if name == 'main': app.run(host='0.0.0.0', port=5000)

Dockerfile:

FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "app.py"]

Build the image:

docker build -t ml-model:v1 .

Test locally:

docker run -p 5000:5000 ml-model:v1

If the API responds, the container is ready. Now, push it to a registry like Docker Hub or ECR.

"Containerizing the model is the first step to killing 'it works on my machine'. Without it, deploying to production is a gamble." — Priscila Lima, ML Engineer at Nubank, in an interview with NeuralPulse (2026).

Step 2: Deploy on Kubernetes with scalability

Create a deployment and a service in Kubernetes. The YAML file defines how many pods run, the port, and how to expose the service.

deployment.yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      containers:
      - name: ml-model
        image: youruser/ml-model:v1
        ports:
        - containerPort: 5000
        resources:
          requests:
            memory: "512Mi"
            cpu: "250m"
          limits:
            memory: "1Gi"
            cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
  ports:
  - protocol: TCP
    port: 80
    targetPort: 5000
  type: LoadBalancer

Apply it to the cluster:

kubectl apply -f deployment.yaml

Kubernetes creates 3 pods. If one fails, the deployment recreates it. The service distributes traffic among them.

To scale manually:

kubectl scale deployment ml-model-deployment --replicas=5

ElevenLabs

Transforme texto em voz com IA realista. Perfeito para narracoes, podcasts e audiolivros.

Testar gratuito

But the ideal is to use the Horizontal Pod Autoscaler (HPA). It adjusts the number of pods based on CPU or memory usage.

hpa.yaml file:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

With HPA, the cluster scales automatically from 2 to 10 pods. You only pay for what you use.

Step 3: Monitoring and MLflow for version tracking

Monitoring the model in production is as important as the deployment itself. Use MLflow to version models and metrics. Integrate with Prometheus and Grafana for real-time metrics.

Install the MLflow Tracking Server:

pip install mlflow
mlflow server --host 0.0.0.0 --port 5001

In the training code, log each experiment:

import mlflow
mlflow.set_tracking_uri("http://localhost:5001")
with mlflow.start_run():
    mlflow.log_param("model_type", "regression")
    mlflow.log_metric("rmse", 0.23)
    mlflow.sklearn.log_model(model, "model")

MLflow stores the model and metrics. For deployment, use the registered model instead of a local file.

For monitoring, configure Prometheus to collect metrics from the pods. Create a dashboard in Grafana with:

Average request latency
Error rate
CPU and memory usage per pod
Number of predictions per second

If latency rises, HPA scales. If errors increase, an alert fires. Without monitoring, you're flying blind.

Cost Comparison: Docker standalone vs Kubernetes in 2026

The table below compares costs for serving a model with 100 requests per second (RPS) for 30 days.

Item	Docker (Single VM)	Kubernetes (3 pods)	Kubernetes (HPA, 2-10 pods)
VM/Cluster Cost	US$ 150/month (t3.medium)	US$ 72/month (EKS control)	US$ 72/month (EKS control)
Compute Cost	Included	US$ 216/month (3 pods)	US$ 144-432/month (avg 6 pods)
Storage Cost	US$ 10/month	US$ 30/month (registry)	US$ 30/month (registry)
Total Cost	US$ 160/month	US$ 318/month	US$ 246-534/month
Scalability	Manual	Manual	Automatic
Failure Recovery Time	10-30 min	< 1 min	< 1 min

Source: AWS Pricing Calculator, June 2026.

Docker standalone is cheaper for stable loads. But Kubernetes shines when traffic varies. With HPA, you pay for the average, not the peak.

When to use each?

Docker standalone: prototypes, models with predictable traffic, small teams.
Kubernetes: production with variable scale, multiple models, mature ML teams.

The wrong choice can be costly. A poorly configured Kubernetes cluster wastes resources. A Docker setup without orchestration breaks under request spikes.

Best Practices for ML Deployment in Production

Use model versioning: MLflow or DVC. Never replace a model without tracking the previous version.
Test the container locally: before pushing to the cluster, run docker run and make test requests.
Define resource limits: pods without CPU/memory limits can crash the cluster.
Implement health checks: Kubernetes uses probes to know if the pod is alive and ready.
Collect business metrics: besides latency and errors, monitor model accuracy in production. Real data degrades models.

ML deployment doesn't end when the model goes up. It begins. With each new version, repeat the cycle. Containerize, deploy, monitor, improve.

The cost of not doing this is high. 78% of models never reach production (Gartner, 2025). The 22% that do often die from lack of maintenance.

With Docker and Kubernetes, you reduce this risk. But only if you follow the steps with discipline. The tutorial ends here. Production starts now.

Also check out: How to Use AI to Create High-Quality Content in 2026 Also check out: From Dataset to Ollama: Fine-Tuning LLMs with Unsloth on Your GPU in 2026 Also check out: 48% Don't Test, 40% Hallucinate: How to Evaluate LLMs in 2026 — An Analytical Guide

#docker#kubernetes#mlflow#model-serving#monitoring#scalability#deployment-cost

Bar chart showing exponential increase in AI costs

ai-business|10 min

The Freemium Trap: Companies Pay 300% More When Scaling AI in 2026

Freemium model of AI platforms hides cost increases of up to 300% when scaling. Learn how to avoid bill shock and escape vendor lock-in.

7 de junho de 2026Read more

Illustration of cloud-connected data centers with AI icons and growth charts

ai-business|10 min

AI as a Service in Brazil in 2026: The New Cloud War and the End of Custom Solutions

Brazilian companies are replacing proprietary AI models with APIs and managed platforms. AWS, Google Cloud, and Azure compete for a US$ 50 billion market...

5 de junho de 2026Read more

Data engineer analyzing monitoring dashboards of language models in production

machine-learning|12 min

Optimization of Natural Language Models for Multilingual Chatbots with Hugging Face and ONNX Runtime in 2026

Learn how to optimize natural language models for multilingual chatbots using Hugging Face, ONNX Runtime and Kubernetes, with a focus on real-time inference...