ML Deployment in Production: Docker, Kubernetes and the Real Cost of Scaling in 2026 (Step-by-Step Tutorial)
78% of machine learning models never reach production. This data point from Gartner (2025) reveals a chasm between the data lab and the real world. The bottleneck is no longer model quality, but the infrastructure to serve predictions at scale.
This practical tutorial shows how to deploy an ML model using Docker and Kubernetes. We'll go from container to cluster, covering monitoring and a real cost breakdown for 2026.
Why Docker and Kubernetes Dominate Model Serving in 2026
Containerizing an ML model solves the classic "it works on my machine" problem. With Docker, you package the model, dependencies, and libraries into an immutable image. Kubernetes, in turn, orchestrates these containers in production.
The average deployment cost on Kubernetes is US$0.10 per hour per pod on AWS EKS (2026 data). It seems cheap, but it scales fast. A cluster with 10 pods running 24/7 costs US$720 per month. Without optimization, the budget blows up.
The choice between AWS, Google Cloud, or Azure depends on your ecosystem. But the deployment pattern is the same: Dockerfile + Kubernetes manifest.
Step 1: Containerize the model with Docker
We'll use a regression model trained with scikit-learn. The goal is to expose a REST API with Flask.
app.py file:
from flask import Flask, request, jsonify
import pickle
import numpy as np
app = Flask(name) model = pickle.load(open('model.pkl', 'rb'))
@app.route('/predict', methods=['POST']) def predict(): data = request.get_json() features = np.array(data['features']).reshape(1, -1) prediction = model.predict(features) return jsonify({'prediction': prediction.tolist()})
if name == 'main': app.run(host='0.0.0.0', port=5000)
Dockerfile:
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 5000
CMD ["python", "app.py"]
Build the image:
docker build -t ml-model:v1 .
Test locally:
docker run -p 5000:5000 ml-model:v1
If the API responds, the container is ready. Now, push it to a registry like Docker Hub or ECR.
"Containerizing the model is the first step to killing 'it works on my machine'. Without it, deploying to production is a gamble." — Priscila Lima, ML Engineer at Nubank, in an interview with NeuralPulse (2026).
Step 2: Deploy on Kubernetes with scalability
Create a deployment and a service in Kubernetes. The YAML file defines how many pods run, the port, and how to expose the service.
deployment.yaml file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ml-model-deployment
spec:
replicas: 3
selector:
matchLabels:
app: ml-model
template:
metadata:
labels:
app: ml-model
spec:
containers:
- name: ml-model
image: youruser/ml-model:v1
ports:
- containerPort: 5000
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi"
cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
name: ml-model-service
spec:
selector:
app: ml-model
ports:
- protocol: TCP
port: 80
targetPort: 5000
type: LoadBalancer
Apply it to the cluster:
kubectl apply -f deployment.yaml
Kubernetes creates 3 pods. If one fails, the deployment recreates it. The service distributes traffic among them.
To scale manually:
kubectl scale deployment ml-model-deployment --replicas=5
But the ideal is to use the Horizontal Pod Autoscaler (HPA). It adjusts the number of pods based on CPU or memory usage.
hpa.yaml file:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ml-model-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ml-model-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
With HPA, the cluster scales automatically from 2 to 10 pods. You only pay for what you use.
Step 3: Monitoring and MLflow for version tracking
Monitoring the model in production is as important as the deployment itself. Use MLflow to version models and metrics. Integrate with Prometheus and Grafana for real-time metrics.
Install the MLflow Tracking Server:
pip install mlflow
mlflow server --host 0.0.0.0 --port 5001
In the training code, log each experiment:
import mlflow
mlflow.set_tracking_uri("http://localhost:5001")
with mlflow.start_run():
mlflow.log_param("model_type", "regression")
mlflow.log_metric("rmse", 0.23)
mlflow.sklearn.log_model(model, "model")
MLflow stores the model and metrics. For deployment, use the registered model instead of a local file.
For monitoring, configure Prometheus to collect metrics from the pods. Create a dashboard in Grafana with:
- Average request latency
- Error rate
- CPU and memory usage per pod
- Number of predictions per second
If latency rises, HPA scales. If errors increase, an alert fires. Without monitoring, you're flying blind.
Cost Comparison: Docker standalone vs Kubernetes in 2026
The table below compares costs for serving a model with 100 requests per second (RPS) for 30 days.
| Item | Docker (Single VM) | Kubernetes (3 pods) | Kubernetes (HPA, 2-10 pods) |
|---|---|---|---|
| VM/Cluster Cost | US$ 150/month (t3.medium) | US$ 72/month (EKS control) | US$ 72/month (EKS control) |
| Compute Cost | Included | US$ 216/month (3 pods) | US$ 144-432/month (avg 6 pods) |
| Storage Cost | US$ 10/month | US$ 30/month (registry) | US$ 30/month (registry) |
| Total Cost | US$ 160/month | US$ 318/month | US$ 246-534/month |
| Scalability | Manual | Manual | Automatic |
| Failure Recovery Time | 10-30 min | < 1 min | < 1 min |
Source: AWS Pricing Calculator, June 2026.
Docker standalone is cheaper for stable loads. But Kubernetes shines when traffic varies. With HPA, you pay for the average, not the peak.
When to use each?
- Docker standalone: prototypes, models with predictable traffic, small teams.
- Kubernetes: production with variable scale, multiple models, mature ML teams.
The wrong choice can be costly. A poorly configured Kubernetes cluster wastes resources. A Docker setup without orchestration breaks under request spikes.
Best Practices for ML Deployment in Production
- Use model versioning: MLflow or DVC. Never replace a model without tracking the previous version.
- Test the container locally: before pushing to the cluster, run
docker runand make test requests. - Define resource limits: pods without CPU/memory limits can crash the cluster.
- Implement health checks: Kubernetes uses probes to know if the pod is alive and ready.
- Collect business metrics: besides latency and errors, monitor model accuracy in production. Real data degrades models.
ML deployment doesn't end when the model goes up. It begins. With each new version, repeat the cycle. Containerize, deploy, monitor, improve.
The cost of not doing this is high. 78% of models never reach production (Gartner, 2025). The 22% that do often die from lack of maintenance.
With Docker and Kubernetes, you reduce this risk. But only if you follow the steps with discipline. The tutorial ends here. Production starts now.
Related Articles
Related Articles
The Freemium Trap: Companies Pay 300% More When Scaling AI in 2026
Freemium model of AI platforms hides cost increases of up to 300% when scaling. Learn how to avoid bill shock and escape vendor lock-in.
AI as a Service in Brazil in 2026: The New Cloud War and the End of Custom Solutions
Brazilian companies are replacing proprietary AI models with APIs and managed platforms. AWS, Google Cloud, and Azure compete for a US$ 50 billion market...
Optimization of Natural Language Models for Multilingual Chatbots with Hugging Face and ONNX Runtime in 2026
Learn how to optimize natural language models for multilingual chatbots using Hugging Face, ONNX Runtime and Kubernetes, with a focus on real-time inference...