Illuminated integrated circuit representing machine learning and technology
machine-learning

From Zero to Model: Build Your First Classifier with Python and scikit-learn (2026 Tutorial)

NeuralPulse|28 de maio de 2026|10 min read|Ler em Português

You have Python installed, know the basics of pandas, and have heard of machine learning — but you've never actually gotten your hands dirty. This tutorial is for you.

The global ML market moved US$79.29 billion in 2024 and is expected to reach US$503.40 billion by 2030 — a CAGR of 36.08% (Source: Statista/DemandSage). 48% of companies already use ML in production, and 92% of leaders have invested heavily (Source: DemandSage/Business Wire). In Brazil, 68% of SMEs will have adopted some form of AI by 2026, compared to just 12% in 2023 (Source: CNI/Sebrae via EuthopIA). Those who master the basics of ML aren't just learning a skill — they're securing a place in a market growing over 35% per year.

The good news: you don't need a PhD or a R$40,000 GPU to get started. You need Python, scikit-learn 1.8, and about 30 minutes of focus. We'll build a classifier that predicts insurance claims, using a real Porto Seguro dataset available on Kaggle. By the end, you'll have a working model, exported and served via an API.

The Setup: Install Everything in 2 Minutes

First, make sure your environment is ready. You'll need Python 3.11 or higher (preferably 3.12 or 3.13 if possible) and the libraries below:

# Clean installation for the tutorial
pip install scikit-learn==1.8.0 pandas numpy matplotlib seaborn jupyter xgboost fastapi uvicorn joblib

73% of data scientists prefer Python for ML tasks (Source: Gitnux). It's no coincidence — the ecosystem is mature, the community is huge, and libraries like scikit-learn make the entire workflow consistent.

"Scikit-learn remains the ideal tool for those working with classical machine learning" — Analytics Insight, 2026

Create a new notebook (or a .py script, if you prefer) and let's get started.

The Dataset: Real Data from Porto Seguro

We'll use the famous Porto Seguro's Safe Driver Prediction dataset, available on Kaggle. Porto Seguro — one of Brazil's largest insurers — released this dataset in 2019, and it became a benchmark for imbalanced binary classification problems. The company automated 85% of its claims analysis processes using ML (Source: Porto Seguro cases/Kaggle).

The problem is simple: given a set of driver and vehicle characteristics, predict whether that customer will file a claim in the next year.

If you want to download the original dataset, go to kaggle.com/c/porto-seguro-safe-driver-prediction. For this tutorial, we'll simulate a reduced version that captures the essence of the problem.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import xgboost as xgb
import joblib

Generating synthetic data similar to the Porto Seguro dataset

np.random.seed(42) n_samples = 100000

Features similar to the real dataset

data = { 'id': range(n_samples), 'ps_ind_01': np.random.normal(0, 1, n_samples), # driver age 'ps_ind_02': np.random.randint(0, 2, n_samples), # gender 'ps_ind_03': np.random.choice(['A', 'B', 'C', 'D', 'E', 'F', 'G'], n_samples), # category 'ps_ind_04': np.random.randint(0, 10, n_samples), # years of experience 'ps_reg_01': np.random.normal(0.5, 0.2, n_samples).clip(0, 1), # population density 'ps_reg_02': np.random.exponential(2, n_samples), # vehicle value 'ps_car_01': np.random.normal(0, 2, n_samples), # vehicle power 'ps_car_02': np.random.choice(['coupe', 'sedan', 'hatch', 'suv', 'pickup', 'van'], n_samples), 'ps_car_03': np.random.randint(0, 5, n_samples), # number of airbags 'ps_car_04': np.random.normal(2015, 5, n_samples).astype(int), # vehicle year 'ps_calc_01': np.random.beta(2, 5, n_samples), # credit score }

df = pd.DataFrame(data)

Creating target with realistic correlations

Claim probability increases with: advanced age, less experience, more expensive car, low score

log_odds = ( -0.5 * df['ps_ind_01'] + 0.3 * (df['ps_ind_01'] > 3) * df['ps_ind_01'] # age > 3 std dev = high risk + 0.2 * (df['ps_ind_04'] < 2) # less than 2 years of experience + 0.4 * df['ps_reg_02'] # more expensive car - 0.3 * df['ps_car_03'] # more airbags = less risk + 0.5 * (1 - df['ps_calc_01']) # low score = more risk + np.random.normal(0, 1, n_samples) ) prob = 1 / (1 + np.exp(-log_odds)) df['target'] = (prob > 0.5).astype(int)

Balancing: ~4% claims (like the real dataset)

mask_claim = df['target'] == 1 n_claim = mask_claim.sum() df_balanced = pd.concat([ df[mask_claim], df[~mask_claim].sample(n=len(mask_claim[mask_claim]) * 24, random_state=42) ])

print(f"Shape: {df_balanced.shape}") print(f"Claim rate: {df_balanced['target'].mean():.2%}") print(f"Numerical features: {df_balanced.select_dtypes(include=[np.number]).shape[1]}") print(f"Categorical features: {df_balanced.select_dtypes(include=['object']).columns.tolist()}")

Run it and see the output. You'll notice something common in real problems: the classes are imbalanced. Only about 4% of the records are claims. This is the first challenge — and one of the most common in the daily life of ML practitioners.

Brazilian companies like Nubank (credit scoring and fraud detection), Itaú Unibanco (predictive analytics), and Magazine Luiza (recommendation and dynamic pricing) face exactly the same type of problem: real data, imbalanced, with mixed features (numerical and categorical).

Preprocessing: Where 80% of the Work Really Happens

If you ask any experienced data scientist what the most time-consuming part of an ML project is, the answer will be unanimous: preprocessing. Let's tackle it head-on.

# Separating features and target
X = df_balanced.drop(['target', 'id'], axis=1)
y = df_balanced['target']

Identifying columns by type

num_cols = X.select_dtypes(include=[np.number]).columns.tolist() cat_cols = X.select_dtypes(include=['object']).columns.tolist()

print(f"Numerical columns ({len(num_cols)}): {num_cols[:5]}...") print(f"Categorical columns ({len(cat_cols)}): {cat_cols}")

Caution number 1 — Data Leakage: Never apply transformations to the data before separating train and test. If you standardize using the mean of the entire dataset, the test mean will "leak" information into the training set. The correct order is: split first, transform later.

# Train-test split BEFORE any transformation
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {X_train.shape}, Test: {X_test.shape}") print(f"Train claim rate: {y_train.mean():.2%}") print(f"Test claim rate: {y_test.mean():.2%}")

I used stratify=y to maintain the same class proportion in both sets. Without this, you could end up with a test set that doesn't represent reality — and your metrics will lie to you.

# Standardization of numerical features
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

Encoding categorical variables

for col in cat_cols: encoder = LabelEncoder() X_train[col] = encoder.fit_transform(X_train[col]) X_test[col] = encoder.transform(X_test[col])

print("Preprocessing complete!") print(f"Mean after standardization: {X_train[num_cols].mean().mean():.6f}") print(f"Standard deviation: {X_train[num_cols].std().mean():.6f}")

Tip from someone who's struggled with this: save the scaler and encoders with joblib to use during deployment. You'll need to transform new data in exactly the same way.

The Heart of ML: Training and Comparing Models

Now comes the part everyone is waiting for. We'll train three different models and compare them using cross-validation. Each has its own characteristics:

ModelProsConsIdeal for
Logistic RegressionFast, interpretable, works well with linearly separable dataDoesn't capture complex relationshipsBaseline, problems with few data
Random ForestHandles non-linearities well, feature importance, resistant to overfittingCan be slow with many trees, less interpretable than logistic regressionTabular data, minimal feature engineering
XGBoostState-of-the-art for tabular data, built-in regularization, wins 82% of Kaggle competitionsMore parameters to tune, risk of overfitting if poorly configuredMaximum performance, competitions, production

An important thing: 85% of models in production need to be retrained quarterly (Source: Gitnux). The model you train today won't work forever — and that's normal.

# Function to evaluate models with cross-validation
def evaluate_model(model, X, y, name, cv=5):
    scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
    print(f"{name}:")
    print(f"  Mean ROC AUC: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
    print(f"  Scores per fold: {[f'{s:.4f}' for s in scores]}")
    return scores.mean()

Models with initial configurations

models = { 'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42), 'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1), 'XGBoost': xgb.XGBClassifier( n_estimators=100, learning_rate=0.1, max_depth=6, random_state=42, eval_metric='logloss', use_label_encoder=False ), }

print("=== Model Comparison (Cross-Validation) ===\n") results = {} for name, model in models.items(): score = evaluate_model(model, X_train, y_train, name) results[name] = score

best = max(results, key=results.get) print(f"\n🏆 Best model (cross-validation): {best} with ROC AUC = {results[best]:.4f}")

"Cross-validation remains the cornerstone of reliable model evaluation" — Nerd Level Tech, 2026

Cross-validation with 5 folds trains the model 5 times, each time with a different chunk of data as validation. It's more computationally expensive, but it gives a much more honest estimate of how the model will perform on new data.

Tuning the Winning Model with GridSearchCV

XGBoost wins 82% of Kaggle competitions (Source: Gitnux). It's no wonder — it implements gradient boosting with built-in regularization and handles tabular data very well. But it has many hyperparameters, and the default values are rarely optimal.

Let's do a grid search (GridSearchCV) to find the best parameters:

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [4, 6, 8],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
}

xgb_base = xgb.XGBClassifier(random_state=42, eval_metric='logloss', use_label_encoder=False)

grid_search = GridSearchCV( estimator=xgb_base, param_grid=param_grid, cv=3, scoring='roc_auc', n_jobs=-1, verbose=1 )

WARNING: This may take a few minutes depending on your hardware

grid_search.fit(X_train, y_train)

print(f"\nBest parameters found:") for param, value in grid_search.best_params_.items(): print(f" {param}: {value}") print(f"Best ROC AUC (CV): {grid_search.best_score_:.4f}")

A common pitfall here: GridSearchCV performs an exhaustive search. With 3 levels of max_depth, 3 of learning_rate, 2 of subsample, 2 of colsample_bytree, and 2 of n_estimators, that's 3 × 3 × 2 × 2 × 2 = 72 combinations. Multiplied by 3 folds = 216 training runs. If each takes 5 seconds, that's 18 minutes. Use RandomizedSearchCV if the search space is larger.

Evaluation on the Test Set

Training is only half the story. The truth appears when we test the model on data it has never seen:

best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

Main metrics

print("=== Metrics on Test Set ===\n") print(classification_report(y_test, y_pred, target_names=['No Claim', 'Claim']))

roc_auc = roc_auc_score(y_test, y_proba) print(f"ROC AUC Score: {roc_auc:.4f}")

Confusion matrix

cm = confusion_matrix(y_test, y_pred) plt.figure(figsize=(8, 6)) sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['No Claim', 'Claim'], yticklabels=['No Claim', 'Claim']) plt.title('Confusion Matrix - Claims Classifier') plt.ylabel('Actual') plt.xlabel('Predicted') plt.show()

ROC Curve

fpr, tpr, _ = roc_curve(y_test, y_proba) plt.figure(figsize=(8, 6)) plt.plot(fpr, tpr, label=f'XGBoost (AUC = {roc_auc:.4f})', linewidth=2) plt.plot([0, 1], [0, 1], 'k--', label='Random') plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC Curve - Claims Classifier') plt.legend() plt.show()

In an insurance problem, the most important metric is precision for the positive class (claim). Why? Because each false positive costs money — you'll approach a customer who doesn't need anything, wasting sales or marketing resources. A false negative (not identifying a claim that will happen) is also costly, but generally less so than harassing thousands of customers unnecessarily.

ROC AUC measures the model's ability to separate classes across all thresholds.

#python#scikit-learn#tutorial#classification#xgboost
Compartilhar: