cheat sheet

scikit-learn

Build classical ML pipelines with scikit-learn. Covers the estimator API, train_test_split, Pipeline, ColumnTransformer, cross-validation, metrics, and model persistence.

updated 05-25-2026

scikit-learn — Classical Machine Learning

What it is

scikit-learn (imported as sklearn) is the de-facto Python library for classical machine learning — everything that isn't deep neural networks. It bundles dozens of estimators (linear and logistic regression, random forests, gradient boosting, SVMs, K-means, …) behind a single tiny API: fit, predict, transform, score. It is built on NumPy, SciPy, and joblib; maintained by a large community led by INRIA. For tabular data with thousands to millions of rows, scikit-learn is the right first answer; for deep learning, reach for PyTorch or JAX instead.

The unified estimator API is scikit-learn's superpower. Once you know how to fit a LogisticRegression, you can fit a RandomForestClassifier or a GradientBoostingRegressor with literally the same three lines.

Install

bash

# Recommended (uv is fastest)
uv pip install scikit-learn

# Or with pip
pip install scikit-learn

# Common companion packages
pip install pandas numpy matplotlib joblib

Output: (none — exits 0 on success)

Verify:

python

import sklearn
print(sklearn.__version__)

Output:

text

1.5.0

Syntax

Every estimator follows the same pattern. Instantiate with hyperparameters, fit on training data, then predict (classifiers/regressors) or transform (preprocessors) on new data. score returns a default metric (accuracy for classifiers, R² for regressors).

python

from sklearn.<family> import <Estimator>

model = <Estimator>(**hyperparameters)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
metric = model.score(X_test, y_test)

Output: (none — pattern, not a runnable snippet)

Essential building blocks

Tool	Module	What it does
`train_test_split`	`model_selection`	Split arrays into train/test
`Pipeline`	`pipeline`	Chain preprocessors + a final estimator
`ColumnTransformer`	`compose`	Apply different transforms per column
`StandardScaler`	`preprocessing`	Zero-mean, unit-variance scaling
`OneHotEncoder`	`preprocessing`	Categorical → dummy variables
`SimpleImputer`	`impute`	Fill missing values
`cross_val_score`	`model_selection`	K-fold CV, returns scores
`GridSearchCV`	`model_selection`	Exhaustive hyperparam search
`RandomizedSearchCV`	`model_selection`	Sampled hyperparam search
`joblib.dump`/`load`	`joblib`	Save/load a fitted estimator

The estimator API

Every preprocessor, model, and transformer in scikit-learn implements two or more of these methods. Understanding them once gives you the whole library.

python

# Estimators (models)
model.fit(X, y)              # train; y optional for unsupervised
model.predict(X)             # produce labels / values
model.predict_proba(X)       # probabilities (classifiers only)
model.score(X, y)            # default metric (accuracy / R²)

# Transformers (preprocessors)
transformer.fit(X)           # learn the transform parameters
transformer.transform(X)     # apply
transformer.fit_transform(X) # shortcut combining the two

# Hyperparameters
model.get_params()           # dict of all hyperparameters
model.set_params(C=10)       # update one (chainable)

Output: (none — API surface)

Quick example — train a classifier

A complete end-to-end run in 12 lines: load a sample dataset, split, fit a classifier, evaluate.

python

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)

print(f"Accuracy: {accuracy_score(y_test, preds):.3f}")
print(f"Score:    {model.score(X_test, y_test):.3f}")

Output:

text

Accuracy: 0.974
Score:    0.974

`train_test_split`

train_test_split shuffles the data and slices it into a training set and a held-out test set. Always pass random_state= for reproducible splits; for classification problems pass stratify=y so class proportions are preserved in both splits.

python

from sklearn.model_selection import train_test_split

# Simple two-way split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Stratified (preserve class balance) — classification only
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Three-way (train / val / test)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
print(len(X_train), len(X_val), len(X_test))

Output:

text

90 30 30

A test set you peek at is no longer a test set. Reserve a final hold-out at the very start of the project and don't touch it until model selection is done. Use cross-validation on the training set for tuning.

Preprocessing

Scikit-learn estimators expect numeric, finite, scaled features. The preprocessing and impute modules provide transformers for everything between raw data and a model-ready matrix. The most common four are below.

`StandardScaler`

Standardise features to zero mean and unit variance. Essential for distance-based models (k-NN, SVM, k-means) and gradient-based linear models (logistic regression). Tree-based models (random forest, gradient boosting) are scale-invariant and don't need it.

python

from sklearn.preprocessing import StandardScaler
import numpy as np

X = np.array([[1, 100], [2, 200], [3, 300], [4, 400]], dtype=float)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print("means:", scaler.mean_)
print("stds: ", scaler.scale_)
print(X_scaled)

Output:

text

means: [  2.5 250. ]
stds:  [ 1.118 111.803]
[[-1.342 -1.342]
 [-0.447 -0.447]
 [ 0.447  0.447]
 [ 1.342  1.342]]

`OneHotEncoder`

Turn a categorical column into one binary column per category. The newer sparse_output=False returns a dense array (the default is sparse); handle_unknown="ignore" silently emits all-zeros for unseen categories at predict time.

python

from sklearn.preprocessing import OneHotEncoder
import numpy as np

X = np.array([["red"], ["blue"], ["green"], ["red"], ["blue"]])
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
print(enc.fit_transform(X))
print(enc.get_feature_names_out())

Output:

text

[[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]
['x0_blue' 'x0_green' 'x0_red']

`OrdinalEncoder`

Map categories to integers 0..N-1. Use only when categories have a natural order (["low", "medium", "high"]) or when feeding tree-based models — linear models will misinterpret the ordering as numeric magnitude.

python

from sklearn.preprocessing import OrdinalEncoder
import numpy as np

X = np.array([["low"], ["high"], ["medium"], ["low"]])
enc = OrdinalEncoder(categories=[["low", "medium", "high"]])
print(enc.fit_transform(X))

Output:

text

[[0.]
 [2.]
 [1.]
 [0.]]

`SimpleImputer`

Fill missing values (NaN) before they reach an estimator. Default strategy is mean for numeric columns; most_frequent works for both numeric and string columns.

python

from sklearn.impute import SimpleImputer
import numpy as np

X = np.array([[1.0, 2.0], [np.nan, 3.0], [4.0, np.nan]])
imp = SimpleImputer(strategy="mean")
print(imp.fit_transform(X))

Output:

text

[[1.  2. ]
 [2.5 3. ]
 [4.  2.5]]

`Pipeline`

A Pipeline chains transformers and a final estimator into a single object that exposes the estimator API. fit calls fit_transform on each step in order; predict calls transform then the final predict. Pipelines are the right way to package a model — they prevent data leakage (scaler fit on the test set), and they save/load atomically.

python

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("logreg", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)
print(f"Accuracy: {pipe.score(X_test, y_test):.3f}")

Output:

text

Accuracy: 0.972

Access a step by name with pipe.named_steps["scale"] or by index pipe[0]. Hyperparameters of any inner step are exposed under <step_name>__<param>:

python

pipe.set_params(logreg__C=0.1)
pipe.fit(X_train, y_train)
print(pipe.named_steps["logreg"].C)

Output:

text

0.1

The make_pipeline(scaler, model) helper from sklearn.pipeline auto-names steps from their class names — handy when you don't care about the names. Otherwise prefer the explicit Pipeline([("name", obj), ...]) form so set_params keys are predictable.

`ColumnTransformer`

Real datasets mix numeric and categorical columns. ColumnTransformer applies different preprocessors to different columns and assembles the results into a single matrix.

python

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier

df = pd.DataFrame({
    "age": [25, 30, None, 45, 22],
    "income": [50_000, 60_000, 70_000, None, 30_000],
    "city": ["NYC", "LA", "NYC", "SF", "LA"],
    "churn": [0, 1, 0, 1, 0],
})
y = df.pop("churn")

numeric = ["age", "income"]
categorical = ["city"]

preprocess = ColumnTransformer([
    ("num", Pipeline([
        ("impute", SimpleImputer(strategy="median")),
        ("scale", StandardScaler()),
    ]), numeric),
    ("cat", OneHotEncoder(handle_unknown="ignore"), categorical),
])

pipe = Pipeline([
    ("prep", preprocess),
    ("clf", GradientBoostingClassifier(random_state=42)),
])
pipe.fit(df, y)
print(pipe.predict(df))

Output:

text

[0 1 0 1 0]

ColumnTransformer returns either a dense np.ndarray or a sparse matrix (auto-selected from the inputs). Pass remainder="passthrough" to keep columns you didn't transform, or remainder="drop" (default) to drop them.

Core estimators

A field guide to the most common scikit-learn estimators. Pick by problem shape first; tune hyperparameters once you have a working baseline.

Linear models

Linear regression for continuous targets, logistic regression for classification, ridge/lasso for regularised regression. Linear models are fast, interpretable, and surprisingly competitive on text and other high-dimensional sparse data.

python

from sklearn.linear_model import (
    LinearRegression, LogisticRegression,
    Ridge, Lasso, ElasticNet,
)

LinearRegression()                            # OLS
LogisticRegression(C=1.0, max_iter=1000)      # logistic; C is 1/lambda
Ridge(alpha=1.0)                              # L2 regularised
Lasso(alpha=0.1)                              # L1; sets coefficients to zero
ElasticNet(alpha=0.1, l1_ratio=0.5)           # mix of L1 and L2

Output: (none — instantiation only)

Tree-based models

Decision trees, random forests, and gradient boosting. Tree models handle mixed types, are scale-invariant, and capture non-linear relationships out of the box. Random forests are a strong default; gradient boosting (especially XGBoost / LightGBM / scikit-learn's HistGradientBoostingClassifier) usually wins competitive leaderboards on tabular data.

python

from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import (
    RandomForestClassifier, RandomForestRegressor,
    GradientBoostingClassifier, GradientBoostingRegressor,
    HistGradientBoostingClassifier, HistGradientBoostingRegressor,
)

RandomForestClassifier(n_estimators=300, max_depth=None, random_state=42)
GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=3)
HistGradientBoostingClassifier(max_iter=200, learning_rate=0.05)  # faster on big data

Output: (none — instantiation only)

Distance-based models

k-Nearest Neighbours and Support Vector Machines. Both require feature scaling — wrap them in a pipeline with StandardScaler.

python

from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR, LinearSVC

KNeighborsClassifier(n_neighbors=5, weights="distance")
SVC(kernel="rbf", C=1.0, gamma="scale", probability=True)
LinearSVC(C=1.0, max_iter=5000)               # faster than SVC for large data

Output: (none — instantiation only)

Naive Bayes

Surprisingly strong baseline for text classification. MultinomialNB for word counts, GaussianNB for continuous features.

python

from sklearn.naive_bayes import MultinomialNB, GaussianNB

GaussianNB()
MultinomialNB(alpha=1.0)

Output: (none)

Clustering and dimensionality reduction (unsupervised)

K-means and DBSCAN for clustering; PCA and t-SNE for dimensionality reduction. These don't take a y argument.

python

from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

km = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = km.fit_predict(X)

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)
print(f"Explained variance: {pca.explained_variance_ratio_}")

Output:

text

Explained variance: [0.7296 0.2285]

Cross-validation

A single train/test split is noisy. Cross-validation averages performance over K folds for a more stable estimate. cross_val_score is the one-liner; cross_validate returns more detail (train scores, fit times, multiple metrics).

python

from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)
pipe = Pipeline([("scale", StandardScaler()), ("logreg", LogisticRegression(max_iter=1000))])

scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print(f"Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
detailed = cross_validate(
    pipe, X, y, cv=cv,
    scoring=["accuracy", "f1", "roc_auc"],
    return_train_score=True,
)
for k, v in detailed.items():
    if k.startswith("test_") or k.startswith("train_"):
        print(f"{k}: {v.mean():.3f}")

Output:

text

Accuracy: 0.977 ± 0.013
test_accuracy: 0.977
test_f1: 0.982
test_roc_auc: 0.996
train_accuracy: 0.991
train_f1: 0.993
train_roc_auc: 0.999

Always pass a StratifiedKFold (or StratifiedShuffleSplit) for classification — KFold can produce folds with no positive class on imbalanced data.

`GridSearchCV` and `RandomizedSearchCV`

Hyperparameter tuning by exhaustive grid search or random sampling. Both run cross-validation under the hood and expose the best estimator via .best_estimator_.

python

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    "n_estimators": [100, 300],
    "max_depth": [None, 10, 20],
    "min_samples_leaf": [1, 5],
}
grid = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring="accuracy",
    n_jobs=-1,
    verbose=1,
)
grid.fit(X, y)
print("Best params:", grid.best_params_)
print(f"Best CV score: {grid.best_score_:.3f}")

Output:

text

Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best params: {'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 300}
Best CV score: 0.965

For larger search spaces, RandomizedSearchCV samples n_iter combinations randomly — usually finds a near-optimal config in a fraction of the budget.

python

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    "n_estimators": randint(50, 500),
    "max_depth": [None] + list(range(5, 30)),
    "min_samples_leaf": randint(1, 20),
    "max_features": uniform(0.1, 0.9),
}
search = RandomizedSearchCV(
    RandomForestClassifier(random_state=42),
    param_dist, n_iter=30, cv=5, random_state=42, n_jobs=-1,
)
search.fit(X, y)
print("Best:", search.best_params_)

Output:

text

Best: {'max_depth': 17, 'max_features': 0.323, 'min_samples_leaf': 3, 'n_estimators': 287}

Metrics

Picking the right metric matters more than picking the right model. For classification, accuracy is misleading on imbalanced data — prefer precision, recall, F1, and ROC-AUC. For regression, MAE is robust; RMSE penalises large errors; R² is unitless but only meaningful relative to a baseline.

python

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    mean_absolute_error, mean_squared_error, r2_score,
)

# Classification
y_true = [0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1, 0, 1, 0]
print(f"accuracy: {accuracy_score(y_true, y_pred):.3f}")
print(f"precision: {precision_score(y_true, y_pred):.3f}")
print(f"recall:    {recall_score(y_true, y_pred):.3f}")
print(f"f1:        {f1_score(y_true, y_pred):.3f}")
print("confusion:")
print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred, digits=3))

Output:

text

accuracy: 0.750
precision: 1.000
recall:    0.600
f1:        0.750
confusion:
[[3 0]
 [2 3]]
              precision    recall  f1-score   support

           0      0.600     1.000     0.750         3
           1      1.000     0.600     0.750         5

    accuracy                          0.750         8
   macro avg      0.800     0.800     0.750         8
weighted avg      0.850     0.750     0.750         8

python

# Regression
y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5,  0.0, 2.0, 8.0]
print(f"MAE:  {mean_absolute_error(y_true, y_pred):.3f}")
print(f"RMSE: {mean_squared_error(y_true, y_pred, squared=False):.3f}")
print(f"R²:   {r2_score(y_true, y_pred):.3f}")

Output:

text

MAE:  0.500
RMSE: 0.612
R²:   0.949

Model persistence with `joblib`

joblib.dump serialises a fitted estimator (or a Pipeline) to disk. joblib.load deserialises it. Always save the entire Pipeline, not just the final estimator — otherwise you have to remember which preprocessing was applied.

python

import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([("scale", StandardScaler()), ("logreg", LogisticRegression())])
pipe.fit(X_train, y_train)

joblib.dump(pipe, "model.joblib", compress=3)
print("Saved.")

# Later — in a different process or service
loaded = joblib.load("model.joblib")
print(loaded.predict(X_test[:5]))

Output:

text

Saved.
[0 1 1 0 1]

joblib.load (and Python's pickle) executes arbitrary code from the file. Never load a model file from an untrusted source. For production, store models in object storage you control and pin the scikit-learn version that produced the file — major-version mismatches can break deserialisation.

Common pitfalls

Fitting the scaler on the full dataset — leaks test statistics into training. Always fit on X_train only (Pipelines handle this for you).
Forgetting stratify=y — on imbalanced classification, an unstratified split can put all positive labels in one fold. Pass stratify=y to train_test_split and use StratifiedKFold for CV.
Using accuracy on imbalanced data — a model that always predicts the majority class on 95/5 data gets 95% accuracy. Use balanced_accuracy, F1, or ROC-AUC instead.
Tuning on the test set — GridSearchCV.fit(X_test, y_test) defeats the entire purpose. Tune on X_train with CV; evaluate the chosen model on X_test exactly once.
predict vs predict_proba — predict returns hard labels using a 0.5 threshold for binary classifiers; for ROC-AUC and calibration you need predict_proba (or decision_function for some estimators).
Categorical features as integers — encoding ["red", "green", "blue"] → [0, 1, 2] makes a linear model think blue > red. Use OneHotEncoder for non-ordinal categories.
n_jobs=-1 on Windows in a script without if __name__ == "__main__": — joblib forks the entire script and re-imports it; without the guard you get infinite recursion. Always wrap top-level fit calls.
Different scikit-learn versions — pickled models often fail to load across major version jumps (1.2 → 1.5). Pin scikit-learn==X.Y.* in requirements.txt.
shuffle=False on time series — random shuffling leaks the future into the past. Use TimeSeriesSplit for cross-validation on temporal data.
High training accuracy, low test accuracy — classic overfit. Add regularisation, prune trees (max_depth), reduce model complexity, or get more data.

Real-world recipes

End-to-end classifier on tabular data

A complete recipe: load a DataFrame with mixed numeric and categorical columns, build a preprocessing+model Pipeline, tune with cross-validation, evaluate on a hold-out, and persist.

python

import pandas as pd
import numpy as np
import joblib
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# 1. Load
df = pd.read_csv("customers.csv")
y = df.pop("churn")

# 2. Hold-out split (touched only at the end)
X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.2, stratify=y, random_state=42
)

# 3. Column-aware preprocessing
numeric_cols = X_train.select_dtypes("number").columns.tolist()
categorical_cols = X_train.select_dtypes(exclude="number").columns.tolist()

preprocess = ColumnTransformer([
    ("num", Pipeline([
        ("impute", SimpleImputer(strategy="median")),
        ("scale", StandardScaler()),
    ]), numeric_cols),
    ("cat", Pipeline([
        ("impute", SimpleImputer(strategy="most_frequent")),
        ("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
    ]), categorical_cols),
])

pipe = Pipeline([
    ("prep", preprocess),
    ("clf", GradientBoostingClassifier(random_state=42)),
])

# 4. Cross-validated grid search on the TRAINING set
param_grid = {
    "clf__n_estimators": [100, 200],
    "clf__learning_rate": [0.05, 0.1],
    "clf__max_depth": [2, 3, 4],
}
search = GridSearchCV(pipe, param_grid, cv=5, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)
print(f"CV best ROC-AUC: {search.best_score_:.3f}")
print("Best params:", search.best_params_)

# 5. Hold-out evaluation — the only time we use X_test
best = search.best_estimator_
proba = best.predict_proba(X_test)[:, 1]
print(f"Hold-out ROC-AUC: {roc_auc_score(y_test, proba):.3f}")
print(classification_report(y_test, best.predict(X_test)))

# 6. Persist the entire fitted pipeline
joblib.dump(best, "churn_model.joblib", compress=3)

Output:

text

CV best ROC-AUC: 0.842
Best params: {'clf__learning_rate': 0.05, 'clf__max_depth': 3, 'clf__n_estimators': 200}
Hold-out ROC-AUC: 0.831
              precision    recall  f1-score   support

           0       0.85      0.92      0.88       795
           1       0.71      0.55      0.62       205

    accuracy                           0.84      1000
   macro avg       0.78      0.74      0.75      1000
weighted avg       0.82      0.84      0.83      1000

Visualise a confusion matrix

ConfusionMatrixDisplay.from_estimator renders a labelled matrix in two lines — paste straight into a notebook or report.

python

import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(best, X_test, y_test, cmap="Purples")
plt.title("Churn — confusion matrix")
plt.tight_layout()
plt.savefig("confusion.png", dpi=150)

Output:

text

[Saved confusion.png — a 2x2 matrix with counts and a purple gradient.]

Feature importance from a tree model

Tree-based estimators expose feature_importances_. Combine with the preprocessor's get_feature_names_out() to get a labelled bar chart.

python

import pandas as pd
import matplotlib.pyplot as plt

clf = best.named_steps["clf"]
prep = best.named_steps["prep"]
feature_names = prep.get_feature_names_out()
importances = pd.Series(clf.feature_importances_, index=feature_names)

top = importances.sort_values(ascending=False).head(15)
top.plot(kind="barh", color="#8a5cff")
plt.gca().invert_yaxis()
plt.title("Top 15 features")
plt.tight_layout()
plt.savefig("importance.png", dpi=150)

Output:

text

[Saved importance.png — a horizontal bar chart of the 15 most important features.]

Time-series cross-validation

TimeSeriesSplit walks forward in time — each fold's training set ends before the validation fold begins. Use whenever rows have a temporal order.

python

from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.linear_model import Ridge

ts_cv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(Ridge(alpha=1.0), X_sorted_by_date, y, cv=ts_cv, scoring="r2")
print("R² per fold:", scores)
print(f"Mean R²:    {scores.mean():.3f}")

Output:

text

R² per fold: [0.42 0.51 0.58 0.55 0.61]
Mean R²:    0.534

Use a saved model inside a web service

A FastAPI endpoint that loads the pickled Pipeline once at startup and serves predictions over HTTP.

python

# server.py
import joblib
import pandas as pd
from fastapi import FastAPI
from pydantic import BaseModel

MODEL = joblib.load("churn_model.joblib")

class Customer(BaseModel):
    age: int
    income: float
    plan: str
    region: str

app = FastAPI()

@app.post("/predict")
def predict(customer: Customer):
    row = pd.DataFrame([customer.model_dump()])
    proba = MODEL.predict_proba(row)[0, 1]
    return {"churn_probability": float(proba)}

bash

uvicorn server:app --reload
curl -s -X POST http://localhost:8000/predict \
  -H "content-type: application/json" \
  -d '{"age": 42, "income": 65000, "plan": "Pro", "region": "EU"}'

Output:

text

{"churn_probability": 0.184}

Find the optimal number of clusters

The elbow method plots inertia (within-cluster sum of squares) against k. The "elbow" is where the marginal improvement stops — usually a good n_clusters.

python

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs

X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
inertias = []
ks = range(1, 10)
for k in ks:
    km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X)
    inertias.append(km.inertia_)

plt.plot(list(ks), inertias, marker="o", color="#8a5cff")
plt.xlabel("k")
plt.ylabel("Inertia")
plt.title("Elbow method")
plt.tight_layout()
plt.savefig("elbow.png", dpi=150)

Output:

text

[Saved elbow.png — a line that drops steeply until k=4, then flattens. The elbow is at 4.]

scikit-learn — Classical Machine Learning

What it is

Install

Syntax

Essential building blocks

The estimator API

Quick example — train a classifier

train_test_split

Preprocessing

StandardScaler

OneHotEncoder

OrdinalEncoder

SimpleImputer

Pipeline

ColumnTransformer

Core estimators

Linear models

Tree-based models

Distance-based models

Naive Bayes

Clustering and dimensionality reduction (unsupervised)

Cross-validation

GridSearchCV and RandomizedSearchCV

Metrics

Model persistence with joblib

Common pitfalls

Real-world recipes

End-to-end classifier on tabular data

Visualise a confusion matrix

Feature importance from a tree model

Time-series cross-validation

Use a saved model inside a web service

Find the optimal number of clusters

See also

`train_test_split`

`StandardScaler`

`OneHotEncoder`

`OrdinalEncoder`

`SimpleImputer`

`Pipeline`

`ColumnTransformer`

`GridSearchCV` and `RandomizedSearchCV`

Model persistence with `joblib`