cheat sheet
scikit-learn
Build classical ML pipelines with scikit-learn. Covers the estimator API, train_test_split, Pipeline, ColumnTransformer, cross-validation, metrics, and model persistence.
scikit-learn — Classical Machine Learning
What it is
scikit-learn (imported as sklearn) is the de-facto Python library for classical machine learning — everything that isn't deep neural networks. It bundles dozens of estimators (linear and logistic regression, random forests, gradient boosting, SVMs, K-means, …) behind a single tiny API: fit, predict, transform, score. It is built on NumPy, SciPy, and joblib; maintained by a large community led by INRIA. For tabular data with thousands to millions of rows, scikit-learn is the right first answer; for deep learning, reach for PyTorch or JAX instead.
The unified estimator API is scikit-learn's superpower. Once you know how to fit a
LogisticRegression, you can fit aRandomForestClassifieror aGradientBoostingRegressorwith literally the same three lines.
Install
# Recommended (uv is fastest)
uv pip install scikit-learn
# Or with pip
pip install scikit-learn
# Common companion packages
pip install pandas numpy matplotlib joblib
Output: (none — exits 0 on success)
Verify:
import sklearn
print(sklearn.__version__)
Output:
1.5.0
Syntax
Every estimator follows the same pattern. Instantiate with hyperparameters, fit on training data, then predict (classifiers/regressors) or transform (preprocessors) on new data. score returns a default metric (accuracy for classifiers, R² for regressors).
from sklearn.<family> import <Estimator>
model = <Estimator>(**hyperparameters)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
metric = model.score(X_test, y_test)
Output: (none — pattern, not a runnable snippet)
Essential building blocks
| Tool | Module | What it does |
|---|---|---|
train_test_split | model_selection | Split arrays into train/test |
Pipeline | pipeline | Chain preprocessors + a final estimator |
ColumnTransformer | compose | Apply different transforms per column |
StandardScaler | preprocessing | Zero-mean, unit-variance scaling |
OneHotEncoder | preprocessing | Categorical → dummy variables |
SimpleImputer | impute | Fill missing values |
cross_val_score | model_selection | K-fold CV, returns scores |
GridSearchCV | model_selection | Exhaustive hyperparam search |
RandomizedSearchCV | model_selection | Sampled hyperparam search |
joblib.dump/load | joblib | Save/load a fitted estimator |
The estimator API
Every preprocessor, model, and transformer in scikit-learn implements two or more of these methods. Understanding them once gives you the whole library.
# Estimators (models)
model.fit(X, y) # train; y optional for unsupervised
model.predict(X) # produce labels / values
model.predict_proba(X) # probabilities (classifiers only)
model.score(X, y) # default metric (accuracy / R²)
# Transformers (preprocessors)
transformer.fit(X) # learn the transform parameters
transformer.transform(X) # apply
transformer.fit_transform(X) # shortcut combining the two
# Hyperparameters
model.get_params() # dict of all hyperparameters
model.set_params(C=10) # update one (chainable)
Output: (none — API surface)
Quick example — train a classifier
A complete end-to-end run in 12 lines: load a sample dataset, split, fit a classifier, evaluate.
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
preds = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, preds):.3f}")
print(f"Score: {model.score(X_test, y_test):.3f}")
Output:
Accuracy: 0.974
Score: 0.974
train_test_split
train_test_split shuffles the data and slices it into a training set and a held-out test set. Always pass random_state= for reproducible splits; for classification problems pass stratify=y so class proportions are preserved in both splits.
from sklearn.model_selection import train_test_split
# Simple two-way split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Stratified (preserve class balance) — classification only
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# Three-way (train / val / test)
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42)
print(len(X_train), len(X_val), len(X_test))
Output:
90 30 30
A test set you peek at is no longer a test set. Reserve a final hold-out at the very start of the project and don't touch it until model selection is done. Use cross-validation on the training set for tuning.
Preprocessing
Scikit-learn estimators expect numeric, finite, scaled features. The preprocessing and impute modules provide transformers for everything between raw data and a model-ready matrix. The most common four are below.
StandardScaler
Standardise features to zero mean and unit variance. Essential for distance-based models (k-NN, SVM, k-means) and gradient-based linear models (logistic regression). Tree-based models (random forest, gradient boosting) are scale-invariant and don't need it.
from sklearn.preprocessing import StandardScaler
import numpy as np
X = np.array([[1, 100], [2, 200], [3, 300], [4, 400]], dtype=float)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print("means:", scaler.mean_)
print("stds: ", scaler.scale_)
print(X_scaled)
Output:
means: [ 2.5 250. ]
stds: [ 1.118 111.803]
[[-1.342 -1.342]
[-0.447 -0.447]
[ 0.447 0.447]
[ 1.342 1.342]]
OneHotEncoder
Turn a categorical column into one binary column per category. The newer sparse_output=False returns a dense array (the default is sparse); handle_unknown="ignore" silently emits all-zeros for unseen categories at predict time.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
X = np.array([["red"], ["blue"], ["green"], ["red"], ["blue"]])
enc = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
print(enc.fit_transform(X))
print(enc.get_feature_names_out())
Output:
[[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]
[1. 0. 0.]]
['x0_blue' 'x0_green' 'x0_red']
OrdinalEncoder
Map categories to integers 0..N-1. Use only when categories have a natural order (["low", "medium", "high"]) or when feeding tree-based models — linear models will misinterpret the ordering as numeric magnitude.
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
X = np.array([["low"], ["high"], ["medium"], ["low"]])
enc = OrdinalEncoder(categories=[["low", "medium", "high"]])
print(enc.fit_transform(X))
Output:
[[0.]
[2.]
[1.]
[0.]]
SimpleImputer
Fill missing values (NaN) before they reach an estimator. Default strategy is mean for numeric columns; most_frequent works for both numeric and string columns.
from sklearn.impute import SimpleImputer
import numpy as np
X = np.array([[1.0, 2.0], [np.nan, 3.0], [4.0, np.nan]])
imp = SimpleImputer(strategy="mean")
print(imp.fit_transform(X))
Output:
[[1. 2. ]
[2.5 3. ]
[4. 2.5]]
Pipeline
A Pipeline chains transformers and a final estimator into a single object that exposes the estimator API. fit calls fit_transform on each step in order; predict calls transform then the final predict. Pipelines are the right way to package a model — they prevent data leakage (scaler fit on the test set), and they save/load atomically.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
pipe = Pipeline([
("scale", StandardScaler()),
("logreg", LogisticRegression(max_iter=1000)),
])
pipe.fit(X_train, y_train)
print(f"Accuracy: {pipe.score(X_test, y_test):.3f}")
Output:
Accuracy: 0.972
Access a step by name with pipe.named_steps["scale"] or by index pipe[0]. Hyperparameters of any inner step are exposed under <step_name>__<param>:
pipe.set_params(logreg__C=0.1)
pipe.fit(X_train, y_train)
print(pipe.named_steps["logreg"].C)
Output:
0.1
The
make_pipeline(scaler, model)helper fromsklearn.pipelineauto-names steps from their class names — handy when you don't care about the names. Otherwise prefer the explicitPipeline([("name", obj), ...])form soset_paramskeys are predictable.
ColumnTransformer
Real datasets mix numeric and categorical columns. ColumnTransformer applies different preprocessors to different columns and assembles the results into a single matrix.
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier
df = pd.DataFrame({
"age": [25, 30, None, 45, 22],
"income": [50_000, 60_000, 70_000, None, 30_000],
"city": ["NYC", "LA", "NYC", "SF", "LA"],
"churn": [0, 1, 0, 1, 0],
})
y = df.pop("churn")
numeric = ["age", "income"]
categorical = ["city"]
preprocess = ColumnTransformer([
("num", Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
]), numeric),
("cat", OneHotEncoder(handle_unknown="ignore"), categorical),
])
pipe = Pipeline([
("prep", preprocess),
("clf", GradientBoostingClassifier(random_state=42)),
])
pipe.fit(df, y)
print(pipe.predict(df))
Output:
[0 1 0 1 0]
ColumnTransformer returns either a dense np.ndarray or a sparse matrix (auto-selected from the inputs). Pass remainder="passthrough" to keep columns you didn't transform, or remainder="drop" (default) to drop them.
Core estimators
A field guide to the most common scikit-learn estimators. Pick by problem shape first; tune hyperparameters once you have a working baseline.
Linear models
Linear regression for continuous targets, logistic regression for classification, ridge/lasso for regularised regression. Linear models are fast, interpretable, and surprisingly competitive on text and other high-dimensional sparse data.
from sklearn.linear_model import (
LinearRegression, LogisticRegression,
Ridge, Lasso, ElasticNet,
)
LinearRegression() # OLS
LogisticRegression(C=1.0, max_iter=1000) # logistic; C is 1/lambda
Ridge(alpha=1.0) # L2 regularised
Lasso(alpha=0.1) # L1; sets coefficients to zero
ElasticNet(alpha=0.1, l1_ratio=0.5) # mix of L1 and L2
Output: (none — instantiation only)
Tree-based models
Decision trees, random forests, and gradient boosting. Tree models handle mixed types, are scale-invariant, and capture non-linear relationships out of the box. Random forests are a strong default; gradient boosting (especially XGBoost / LightGBM / scikit-learn's HistGradientBoostingClassifier) usually wins competitive leaderboards on tabular data.
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.ensemble import (
RandomForestClassifier, RandomForestRegressor,
GradientBoostingClassifier, GradientBoostingRegressor,
HistGradientBoostingClassifier, HistGradientBoostingRegressor,
)
RandomForestClassifier(n_estimators=300, max_depth=None, random_state=42)
GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=3)
HistGradientBoostingClassifier(max_iter=200, learning_rate=0.05) # faster on big data
Output: (none — instantiation only)
Distance-based models
k-Nearest Neighbours and Support Vector Machines. Both require feature scaling — wrap them in a pipeline with StandardScaler.
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.svm import SVC, SVR, LinearSVC
KNeighborsClassifier(n_neighbors=5, weights="distance")
SVC(kernel="rbf", C=1.0, gamma="scale", probability=True)
LinearSVC(C=1.0, max_iter=5000) # faster than SVC for large data
Output: (none — instantiation only)
Naive Bayes
Surprisingly strong baseline for text classification. MultinomialNB for word counts, GaussianNB for continuous features.
from sklearn.naive_bayes import MultinomialNB, GaussianNB
GaussianNB()
MultinomialNB(alpha=1.0)
Output: (none)
Clustering and dimensionality reduction (unsupervised)
K-means and DBSCAN for clustering; PCA and t-SNE for dimensionality reduction. These don't take a y argument.
from sklearn.cluster import KMeans, DBSCAN
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
km = KMeans(n_clusters=4, random_state=42, n_init=10)
labels = km.fit_predict(X)
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)
print(f"Explained variance: {pca.explained_variance_ratio_}")
Output:
Explained variance: [0.7296 0.2285]
Cross-validation
A single train/test split is noisy. Cross-validation averages performance over K folds for a more stable estimate. cross_val_score is the one-liner; cross_validate returns more detail (train scores, fit times, multiple metrics).
from sklearn.model_selection import cross_val_score, cross_validate, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)
pipe = Pipeline([("scale", StandardScaler()), ("logreg", LogisticRegression(max_iter=1000))])
scores = cross_val_score(pipe, X, y, cv=5, scoring="accuracy")
print(f"Accuracy: {scores.mean():.3f} ± {scores.std():.3f}")
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
detailed = cross_validate(
pipe, X, y, cv=cv,
scoring=["accuracy", "f1", "roc_auc"],
return_train_score=True,
)
for k, v in detailed.items():
if k.startswith("test_") or k.startswith("train_"):
print(f"{k}: {v.mean():.3f}")
Output:
Accuracy: 0.977 ± 0.013
test_accuracy: 0.977
test_f1: 0.982
test_roc_auc: 0.996
train_accuracy: 0.991
train_f1: 0.993
train_roc_auc: 0.999
Always pass a
StratifiedKFold(orStratifiedShuffleSplit) for classification —KFoldcan produce folds with no positive class on imbalanced data.
GridSearchCV and RandomizedSearchCV
Hyperparameter tuning by exhaustive grid search or random sampling. Both run cross-validation under the hood and expose the best estimator via .best_estimator_.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
"n_estimators": [100, 300],
"max_depth": [None, 10, 20],
"min_samples_leaf": [1, 5],
}
grid = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=5,
scoring="accuracy",
n_jobs=-1,
verbose=1,
)
grid.fit(X, y)
print("Best params:", grid.best_params_)
print(f"Best CV score: {grid.best_score_:.3f}")
Output:
Fitting 5 folds for each of 12 candidates, totalling 60 fits
Best params: {'max_depth': None, 'min_samples_leaf': 1, 'n_estimators': 300}
Best CV score: 0.965
For larger search spaces, RandomizedSearchCV samples n_iter combinations randomly — usually finds a near-optimal config in a fraction of the budget.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_dist = {
"n_estimators": randint(50, 500),
"max_depth": [None] + list(range(5, 30)),
"min_samples_leaf": randint(1, 20),
"max_features": uniform(0.1, 0.9),
}
search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_dist, n_iter=30, cv=5, random_state=42, n_jobs=-1,
)
search.fit(X, y)
print("Best:", search.best_params_)
Output:
Best: {'max_depth': 17, 'max_features': 0.323, 'min_samples_leaf': 3, 'n_estimators': 287}
Metrics
Picking the right metric matters more than picking the right model. For classification, accuracy is misleading on imbalanced data — prefer precision, recall, F1, and ROC-AUC. For regression, MAE is robust; RMSE penalises large errors; R² is unitless but only meaningful relative to a baseline.
from sklearn.metrics import (
accuracy_score, precision_score, recall_score, f1_score,
roc_auc_score, confusion_matrix, classification_report,
mean_absolute_error, mean_squared_error, r2_score,
)
# Classification
y_true = [0, 1, 1, 0, 1, 0, 1, 1]
y_pred = [0, 1, 0, 0, 1, 0, 1, 0]
print(f"accuracy: {accuracy_score(y_true, y_pred):.3f}")
print(f"precision: {precision_score(y_true, y_pred):.3f}")
print(f"recall: {recall_score(y_true, y_pred):.3f}")
print(f"f1: {f1_score(y_true, y_pred):.3f}")
print("confusion:")
print(confusion_matrix(y_true, y_pred))
print(classification_report(y_true, y_pred, digits=3))
Output:
accuracy: 0.750
precision: 1.000
recall: 0.600
f1: 0.750
confusion:
[[3 0]
[2 3]]
precision recall f1-score support
0 0.600 1.000 0.750 3
1 1.000 0.600 0.750 5
accuracy 0.750 8
macro avg 0.800 0.800 0.750 8
weighted avg 0.850 0.750 0.750 8
# Regression
y_true = [3.0, -0.5, 2.0, 7.0]
y_pred = [2.5, 0.0, 2.0, 8.0]
print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}")
print(f"RMSE: {mean_squared_error(y_true, y_pred, squared=False):.3f}")
print(f"R²: {r2_score(y_true, y_pred):.3f}")
Output:
MAE: 0.500
RMSE: 0.612
R²: 0.949
Model persistence with joblib
joblib.dump serialises a fitted estimator (or a Pipeline) to disk. joblib.load deserialises it. Always save the entire Pipeline, not just the final estimator — otherwise you have to remember which preprocessing was applied.
import joblib
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
pipe = Pipeline([("scale", StandardScaler()), ("logreg", LogisticRegression())])
pipe.fit(X_train, y_train)
joblib.dump(pipe, "model.joblib", compress=3)
print("Saved.")
# Later — in a different process or service
loaded = joblib.load("model.joblib")
print(loaded.predict(X_test[:5]))
Output:
Saved.
[0 1 1 0 1]
joblib.load(and Python'spickle) executes arbitrary code from the file. Never load a model file from an untrusted source. For production, store models in object storage you control and pin the scikit-learn version that produced the file — major-version mismatches can break deserialisation.
Common pitfalls
- Fitting the scaler on the full dataset — leaks test statistics into training. Always fit on
X_trainonly (Pipelines handle this for you). - Forgetting
stratify=y— on imbalanced classification, an unstratified split can put all positive labels in one fold. Passstratify=ytotrain_test_splitand useStratifiedKFoldfor CV. - Using accuracy on imbalanced data — a model that always predicts the majority class on 95/5 data gets 95% accuracy. Use
balanced_accuracy, F1, or ROC-AUC instead. - Tuning on the test set —
GridSearchCV.fit(X_test, y_test)defeats the entire purpose. Tune onX_trainwith CV; evaluate the chosen model onX_testexactly once. predictvspredict_proba—predictreturns hard labels using a 0.5 threshold for binary classifiers; for ROC-AUC and calibration you needpredict_proba(ordecision_functionfor some estimators).- Categorical features as integers — encoding
["red", "green", "blue"]→[0, 1, 2]makes a linear model thinkblue > red. UseOneHotEncoderfor non-ordinal categories. n_jobs=-1on Windows in a script withoutif __name__ == "__main__":— joblib forks the entire script and re-imports it; without the guard you get infinite recursion. Always wrap top-level fit calls.- Different scikit-learn versions — pickled models often fail to load across major version jumps (1.2 → 1.5). Pin
scikit-learn==X.Y.*inrequirements.txt. shuffle=Falseon time series — random shuffling leaks the future into the past. UseTimeSeriesSplitfor cross-validation on temporal data.- High training accuracy, low test accuracy — classic overfit. Add regularisation, prune trees (
max_depth), reduce model complexity, or get more data.
Real-world recipes
End-to-end classifier on tabular data
A complete recipe: load a DataFrame with mixed numeric and categorical columns, build a preprocessing+model Pipeline, tune with cross-validation, evaluate on a hold-out, and persist.
import pandas as pd
import numpy as np
import joblib
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# 1. Load
df = pd.read_csv("customers.csv")
y = df.pop("churn")
# 2. Hold-out split (touched only at the end)
X_train, X_test, y_train, y_test = train_test_split(
df, y, test_size=0.2, stratify=y, random_state=42
)
# 3. Column-aware preprocessing
numeric_cols = X_train.select_dtypes("number").columns.tolist()
categorical_cols = X_train.select_dtypes(exclude="number").columns.tolist()
preprocess = ColumnTransformer([
("num", Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
]), numeric_cols),
("cat", Pipeline([
("impute", SimpleImputer(strategy="most_frequent")),
("ohe", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
]), categorical_cols),
])
pipe = Pipeline([
("prep", preprocess),
("clf", GradientBoostingClassifier(random_state=42)),
])
# 4. Cross-validated grid search on the TRAINING set
param_grid = {
"clf__n_estimators": [100, 200],
"clf__learning_rate": [0.05, 0.1],
"clf__max_depth": [2, 3, 4],
}
search = GridSearchCV(pipe, param_grid, cv=5, scoring="roc_auc", n_jobs=-1)
search.fit(X_train, y_train)
print(f"CV best ROC-AUC: {search.best_score_:.3f}")
print("Best params:", search.best_params_)
# 5. Hold-out evaluation — the only time we use X_test
best = search.best_estimator_
proba = best.predict_proba(X_test)[:, 1]
print(f"Hold-out ROC-AUC: {roc_auc_score(y_test, proba):.3f}")
print(classification_report(y_test, best.predict(X_test)))
# 6. Persist the entire fitted pipeline
joblib.dump(best, "churn_model.joblib", compress=3)
Output:
CV best ROC-AUC: 0.842
Best params: {'clf__learning_rate': 0.05, 'clf__max_depth': 3, 'clf__n_estimators': 200}
Hold-out ROC-AUC: 0.831
precision recall f1-score support
0 0.85 0.92 0.88 795
1 0.71 0.55 0.62 205
accuracy 0.84 1000
macro avg 0.78 0.74 0.75 1000
weighted avg 0.82 0.84 0.83 1000
Visualise a confusion matrix
ConfusionMatrixDisplay.from_estimator renders a labelled matrix in two lines — paste straight into a notebook or report.
import matplotlib.pyplot as plt
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_estimator(best, X_test, y_test, cmap="Purples")
plt.title("Churn — confusion matrix")
plt.tight_layout()
plt.savefig("confusion.png", dpi=150)
Output:
[Saved confusion.png — a 2x2 matrix with counts and a purple gradient.]
Feature importance from a tree model
Tree-based estimators expose feature_importances_. Combine with the preprocessor's get_feature_names_out() to get a labelled bar chart.
import pandas as pd
import matplotlib.pyplot as plt
clf = best.named_steps["clf"]
prep = best.named_steps["prep"]
feature_names = prep.get_feature_names_out()
importances = pd.Series(clf.feature_importances_, index=feature_names)
top = importances.sort_values(ascending=False).head(15)
top.plot(kind="barh", color="#8a5cff")
plt.gca().invert_yaxis()
plt.title("Top 15 features")
plt.tight_layout()
plt.savefig("importance.png", dpi=150)
Output:
[Saved importance.png — a horizontal bar chart of the 15 most important features.]
Time-series cross-validation
TimeSeriesSplit walks forward in time — each fold's training set ends before the validation fold begins. Use whenever rows have a temporal order.
from sklearn.model_selection import TimeSeriesSplit, cross_val_score
from sklearn.linear_model import Ridge
ts_cv = TimeSeriesSplit(n_splits=5)
scores = cross_val_score(Ridge(alpha=1.0), X_sorted_by_date, y, cv=ts_cv, scoring="r2")
print("R² per fold:", scores)
print(f"Mean R²: {scores.mean():.3f}")
Output:
R² per fold: [0.42 0.51 0.58 0.55 0.61]
Mean R²: 0.534
Use a saved model inside a web service
A FastAPI endpoint that loads the pickled Pipeline once at startup and serves predictions over HTTP.
# server.py
import joblib
import pandas as pd
from fastapi import FastAPI
from pydantic import BaseModel
MODEL = joblib.load("churn_model.joblib")
class Customer(BaseModel):
age: int
income: float
plan: str
region: str
app = FastAPI()
@app.post("/predict")
def predict(customer: Customer):
row = pd.DataFrame([customer.model_dump()])
proba = MODEL.predict_proba(row)[0, 1]
return {"churn_probability": float(proba)}
uvicorn server:app --reload
curl -s -X POST http://localhost:8000/predict \
-H "content-type: application/json" \
-d '{"age": 42, "income": 65000, "plan": "Pro", "region": "EU"}'
Output:
{"churn_probability": 0.184}
Find the optimal number of clusters
The elbow method plots inertia (within-cluster sum of squares) against k. The "elbow" is where the marginal improvement stops — usually a good n_clusters.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
X, _ = make_blobs(n_samples=500, centers=4, random_state=42)
inertias = []
ks = range(1, 10)
for k in ks:
km = KMeans(n_clusters=k, n_init=10, random_state=42).fit(X)
inertias.append(km.inertia_)
plt.plot(list(ks), inertias, marker="o", color="#8a5cff")
plt.xlabel("k")
plt.ylabel("Inertia")
plt.title("Elbow method")
plt.tight_layout()
plt.savefig("elbow.png", dpi=150)
Output:
[Saved elbow.png — a line that drops steeply until k=4, then flattens. The elbow is at 4.]
See also
sections/python/pandas— load and clean tabular data before it reaches scikit-learn.sections/python/numpy— the underlying array library; understand it for fancy indexing and broadcasting in custom transformers.sections/python/matplotlib— render confusion matrices, feature-importance bars, and elbow plots.sections/python/scipy— statistical distributions used inRandomizedSearchCVparameter spaces.sections/python/streamlit— wrap a fitted Pipeline into a one-page demo UI.