cheat sheet

scikit-learn

Package-level reference for scikit-learn — install, versioning, extras, and gotchas. The de-facto classical-ML library on PyPI.

updated 05-31-2026

scikit-learn

What it is

scikit-learn (imported as sklearn, but distributed on PyPI as scikit-learn) is the canonical classical-ML library for Python — every estimator behind the same tiny fit / predict / transform / score API. It covers linear and logistic regression, tree ensembles, gradient boosting, SVMs, K-means, PCA, model selection, preprocessing, and pipelines. It is built on numpy + scipy + joblib and maintained by a large open-source community led by INRIA.

On PyPI scikit-learn is the most-installed ML library by download volume — used everywhere from research notebooks to production batch jobs. Reach for it whenever your data is tabular and your problem is "fit a model to features"; reach for PyTorch or JAX instead for deep learning.

Install

bash

pip install scikit-learn

Output: (none — exits 0 on success)

bash

uv add scikit-learn

Output: dependency resolved, lockfile updated; pulls numpy + scipy + joblib + threadpoolctl

bash

poetry add scikit-learn

Output: installed into the project venv

bash

pip install "scikit-learn[examples]"

Output: installs scikit-learn plus matplotlib / pandas / seaborn for the gallery examples

Note: the PyPI distribution name is scikit-learn but the import name is sklearn. Installing pip install sklearn (without the dash) historically resolved to a deprecated placeholder package — always use scikit-learn.

Versioning & Python support

scikit-learn follows the SPEC 0 Python support window (latest three minor versions) and uses SemVer-style minor releases roughly twice a year. Deprecations are warned for two minor cycles before removal.

sklearn line	Python support	Numpy / SciPy floor
1.3.x	3.8 – 3.12	numpy >= 1.17
1.4.x	3.9 – 3.12	numpy >= 1.19
1.5.x	3.9 – 3.13	numpy >= 1.19
1.6.x+	3.10 – 3.13	numpy 2.x supported

The 1.x line has been stable for several years — there is no announced 2.0 break as of late 2025.

Package metadata

Maintainer: scikit-learn consortium under INRIA Foundation, plus broad community
Project home: github.com/scikit-learn/scikit-learn
Docs: scikit-learn.org/stable
License: BSD-3-Clause
PyPI: pypi.org/project/scikit-learn
Governance: Technical Committee + Maintainers; SLEP (scikit-learn Enhancement Proposals)
First released: 2010 (Google Summer of Code, David Cournapeau et al.)
Downloads: > 80 M / month on PyPI as of late 2025

Optional dependencies & extras

scikit-learn's required deps are minimal — numpy, scipy, joblib, threadpoolctl. Extras and common companion packages cover the surrounding ecosystem:

bash

pip install scikit-learn pandas matplotlib seaborn jupyter joblib pyarrow

Output: installs the canonical modelling stack

Companion	Use
pandas	feature DataFrames in / out of pipelines
matplotlib + seaborn	plotting confusion matrices, ROC curves
joblib	model persistence (`joblib.dump`), parallelism backend
imbalanced-learn	over-/under-sampling, SMOTE — sklearn-compatible API
category_encoders	richer categorical encoders than `OneHotEncoder`
optuna / scikit-optimize	hyperparameter search beyond `GridSearchCV`
xgboost / lightgbm / catboost	gradient-boosted trees with the sklearn estimator API
shap	model explanations on top of any sklearn estimator

There is also scikit-learn[tests] and scikit-learn[docs] but these are aimed at contributors, not end users.

Alternatives

Package	One-line trade-off
xgboost / lightgbm / catboost	better than sklearn's GBM on most tabular benchmarks
statsmodels	statistical inference (p-values, R-style formulas) rather than prediction
pycaret	high-level AutoML wrapper around sklearn
h2o.ai	distributed AutoML — JVM dependency
cuML (RAPIDS)	GPU-accelerated sklearn API, NVIDIA-only
PyTorch / JAX / TensorFlow	deep learning — wrong tool for purely tabular problems
flaml / autosklearn	AutoML on top of sklearn

Common gotchas

Not GPU-accelerated. sklearn is CPU-only by design. For NVIDIA GPUs use cuML; for everyone else, sklearn's BLAS-backed estimators are already fast on modern CPUs.
Distribution name vs import name. Install with scikit-learn, import as sklearn. The pre-2023 placeholder sklearn package on PyPI is now a hard error to install — good.
Large memory for kernel methods. SVC(kernel="rbf"), GaussianProcessRegressor, and KernelRidge build an N×N Gram matrix. 50k rows ⇒ 20 GB RAM. Use linear models or tree ensembles past a few thousand rows.
n_jobs=-1 does not mean "best parallelism". sklearn uses joblib + loky; the underlying BLAS (OpenBLAS, MKL) also parallelises. Stacking the two oversubscribes cores. Use threadpoolctl to cap BLAS threads, or set n_jobs=1 for individual estimators inside a GridSearchCV(n_jobs=-1).
Random-state plumbing. Reproducibility requires random_state= on every estimator that has the parameter; default is None, which uses the global numpy RNG and is not reproducible across runs.
Model persistence is joblib.dump / joblib.load, not pickle directly. sklearn-pinned object internals (e.g. numpy arrays) round-trip safely through joblib's memory-efficient serialisation.
Cross-version model loading is not guaranteed. Models saved with sklearn 1.3 may not load in 1.5. Pin the version in requirements.txt for any served model, or convert to ONNX for portability.
OneHotEncoder(sparse=...) was renamed to sparse_output= in 1.2 and the old keyword was removed in 1.4. Old tutorials still use the deprecated name.
fit_transform on test data is a leak. Use fit_transform(X_train) then transform(X_test). Pipeline enforces this, so prefer Pipelines over loose transform calls.

Real-world recipes

scikit-learn is the ML library that almost every team starts with; the recipes below are the packaging-level patterns — pipeline shape, persistence choice, parallelism — that you'd reach for in production. The companion sections/python/scikit-learn covers the estimator API depth.

Tabular pipeline with ColumnTransformer — the canonical sklearn pattern. Every preprocessing step is encapsulated; the whole pipeline pickles as a single object.

python

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    "age": [25, 32, 47, 51, 23, 39, 60],
    "income": [40_000, 70_000, 120_000, 90_000, 35_000, 80_000, 150_000],
    "region": ["NA", "EU", "NA", "APAC", "EU", "NA", "EU"],
    "churned": [0, 0, 1, 0, 1, 0, 1],
})

X = df.drop(columns=["churned"])
y = df["churned"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

num_cols = ["age", "income"]
cat_cols = ["region"]
pre = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
])
model = Pipeline([
    ("pre", pre),
    ("clf", GradientBoostingClassifier(random_state=0)),
])
model.fit(X_train, y_train)
print(f"score: {model.score(X_test, y_test):.3f}")

Output: the test accuracy; every transformation is fitted on train and applied to test — no leakage

Cross-validated hyperparameter search:

python

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

X, y = load_breast_cancer(return_X_y=True)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
search = GridSearchCV(
    RandomForestClassifier(random_state=0),
    param_grid={"n_estimators": [50, 100, 200], "max_depth": [3, 5, None]},
    cv=cv,
    n_jobs=-1,
    scoring="roc_auc",
)
search.fit(X, y)
print(search.best_params_, f"{search.best_score_:.3f}")

Output: the best parameter combination and its mean ROC-AUC across folds; n_jobs=-1 parallelises fits across cores via joblib

Model persistence — production-shaped:

python

import joblib
from sklearn import __version__ as sk_version
from sklearn.pipeline import Pipeline

# After training as above ...
artifact = {
    "model": model,
    "sklearn_version": sk_version,
    "training_data_schema": {"age": "int64", "income": "int64", "region": "category"},
}
joblib.dump(artifact, "churn-v1.joblib", compress=3)

# Loading the model at serve time
loaded = joblib.load("churn-v1.joblib")
assert loaded["sklearn_version"] == sk_version, "version mismatch — re-train"
print(loaded["model"].predict([[30, 60_000, "EU"]]))

Output: the prediction for a synthetic record; the artifact carries enough metadata to detect version drift before it produces wrong answers

Custom estimator that plugs into Pipelines:

python

import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

class WinsorizeTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, lower=0.01, upper=0.99):
        self.lower = lower
        self.upper = upper

    def fit(self, X, y=None):
        X = np.asarray(X)
        self.lower_ = np.quantile(X, self.lower, axis=0)
        self.upper_ = np.quantile(X, self.upper, axis=0)
        self.n_features_in_ = X.shape[1]
        return self

    def transform(self, X):
        check_is_fitted(self)
        X = np.asarray(X)
        return np.clip(X, self.lower_, self.upper_)

from sklearn.utils.estimator_checks import check_estimator
check_estimator(WinsorizeTransformer())
print("estimator passes sklearn checks")

Output: estimator passes sklearn checks — check_estimator is the conformance test for custom estimators; passing it means the class composes correctly with Pipelines and GridSearchCV

Performance tuning

scikit-learn performance is dominated by BLAS (for the linear-algebra-heavy estimators) and joblib parallelism (for the embarrassingly-parallel ones). Both can be over-tuned in ways that slow things down — over-subscribing CPU cores is the most common mistake.

python

from threadpoolctl import threadpool_info
import sklearn

sklearn.show_versions()
print(threadpool_info())

Output: the BLAS backend and current thread caps — sklearn's show_versions() prints the BLAS info; threadpool_info() prints what's active in this process

Tuning levers, ordered by impact:

Lever	Mechanism	When it matters
BLAS thread cap	`threadpool_limits(limits=1)` inside parallel CV	when `n_jobs > 1` and BLAS oversubscribes
`n_jobs=-1` on outer search, `n_jobs=1` on estimator	one layer of parallelism, not two	`GridSearchCV(n_jobs=-1)` over RandomForest
`algorithm="ball_tree"` / `"kd_tree"` on neighbours	spatial index	high-dim k-NN
`partial_fit` for incremental fitting	streaming-fit estimators	data > RAM (SGDClassifier, MiniBatchKMeans)
`HistGradientBoostingClassifier` over `GradientBoostingClassifier`	histogram binning + native parallelism	large tabular ML
`n_features_in_` slimming	drop columns before fit	wide feature matrices
Sparse input (`scipy.sparse`)	skip zero ops	high-dim TF-IDF / categorical

The parallelism stacking trap:

python

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from threadpoolctl import threadpool_limits

with threadpool_limits(limits=1, user_api="blas"):
    rf = RandomForestClassifier(n_jobs=-1)              # uses all cores
    search = GridSearchCV(rf, {"max_depth": [3, 5]}, n_jobs=-1)
    # both layers want all cores -> oversubscription

Output: no return; but watch htop — without the threadpool_limits guard you'd see CPU at 1200% on an 8-core box, with most of that being context-switching overhead

The rule of thumb: pick one layer to parallelise. GridSearchCV(n_jobs=-1) with RandomForestClassifier(n_jobs=1) is usually faster than the reverse on a single machine.

Memory & dataset-size scaling

scikit-learn is in-RAM and CPU-only. Past ~10 GB of training data the answer is usually a different estimator (HistGradientBoosting, SGDClassifier with partial_fit) rather than a different library.

python

import numpy as np
from sklearn.linear_model import SGDClassifier

rng = np.random.default_rng(0)
model = SGDClassifier(loss="log_loss", random_state=0)

# Stream batches without ever loading the full dataset
for _ in range(50):
    X_batch = rng.normal(size=(10_000, 100))
    y_batch = (X_batch[:, 0] > 0).astype(int)
    model.partial_fit(X_batch, y_batch, classes=np.array([0, 1]))

print(model.coef_.shape)

Output: (1, 100) — the model trained on 500k synthetic rows in batches of 10k; constant memory regardless of total size

Out-of-core options:

partial_fit on SGDClassifier, SGDRegressor, MiniBatchKMeans, IncrementalPCA, several Naive Bayes variants — fit in batches.
HistGradientBoostingClassifier / Regressor — histogram binning makes large tabular ML feasible without leaving sklearn.
dask-ml — sklearn-shaped API parallelised across Dask workers.
cuML (RAPIDS) — sklearn API on NVIDIA GPUs; install separately from sklearn.
For very large feature matrices: HashingVectorizer instead of CountVectorizer / TfidfVectorizer to skip the vocabulary dict.

For genuinely huge problems, sklearn is the wrong tool — XGBoost / LightGBM / CatBoost for tabular at scale, PyTorch / JAX for neural nets.

Version migration guide

sklearn moves slower than pandas or numpy but its deprecations are strict — a FutureWarning for two minor versions, then removal. The recent breaks worth knowing:

1.2 → 1.3:

OneHotEncoder(sparse=...) → sparse_output=.
Some metrics keyword orderings tightened.
Default max_iter increased for several solvers.

1.3 → 1.4:

Removed sparse= (only sparse_output= works now).
BaseEstimator.set_output(transform=...) added — transformers can return pandas DataFrames preserving column names.
feature_names_in_ enforced everywhere — fitting on a DataFrame and predicting on a numpy array now warns.

1.4 → 1.5:

Several ClassWeight="auto" removed (use "balanced").
partial_dependence(..., kind="legacy") removed.

1.5 → 1.6:

Full numpy 2.x compatibility.
Several model_selection.cross_validate returned-dict keys renamed.
Pipeline.set_output(transform="pandas") propagates to all transformers — easier debugging.

python

import sklearn
print(sklearn.__version__)
sklearn.show_versions()

Output: the installed version, then a multi-line dump of dependent libraries (numpy, scipy, joblib, threadpoolctl) — useful to attach to bug reports

Cross-version pickle compatibility is NOT guaranteed. Models pickled with sklearn 1.3 may fail to load on 1.5 with an obscure AttributeError. The safest production patterns:

Pin sklearn in requirements.txt for any served model.
Use skops.io.dump / skops.io.load — a security-and-portability-focused replacement for joblib that includes signature checks.
Export to ONNX via skl2onnx for true cross-version (and cross-language) inference. Loses the ability to retrain but the inference graph is stable.

Interop with adjacent ecosystems

scikit-learn lives at a hub in the ML ecosystem — its BaseEstimator API is the de-facto Python ML interface; many third-party libraries implement it for compatibility.

Library	Speaks sklearn API?	Notes
XGBoost	Yes (`XGBClassifier`, `XGBRegressor`)	drop-in for `GradientBoostingClassifier`
LightGBM	Yes (`LGBMClassifier`)	faster than sklearn's GBM, same API
CatBoost	Yes (`CatBoostClassifier`)	good with high-cardinality categoricals
imbalanced-learn	Yes — sklearn-compatible samplers	SMOTE, RandomUnderSampler, …
category_encoders	Yes — sklearn transformers	TargetEncoder, BinaryEncoder, …
optuna	Wraps sklearn estimators	smarter than `GridSearchCV`
shap	Consumes any sklearn model	model explanations
skops	Replaces joblib for serialisation	safer model persistence
cuML	Mimics sklearn API on GPU	NVIDIA-only
pandas	First-class input/output	`set_output(transform="pandas")` for DataFrame outputs

python

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import set_config

set_config(transform_output="pandas")   # transformers return DataFrames

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression()),
])

df = pd.DataFrame({"a": [1.0, 2.0, 3.0, 4.0], "b": [5.0, 6.0, 7.0, 8.0]})
pipe.fit(df, [0, 1, 0, 1])
print(pipe.named_steps["scale"].transform(df).head())

Output: a pandas DataFrame (not a numpy array) with column names preserved — set_config(transform_output="pandas") is the 1.4+ way to keep DataFrames through a pipeline

Troubleshooting common errors

The errors below cover the recurring frictions; most are about preprocessing assumptions rather than the modeling layer itself.

ValueError: Input contains NaN — pipeline does not handle missing values. Add a SimpleImputer step before the model.
ValueError: Found unknown categories in column — OneHotEncoder saw a test-time category not present in fit. Set handle_unknown="ignore".
NotFittedError — you called predict before fit. Often because of a Pipeline where one step was replaced after-the-fact.
AttributeError loading an old pickle — sklearn version mismatch. Pin versions or switch to skops/ONNX.
Warning: Got input with feature names but feature_names_in_ has not been set — fitted on numpy, predicted on DataFrame. Be consistent throughout.
GridSearch is mysteriously slow — BLAS threads + joblib stacking. Use threadpool_limits(limits=1) inside the search context.
SVC hangs on big data — kernel SVM builds an N×N Gram matrix. Use LinearSVC for n > 10k or switch to a tree ensemble.
StratifiedKFold complains about class counts — too few samples per class for the requested fold count. Reduce n_splits or rebalance.
r2_score returns negative numbers — your model is worse than predicting the mean. Inspect data quality first.
pipeline.fit is fast but pipeline.predict is slow — the prediction path includes OneHotEncoder rebuilding categories. Make sure categories= is fixed at fit time.

When NOT to use this

sklearn is the default for tabular classical ML; the cases below are where another tool wins.

Deep learning: PyTorch, JAX, TensorFlow. sklearn has MLPClassifier but it's a teaching tool, not a production neural-net stack.
Massive tabular ML: XGBoost / LightGBM / CatBoost. Faster and usually more accurate than sklearn's GBM.
AutoML: auto-sklearn, FLAML, PyCaret, H2O. sklearn provides the estimators; AutoML wraps the search.
Statistical inference (p-values, confidence intervals): statsmodels. sklearn focuses on prediction, not inference.
GPU inference: cuML (RAPIDS) for NVIDIA, ONNX runtime for cross-platform.
Online learning at scale: River for streaming-first ML; sklearn's partial_fit is batch-shaped.
Time-series-specific models: sktime, darts, statsforecast for richer time-series support than sklearn alone.

scikit-learn

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Real-world recipes

Performance tuning

Memory & dataset-size scaling

Version migration guide

Interop with adjacent ecosystems

Troubleshooting common errors

When NOT to use this

See also