cheat sheet

scikit-learn

Package-level reference for scikit-learn — install, versioning, extras, and gotchas. The de-facto classical-ML library on PyPI.

scikit-learn

What it is

scikit-learn (imported as sklearn, but distributed on PyPI as scikit-learn) is the canonical classical-ML library for Python — every estimator behind the same tiny fit / predict / transform / score API. It covers linear and logistic regression, tree ensembles, gradient boosting, SVMs, K-means, PCA, model selection, preprocessing, and pipelines. It is built on numpy + scipy + joblib and maintained by a large open-source community led by INRIA.

On PyPI scikit-learn is the most-installed ML library by download volume — used everywhere from research notebooks to production batch jobs. Reach for it whenever your data is tabular and your problem is "fit a model to features"; reach for PyTorch or JAX instead for deep learning.

Install

bash
pip install scikit-learn

Output: (none — exits 0 on success)

bash
uv add scikit-learn

Output: dependency resolved, lockfile updated; pulls numpy + scipy + joblib + threadpoolctl

bash
poetry add scikit-learn

Output: installed into the project venv

bash
pip install "scikit-learn[examples]"

Output: installs scikit-learn plus matplotlib / pandas / seaborn for the gallery examples

Note: the PyPI distribution name is scikit-learn but the import name is sklearn. Installing pip install sklearn (without the dash) historically resolved to a deprecated placeholder package — always use scikit-learn.

Versioning & Python support

scikit-learn follows the SPEC 0 Python support window (latest three minor versions) and uses SemVer-style minor releases roughly twice a year. Deprecations are warned for two minor cycles before removal.

sklearn linePython supportNumpy / SciPy floor
1.3.x3.8 – 3.12numpy >= 1.17
1.4.x3.9 – 3.12numpy >= 1.19
1.5.x3.9 – 3.13numpy >= 1.19
1.6.x+3.10 – 3.13numpy 2.x supported

The 1.x line has been stable for several years — there is no announced 2.0 break as of late 2025.

Package metadata

  • Maintainer: scikit-learn consortium under INRIA Foundation, plus broad community
  • Project home: github.com/scikit-learn/scikit-learn
  • Docs: scikit-learn.org/stable
  • License: BSD-3-Clause
  • PyPI: pypi.org/project/scikit-learn
  • Governance: Technical Committee + Maintainers; SLEP (scikit-learn Enhancement Proposals)
  • First released: 2010 (Google Summer of Code, David Cournapeau et al.)
  • Downloads: > 80 M / month on PyPI as of late 2025

Optional dependencies & extras

scikit-learn's required deps are minimal — numpy, scipy, joblib, threadpoolctl. Extras and common companion packages cover the surrounding ecosystem:

bash
pip install scikit-learn pandas matplotlib seaborn jupyter joblib pyarrow

Output: installs the canonical modelling stack

CompanionUse
pandasfeature DataFrames in / out of pipelines
matplotlib + seabornplotting confusion matrices, ROC curves
joblibmodel persistence (joblib.dump), parallelism backend
imbalanced-learnover-/under-sampling, SMOTE — sklearn-compatible API
category_encodersricher categorical encoders than OneHotEncoder
optuna / scikit-optimizehyperparameter search beyond GridSearchCV
xgboost / lightgbm / catboostgradient-boosted trees with the sklearn estimator API
shapmodel explanations on top of any sklearn estimator

There is also scikit-learn[tests] and scikit-learn[docs] but these are aimed at contributors, not end users.

Alternatives

PackageOne-line trade-off
xgboost / lightgbm / catboostbetter than sklearn's GBM on most tabular benchmarks
statsmodelsstatistical inference (p-values, R-style formulas) rather than prediction
pycarethigh-level AutoML wrapper around sklearn
h2o.aidistributed AutoML — JVM dependency
cuML (RAPIDS)GPU-accelerated sklearn API, NVIDIA-only
PyTorch / JAX / TensorFlowdeep learning — wrong tool for purely tabular problems
flaml / autosklearnAutoML on top of sklearn

Common gotchas

  • Not GPU-accelerated. sklearn is CPU-only by design. For NVIDIA GPUs use cuML; for everyone else, sklearn's BLAS-backed estimators are already fast on modern CPUs.
  • Distribution name vs import name. Install with scikit-learn, import as sklearn. The pre-2023 placeholder sklearn package on PyPI is now a hard error to install — good.
  • Large memory for kernel methods. SVC(kernel="rbf"), GaussianProcessRegressor, and KernelRidge build an N×N Gram matrix. 50k rows ⇒ 20 GB RAM. Use linear models or tree ensembles past a few thousand rows.
  • n_jobs=-1 does not mean "best parallelism". sklearn uses joblib + loky; the underlying BLAS (OpenBLAS, MKL) also parallelises. Stacking the two oversubscribes cores. Use threadpoolctl to cap BLAS threads, or set n_jobs=1 for individual estimators inside a GridSearchCV(n_jobs=-1).
  • Random-state plumbing. Reproducibility requires random_state= on every estimator that has the parameter; default is None, which uses the global numpy RNG and is not reproducible across runs.
  • Model persistence is joblib.dump / joblib.load, not pickle directly. sklearn-pinned object internals (e.g. numpy arrays) round-trip safely through joblib's memory-efficient serialisation.
  • Cross-version model loading is not guaranteed. Models saved with sklearn 1.3 may not load in 1.5. Pin the version in requirements.txt for any served model, or convert to ONNX for portability.
  • OneHotEncoder(sparse=...) was renamed to sparse_output= in 1.2 and the old keyword was removed in 1.4. Old tutorials still use the deprecated name.
  • fit_transform on test data is a leak. Use fit_transform(X_train) then transform(X_test). Pipeline enforces this, so prefer Pipelines over loose transform calls.

Real-world recipes

scikit-learn is the ML library that almost every team starts with; the recipes below are the packaging-level patterns — pipeline shape, persistence choice, parallelism — that you'd reach for in production. The companion sections/python/scikit-learn covers the estimator API depth.

Tabular pipeline with ColumnTransformer — the canonical sklearn pattern. Every preprocessing step is encapsulated; the whole pipeline pickles as a single object.

python
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

df = pd.DataFrame({
    "age": [25, 32, 47, 51, 23, 39, 60],
    "income": [40_000, 70_000, 120_000, 90_000, 35_000, 80_000, 150_000],
    "region": ["NA", "EU", "NA", "APAC", "EU", "NA", "EU"],
    "churned": [0, 0, 1, 0, 1, 0, 1],
})

X = df.drop(columns=["churned"])
y = df["churned"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

num_cols = ["age", "income"]
cat_cols = ["region"]
pre = ColumnTransformer([
    ("num", StandardScaler(), num_cols),
    ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
])
model = Pipeline([
    ("pre", pre),
    ("clf", GradientBoostingClassifier(random_state=0)),
])
model.fit(X_train, y_train)
print(f"score: {model.score(X_test, y_test):.3f}")

Output: the test accuracy; every transformation is fitted on train and applied to test — no leakage

Cross-validated hyperparameter search:

python
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold

X, y = load_breast_cancer(return_X_y=True)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
search = GridSearchCV(
    RandomForestClassifier(random_state=0),
    param_grid={"n_estimators": [50, 100, 200], "max_depth": [3, 5, None]},
    cv=cv,
    n_jobs=-1,
    scoring="roc_auc",
)
search.fit(X, y)
print(search.best_params_, f"{search.best_score_:.3f}")

Output: the best parameter combination and its mean ROC-AUC across folds; n_jobs=-1 parallelises fits across cores via joblib

Model persistence — production-shaped:

python
import joblib
from sklearn import __version__ as sk_version
from sklearn.pipeline import Pipeline

# After training as above ...
artifact = {
    "model": model,
    "sklearn_version": sk_version,
    "training_data_schema": {"age": "int64", "income": "int64", "region": "category"},
}
joblib.dump(artifact, "churn-v1.joblib", compress=3)

# Loading the model at serve time
loaded = joblib.load("churn-v1.joblib")
assert loaded["sklearn_version"] == sk_version, "version mismatch — re-train"
print(loaded["model"].predict([[30, 60_000, "EU"]]))

Output: the prediction for a synthetic record; the artifact carries enough metadata to detect version drift before it produces wrong answers

Custom estimator that plugs into Pipelines:

python
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted

class WinsorizeTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, lower=0.01, upper=0.99):
        self.lower = lower
        self.upper = upper

    def fit(self, X, y=None):
        X = np.asarray(X)
        self.lower_ = np.quantile(X, self.lower, axis=0)
        self.upper_ = np.quantile(X, self.upper, axis=0)
        self.n_features_in_ = X.shape[1]
        return self

    def transform(self, X):
        check_is_fitted(self)
        X = np.asarray(X)
        return np.clip(X, self.lower_, self.upper_)

from sklearn.utils.estimator_checks import check_estimator
check_estimator(WinsorizeTransformer())
print("estimator passes sklearn checks")

Output: estimator passes sklearn checkscheck_estimator is the conformance test for custom estimators; passing it means the class composes correctly with Pipelines and GridSearchCV

Performance tuning

scikit-learn performance is dominated by BLAS (for the linear-algebra-heavy estimators) and joblib parallelism (for the embarrassingly-parallel ones). Both can be over-tuned in ways that slow things down — over-subscribing CPU cores is the most common mistake.

python
from threadpoolctl import threadpool_info
import sklearn

sklearn.show_versions()
print(threadpool_info())

Output: the BLAS backend and current thread caps — sklearn's show_versions() prints the BLAS info; threadpool_info() prints what's active in this process

Tuning levers, ordered by impact:

LeverMechanismWhen it matters
BLAS thread capthreadpool_limits(limits=1) inside parallel CVwhen n_jobs > 1 and BLAS oversubscribes
n_jobs=-1 on outer search, n_jobs=1 on estimatorone layer of parallelism, not twoGridSearchCV(n_jobs=-1) over RandomForest
algorithm="ball_tree" / "kd_tree" on neighboursspatial indexhigh-dim k-NN
partial_fit for incremental fittingstreaming-fit estimatorsdata > RAM (SGDClassifier, MiniBatchKMeans)
HistGradientBoostingClassifier over GradientBoostingClassifierhistogram binning + native parallelismlarge tabular ML
n_features_in_ slimmingdrop columns before fitwide feature matrices
Sparse input (scipy.sparse)skip zero opshigh-dim TF-IDF / categorical

The parallelism stacking trap:

python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from threadpoolctl import threadpool_limits

with threadpool_limits(limits=1, user_api="blas"):
    rf = RandomForestClassifier(n_jobs=-1)              # uses all cores
    search = GridSearchCV(rf, {"max_depth": [3, 5]}, n_jobs=-1)
    # both layers want all cores -> oversubscription

Output: no return; but watch htop — without the threadpool_limits guard you'd see CPU at 1200% on an 8-core box, with most of that being context-switching overhead

The rule of thumb: pick one layer to parallelise. GridSearchCV(n_jobs=-1) with RandomForestClassifier(n_jobs=1) is usually faster than the reverse on a single machine.

Memory & dataset-size scaling

scikit-learn is in-RAM and CPU-only. Past ~10 GB of training data the answer is usually a different estimator (HistGradientBoosting, SGDClassifier with partial_fit) rather than a different library.

python
import numpy as np
from sklearn.linear_model import SGDClassifier

rng = np.random.default_rng(0)
model = SGDClassifier(loss="log_loss", random_state=0)

# Stream batches without ever loading the full dataset
for _ in range(50):
    X_batch = rng.normal(size=(10_000, 100))
    y_batch = (X_batch[:, 0] > 0).astype(int)
    model.partial_fit(X_batch, y_batch, classes=np.array([0, 1]))

print(model.coef_.shape)

Output: (1, 100) — the model trained on 500k synthetic rows in batches of 10k; constant memory regardless of total size

Out-of-core options:

  • partial_fit on SGDClassifier, SGDRegressor, MiniBatchKMeans, IncrementalPCA, several Naive Bayes variants — fit in batches.
  • HistGradientBoostingClassifier / Regressor — histogram binning makes large tabular ML feasible without leaving sklearn.
  • dask-ml — sklearn-shaped API parallelised across Dask workers.
  • cuML (RAPIDS) — sklearn API on NVIDIA GPUs; install separately from sklearn.
  • For very large feature matrices: HashingVectorizer instead of CountVectorizer / TfidfVectorizer to skip the vocabulary dict.

For genuinely huge problems, sklearn is the wrong tool — XGBoost / LightGBM / CatBoost for tabular at scale, PyTorch / JAX for neural nets.

Version migration guide

sklearn moves slower than pandas or numpy but its deprecations are strict — a FutureWarning for two minor versions, then removal. The recent breaks worth knowing:

1.2 → 1.3:

  • OneHotEncoder(sparse=...)sparse_output=.
  • Some metrics keyword orderings tightened.
  • Default max_iter increased for several solvers.

1.3 → 1.4:

  • Removed sparse= (only sparse_output= works now).
  • BaseEstimator.set_output(transform=...) added — transformers can return pandas DataFrames preserving column names.
  • feature_names_in_ enforced everywhere — fitting on a DataFrame and predicting on a numpy array now warns.

1.4 → 1.5:

  • Several ClassWeight="auto" removed (use "balanced").
  • partial_dependence(..., kind="legacy") removed.

1.5 → 1.6:

  • Full numpy 2.x compatibility.
  • Several model_selection.cross_validate returned-dict keys renamed.
  • Pipeline.set_output(transform="pandas") propagates to all transformers — easier debugging.
python
import sklearn
print(sklearn.__version__)
sklearn.show_versions()

Output: the installed version, then a multi-line dump of dependent libraries (numpy, scipy, joblib, threadpoolctl) — useful to attach to bug reports

Cross-version pickle compatibility is NOT guaranteed. Models pickled with sklearn 1.3 may fail to load on 1.5 with an obscure AttributeError. The safest production patterns:

  • Pin sklearn in requirements.txt for any served model.
  • Use skops.io.dump / skops.io.load — a security-and-portability-focused replacement for joblib that includes signature checks.
  • Export to ONNX via skl2onnx for true cross-version (and cross-language) inference. Loses the ability to retrain but the inference graph is stable.

Interop with adjacent ecosystems

scikit-learn lives at a hub in the ML ecosystem — its BaseEstimator API is the de-facto Python ML interface; many third-party libraries implement it for compatibility.

LibrarySpeaks sklearn API?Notes
XGBoostYes (XGBClassifier, XGBRegressor)drop-in for GradientBoostingClassifier
LightGBMYes (LGBMClassifier)faster than sklearn's GBM, same API
CatBoostYes (CatBoostClassifier)good with high-cardinality categoricals
imbalanced-learnYes — sklearn-compatible samplersSMOTE, RandomUnderSampler, …
category_encodersYes — sklearn transformersTargetEncoder, BinaryEncoder, …
optunaWraps sklearn estimatorssmarter than GridSearchCV
shapConsumes any sklearn modelmodel explanations
skopsReplaces joblib for serialisationsafer model persistence
cuMLMimics sklearn API on GPUNVIDIA-only
pandasFirst-class input/outputset_output(transform="pandas") for DataFrame outputs
python
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import set_config

set_config(transform_output="pandas")   # transformers return DataFrames

pipe = Pipeline([
    ("scale", StandardScaler()),
    ("clf", LogisticRegression()),
])

df = pd.DataFrame({"a": [1.0, 2.0, 3.0, 4.0], "b": [5.0, 6.0, 7.0, 8.0]})
pipe.fit(df, [0, 1, 0, 1])
print(pipe.named_steps["scale"].transform(df).head())

Output: a pandas DataFrame (not a numpy array) with column names preserved — set_config(transform_output="pandas") is the 1.4+ way to keep DataFrames through a pipeline

Troubleshooting common errors

The errors below cover the recurring frictions; most are about preprocessing assumptions rather than the modeling layer itself.

  • ValueError: Input contains NaN — pipeline does not handle missing values. Add a SimpleImputer step before the model.
  • ValueError: Found unknown categories in columnOneHotEncoder saw a test-time category not present in fit. Set handle_unknown="ignore".
  • NotFittedError — you called predict before fit. Often because of a Pipeline where one step was replaced after-the-fact.
  • AttributeError loading an old pickle — sklearn version mismatch. Pin versions or switch to skops/ONNX.
  • Warning: Got input with feature names but feature_names_in_ has not been set — fitted on numpy, predicted on DataFrame. Be consistent throughout.
  • GridSearch is mysteriously slow — BLAS threads + joblib stacking. Use threadpool_limits(limits=1) inside the search context.
  • SVC hangs on big data — kernel SVM builds an N×N Gram matrix. Use LinearSVC for n > 10k or switch to a tree ensemble.
  • StratifiedKFold complains about class counts — too few samples per class for the requested fold count. Reduce n_splits or rebalance.
  • r2_score returns negative numbers — your model is worse than predicting the mean. Inspect data quality first.
  • pipeline.fit is fast but pipeline.predict is slow — the prediction path includes OneHotEncoder rebuilding categories. Make sure categories= is fixed at fit time.

When NOT to use this

sklearn is the default for tabular classical ML; the cases below are where another tool wins.

  • Deep learning: PyTorch, JAX, TensorFlow. sklearn has MLPClassifier but it's a teaching tool, not a production neural-net stack.
  • Massive tabular ML: XGBoost / LightGBM / CatBoost. Faster and usually more accurate than sklearn's GBM.
  • AutoML: auto-sklearn, FLAML, PyCaret, H2O. sklearn provides the estimators; AutoML wraps the search.
  • Statistical inference (p-values, confidence intervals): statsmodels. sklearn focuses on prediction, not inference.
  • GPU inference: cuML (RAPIDS) for NVIDIA, ONNX runtime for cross-platform.
  • Online learning at scale: River for streaming-first ML; sklearn's partial_fit is batch-shaped.
  • Time-series-specific models: sktime, darts, statsforecast for richer time-series support than sklearn alone.

See also