cheat sheet
scikit-learn
Package-level reference for scikit-learn — install, versioning, extras, and gotchas. The de-facto classical-ML library on PyPI.
scikit-learn
What it is
scikit-learn (imported as sklearn, but distributed on PyPI as scikit-learn) is the canonical classical-ML library for Python — every estimator behind the same tiny fit / predict / transform / score API. It covers linear and logistic regression, tree ensembles, gradient boosting, SVMs, K-means, PCA, model selection, preprocessing, and pipelines. It is built on numpy + scipy + joblib and maintained by a large open-source community led by INRIA.
On PyPI scikit-learn is the most-installed ML library by download volume — used everywhere from research notebooks to production batch jobs. Reach for it whenever your data is tabular and your problem is "fit a model to features"; reach for PyTorch or JAX instead for deep learning.
Install
pip install scikit-learn
Output: (none — exits 0 on success)
uv add scikit-learn
Output: dependency resolved, lockfile updated; pulls numpy + scipy + joblib + threadpoolctl
poetry add scikit-learn
Output: installed into the project venv
pip install "scikit-learn[examples]"
Output: installs scikit-learn plus matplotlib / pandas / seaborn for the gallery examples
Note: the PyPI distribution name is scikit-learn but the import name is sklearn. Installing pip install sklearn (without the dash) historically resolved to a deprecated placeholder package — always use scikit-learn.
Versioning & Python support
scikit-learn follows the SPEC 0 Python support window (latest three minor versions) and uses SemVer-style minor releases roughly twice a year. Deprecations are warned for two minor cycles before removal.
| sklearn line | Python support | Numpy / SciPy floor |
|---|---|---|
| 1.3.x | 3.8 – 3.12 | numpy >= 1.17 |
| 1.4.x | 3.9 – 3.12 | numpy >= 1.19 |
| 1.5.x | 3.9 – 3.13 | numpy >= 1.19 |
| 1.6.x+ | 3.10 – 3.13 | numpy 2.x supported |
The 1.x line has been stable for several years — there is no announced 2.0 break as of late 2025.
Package metadata
- Maintainer: scikit-learn consortium under INRIA Foundation, plus broad community
- Project home: github.com/scikit-learn/scikit-learn
- Docs: scikit-learn.org/stable
- License: BSD-3-Clause
- PyPI: pypi.org/project/scikit-learn
- Governance: Technical Committee + Maintainers; SLEP (scikit-learn Enhancement Proposals)
- First released: 2010 (Google Summer of Code, David Cournapeau et al.)
- Downloads: > 80 M / month on PyPI as of late 2025
Optional dependencies & extras
scikit-learn's required deps are minimal — numpy, scipy, joblib, threadpoolctl. Extras and common companion packages cover the surrounding ecosystem:
pip install scikit-learn pandas matplotlib seaborn jupyter joblib pyarrow
Output: installs the canonical modelling stack
| Companion | Use |
|---|---|
| pandas | feature DataFrames in / out of pipelines |
| matplotlib + seaborn | plotting confusion matrices, ROC curves |
| joblib | model persistence (joblib.dump), parallelism backend |
| imbalanced-learn | over-/under-sampling, SMOTE — sklearn-compatible API |
| category_encoders | richer categorical encoders than OneHotEncoder |
| optuna / scikit-optimize | hyperparameter search beyond GridSearchCV |
| xgboost / lightgbm / catboost | gradient-boosted trees with the sklearn estimator API |
| shap | model explanations on top of any sklearn estimator |
There is also scikit-learn[tests] and scikit-learn[docs] but these are aimed at contributors, not end users.
Alternatives
| Package | One-line trade-off |
|---|---|
| xgboost / lightgbm / catboost | better than sklearn's GBM on most tabular benchmarks |
| statsmodels | statistical inference (p-values, R-style formulas) rather than prediction |
| pycaret | high-level AutoML wrapper around sklearn |
| h2o.ai | distributed AutoML — JVM dependency |
| cuML (RAPIDS) | GPU-accelerated sklearn API, NVIDIA-only |
| PyTorch / JAX / TensorFlow | deep learning — wrong tool for purely tabular problems |
| flaml / autosklearn | AutoML on top of sklearn |
Common gotchas
- Not GPU-accelerated. sklearn is CPU-only by design. For NVIDIA GPUs use cuML; for everyone else, sklearn's BLAS-backed estimators are already fast on modern CPUs.
- Distribution name vs import name. Install with
scikit-learn, import assklearn. The pre-2023 placeholdersklearnpackage on PyPI is now a hard error to install — good. - Large memory for kernel methods.
SVC(kernel="rbf"),GaussianProcessRegressor, andKernelRidgebuild an N×N Gram matrix. 50k rows ⇒ 20 GB RAM. Use linear models or tree ensembles past a few thousand rows. n_jobs=-1does not mean "best parallelism". sklearn uses joblib + loky; the underlying BLAS (OpenBLAS, MKL) also parallelises. Stacking the two oversubscribes cores. Usethreadpoolctlto cap BLAS threads, or setn_jobs=1for individual estimators inside aGridSearchCV(n_jobs=-1).- Random-state plumbing. Reproducibility requires
random_state=on every estimator that has the parameter; default isNone, which uses the global numpy RNG and is not reproducible across runs. - Model persistence is
joblib.dump/joblib.load, not pickle directly. sklearn-pinned object internals (e.g. numpy arrays) round-trip safely through joblib's memory-efficient serialisation. - Cross-version model loading is not guaranteed. Models saved with sklearn 1.3 may not load in 1.5. Pin the version in
requirements.txtfor any served model, or convert to ONNX for portability. OneHotEncoder(sparse=...)was renamed tosparse_output=in 1.2 and the old keyword was removed in 1.4. Old tutorials still use the deprecated name.fit_transformon test data is a leak. Usefit_transform(X_train)thentransform(X_test).Pipelineenforces this, so prefer Pipelines over loosetransformcalls.
Real-world recipes
scikit-learn is the ML library that almost every team starts with; the recipes below are the packaging-level patterns — pipeline shape, persistence choice, parallelism — that you'd reach for in production. The companion sections/python/scikit-learn covers the estimator API depth.
Tabular pipeline with ColumnTransformer — the canonical sklearn pattern. Every preprocessing step is encapsulated; the whole pipeline pickles as a single object.
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
df = pd.DataFrame({
"age": [25, 32, 47, 51, 23, 39, 60],
"income": [40_000, 70_000, 120_000, 90_000, 35_000, 80_000, 150_000],
"region": ["NA", "EU", "NA", "APAC", "EU", "NA", "EU"],
"churned": [0, 0, 1, 0, 1, 0, 1],
})
X = df.drop(columns=["churned"])
y = df["churned"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
num_cols = ["age", "income"]
cat_cols = ["region"]
pre = ColumnTransformer([
("num", StandardScaler(), num_cols),
("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
])
model = Pipeline([
("pre", pre),
("clf", GradientBoostingClassifier(random_state=0)),
])
model.fit(X_train, y_train)
print(f"score: {model.score(X_test, y_test):.3f}")
Output: the test accuracy; every transformation is fitted on train and applied to test — no leakage
Cross-validated hyperparameter search:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
X, y = load_breast_cancer(return_X_y=True)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
search = GridSearchCV(
RandomForestClassifier(random_state=0),
param_grid={"n_estimators": [50, 100, 200], "max_depth": [3, 5, None]},
cv=cv,
n_jobs=-1,
scoring="roc_auc",
)
search.fit(X, y)
print(search.best_params_, f"{search.best_score_:.3f}")
Output: the best parameter combination and its mean ROC-AUC across folds; n_jobs=-1 parallelises fits across cores via joblib
Model persistence — production-shaped:
import joblib
from sklearn import __version__ as sk_version
from sklearn.pipeline import Pipeline
# After training as above ...
artifact = {
"model": model,
"sklearn_version": sk_version,
"training_data_schema": {"age": "int64", "income": "int64", "region": "category"},
}
joblib.dump(artifact, "churn-v1.joblib", compress=3)
# Loading the model at serve time
loaded = joblib.load("churn-v1.joblib")
assert loaded["sklearn_version"] == sk_version, "version mismatch — re-train"
print(loaded["model"].predict([[30, 60_000, "EU"]]))
Output: the prediction for a synthetic record; the artifact carries enough metadata to detect version drift before it produces wrong answers
Custom estimator that plugs into Pipelines:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.utils.validation import check_is_fitted
class WinsorizeTransformer(BaseEstimator, TransformerMixin):
def __init__(self, lower=0.01, upper=0.99):
self.lower = lower
self.upper = upper
def fit(self, X, y=None):
X = np.asarray(X)
self.lower_ = np.quantile(X, self.lower, axis=0)
self.upper_ = np.quantile(X, self.upper, axis=0)
self.n_features_in_ = X.shape[1]
return self
def transform(self, X):
check_is_fitted(self)
X = np.asarray(X)
return np.clip(X, self.lower_, self.upper_)
from sklearn.utils.estimator_checks import check_estimator
check_estimator(WinsorizeTransformer())
print("estimator passes sklearn checks")
Output: estimator passes sklearn checks — check_estimator is the conformance test for custom estimators; passing it means the class composes correctly with Pipelines and GridSearchCV
Performance tuning
scikit-learn performance is dominated by BLAS (for the linear-algebra-heavy estimators) and joblib parallelism (for the embarrassingly-parallel ones). Both can be over-tuned in ways that slow things down — over-subscribing CPU cores is the most common mistake.
from threadpoolctl import threadpool_info
import sklearn
sklearn.show_versions()
print(threadpool_info())
Output: the BLAS backend and current thread caps — sklearn's show_versions() prints the BLAS info; threadpool_info() prints what's active in this process
Tuning levers, ordered by impact:
| Lever | Mechanism | When it matters |
|---|---|---|
| BLAS thread cap | threadpool_limits(limits=1) inside parallel CV | when n_jobs > 1 and BLAS oversubscribes |
n_jobs=-1 on outer search, n_jobs=1 on estimator | one layer of parallelism, not two | GridSearchCV(n_jobs=-1) over RandomForest |
algorithm="ball_tree" / "kd_tree" on neighbours | spatial index | high-dim k-NN |
partial_fit for incremental fitting | streaming-fit estimators | data > RAM (SGDClassifier, MiniBatchKMeans) |
HistGradientBoostingClassifier over GradientBoostingClassifier | histogram binning + native parallelism | large tabular ML |
n_features_in_ slimming | drop columns before fit | wide feature matrices |
Sparse input (scipy.sparse) | skip zero ops | high-dim TF-IDF / categorical |
The parallelism stacking trap:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from threadpoolctl import threadpool_limits
with threadpool_limits(limits=1, user_api="blas"):
rf = RandomForestClassifier(n_jobs=-1) # uses all cores
search = GridSearchCV(rf, {"max_depth": [3, 5]}, n_jobs=-1)
# both layers want all cores -> oversubscription
Output: no return; but watch htop — without the threadpool_limits guard you'd see CPU at 1200% on an 8-core box, with most of that being context-switching overhead
The rule of thumb: pick one layer to parallelise. GridSearchCV(n_jobs=-1) with RandomForestClassifier(n_jobs=1) is usually faster than the reverse on a single machine.
Memory & dataset-size scaling
scikit-learn is in-RAM and CPU-only. Past ~10 GB of training data the answer is usually a different estimator (HistGradientBoosting, SGDClassifier with partial_fit) rather than a different library.
import numpy as np
from sklearn.linear_model import SGDClassifier
rng = np.random.default_rng(0)
model = SGDClassifier(loss="log_loss", random_state=0)
# Stream batches without ever loading the full dataset
for _ in range(50):
X_batch = rng.normal(size=(10_000, 100))
y_batch = (X_batch[:, 0] > 0).astype(int)
model.partial_fit(X_batch, y_batch, classes=np.array([0, 1]))
print(model.coef_.shape)
Output: (1, 100) — the model trained on 500k synthetic rows in batches of 10k; constant memory regardless of total size
Out-of-core options:
partial_fitonSGDClassifier,SGDRegressor,MiniBatchKMeans,IncrementalPCA, severalNaive Bayesvariants — fit in batches.HistGradientBoostingClassifier/Regressor— histogram binning makes large tabular ML feasible without leaving sklearn.- dask-ml — sklearn-shaped API parallelised across Dask workers.
- cuML (RAPIDS) — sklearn API on NVIDIA GPUs; install separately from sklearn.
- For very large feature matrices:
HashingVectorizerinstead ofCountVectorizer/TfidfVectorizerto skip the vocabulary dict.
For genuinely huge problems, sklearn is the wrong tool — XGBoost / LightGBM / CatBoost for tabular at scale, PyTorch / JAX for neural nets.
Version migration guide
sklearn moves slower than pandas or numpy but its deprecations are strict — a FutureWarning for two minor versions, then removal. The recent breaks worth knowing:
1.2 → 1.3:
OneHotEncoder(sparse=...)→sparse_output=.- Some
metricskeyword orderings tightened. - Default
max_iterincreased for several solvers.
1.3 → 1.4:
- Removed
sparse=(onlysparse_output=works now). BaseEstimator.set_output(transform=...)added — transformers can return pandas DataFrames preserving column names.feature_names_in_enforced everywhere — fitting on a DataFrame and predicting on a numpy array now warns.
1.4 → 1.5:
- Several
ClassWeight="auto"removed (use"balanced"). partial_dependence(..., kind="legacy")removed.
1.5 → 1.6:
- Full numpy 2.x compatibility.
- Several
model_selection.cross_validatereturned-dict keys renamed. Pipeline.set_output(transform="pandas")propagates to all transformers — easier debugging.
import sklearn
print(sklearn.__version__)
sklearn.show_versions()
Output: the installed version, then a multi-line dump of dependent libraries (numpy, scipy, joblib, threadpoolctl) — useful to attach to bug reports
Cross-version pickle compatibility is NOT guaranteed. Models pickled with sklearn 1.3 may fail to load on 1.5 with an obscure AttributeError. The safest production patterns:
- Pin sklearn in
requirements.txtfor any served model. - Use
skops.io.dump/skops.io.load— a security-and-portability-focused replacement for joblib that includes signature checks. - Export to ONNX via
skl2onnxfor true cross-version (and cross-language) inference. Loses the ability to retrain but the inference graph is stable.
Interop with adjacent ecosystems
scikit-learn lives at a hub in the ML ecosystem — its BaseEstimator API is the de-facto Python ML interface; many third-party libraries implement it for compatibility.
| Library | Speaks sklearn API? | Notes |
|---|---|---|
| XGBoost | Yes (XGBClassifier, XGBRegressor) | drop-in for GradientBoostingClassifier |
| LightGBM | Yes (LGBMClassifier) | faster than sklearn's GBM, same API |
| CatBoost | Yes (CatBoostClassifier) | good with high-cardinality categoricals |
| imbalanced-learn | Yes — sklearn-compatible samplers | SMOTE, RandomUnderSampler, … |
| category_encoders | Yes — sklearn transformers | TargetEncoder, BinaryEncoder, … |
| optuna | Wraps sklearn estimators | smarter than GridSearchCV |
| shap | Consumes any sklearn model | model explanations |
| skops | Replaces joblib for serialisation | safer model persistence |
| cuML | Mimics sklearn API on GPU | NVIDIA-only |
| pandas | First-class input/output | set_output(transform="pandas") for DataFrame outputs |
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn import set_config
set_config(transform_output="pandas") # transformers return DataFrames
pipe = Pipeline([
("scale", StandardScaler()),
("clf", LogisticRegression()),
])
df = pd.DataFrame({"a": [1.0, 2.0, 3.0, 4.0], "b": [5.0, 6.0, 7.0, 8.0]})
pipe.fit(df, [0, 1, 0, 1])
print(pipe.named_steps["scale"].transform(df).head())
Output: a pandas DataFrame (not a numpy array) with column names preserved — set_config(transform_output="pandas") is the 1.4+ way to keep DataFrames through a pipeline
Troubleshooting common errors
The errors below cover the recurring frictions; most are about preprocessing assumptions rather than the modeling layer itself.
ValueError: Input contains NaN— pipeline does not handle missing values. Add aSimpleImputerstep before the model.ValueError: Found unknown categories in column—OneHotEncodersaw a test-time category not present in fit. Sethandle_unknown="ignore".NotFittedError— you calledpredictbeforefit. Often because of a Pipeline where one step was replaced after-the-fact.AttributeErrorloading an old pickle — sklearn version mismatch. Pin versions or switch to skops/ONNX.Warning: Got input with feature names but feature_names_in_ has not been set— fitted on numpy, predicted on DataFrame. Be consistent throughout.- GridSearch is mysteriously slow — BLAS threads + joblib stacking. Use
threadpool_limits(limits=1)inside the search context. SVChangs on big data — kernel SVM builds an N×N Gram matrix. UseLinearSVCfor n > 10k or switch to a tree ensemble.StratifiedKFoldcomplains about class counts — too few samples per class for the requested fold count. Reducen_splitsor rebalance.r2_scorereturns negative numbers — your model is worse than predicting the mean. Inspect data quality first.pipeline.fitis fast butpipeline.predictis slow — the prediction path includesOneHotEncoderrebuilding categories. Make surecategories=is fixed at fit time.
When NOT to use this
sklearn is the default for tabular classical ML; the cases below are where another tool wins.
- Deep learning: PyTorch, JAX, TensorFlow. sklearn has
MLPClassifierbut it's a teaching tool, not a production neural-net stack. - Massive tabular ML: XGBoost / LightGBM / CatBoost. Faster and usually more accurate than sklearn's GBM.
- AutoML: auto-sklearn, FLAML, PyCaret, H2O. sklearn provides the estimators; AutoML wraps the search.
- Statistical inference (p-values, confidence intervals): statsmodels. sklearn focuses on prediction, not inference.
- GPU inference: cuML (RAPIDS) for NVIDIA, ONNX runtime for cross-platform.
- Online learning at scale: River for streaming-first ML; sklearn's
partial_fitis batch-shaped. - Time-series-specific models: sktime, darts, statsforecast for richer time-series support than sklearn alone.
See also
- sections/python/scikit-learn — full API tutorial (estimators, Pipeline, ColumnTransformer, CV)
- sections/python/numpy — the array foundation
- sections/python/scipy — the algorithm library scikit-learn is built on
- sections/python/pandas — feature DataFrames in and out of pipelines
- sections/packages-pip/pip-numpy — package-level foundation
- sections/packages-pip/pip-scipy — package-level companion