cheat sheet

modin

Package-level reference for modin — install, backend extras, versioning, and gotchas. Speeds up existing pandas code with a one-line import swap.

updated 05-31-2026

modin

What it is

modin is a parallelising drop-in for pandas. It exposes the same module path (modin.pandas) and the same DataFrame API, then dispatches work across all CPU cores using one of three execution backends: Ray (default), Dask, or Unidist. Operations modin has not yet implemented fall back to pandas transparently.

On PyPI, modin occupies the niche of "speed up existing pandas without rewriting" — different from polars (new API, faster but rewrite required) and different from dask.dataframe (distributed-cluster orientation, larger overhead).

Install

bash

pip install "modin[ray]"

Output: installs modin + Ray; the recommended default

bash

pip install "modin[dask]"

Output: installs modin + Dask; pick when Dask is already in the stack

bash

pip install "modin[unidist]"

Output: installs modin + Unidist (lightweight MPI-backed engine)

bash

pip install "modin[all]"

Output: installs every backend at once — only useful for benchmark scripts

bash

uv add "modin[ray]"

Output: dependency resolved, lockfile updated

Versioning & Python support

modin tracks pandas closely — every modin release pins a compatible pandas range and updates the supported API surface. SemVer-style versioning; backwards-incompatible changes are batched into majors. As of late 2025, modin is in the 0.3x.x line; despite the leading zero, the project has been production-stable for years.

Modin line	Pandas pinned	Python support
0.28.x	pandas 2.1	3.9 – 3.11
0.30.x	pandas 2.2	3.9 – 3.12
0.32.x	pandas 2.2 / 2.3	3.10 – 3.13

A modin upgrade can drag your pandas version with it — pin both in lockfiles if your downstream code is sensitive to pandas internals.

Package metadata

Maintainer: UC Berkeley RISELab origin, now community-maintained with Intel and Ponder.io contributors
Project home: github.com/modin-project/modin
Docs: modin.readthedocs.io
License: Apache 2.0
PyPI: pypi.org/project/modin
Governance: open-source maintainers; corporate sponsors rotate
First released: 2018 (RISELab paper)
Downloads: ~1 M / month on PyPI as of late 2025

Optional dependencies & extras

modin's extras pick the execution engine, not the feature set — picking one is mandatory (the default pip install modin ships no backend and errors at import).

Extra	Pulls in	When to use
`ray`	ray[default]	recommended default, best out-of-the-box scaling on a laptop
`dask`	dask[complete]	already running Dask, want a single cluster
`unidist`	unidist[mpi]	MPI clusters, lighter footprint
`hdfs`	pyarrow + hdfs deps	distributed HDFS reads
`all`	all backends	rare; benchmarking only

Common companion packages alongside modin:

bash

pip install "modin[ray]" pandas pyarrow matplotlib jupyter

Output: installs modin (Ray backend) + the usual analytical stack — pandas is already pulled in, listed here for the pinned version

Alternatives

Package	One-line trade-off
pandas	single-threaded, simpler, smaller install
polars	faster than modin in most benchmarks, but new API
dask.dataframe	distributed across a cluster, higher orchestration overhead
cuDF (RAPIDS)	GPU-accelerated pandas-like API, NVIDIA-only
pyspark.pandas	pandas API on Spark — JVM dependency, cluster-oriented
duckdb	SQL-first; complements rather than replaces pandas

Common gotchas

Backend extras are required. pip install modin without [ray], [dask], or [unidist] fails at import with "no execution engine configured". The error message is clear but trips first-time installers.
Not all pandas APIs are accelerated. Unsupported ops silently fall back to pandas (single-threaded). Set MODIN_ENGINE + check the defaulting to pandas warnings — that is where your missing speedup lives.
Ray daemon survives the script. Calling pd.read_csv once starts a Ray runtime; the worker pool lingers after python script.py exits unless you explicitly ray.shutdown(). CI runners with short timeouts notice.
Mixing modin.pandas and pandas. Passing a modin DataFrame to a function that does pd.DataFrame(...) (where pd is the real pandas) silently round-trips through pandas, losing partitioning. Use the same pd alias everywhere in a pipeline.
Memory amplification with Ray. Ray serialises partitions through Plasma object store — peak RAM can be 2–3× the resident pandas equivalent for short operations. Watch MODIN_MEMORY and RAY_OBJECT_STORE_MEMORY.
Windows + Ray. Ray on Windows is supported but historically the least-tested combination. WSL is the smoother path.
engine="modin" is not a thing. modin replaces the import; you do not pass it to pandas — you import import modin.pandas as pd and that is the whole switch.

Real-world recipes

modin's value proposition is "the one-line speedup" — the recipes below show the scenarios where that promise actually pays off. The companion sections/python/modin covers the API; here the focus is on packaging and backend choice.

Drop-in pandas accelerator on a fat-laptop CSV read — the canonical "this is why modin exists" pattern. A 4 GB CSV that takes 90 seconds in pandas often completes in 15-25 seconds in modin (Ray backend) on a modern 8-core machine.

python

import modin.pandas as pd

df = pd.read_csv("events.csv")
agg = df.groupby("user_id").agg({"revenue": "sum", "session_id": "nunique"})
print(agg.head())

Output: the same DataFrame.groupby result as pandas; under the hood Ray distributes the file read and the group-by across all CPU cores

Switching backends without changing application code:

python

import os
os.environ["MODIN_ENGINE"] = "dask"   # or "ray" / "unidist" / "python"

import modin.pandas as pd
df = pd.read_csv("events.csv")

Output: the import-time environment lookup picks the backend; same code runs on Ray today and Dask tomorrow

Hybrid pipeline: modin for the heavy I/O, pandas for the model-fitting boundary — sklearn does not understand modin frames, so most pipelines materialise back to pandas before the fit call:

python

import modin.pandas as mpd
from sklearn.linear_model import LogisticRegression

df = mpd.read_csv("training.csv")
features = df[["age", "income", "is_premium"]].dropna()
y = df["churned"].dropna()

# Pull back to pandas for the sklearn boundary
X = features._to_pandas()
y = y._to_pandas()

model = LogisticRegression().fit(X.to_numpy(), y.to_numpy())
print(model.coef_)

Output: model coefficients; the CSV read and dropna run on Ray, the fit runs on a single core (sklearn does not consume modin natively)

Performance tuning

modin performance is dominated by partition count and memory budget. The defaults are reasonable on a 16 GB / 8-core laptop; production tuning means matching partitions to cores and giving Ray enough object-store space.

python

import modin.config as cfg

cfg.NPartitions.put(8)              # one partition per core, typically
cfg.CpuCount.put(8)                 # cap CPUs modin uses
cfg.Engine.put("ray")               # explicit engine choice

Output: no return; subsequent operations honour the new settings

Tuning levers, ordered by impact:

Lever	Mechanism	When it helps
`MODIN_ENGINE` env / `cfg.Engine.put`	choose Ray / Dask / Unidist	already running one of these schedulers
`cfg.NPartitions.put(N)`	partition count	uneven CPU usage
`RAY_OBJECT_STORE_MEMORY` env	Plasma store sizing	OOMs during group-by / merge
`cfg.RangePartitioning.put(True)`	better skew handling	wide-skew group-by keys
`cfg.BenchmarkMode.put(True)`	sync execution for clean timings	benchmark scripts only

Confirming a step did not fall back to pandas:

python

import warnings
import modin.pandas as pd

with warnings.catch_warnings():
    warnings.simplefilter("error", UserWarning)
    try:
        df = pd.read_csv("events.csv")
        result = df.pivot_table(index="user_id", values="revenue", aggfunc="sum")
        print(result.head())
    except UserWarning as exc:
        print(f"Modin defaulted to pandas at: {exc}")

Output: raises if any modin op silently fell back to pandas — the single most-common reason "modin didn't speed up my code"

Memory & dataset-size scaling

modin partitions the frame and distributes work across processes; on a single machine the partition data lives in Ray's Plasma object store (shared memory) or Dask's worker memory. Peak memory is typically 2-3x the equivalent pandas resident set because partitions hold copies during shuffles.

python

import modin.pandas as pd
import modin.config as cfg

cfg.NPartitions.put(16)

# Read a 30 GB CSV directory in parallel
df = pd.read_csv("data/*.csv")
print(df.shape)

Output: the full frame shape; the read is parallel across files, dispatched by Ray

For data that genuinely exceeds RAM:

Modin is out-of-core when backed by Dask — Dask spills partitions to disk under memory pressure. Ray's Plasma store does not spill by default; configure RAY_OBJECT_STORE_MEMORY and object_spilling_config for spill behaviour.
The streaming pattern is pandas chunking through a modin reader: split files explicitly, process each, concatenate at the end.
Past ~100 GB on a single node, switch to polars-streaming or duckdb — modin's overhead per partition starts to eat the win.

Version migration guide

modin's 0.x versioning understates its production stability — the leading zero is historical. The relevant migration story is modin upgrades dragging pandas with them and backend renames.

Recent changes worth knowing:

0.26 → 0.27: the unidist engine matured; MODIN_ENGINE=unidist became viable for HPC-shaped workloads.
0.28 → 0.30: dropped the omnisci backend (now hdk); if you were on omnisci, the rename matters at install time (pip install "modin[hdk]").
0.30 → 0.32: stricter pandas pin range; older modin won't install with pandas 2.3.
0.32+: dropped support for Python 3.9 — match your project's Python floor before upgrading.

python

import modin
import pandas

print("modin:", modin.__version__)
print("pandas:", pandas.__version__)   # which pandas modin is delegating to

Output: the two versions side by side — pin both in requirements.txt because modin's fallback path is pandas

API parity: every modin minor release closes gaps with the latest pandas. Check the defaulting to pandas warnings before assuming an op is accelerated — the parity matrix changes between releases.

Interop with adjacent ecosystems

modin's promise is full pandas-API compatibility, so interop generally means convert to pandas at the boundary with downstream libraries that do not understand modin partitions.

Convert from / to	How	Cost
modin → pandas	`mdf._to_pandas()`	Collects all partitions onto the driver process
modin → NumPy	`mdf.to_numpy()`	Collect + convert; not zero-copy
modin → Arrow	`mdf._to_pandas().to_arrow()` (no direct path)	Two copies; usually go through pandas
modin → polars	round-trip via pandas/Arrow	Two copies; consider polars-only if heavy
modin → duckdb	register the underlying pandas frame: `duckdb.sql("SELECT * FROM mdf._to_pandas()")`	One copy
modin → sklearn	`.to_numpy()` at fit-time	sklearn is single-process anyway
modin → Spark	`pyspark.pandas.from_pandas(mdf._to_pandas())`	Cross-cluster shuffle

python

import modin.pandas as mpd

mdf = mpd.read_csv("events.csv")

# Boundary to plain pandas
pdf = mdf._to_pandas()
print(type(pdf))

Output: <class 'pandas.core.frame.DataFrame'> — the partitions have been collected onto the driver process

Troubleshooting common errors

The list below catalogues the failure modes that send people to the modin issue tracker. Each has a fast remediation; the underlying cause is usually backend configuration rather than the modin layer itself.

ImportError: No module named 'ray' — backend extra missing. Fix: pip install "modin[ray]".
UserWarning: Distributing <object> object. This may take some time. — you passed a pandas object into modin; it is distributing it. Either accept the one-time cost or read directly via mpd.read_*.
UserWarning: Defaulting to pandas implementation — the op has no modin kernel yet. Either accept the single-core penalty or rewrite the op to use accelerated primitives.
Ray workers OOM on group-by — Plasma object store too small. RAY_OBJECT_STORE_MEMORY=... in bytes, or pass object_store_memory= to ray.init.
PermissionError: [Errno 13] on Windows — Ray on Windows is fussier about temp dirs; set RAY_TEMP_DIR=C:\Users\...\Temp\ray.
Ray daemon survives the script — call ray.shutdown() (or mpd.utils.shutdown()) explicitly at exit. CI runners hang otherwise.
SettingWithCopyWarning — modin inherits pandas semantics; same fix (.loc[mask, col] = ...).
Pickled modin DataFrame won't load — modin partition references break across machines. Always pickle via mdf._to_pandas() for cross-machine transport.

When NOT to use this

modin's narrow sweet spot — "speed up existing pandas code" — has a few cases where the trade-off is wrong.

You can rewrite to polars: polars is faster than modin in most benchmarks and has a cleaner API. If you have the time to migrate, polars is the more durable answer.
Small data (< 1 GB): the partition overhead dominates. Plain pandas is faster.
Cluster-scale (TB+): modin is single-node optimised. Use ray-data, pyspark.pandas, or dask.dataframe directly.
GPU inference: cuDF is the right pandas-shaped GPU library.
Heavy use of apply with Python UDFs: these almost always fall back to pandas, defeating the parallelism. Vectorise first, then evaluate modin.
You need every pandas API to work: modin's parity is partial. Audit the defaulting to pandas warnings before betting a production pipeline on it.

modin

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Real-world recipes

Performance tuning

Memory & dataset-size scaling

Version migration guide

Interop with adjacent ecosystems

Troubleshooting common errors

When NOT to use this

See also