cheat sheet
modin
Package-level reference for modin — install, backend extras, versioning, and gotchas. Speeds up existing pandas code with a one-line import swap.
modin
What it is
modin is a parallelising drop-in for pandas. It exposes the same module path (modin.pandas) and the same DataFrame API, then dispatches work across all CPU cores using one of three execution backends: Ray (default), Dask, or Unidist. Operations modin has not yet implemented fall back to pandas transparently.
On PyPI, modin occupies the niche of "speed up existing pandas without rewriting" — different from polars (new API, faster but rewrite required) and different from dask.dataframe (distributed-cluster orientation, larger overhead).
Install
pip install "modin[ray]"
Output: installs modin + Ray; the recommended default
pip install "modin[dask]"
Output: installs modin + Dask; pick when Dask is already in the stack
pip install "modin[unidist]"
Output: installs modin + Unidist (lightweight MPI-backed engine)
pip install "modin[all]"
Output: installs every backend at once — only useful for benchmark scripts
uv add "modin[ray]"
Output: dependency resolved, lockfile updated
Versioning & Python support
modin tracks pandas closely — every modin release pins a compatible pandas range and updates the supported API surface. SemVer-style versioning; backwards-incompatible changes are batched into majors. As of late 2025, modin is in the 0.3x.x line; despite the leading zero, the project has been production-stable for years.
| Modin line | Pandas pinned | Python support |
|---|---|---|
| 0.28.x | pandas 2.1 | 3.9 – 3.11 |
| 0.30.x | pandas 2.2 | 3.9 – 3.12 |
| 0.32.x | pandas 2.2 / 2.3 | 3.10 – 3.13 |
A modin upgrade can drag your pandas version with it — pin both in lockfiles if your downstream code is sensitive to pandas internals.
Package metadata
- Maintainer: UC Berkeley RISELab origin, now community-maintained with Intel and Ponder.io contributors
- Project home: github.com/modin-project/modin
- Docs: modin.readthedocs.io
- License: Apache 2.0
- PyPI: pypi.org/project/modin
- Governance: open-source maintainers; corporate sponsors rotate
- First released: 2018 (RISELab paper)
- Downloads: ~1 M / month on PyPI as of late 2025
Optional dependencies & extras
modin's extras pick the execution engine, not the feature set — picking one is mandatory (the default pip install modin ships no backend and errors at import).
| Extra | Pulls in | When to use |
|---|---|---|
ray | ray[default] | recommended default, best out-of-the-box scaling on a laptop |
dask | dask[complete] | already running Dask, want a single cluster |
unidist | unidist[mpi] | MPI clusters, lighter footprint |
hdfs | pyarrow + hdfs deps | distributed HDFS reads |
all | all backends | rare; benchmarking only |
Common companion packages alongside modin:
pip install "modin[ray]" pandas pyarrow matplotlib jupyter
Output: installs modin (Ray backend) + the usual analytical stack — pandas is already pulled in, listed here for the pinned version
Alternatives
| Package | One-line trade-off |
|---|---|
| pandas | single-threaded, simpler, smaller install |
| polars | faster than modin in most benchmarks, but new API |
| dask.dataframe | distributed across a cluster, higher orchestration overhead |
| cuDF (RAPIDS) | GPU-accelerated pandas-like API, NVIDIA-only |
| pyspark.pandas | pandas API on Spark — JVM dependency, cluster-oriented |
| duckdb | SQL-first; complements rather than replaces pandas |
Common gotchas
- Backend extras are required.
pip install modinwithout[ray],[dask], or[unidist]fails at import with "no execution engine configured". The error message is clear but trips first-time installers. - Not all pandas APIs are accelerated. Unsupported ops silently fall back to pandas (single-threaded). Set
MODIN_ENGINE+ check thedefaulting to pandaswarnings — that is where your missing speedup lives. - Ray daemon survives the script. Calling
pd.read_csvonce starts a Ray runtime; the worker pool lingers afterpython script.pyexits unless you explicitlyray.shutdown(). CI runners with short timeouts notice. - Mixing
modin.pandasandpandas. Passing a modin DataFrame to a function that doespd.DataFrame(...)(wherepdis the real pandas) silently round-trips through pandas, losing partitioning. Use the samepdalias everywhere in a pipeline. - Memory amplification with Ray. Ray serialises partitions through Plasma object store — peak RAM can be 2–3× the resident pandas equivalent for short operations. Watch
MODIN_MEMORYandRAY_OBJECT_STORE_MEMORY. - Windows + Ray. Ray on Windows is supported but historically the least-tested combination. WSL is the smoother path.
engine="modin"is not a thing. modin replaces the import; you do not pass it to pandas — you importimport modin.pandas as pdand that is the whole switch.
Real-world recipes
modin's value proposition is "the one-line speedup" — the recipes below show the scenarios where that promise actually pays off. The companion sections/python/modin covers the API; here the focus is on packaging and backend choice.
Drop-in pandas accelerator on a fat-laptop CSV read — the canonical "this is why modin exists" pattern. A 4 GB CSV that takes 90 seconds in pandas often completes in 15-25 seconds in modin (Ray backend) on a modern 8-core machine.
import modin.pandas as pd
df = pd.read_csv("events.csv")
agg = df.groupby("user_id").agg({"revenue": "sum", "session_id": "nunique"})
print(agg.head())
Output: the same DataFrame.groupby result as pandas; under the hood Ray distributes the file read and the group-by across all CPU cores
Switching backends without changing application code:
import os
os.environ["MODIN_ENGINE"] = "dask" # or "ray" / "unidist" / "python"
import modin.pandas as pd
df = pd.read_csv("events.csv")
Output: the import-time environment lookup picks the backend; same code runs on Ray today and Dask tomorrow
Hybrid pipeline: modin for the heavy I/O, pandas for the model-fitting boundary — sklearn does not understand modin frames, so most pipelines materialise back to pandas before the fit call:
import modin.pandas as mpd
from sklearn.linear_model import LogisticRegression
df = mpd.read_csv("training.csv")
features = df[["age", "income", "is_premium"]].dropna()
y = df["churned"].dropna()
# Pull back to pandas for the sklearn boundary
X = features._to_pandas()
y = y._to_pandas()
model = LogisticRegression().fit(X.to_numpy(), y.to_numpy())
print(model.coef_)
Output: model coefficients; the CSV read and dropna run on Ray, the fit runs on a single core (sklearn does not consume modin natively)
Performance tuning
modin performance is dominated by partition count and memory budget. The defaults are reasonable on a 16 GB / 8-core laptop; production tuning means matching partitions to cores and giving Ray enough object-store space.
import modin.config as cfg
cfg.NPartitions.put(8) # one partition per core, typically
cfg.CpuCount.put(8) # cap CPUs modin uses
cfg.Engine.put("ray") # explicit engine choice
Output: no return; subsequent operations honour the new settings
Tuning levers, ordered by impact:
| Lever | Mechanism | When it helps |
|---|---|---|
MODIN_ENGINE env / cfg.Engine.put | choose Ray / Dask / Unidist | already running one of these schedulers |
cfg.NPartitions.put(N) | partition count | uneven CPU usage |
RAY_OBJECT_STORE_MEMORY env | Plasma store sizing | OOMs during group-by / merge |
cfg.RangePartitioning.put(True) | better skew handling | wide-skew group-by keys |
cfg.BenchmarkMode.put(True) | sync execution for clean timings | benchmark scripts only |
Confirming a step did not fall back to pandas:
import warnings
import modin.pandas as pd
with warnings.catch_warnings():
warnings.simplefilter("error", UserWarning)
try:
df = pd.read_csv("events.csv")
result = df.pivot_table(index="user_id", values="revenue", aggfunc="sum")
print(result.head())
except UserWarning as exc:
print(f"Modin defaulted to pandas at: {exc}")
Output: raises if any modin op silently fell back to pandas — the single most-common reason "modin didn't speed up my code"
Memory & dataset-size scaling
modin partitions the frame and distributes work across processes; on a single machine the partition data lives in Ray's Plasma object store (shared memory) or Dask's worker memory. Peak memory is typically 2-3x the equivalent pandas resident set because partitions hold copies during shuffles.
import modin.pandas as pd
import modin.config as cfg
cfg.NPartitions.put(16)
# Read a 30 GB CSV directory in parallel
df = pd.read_csv("data/*.csv")
print(df.shape)
Output: the full frame shape; the read is parallel across files, dispatched by Ray
For data that genuinely exceeds RAM:
- Modin is out-of-core when backed by Dask — Dask spills partitions to disk under memory pressure. Ray's Plasma store does not spill by default; configure
RAY_OBJECT_STORE_MEMORYandobject_spilling_configfor spill behaviour. - The streaming pattern is pandas chunking through a modin reader: split files explicitly, process each, concatenate at the end.
- Past ~100 GB on a single node, switch to polars-streaming or duckdb — modin's overhead per partition starts to eat the win.
Version migration guide
modin's 0.x versioning understates its production stability — the leading zero is historical. The relevant migration story is modin upgrades dragging pandas with them and backend renames.
Recent changes worth knowing:
- 0.26 → 0.27: the
unidistengine matured;MODIN_ENGINE=unidistbecame viable for HPC-shaped workloads. - 0.28 → 0.30: dropped the
omniscibackend (nowhdk); if you were on omnisci, the rename matters at install time (pip install "modin[hdk]"). - 0.30 → 0.32: stricter pandas pin range; older modin won't install with pandas 2.3.
- 0.32+: dropped support for Python 3.9 — match your project's Python floor before upgrading.
import modin
import pandas
print("modin:", modin.__version__)
print("pandas:", pandas.__version__) # which pandas modin is delegating to
Output: the two versions side by side — pin both in requirements.txt because modin's fallback path is pandas
API parity: every modin minor release closes gaps with the latest pandas. Check the defaulting to pandas warnings before assuming an op is accelerated — the parity matrix changes between releases.
Interop with adjacent ecosystems
modin's promise is full pandas-API compatibility, so interop generally means convert to pandas at the boundary with downstream libraries that do not understand modin partitions.
| Convert from / to | How | Cost |
|---|---|---|
| modin → pandas | mdf._to_pandas() | Collects all partitions onto the driver process |
| modin → NumPy | mdf.to_numpy() | Collect + convert; not zero-copy |
| modin → Arrow | mdf._to_pandas().to_arrow() (no direct path) | Two copies; usually go through pandas |
| modin → polars | round-trip via pandas/Arrow | Two copies; consider polars-only if heavy |
| modin → duckdb | register the underlying pandas frame: duckdb.sql("SELECT * FROM mdf._to_pandas()") | One copy |
| modin → sklearn | .to_numpy() at fit-time | sklearn is single-process anyway |
| modin → Spark | pyspark.pandas.from_pandas(mdf._to_pandas()) | Cross-cluster shuffle |
import modin.pandas as mpd
mdf = mpd.read_csv("events.csv")
# Boundary to plain pandas
pdf = mdf._to_pandas()
print(type(pdf))
Output: <class 'pandas.core.frame.DataFrame'> — the partitions have been collected onto the driver process
Troubleshooting common errors
The list below catalogues the failure modes that send people to the modin issue tracker. Each has a fast remediation; the underlying cause is usually backend configuration rather than the modin layer itself.
ImportError: No module named 'ray'— backend extra missing. Fix:pip install "modin[ray]".UserWarning: Distributing <object> object. This may take some time.— you passed a pandas object into modin; it is distributing it. Either accept the one-time cost or read directly viampd.read_*.UserWarning: Defaulting to pandas implementation— the op has no modin kernel yet. Either accept the single-core penalty or rewrite the op to use accelerated primitives.- Ray workers OOM on group-by — Plasma object store too small.
RAY_OBJECT_STORE_MEMORY=...in bytes, or passobject_store_memory=toray.init. PermissionError: [Errno 13]on Windows — Ray on Windows is fussier about temp dirs; setRAY_TEMP_DIR=C:\Users\...\Temp\ray.- Ray daemon survives the script — call
ray.shutdown()(ormpd.utils.shutdown()) explicitly at exit. CI runners hang otherwise. SettingWithCopyWarning— modin inherits pandas semantics; same fix (.loc[mask, col] = ...).- Pickled modin DataFrame won't load — modin partition references break across machines. Always pickle via
mdf._to_pandas()for cross-machine transport.
When NOT to use this
modin's narrow sweet spot — "speed up existing pandas code" — has a few cases where the trade-off is wrong.
- You can rewrite to polars: polars is faster than modin in most benchmarks and has a cleaner API. If you have the time to migrate, polars is the more durable answer.
- Small data (< 1 GB): the partition overhead dominates. Plain pandas is faster.
- Cluster-scale (TB+): modin is single-node optimised. Use ray-data, pyspark.pandas, or dask.dataframe directly.
- GPU inference: cuDF is the right pandas-shaped GPU library.
- Heavy use of
applywith Python UDFs: these almost always fall back to pandas, defeating the parallelism. Vectorise first, then evaluate modin. - You need every pandas API to work: modin's parity is partial. Audit the
defaulting to pandaswarnings before betting a production pipeline on it.
See also
- sections/python/modin — full API tutorial (groupby, backends, tuning)
- sections/python/pandas — the API modin re-implements
- sections/python/polars — alternative path when rewriting is acceptable
- sections/packages-pip/pip-pandas — package-level comparison
- sections/packages-pip/pip-polars — sibling fast-DataFrame library