cheat sheet

modin

Package-level reference for modin — install, backend extras, versioning, and gotchas. Speeds up existing pandas code with a one-line import swap.

modin

What it is

modin is a parallelising drop-in for pandas. It exposes the same module path (modin.pandas) and the same DataFrame API, then dispatches work across all CPU cores using one of three execution backends: Ray (default), Dask, or Unidist. Operations modin has not yet implemented fall back to pandas transparently.

On PyPI, modin occupies the niche of "speed up existing pandas without rewriting" — different from polars (new API, faster but rewrite required) and different from dask.dataframe (distributed-cluster orientation, larger overhead).

Install

bash
pip install "modin[ray]"

Output: installs modin + Ray; the recommended default

bash
pip install "modin[dask]"

Output: installs modin + Dask; pick when Dask is already in the stack

bash
pip install "modin[unidist]"

Output: installs modin + Unidist (lightweight MPI-backed engine)

bash
pip install "modin[all]"

Output: installs every backend at once — only useful for benchmark scripts

bash
uv add "modin[ray]"

Output: dependency resolved, lockfile updated

Versioning & Python support

modin tracks pandas closely — every modin release pins a compatible pandas range and updates the supported API surface. SemVer-style versioning; backwards-incompatible changes are batched into majors. As of late 2025, modin is in the 0.3x.x line; despite the leading zero, the project has been production-stable for years.

Modin linePandas pinnedPython support
0.28.xpandas 2.13.9 – 3.11
0.30.xpandas 2.23.9 – 3.12
0.32.xpandas 2.2 / 2.33.10 – 3.13

A modin upgrade can drag your pandas version with it — pin both in lockfiles if your downstream code is sensitive to pandas internals.

Package metadata

  • Maintainer: UC Berkeley RISELab origin, now community-maintained with Intel and Ponder.io contributors
  • Project home: github.com/modin-project/modin
  • Docs: modin.readthedocs.io
  • License: Apache 2.0
  • PyPI: pypi.org/project/modin
  • Governance: open-source maintainers; corporate sponsors rotate
  • First released: 2018 (RISELab paper)
  • Downloads: ~1 M / month on PyPI as of late 2025

Optional dependencies & extras

modin's extras pick the execution engine, not the feature set — picking one is mandatory (the default pip install modin ships no backend and errors at import).

ExtraPulls inWhen to use
rayray[default]recommended default, best out-of-the-box scaling on a laptop
daskdask[complete]already running Dask, want a single cluster
unidistunidist[mpi]MPI clusters, lighter footprint
hdfspyarrow + hdfs depsdistributed HDFS reads
allall backendsrare; benchmarking only

Common companion packages alongside modin:

bash
pip install "modin[ray]" pandas pyarrow matplotlib jupyter

Output: installs modin (Ray backend) + the usual analytical stack — pandas is already pulled in, listed here for the pinned version

Alternatives

PackageOne-line trade-off
pandassingle-threaded, simpler, smaller install
polarsfaster than modin in most benchmarks, but new API
dask.dataframedistributed across a cluster, higher orchestration overhead
cuDF (RAPIDS)GPU-accelerated pandas-like API, NVIDIA-only
pyspark.pandaspandas API on Spark — JVM dependency, cluster-oriented
duckdbSQL-first; complements rather than replaces pandas

Common gotchas

  • Backend extras are required. pip install modin without [ray], [dask], or [unidist] fails at import with "no execution engine configured". The error message is clear but trips first-time installers.
  • Not all pandas APIs are accelerated. Unsupported ops silently fall back to pandas (single-threaded). Set MODIN_ENGINE + check the defaulting to pandas warnings — that is where your missing speedup lives.
  • Ray daemon survives the script. Calling pd.read_csv once starts a Ray runtime; the worker pool lingers after python script.py exits unless you explicitly ray.shutdown(). CI runners with short timeouts notice.
  • Mixing modin.pandas and pandas. Passing a modin DataFrame to a function that does pd.DataFrame(...) (where pd is the real pandas) silently round-trips through pandas, losing partitioning. Use the same pd alias everywhere in a pipeline.
  • Memory amplification with Ray. Ray serialises partitions through Plasma object store — peak RAM can be 2–3× the resident pandas equivalent for short operations. Watch MODIN_MEMORY and RAY_OBJECT_STORE_MEMORY.
  • Windows + Ray. Ray on Windows is supported but historically the least-tested combination. WSL is the smoother path.
  • engine="modin" is not a thing. modin replaces the import; you do not pass it to pandas — you import import modin.pandas as pd and that is the whole switch.

Real-world recipes

modin's value proposition is "the one-line speedup" — the recipes below show the scenarios where that promise actually pays off. The companion sections/python/modin covers the API; here the focus is on packaging and backend choice.

Drop-in pandas accelerator on a fat-laptop CSV read — the canonical "this is why modin exists" pattern. A 4 GB CSV that takes 90 seconds in pandas often completes in 15-25 seconds in modin (Ray backend) on a modern 8-core machine.

python
import modin.pandas as pd

df = pd.read_csv("events.csv")
agg = df.groupby("user_id").agg({"revenue": "sum", "session_id": "nunique"})
print(agg.head())

Output: the same DataFrame.groupby result as pandas; under the hood Ray distributes the file read and the group-by across all CPU cores

Switching backends without changing application code:

python
import os
os.environ["MODIN_ENGINE"] = "dask"   # or "ray" / "unidist" / "python"

import modin.pandas as pd
df = pd.read_csv("events.csv")

Output: the import-time environment lookup picks the backend; same code runs on Ray today and Dask tomorrow

Hybrid pipeline: modin for the heavy I/O, pandas for the model-fitting boundary — sklearn does not understand modin frames, so most pipelines materialise back to pandas before the fit call:

python
import modin.pandas as mpd
from sklearn.linear_model import LogisticRegression

df = mpd.read_csv("training.csv")
features = df[["age", "income", "is_premium"]].dropna()
y = df["churned"].dropna()

# Pull back to pandas for the sklearn boundary
X = features._to_pandas()
y = y._to_pandas()

model = LogisticRegression().fit(X.to_numpy(), y.to_numpy())
print(model.coef_)

Output: model coefficients; the CSV read and dropna run on Ray, the fit runs on a single core (sklearn does not consume modin natively)

Performance tuning

modin performance is dominated by partition count and memory budget. The defaults are reasonable on a 16 GB / 8-core laptop; production tuning means matching partitions to cores and giving Ray enough object-store space.

python
import modin.config as cfg

cfg.NPartitions.put(8)              # one partition per core, typically
cfg.CpuCount.put(8)                 # cap CPUs modin uses
cfg.Engine.put("ray")               # explicit engine choice

Output: no return; subsequent operations honour the new settings

Tuning levers, ordered by impact:

LeverMechanismWhen it helps
MODIN_ENGINE env / cfg.Engine.putchoose Ray / Dask / Unidistalready running one of these schedulers
cfg.NPartitions.put(N)partition countuneven CPU usage
RAY_OBJECT_STORE_MEMORY envPlasma store sizingOOMs during group-by / merge
cfg.RangePartitioning.put(True)better skew handlingwide-skew group-by keys
cfg.BenchmarkMode.put(True)sync execution for clean timingsbenchmark scripts only

Confirming a step did not fall back to pandas:

python
import warnings
import modin.pandas as pd

with warnings.catch_warnings():
    warnings.simplefilter("error", UserWarning)
    try:
        df = pd.read_csv("events.csv")
        result = df.pivot_table(index="user_id", values="revenue", aggfunc="sum")
        print(result.head())
    except UserWarning as exc:
        print(f"Modin defaulted to pandas at: {exc}")

Output: raises if any modin op silently fell back to pandas — the single most-common reason "modin didn't speed up my code"

Memory & dataset-size scaling

modin partitions the frame and distributes work across processes; on a single machine the partition data lives in Ray's Plasma object store (shared memory) or Dask's worker memory. Peak memory is typically 2-3x the equivalent pandas resident set because partitions hold copies during shuffles.

python
import modin.pandas as pd
import modin.config as cfg

cfg.NPartitions.put(16)

# Read a 30 GB CSV directory in parallel
df = pd.read_csv("data/*.csv")
print(df.shape)

Output: the full frame shape; the read is parallel across files, dispatched by Ray

For data that genuinely exceeds RAM:

  • Modin is out-of-core when backed by Dask — Dask spills partitions to disk under memory pressure. Ray's Plasma store does not spill by default; configure RAY_OBJECT_STORE_MEMORY and object_spilling_config for spill behaviour.
  • The streaming pattern is pandas chunking through a modin reader: split files explicitly, process each, concatenate at the end.
  • Past ~100 GB on a single node, switch to polars-streaming or duckdb — modin's overhead per partition starts to eat the win.

Version migration guide

modin's 0.x versioning understates its production stability — the leading zero is historical. The relevant migration story is modin upgrades dragging pandas with them and backend renames.

Recent changes worth knowing:

  • 0.26 → 0.27: the unidist engine matured; MODIN_ENGINE=unidist became viable for HPC-shaped workloads.
  • 0.28 → 0.30: dropped the omnisci backend (now hdk); if you were on omnisci, the rename matters at install time (pip install "modin[hdk]").
  • 0.30 → 0.32: stricter pandas pin range; older modin won't install with pandas 2.3.
  • 0.32+: dropped support for Python 3.9 — match your project's Python floor before upgrading.
python
import modin
import pandas

print("modin:", modin.__version__)
print("pandas:", pandas.__version__)   # which pandas modin is delegating to

Output: the two versions side by side — pin both in requirements.txt because modin's fallback path is pandas

API parity: every modin minor release closes gaps with the latest pandas. Check the defaulting to pandas warnings before assuming an op is accelerated — the parity matrix changes between releases.

Interop with adjacent ecosystems

modin's promise is full pandas-API compatibility, so interop generally means convert to pandas at the boundary with downstream libraries that do not understand modin partitions.

Convert from / toHowCost
modin → pandasmdf._to_pandas()Collects all partitions onto the driver process
modin → NumPymdf.to_numpy()Collect + convert; not zero-copy
modin → Arrowmdf._to_pandas().to_arrow() (no direct path)Two copies; usually go through pandas
modin → polarsround-trip via pandas/ArrowTwo copies; consider polars-only if heavy
modin → duckdbregister the underlying pandas frame: duckdb.sql("SELECT * FROM mdf._to_pandas()")One copy
modin → sklearn.to_numpy() at fit-timesklearn is single-process anyway
modin → Sparkpyspark.pandas.from_pandas(mdf._to_pandas())Cross-cluster shuffle
python
import modin.pandas as mpd

mdf = mpd.read_csv("events.csv")

# Boundary to plain pandas
pdf = mdf._to_pandas()
print(type(pdf))

Output: <class 'pandas.core.frame.DataFrame'> — the partitions have been collected onto the driver process

Troubleshooting common errors

The list below catalogues the failure modes that send people to the modin issue tracker. Each has a fast remediation; the underlying cause is usually backend configuration rather than the modin layer itself.

  • ImportError: No module named 'ray' — backend extra missing. Fix: pip install "modin[ray]".
  • UserWarning: Distributing <object> object. This may take some time. — you passed a pandas object into modin; it is distributing it. Either accept the one-time cost or read directly via mpd.read_*.
  • UserWarning: Defaulting to pandas implementation — the op has no modin kernel yet. Either accept the single-core penalty or rewrite the op to use accelerated primitives.
  • Ray workers OOM on group-by — Plasma object store too small. RAY_OBJECT_STORE_MEMORY=... in bytes, or pass object_store_memory= to ray.init.
  • PermissionError: [Errno 13] on Windows — Ray on Windows is fussier about temp dirs; set RAY_TEMP_DIR=C:\Users\...\Temp\ray.
  • Ray daemon survives the script — call ray.shutdown() (or mpd.utils.shutdown()) explicitly at exit. CI runners hang otherwise.
  • SettingWithCopyWarning — modin inherits pandas semantics; same fix (.loc[mask, col] = ...).
  • Pickled modin DataFrame won't load — modin partition references break across machines. Always pickle via mdf._to_pandas() for cross-machine transport.

When NOT to use this

modin's narrow sweet spot — "speed up existing pandas code" — has a few cases where the trade-off is wrong.

  • You can rewrite to polars: polars is faster than modin in most benchmarks and has a cleaner API. If you have the time to migrate, polars is the more durable answer.
  • Small data (< 1 GB): the partition overhead dominates. Plain pandas is faster.
  • Cluster-scale (TB+): modin is single-node optimised. Use ray-data, pyspark.pandas, or dask.dataframe directly.
  • GPU inference: cuDF is the right pandas-shaped GPU library.
  • Heavy use of apply with Python UDFs: these almost always fall back to pandas, defeating the parallelism. Vectorise first, then evaluate modin.
  • You need every pandas API to work: modin's parity is partial. Audit the defaulting to pandas warnings before betting a production pipeline on it.

See also