cheat sheet
modin
Speed up pandas workloads across all CPU cores with a one-line import swap. Covers Ray and Dask backends, config tuning, pandas interop, and when modin wins vs polars.
modin — Drop-in pandas at Scale
What it is
Modin is a DataFrame library that parallelises pandas operations across all available CPU cores with a single import change. It replaces import pandas as pd with import modin.pandas as pd and intercepts the same API, distributing work across partitions using Ray (default), Dask, or Unidist as the execution engine. Code that already works with pandas runs on modin without modification; modin falls back to pandas transparently for unsupported operations. It is the lowest-friction path for speeding up existing pandas pipelines on multi-core machines.
Install
pip install modin[ray] # Ray backend (recommended, best out-of-the-box speed)
pip install modin[dask] # Dask backend (useful if Dask is already in your stack)
pip install modin[all] # all backends
Output: (none — exits 0 on success)
Quick example
import modin.pandas as pd # only this line changes
df = pd.read_csv("large.csv") # reads in parallel across cores
print(df.groupby("category")["value"].mean())
Output:
category
electronics 299.42
furniture 189.10
software 149.99
Name: value, dtype: float64
When / why to use it
- Existing pandas codebase you want to speed up without rewriting to polars or Dask.
- Multi-core machines (8+ cores) where single-threaded pandas leaves most hardware idle.
- Large CSVs or Parquet files where
read_csv/read_parquetis the bottleneck. - Teams that cannot retrain on a new DataFrame API.
- Ad-hoc exploration where polars' expression-based API feels unfamiliar.
Common pitfalls
Mixing modin and pandas DataFrames — passing a modin DataFrame to a function that calls
pd.DataFrame()with the stockpandasimport creates a plain pandas DataFrame, silently losing the distributed partitioning. Keep all objects in the same library within a pipeline.
Small DataFrames are slower — modin has coordination overhead from the execution engine. For DataFrames under ~100 MB, plain pandas is often faster. Modin's parallelism pays off only at scale.
inplace=Trueis unreliable — modin does not guarantee thatinplace=Truemodifies the original object. Reassign instead:df = df.dropna().
Unsupported operations fall back silently — modin falls back to pandas for ~15% of the API surface. The fallback is correct but single-threaded. Check
modin.utils.has_metadataor watch forUserWarning: Distributing...messages to understand what's running in parallel.
Set
MODIN_ENGINE=ray(ordask) as an environment variable before importing modin to control the backend without touching code. This makes it easy to switch backends per-environment.
df._to_pandas()extracts a plain pandas DataFrame from a modin object. Use it when you need to pass data to a library that does not accept modin DataFrames.
Richer example — multi-file CSV aggregation
import modin.pandas as pd
import glob
# Read and concat multiple large CSVs in parallel
files = glob.glob("data/logs_*.csv")
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)
# Group by date and action, count events
summary = (
df.groupby(["date", "action"])["user_id"]
.count()
.reset_index(name="events")
.sort_values("events", ascending=False)
)
print(summary.head(5))
Output:
date action events
3 2026-01-03 page_view 48291
1 2026-01-01 page_view 45982
7 2026-01-07 add_cart 38741
2 2026-01-02 search 33019
5 2026-01-05 page_view 31284
Backend selection
Modin supports three execution engines. The backend is selected once at import time (or via environment variable) and cannot be changed mid-session.
# Option 1 — environment variable (set before Python starts)
# MODIN_ENGINE=ray python script.py
# Option 2 — modin.config (set before first import of modin.pandas)
import modin.config as cfg
cfg.Engine.put("ray") # or "dask", "unidist"
import modin.pandas as pd
# Option 3 — Ray directly (start cluster before modin import)
import ray
ray.init(num_cpus=8) # or ray.init(address="auto") for remote cluster
import modin.pandas as pd
| Backend | Best for |
|---|---|
| Ray | Single machine, best default performance, easy install |
| Dask | Already using Dask, distributed cluster, lower memory overhead |
| Unidist | MPI environments, HPC clusters |
modin.config — tuning parameters
modin.config exposes runtime knobs for partition size, engine, and logging. Changes must be made before importing modin.pandas.
import modin.config as cfg
cfg.Engine.put("ray")
cfg.NPartitions.put(16) # override number of row partitions (default: nCPU)
cfg.RayRedisAddress.put(None) # use local Ray cluster
cfg.BenchmarkMode.put(False) # set True to disable fallback (raises instead of falling back)
import modin.pandas as pd
Check current values:
import modin.config as cfg
print(cfg.Engine.get()) # ray
print(cfg.NPartitions.get()) # 8 (on 8-core machine)
Interop with pandas
Converting between modin and pandas is explicit and inexpensive — both share Arrow-compatible memory when possible.
import modin.pandas as mpd
import pandas as pd
# modin → pandas
df_modin = mpd.read_csv("data.csv")
df_pandas = df_modin._to_pandas()
print(type(df_pandas)) # <class 'pandas.core.frame.DataFrame'>
# pandas → modin
df_back = mpd.DataFrame(df_pandas)
print(type(df_back)) # <class 'modin.pandas.dataframe.DataFrame'>
# Use modin in a function that expects pandas
import numpy as np
result = pd.DataFrame(np.corrcoef(df_modin["a"], df_modin["b"])) # works via __array__
Operations that fall back to pandas
Modin handles the most common operations natively. The following operations silently fall back to single-threaded pandas:
| Operation | Status |
|---|---|
read_csv, read_parquet, read_excel | Parallel ✅ |
groupby().agg() | Parallel ✅ |
merge / join | Parallel ✅ |
sort_values | Parallel ✅ |
apply(axis=0) (column-wise) | Parallel ✅ |
apply(axis=1) (row-wise) | Fallback ⚠️ |
iterrows, itertuples | Fallback ⚠️ |
to_sql | Fallback ⚠️ |
Custom groupby().apply() | Fallback ⚠️ |
MultiIndex operations | Partial ⚠️ |
Row-wise
apply(axis=1)is the most common performance trap. If you need per-row custom logic at scale, move to polars'map_elementsor vectorise with numpy instead.
Benchmarks — when modin wins
Modin's speedup is proportional to core count and dataset size. Rough guidelines:
| Scenario | modin vs pandas |
|---|---|
read_csv (2 GB, 8 cores) | ~4–6× faster |
groupby().agg() (50 M rows) | ~3–5× faster |
merge (two 10 M row tables) | ~2–4× faster |
sort_values (10 M rows) | ~2–3× faster |
apply(axis=1) | ~1× (fallback) |
| DataFrame under 10 MB | ~0.5–0.8× (overhead dominates) |
For datasets over ~10 GB or when you need strict performance guarantees, polars is typically faster than modin because its Rust/Arrow core has lower coordination overhead than modin's Ray-based partitioning.
Dropping back to pandas for a single operation
When you know an operation falls back anyway, it is cleaner to extract pandas explicitly for that step:
import modin.pandas as mpd
df = mpd.read_parquet("large.parquet") # parallel read
# Complex custom groupby — do it in pandas
pandas_df = df._to_pandas()
result = pandas_df.groupby("user_id").apply(
lambda g: g.nlargest(3, "value")
)
# Continue in modin
df_result = mpd.DataFrame(result.reset_index(drop=True))
Partition inspection
import modin.pandas as pd
df = pd.read_csv("data.csv")
# Number of row partitions
from modin.config import NPartitions
print(NPartitions.get())
# Underlying pandas partitions (for debugging)
parts = df._query_compiler._modin_frame._partitions
print(f"Partition grid: {len(parts)} rows × {len(parts[0])} cols")
Backend comparison — ray vs dask vs unidist
Each backend has different operational characteristics. Match the backend to your environment and existing infrastructure rather than chasing micro-benchmarks.
| Aspect | Ray | Dask | Unidist |
|---|---|---|---|
| Install footprint | ~150 MB | ~25 MB | ~10 MB (MPI required separately) |
| Single-machine speed | Best on most workloads | ~10–20% slower than Ray | Comparable to Ray with MPI tuning |
| Distributed setup | ray.init(address="...") | Client(address="...") | mpirun-driven launch |
| Memory overhead per worker | ~200 MB | ~80 MB | ~50 MB |
| Plays well with | Ray Tune, Ray Serve, RLlib | dask.delayed, dask.dataframe, Coiled | OpenMPI/MPICH clusters, HPC schedulers |
| Best when | Default; ML stack; want one cluster for training+ETL | Already on Dask; need lower RAM ceiling | HPC environment with InfiniBand and Slurm |
# Side-by-side micro-benchmark scaffold
import os, time
def bench(engine: str, path: str):
os.environ["MODIN_ENGINE"] = engine
# Force reimport — clearing modin from sys.modules is required
import importlib, sys
for mod in [m for m in sys.modules if m.startswith("modin")]:
del sys.modules[mod]
import modin.pandas as pd
t0 = time.perf_counter()
df = pd.read_parquet(path)
n = len(df.groupby("category")["value"].sum())
return engine, time.perf_counter() - t0, n
# Run in a fresh subprocess for each engine — engine swap mid-process is brittle
print(bench("ray", "/data/events.parquet"))
print(bench("dask", "/data/events.parquet"))
Output:
('ray', 4.21, 17)
('dask', 5.07, 17)
Engine swapping in one process is fragile. Modin caches initialised executor handles. Spawn a fresh
subprocessper engine when benchmarking — clearingsys.modulesworks most of the time but is not officially supported.
Performance tuning
Modin scales well only when partition count and engine resources match the dataset and machine. Default behaviour is "one partition per CPU", which is rarely optimal.
import modin.config as cfg
import os
# 1. Partition count — for very large frames, oversubscribe slightly
cfg.NPartitions.put(os.cpu_count() * 2)
# 2. Min partition size — avoid creating tiny partitions on small frames
cfg.MinPartitionSize.put(32) # 32 MB per partition floor
# 3. Async read mode — overlap I/O with compute on multi-file reads
cfg.AsyncReadMode.put(True)
# 4. Storage format — pyarrow is faster than the pandas default for groupby
cfg.StorageFormat.put("pandas") # or "hdk" for in-memory analytics engine
# 5. Progress bar — show parallel task progress
cfg.ProgressBar.put(True)
import modin.pandas as pd
Output: (none — exits 0 on success)
Tuning checklist for slow modin workloads:
- Confirm parallelism is engaged. Watch for
UserWarning: Distributing <object>— its absence on a heavy op means modin fell back to pandas. - Inspect partition count.
cfg.NPartitions.get()should be ≥ CPU count for parallelism, but each partition should hold ≥ 32 MB to amortise overhead. - Profile the engine.
ray.timeline()(Ray) andclient.profile()(Dask) show where wall time goes. - Pin object refs. For repeatedly-used DataFrames, call
_to_pandas()once and reuse the pandas object rather than re-distributing on every access. - Use Parquet over CSV. Modin's parallel Parquet reader is 3–5× faster than its parallel CSV reader on the same data, mostly because Parquet stores row counts in metadata so partitioning is cheap.
Troubleshooting common errors
| Error / symptom | Cause | Fix |
|---|---|---|
ImportError: ray is not installed | Default backend selected without extras | pip install "modin[ray]" or set MODIN_ENGINE=dask |
UserWarning: Distributing <object> ... defaulting to pandas | Operation unsupported by modin | Expected for the ~15% fallback surface; rewrite using a supported op |
| Ray dashboard shows zero tasks during heavy op | Operation fell back to pandas silently | Search release notes for the op name; consider _to_pandas() for the step |
ray.exceptions.RayOutOfMemoryError | Partitions too large for worker RAM | Increase NPartitions, decrease MinPartitionSize, or set object_store_memory on ray.init() |
AttributeError: 'DataFrame' object has no attribute '_to_pandas' | You imported plain pandas, not modin | Check the import line at top of the file |
| First op takes 5+ seconds before any compute | Ray/Dask cluster startup | Pre-warm with ray.init() at app start; cache between runs |
Engine 'ray' is not initialized after fork | Forking with Ray running breaks workers | Initialise Ray after os.fork(); or use spawn-context multiprocessing |
| Memory grows monotonically across iterations | Ray object store fragmentation | Periodic ray.internal_kv._internal_kv_reset() or restart the driver between batches |
Migration from pandas
The full drop-in promise holds for most code, but a handful of patterns need adjustment for correctness and performance.
Step-by-step migration checklist:
- Replace the import.
import pandas as pd→import modin.pandas as pd. Do this at the top of every file, not at module level inside packages — modin must be initialised before forks. - Pin the backend. Export
MODIN_ENGINE=rayin your shell or setcfg.Engine.put("ray")before the firstimport modin.pandas. - Audit
inplace=Truecalls. Replace each with reassignment — modin does not guarantee in-place semantics. - Eliminate
.apply(axis=1). Row-wise apply is the most common silent fallback. Rewrite as vectorised numpy/arrow ops or asgroupby().agg(...)with named aggregators. - Re-test edge cases. Float precision, NaN handling, and sort stability differ slightly across pandas versions; modin tracks the most recent stable pandas API.
- Profile before claiming a win. On datasets under 100 MB, modin can be slower than pandas; the speedup only pays back at ~500 MB+.
# Before — single-threaded pandas
import pandas as pd
df = pd.read_csv("events.csv")
df.dropna(inplace=True) # ❌ in-place
df["score"] = df.apply(lambda r: r.a + r.b, axis=1) # ❌ row-wise apply
# After — modin-friendly equivalent
import modin.pandas as pd
df = pd.read_csv("events.csv")
df = df.dropna() # ✅ reassign
df["score"] = df["a"] + df["b"] # ✅ vectorised
Output: (none — exits 0 on success)
Real-world recipes
A handful of full pipelines that exercise modin's strengths — parallel I/O, parallel groupby, and pandas-compat for the long tail.
1. Multi-GB log aggregation
import modin.pandas as pd
import glob
paths = sorted(glob.glob("/var/log/app/events_*.parquet"))
df = pd.concat([pd.read_parquet(p) for p in paths])
# Two-level aggregation across ~50 M rows
hourly = (
df.assign(hour=df["timestamp"].dt.floor("h"))
.groupby(["hour", "endpoint"])
.agg(requests=("request_id", "count"),
p95_latency=("latency_ms", lambda s: s.quantile(0.95)))
.reset_index()
)
hourly.to_parquet("/var/log/app/hourly_rollup.parquet")
print(hourly.head(3))
Output:
hour endpoint requests p95_latency
0 2026-05-30 00:00:00 /v1/users 18421 142.0
1 2026-05-30 00:00:00 /v1/orders 3211 278.0
2 2026-05-30 01:00:00 /v1/users 17984 138.0
2. Joining two large tables
import modin.pandas as pd
users = pd.read_parquet("users.parquet") # 8 M rows
orders = pd.read_parquet("orders.parquet") # 80 M rows
joined = orders.merge(users, on="user_id", how="left")
spend = joined.groupby("country")["amount"].sum().sort_values(ascending=False)
print(spend.head(5))
Output:
country
US 18421094.10
DE 4218390.22
GB 3197102.55
FR 2811944.33
JP 2502887.10
Name: amount, dtype: float64
3. Mixed modin + pandas for unsupported ops
import modin.pandas as mpd
import pandas as pd
df = mpd.read_parquet("events.parquet") # parallel read
# Custom groupby.apply → fallback. Do it once in pandas explicitly.
pdf = df._to_pandas()
top3_per_user = pdf.groupby("user_id").apply(
lambda g: g.nlargest(3, "score")
).reset_index(drop=True)
back = mpd.DataFrame(top3_per_user) # back to parallel
out = back.groupby("country")["score"].mean()
print(out.head(3))
Output:
country
DE 87.4
FR 86.9
GB 85.2
Name: score, dtype: float64
4. Ray cluster — distributed read across nodes
import ray
ray.init(address="ray://head-node:10001") # connect to remote cluster
import modin.pandas as pd
df = pd.read_parquet("s3://my-bucket/events/year=2026/")
print(df.groupby("region")["amount"].sum())
Output:
region
emea 19821304.50
apac 8210444.10
amer 14328990.77
Name: amount, dtype: float64
Ecosystem integrations
Modin is API-compatible with pandas, so any library accepting a pandas DataFrame accepts a modin DataFrame via the __array__ and __dataframe__ protocols — but some integrations are worth calling out.
| Library | Integration notes |
|---|---|
| Ray Data | ray.data.from_modin(df) and ds.to_modin() round-trip without re-distribution |
| dask.dataframe | When MODIN_ENGINE=dask, df._to_pandas() falls back via the dask scheduler |
| DuckDB | duckdb.from_df(df._to_pandas()); DuckDB outperforms modin on most analytical SQL |
| cuDF | GPU-pandas; no direct interop — convert via df._to_pandas() then cudf.from_pandas(...) |
| PyArrow | df.to_arrow() returns a pyarrow.Table; zero-copy if storage format is "hdk" |
| scikit-learn | Pass df._to_pandas() — sklearn does not understand modin's deferred partitions |
| polars | pl.from_pandas(df._to_pandas()); for >10 GB workloads, polars typically beats modin |
When NOT to use this
Modin's parallel coordination overhead is real, and there are workloads where it costs more than it earns.
- DataFrames under ~100 MB on disk. Plain pandas is consistently faster. The ratio approaches 1× around 500 MB.
- Workloads dominated by
apply(axis=1)or customgroupby().apply(). These fall back to single-threaded pandas anyway. Modin adds setup cost with no parallel return. - Tight latency budgets. Ray and Dask take ~2–5 seconds to start their first executor. Cold-start sensitive code paths (web request handlers, serverless functions) are a poor fit.
- You need polars' lazy planner. Polars compiles a lazy plan and pushes filters down; modin executes eagerly like pandas.
- You need strict GPU acceleration. Use cuDF directly.
- Strict pandas API parity is required. Modin tracks pandas but lags by a few months on new API surface; some niche operations don't roundtrip.
Quick reference
| Task | Code |
|---|---|
| Import swap | import modin.pandas as pd |
| Set backend | modin.config.Engine.put("ray") |
| Set partitions | modin.config.NPartitions.put(16) |
| To pandas | df._to_pandas() |
| From pandas | modin.pandas.DataFrame(pandas_df) |
| Read CSV parallel | pd.read_csv("file.csv") |
| Read Parquet parallel | pd.read_parquet("file.parquet") |
| groupby (parallel) | df.groupby("k").agg({"v": "sum"}) |
| merge (parallel) | pd.merge(left, right, on="id") |
| Check engine | modin.config.Engine.get() |
| Start Ray manually | ray.init(num_cpus=N) before import |