cheat sheet

modin

Speed up pandas workloads across all CPU cores with a one-line import swap. Covers Ray and Dask backends, config tuning, pandas interop, and when modin wins vs polars.

updated 04-27-2026

modin — Drop-in pandas at Scale

What it is

Modin is a DataFrame library that parallelises pandas operations across all available CPU cores with a single import change. It replaces import pandas as pd with import modin.pandas as pd and intercepts the same API, distributing work across partitions using Ray (default), Dask, or Unidist as the execution engine. Code that already works with pandas runs on modin without modification; modin falls back to pandas transparently for unsupported operations. It is the lowest-friction path for speeding up existing pandas pipelines on multi-core machines.

Install

bash

pip install modin[ray]    # Ray backend (recommended, best out-of-the-box speed)
pip install modin[dask]   # Dask backend (useful if Dask is already in your stack)
pip install modin[all]    # all backends

Output: (none — exits 0 on success)

Quick example

python

import modin.pandas as pd   # only this line changes

df = pd.read_csv("large.csv")   # reads in parallel across cores
print(df.groupby("category")["value"].mean())

Output:

text

category
electronics    299.42
furniture      189.10
software       149.99
Name: value, dtype: float64

When / why to use it

Existing pandas codebase you want to speed up without rewriting to polars or Dask.
Multi-core machines (8+ cores) where single-threaded pandas leaves most hardware idle.
Large CSVs or Parquet files where read_csv / read_parquet is the bottleneck.
Teams that cannot retrain on a new DataFrame API.
Ad-hoc exploration where polars' expression-based API feels unfamiliar.

Common pitfalls

Mixing modin and pandas DataFrames — passing a modin DataFrame to a function that calls pd.DataFrame() with the stock pandas import creates a plain pandas DataFrame, silently losing the distributed partitioning. Keep all objects in the same library within a pipeline.

Small DataFrames are slower — modin has coordination overhead from the execution engine. For DataFrames under ~100 MB, plain pandas is often faster. Modin's parallelism pays off only at scale.

inplace=True is unreliable — modin does not guarantee that inplace=True modifies the original object. Reassign instead: df = df.dropna().

Unsupported operations fall back silently — modin falls back to pandas for ~15% of the API surface. The fallback is correct but single-threaded. Check modin.utils.has_metadata or watch for UserWarning: Distributing... messages to understand what's running in parallel.

Set MODIN_ENGINE=ray (or dask) as an environment variable before importing modin to control the backend without touching code. This makes it easy to switch backends per-environment.

df._to_pandas() extracts a plain pandas DataFrame from a modin object. Use it when you need to pass data to a library that does not accept modin DataFrames.

Richer example — multi-file CSV aggregation

python

import modin.pandas as pd
import glob

# Read and concat multiple large CSVs in parallel
files = glob.glob("data/logs_*.csv")
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

# Group by date and action, count events
summary = (
    df.groupby(["date", "action"])["user_id"]
    .count()
    .reset_index(name="events")
    .sort_values("events", ascending=False)
)
print(summary.head(5))

Output:

text

         date      action  events
3  2026-01-03  page_view   48291
1  2026-01-01  page_view   45982
7  2026-01-07    add_cart   38741
2  2026-01-02     search   33019
5  2026-01-05  page_view   31284

Backend selection

Modin supports three execution engines. The backend is selected once at import time (or via environment variable) and cannot be changed mid-session.

python

# Option 1 — environment variable (set before Python starts)
# MODIN_ENGINE=ray python script.py

# Option 2 — modin.config (set before first import of modin.pandas)
import modin.config as cfg
cfg.Engine.put("ray")          # or "dask", "unidist"
import modin.pandas as pd

# Option 3 — Ray directly (start cluster before modin import)
import ray
ray.init(num_cpus=8)           # or ray.init(address="auto") for remote cluster
import modin.pandas as pd

Backend	Best for
Ray	Single machine, best default performance, easy install
Dask	Already using Dask, distributed cluster, lower memory overhead
Unidist	MPI environments, HPC clusters

modin.config — tuning parameters

modin.config exposes runtime knobs for partition size, engine, and logging. Changes must be made before importing modin.pandas.

python

import modin.config as cfg

cfg.Engine.put("ray")
cfg.NPartitions.put(16)          # override number of row partitions (default: nCPU)
cfg.RayRedisAddress.put(None)    # use local Ray cluster
cfg.BenchmarkMode.put(False)     # set True to disable fallback (raises instead of falling back)

import modin.pandas as pd

Check current values:

python

import modin.config as cfg
print(cfg.Engine.get())          # ray
print(cfg.NPartitions.get())     # 8 (on 8-core machine)

Interop with pandas

Converting between modin and pandas is explicit and inexpensive — both share Arrow-compatible memory when possible.

python

import modin.pandas as mpd
import pandas as pd

# modin → pandas
df_modin = mpd.read_csv("data.csv")
df_pandas = df_modin._to_pandas()
print(type(df_pandas))   # <class 'pandas.core.frame.DataFrame'>

# pandas → modin
df_back = mpd.DataFrame(df_pandas)
print(type(df_back))     # <class 'modin.pandas.dataframe.DataFrame'>

# Use modin in a function that expects pandas
import numpy as np
result = pd.DataFrame(np.corrcoef(df_modin["a"], df_modin["b"]))  # works via __array__

Operations that fall back to pandas

Modin handles the most common operations natively. The following operations silently fall back to single-threaded pandas:

Operation	Status
`read_csv`, `read_parquet`, `read_excel`	Parallel ✅
`groupby().agg()`	Parallel ✅
`merge` / `join`	Parallel ✅
`sort_values`	Parallel ✅
`apply(axis=0)` (column-wise)	Parallel ✅
`apply(axis=1)` (row-wise)	Fallback ⚠️
`iterrows`, `itertuples`	Fallback ⚠️
`to_sql`	Fallback ⚠️
Custom `groupby().apply()`	Fallback ⚠️
`MultiIndex` operations	Partial ⚠️

Row-wise apply(axis=1) is the most common performance trap. If you need per-row custom logic at scale, move to polars' map_elements or vectorise with numpy instead.

Benchmarks — when modin wins

Modin's speedup is proportional to core count and dataset size. Rough guidelines:

Scenario	modin vs pandas
`read_csv` (2 GB, 8 cores)	~4–6× faster
`groupby().agg()` (50 M rows)	~3–5× faster
`merge` (two 10 M row tables)	~2–4× faster
`sort_values` (10 M rows)	~2–3× faster
`apply(axis=1)`	~1× (fallback)
DataFrame under 10 MB	~0.5–0.8× (overhead dominates)

For datasets over ~10 GB or when you need strict performance guarantees, polars is typically faster than modin because its Rust/Arrow core has lower coordination overhead than modin's Ray-based partitioning.

Dropping back to pandas for a single operation

When you know an operation falls back anyway, it is cleaner to extract pandas explicitly for that step:

python

import modin.pandas as mpd

df = mpd.read_parquet("large.parquet")   # parallel read

# Complex custom groupby — do it in pandas
pandas_df = df._to_pandas()
result = pandas_df.groupby("user_id").apply(
    lambda g: g.nlargest(3, "value")
)

# Continue in modin
df_result = mpd.DataFrame(result.reset_index(drop=True))

Partition inspection

python

import modin.pandas as pd

df = pd.read_csv("data.csv")

# Number of row partitions
from modin.config import NPartitions
print(NPartitions.get())

# Underlying pandas partitions (for debugging)
parts = df._query_compiler._modin_frame._partitions
print(f"Partition grid: {len(parts)} rows × {len(parts[0])} cols")

Backend comparison — ray vs dask vs unidist

Each backend has different operational characteristics. Match the backend to your environment and existing infrastructure rather than chasing micro-benchmarks.

Aspect	Ray	Dask	Unidist
Install footprint	~150 MB	~25 MB	~10 MB (MPI required separately)
Single-machine speed	Best on most workloads	~10–20% slower than Ray	Comparable to Ray with MPI tuning
Distributed setup	`ray.init(address="...")`	`Client(address="...")`	mpirun-driven launch
Memory overhead per worker	~200 MB	~80 MB	~50 MB
Plays well with	Ray Tune, Ray Serve, RLlib	dask.delayed, dask.dataframe, Coiled	OpenMPI/MPICH clusters, HPC schedulers
Best when	Default; ML stack; want one cluster for training+ETL	Already on Dask; need lower RAM ceiling	HPC environment with InfiniBand and Slurm

python

# Side-by-side micro-benchmark scaffold
import os, time

def bench(engine: str, path: str):
    os.environ["MODIN_ENGINE"] = engine
    # Force reimport — clearing modin from sys.modules is required
    import importlib, sys
    for mod in [m for m in sys.modules if m.startswith("modin")]:
        del sys.modules[mod]
    import modin.pandas as pd
    t0 = time.perf_counter()
    df = pd.read_parquet(path)
    n = len(df.groupby("category")["value"].sum())
    return engine, time.perf_counter() - t0, n

# Run in a fresh subprocess for each engine — engine swap mid-process is brittle
print(bench("ray", "/data/events.parquet"))
print(bench("dask", "/data/events.parquet"))

Output:

text

('ray', 4.21, 17)
('dask', 5.07, 17)

Engine swapping in one process is fragile. Modin caches initialised executor handles. Spawn a fresh subprocess per engine when benchmarking — clearing sys.modules works most of the time but is not officially supported.

Performance tuning

Modin scales well only when partition count and engine resources match the dataset and machine. Default behaviour is "one partition per CPU", which is rarely optimal.

python

import modin.config as cfg
import os

# 1. Partition count — for very large frames, oversubscribe slightly
cfg.NPartitions.put(os.cpu_count() * 2)

# 2. Min partition size — avoid creating tiny partitions on small frames
cfg.MinPartitionSize.put(32)  # 32 MB per partition floor

# 3. Async read mode — overlap I/O with compute on multi-file reads
cfg.AsyncReadMode.put(True)

# 4. Storage format — pyarrow is faster than the pandas default for groupby
cfg.StorageFormat.put("pandas")  # or "hdk" for in-memory analytics engine

# 5. Progress bar — show parallel task progress
cfg.ProgressBar.put(True)

import modin.pandas as pd

Output: (none — exits 0 on success)

Tuning checklist for slow modin workloads:

Confirm parallelism is engaged. Watch for UserWarning: Distributing <object> — its absence on a heavy op means modin fell back to pandas.
Inspect partition count. cfg.NPartitions.get() should be ≥ CPU count for parallelism, but each partition should hold ≥ 32 MB to amortise overhead.
Profile the engine. ray.timeline() (Ray) and client.profile() (Dask) show where wall time goes.
Pin object refs. For repeatedly-used DataFrames, call _to_pandas() once and reuse the pandas object rather than re-distributing on every access.
Use Parquet over CSV. Modin's parallel Parquet reader is 3–5× faster than its parallel CSV reader on the same data, mostly because Parquet stores row counts in metadata so partitioning is cheap.

Troubleshooting common errors

Error / symptom	Cause	Fix
`ImportError: ray is not installed`	Default backend selected without extras	`pip install "modin[ray]"` or set `MODIN_ENGINE=dask`
`UserWarning: Distributing <object> ... defaulting to pandas`	Operation unsupported by modin	Expected for the ~15% fallback surface; rewrite using a supported op
Ray dashboard shows zero tasks during heavy op	Operation fell back to pandas silently	Search release notes for the op name; consider `_to_pandas()` for the step
`ray.exceptions.RayOutOfMemoryError`	Partitions too large for worker RAM	Increase `NPartitions`, decrease `MinPartitionSize`, or set `object_store_memory` on `ray.init()`
`AttributeError: 'DataFrame' object has no attribute '_to_pandas'`	You imported plain pandas, not modin	Check the `import` line at top of the file
First op takes 5+ seconds before any compute	Ray/Dask cluster startup	Pre-warm with `ray.init()` at app start; cache between runs
`Engine 'ray' is not initialized` after fork	Forking with Ray running breaks workers	Initialise Ray after `os.fork()`; or use spawn-context `multiprocessing`
Memory grows monotonically across iterations	Ray object store fragmentation	Periodic `ray.internal_kv._internal_kv_reset()` or restart the driver between batches

Migration from pandas

The full drop-in promise holds for most code, but a handful of patterns need adjustment for correctness and performance.

Step-by-step migration checklist:

Replace the import. import pandas as pd → import modin.pandas as pd. Do this at the top of every file, not at module level inside packages — modin must be initialised before forks.
Pin the backend. Export MODIN_ENGINE=ray in your shell or set cfg.Engine.put("ray") before the first import modin.pandas.
Audit inplace=True calls. Replace each with reassignment — modin does not guarantee in-place semantics.
Eliminate .apply(axis=1). Row-wise apply is the most common silent fallback. Rewrite as vectorised numpy/arrow ops or as groupby().agg(...) with named aggregators.
Re-test edge cases. Float precision, NaN handling, and sort stability differ slightly across pandas versions; modin tracks the most recent stable pandas API.
Profile before claiming a win. On datasets under 100 MB, modin can be slower than pandas; the speedup only pays back at ~500 MB+.

python

# Before — single-threaded pandas
import pandas as pd
df = pd.read_csv("events.csv")
df.dropna(inplace=True)                                # ❌ in-place
df["score"] = df.apply(lambda r: r.a + r.b, axis=1)    # ❌ row-wise apply

# After — modin-friendly equivalent
import modin.pandas as pd
df = pd.read_csv("events.csv")
df = df.dropna()                                       # ✅ reassign
df["score"] = df["a"] + df["b"]                        # ✅ vectorised

Output: (none — exits 0 on success)

Real-world recipes

A handful of full pipelines that exercise modin's strengths — parallel I/O, parallel groupby, and pandas-compat for the long tail.

1. Multi-GB log aggregation

python

import modin.pandas as pd
import glob

paths = sorted(glob.glob("/var/log/app/events_*.parquet"))
df = pd.concat([pd.read_parquet(p) for p in paths])

# Two-level aggregation across ~50 M rows
hourly = (
    df.assign(hour=df["timestamp"].dt.floor("h"))
      .groupby(["hour", "endpoint"])
      .agg(requests=("request_id", "count"),
           p95_latency=("latency_ms", lambda s: s.quantile(0.95)))
      .reset_index()
)
hourly.to_parquet("/var/log/app/hourly_rollup.parquet")
print(hourly.head(3))

Output:

text

                  hour  endpoint  requests  p95_latency
0  2026-05-30 00:00:00  /v1/users   18421       142.0
1  2026-05-30 00:00:00  /v1/orders   3211       278.0
2  2026-05-30 01:00:00  /v1/users   17984       138.0

2. Joining two large tables

python

import modin.pandas as pd

users = pd.read_parquet("users.parquet")          # 8 M rows
orders = pd.read_parquet("orders.parquet")        # 80 M rows

joined = orders.merge(users, on="user_id", how="left")
spend = joined.groupby("country")["amount"].sum().sort_values(ascending=False)
print(spend.head(5))

Output:

text

country
US    18421094.10
DE     4218390.22
GB     3197102.55
FR     2811944.33
JP     2502887.10
Name: amount, dtype: float64

3. Mixed modin + pandas for unsupported ops

python

import modin.pandas as mpd
import pandas as pd

df = mpd.read_parquet("events.parquet")            # parallel read

# Custom groupby.apply → fallback. Do it once in pandas explicitly.
pdf = df._to_pandas()
top3_per_user = pdf.groupby("user_id").apply(
    lambda g: g.nlargest(3, "score")
).reset_index(drop=True)

back = mpd.DataFrame(top3_per_user)                # back to parallel
out = back.groupby("country")["score"].mean()
print(out.head(3))

Output:

text

country
DE    87.4
FR    86.9
GB    85.2
Name: score, dtype: float64

4. Ray cluster — distributed read across nodes

python

import ray
ray.init(address="ray://head-node:10001")  # connect to remote cluster

import modin.pandas as pd
df = pd.read_parquet("s3://my-bucket/events/year=2026/")
print(df.groupby("region")["amount"].sum())

Output:

text

region
emea    19821304.50
apac     8210444.10
amer    14328990.77
Name: amount, dtype: float64

Ecosystem integrations

Modin is API-compatible with pandas, so any library accepting a pandas DataFrame accepts a modin DataFrame via the __array__ and __dataframe__ protocols — but some integrations are worth calling out.

Library	Integration notes
Ray Data	`ray.data.from_modin(df)` and `ds.to_modin()` round-trip without re-distribution
dask.dataframe	When `MODIN_ENGINE=dask`, `df._to_pandas()` falls back via the dask scheduler
DuckDB	`duckdb.from_df(df._to_pandas())`; DuckDB outperforms modin on most analytical SQL
cuDF	GPU-pandas; no direct interop — convert via `df._to_pandas()` then `cudf.from_pandas(...)`
PyArrow	`df.to_arrow()` returns a `pyarrow.Table`; zero-copy if storage format is `"hdk"`
scikit-learn	Pass `df._to_pandas()` — sklearn does not understand modin's deferred partitions
polars	`pl.from_pandas(df._to_pandas())`; for >10 GB workloads, polars typically beats modin

When NOT to use this

Modin's parallel coordination overhead is real, and there are workloads where it costs more than it earns.

DataFrames under ~100 MB on disk. Plain pandas is consistently faster. The ratio approaches 1× around 500 MB.
Workloads dominated by apply(axis=1) or custom groupby().apply(). These fall back to single-threaded pandas anyway. Modin adds setup cost with no parallel return.
Tight latency budgets. Ray and Dask take ~2–5 seconds to start their first executor. Cold-start sensitive code paths (web request handlers, serverless functions) are a poor fit.
You need polars' lazy planner. Polars compiles a lazy plan and pushes filters down; modin executes eagerly like pandas.
You need strict GPU acceleration. Use cuDF directly.
Strict pandas API parity is required. Modin tracks pandas but lags by a few months on new API surface; some niche operations don't roundtrip.

Quick reference

Task	Code
Import swap	`import modin.pandas as pd`
Set backend	`modin.config.Engine.put("ray")`
Set partitions	`modin.config.NPartitions.put(16)`
To pandas	`df._to_pandas()`
From pandas	`modin.pandas.DataFrame(pandas_df)`
Read CSV parallel	`pd.read_csv("file.csv")`
Read Parquet parallel	`pd.read_parquet("file.parquet")`
groupby (parallel)	`df.groupby("k").agg({"v": "sum"})`
merge (parallel)	`pd.merge(left, right, on="id")`
Check engine	`modin.config.Engine.get()`
Start Ray manually	`ray.init(num_cpus=N)` before import