cheat sheet

modin

Speed up pandas workloads across all CPU cores with a one-line import swap. Covers Ray and Dask backends, config tuning, pandas interop, and when modin wins vs polars.

modin — Drop-in pandas at Scale

What it is

Modin is a DataFrame library that parallelises pandas operations across all available CPU cores with a single import change. It replaces import pandas as pd with import modin.pandas as pd and intercepts the same API, distributing work across partitions using Ray (default), Dask, or Unidist as the execution engine. Code that already works with pandas runs on modin without modification; modin falls back to pandas transparently for unsupported operations. It is the lowest-friction path for speeding up existing pandas pipelines on multi-core machines.

Install

bash
pip install modin[ray]    # Ray backend (recommended, best out-of-the-box speed)
pip install modin[dask]   # Dask backend (useful if Dask is already in your stack)
pip install modin[all]    # all backends

Output: (none — exits 0 on success)

Quick example

python
import modin.pandas as pd   # only this line changes

df = pd.read_csv("large.csv")   # reads in parallel across cores
print(df.groupby("category")["value"].mean())

Output:

text
category
electronics    299.42
furniture      189.10
software       149.99
Name: value, dtype: float64

When / why to use it

  • Existing pandas codebase you want to speed up without rewriting to polars or Dask.
  • Multi-core machines (8+ cores) where single-threaded pandas leaves most hardware idle.
  • Large CSVs or Parquet files where read_csv / read_parquet is the bottleneck.
  • Teams that cannot retrain on a new DataFrame API.
  • Ad-hoc exploration where polars' expression-based API feels unfamiliar.

Common pitfalls

Mixing modin and pandas DataFrames — passing a modin DataFrame to a function that calls pd.DataFrame() with the stock pandas import creates a plain pandas DataFrame, silently losing the distributed partitioning. Keep all objects in the same library within a pipeline.

Small DataFrames are slower — modin has coordination overhead from the execution engine. For DataFrames under ~100 MB, plain pandas is often faster. Modin's parallelism pays off only at scale.

inplace=True is unreliable — modin does not guarantee that inplace=True modifies the original object. Reassign instead: df = df.dropna().

Unsupported operations fall back silently — modin falls back to pandas for ~15% of the API surface. The fallback is correct but single-threaded. Check modin.utils.has_metadata or watch for UserWarning: Distributing... messages to understand what's running in parallel.

Set MODIN_ENGINE=ray (or dask) as an environment variable before importing modin to control the backend without touching code. This makes it easy to switch backends per-environment.

df._to_pandas() extracts a plain pandas DataFrame from a modin object. Use it when you need to pass data to a library that does not accept modin DataFrames.

Richer example — multi-file CSV aggregation

python
import modin.pandas as pd
import glob

# Read and concat multiple large CSVs in parallel
files = glob.glob("data/logs_*.csv")
df = pd.concat([pd.read_csv(f) for f in files], ignore_index=True)

# Group by date and action, count events
summary = (
    df.groupby(["date", "action"])["user_id"]
    .count()
    .reset_index(name="events")
    .sort_values("events", ascending=False)
)
print(summary.head(5))

Output:

text
         date      action  events
3  2026-01-03  page_view   48291
1  2026-01-01  page_view   45982
7  2026-01-07    add_cart   38741
2  2026-01-02     search   33019
5  2026-01-05  page_view   31284

Backend selection

Modin supports three execution engines. The backend is selected once at import time (or via environment variable) and cannot be changed mid-session.

python
# Option 1 — environment variable (set before Python starts)
# MODIN_ENGINE=ray python script.py

# Option 2 — modin.config (set before first import of modin.pandas)
import modin.config as cfg
cfg.Engine.put("ray")          # or "dask", "unidist"
import modin.pandas as pd

# Option 3 — Ray directly (start cluster before modin import)
import ray
ray.init(num_cpus=8)           # or ray.init(address="auto") for remote cluster
import modin.pandas as pd
BackendBest for
RaySingle machine, best default performance, easy install
DaskAlready using Dask, distributed cluster, lower memory overhead
UnidistMPI environments, HPC clusters

modin.config — tuning parameters

modin.config exposes runtime knobs for partition size, engine, and logging. Changes must be made before importing modin.pandas.

python
import modin.config as cfg

cfg.Engine.put("ray")
cfg.NPartitions.put(16)          # override number of row partitions (default: nCPU)
cfg.RayRedisAddress.put(None)    # use local Ray cluster
cfg.BenchmarkMode.put(False)     # set True to disable fallback (raises instead of falling back)

import modin.pandas as pd

Check current values:

python
import modin.config as cfg
print(cfg.Engine.get())          # ray
print(cfg.NPartitions.get())     # 8 (on 8-core machine)

Interop with pandas

Converting between modin and pandas is explicit and inexpensive — both share Arrow-compatible memory when possible.

python
import modin.pandas as mpd
import pandas as pd

# modin → pandas
df_modin = mpd.read_csv("data.csv")
df_pandas = df_modin._to_pandas()
print(type(df_pandas))   # <class 'pandas.core.frame.DataFrame'>

# pandas → modin
df_back = mpd.DataFrame(df_pandas)
print(type(df_back))     # <class 'modin.pandas.dataframe.DataFrame'>

# Use modin in a function that expects pandas
import numpy as np
result = pd.DataFrame(np.corrcoef(df_modin["a"], df_modin["b"]))  # works via __array__

Operations that fall back to pandas

Modin handles the most common operations natively. The following operations silently fall back to single-threaded pandas:

OperationStatus
read_csv, read_parquet, read_excelParallel ✅
groupby().agg()Parallel ✅
merge / joinParallel ✅
sort_valuesParallel ✅
apply(axis=0) (column-wise)Parallel ✅
apply(axis=1) (row-wise)Fallback ⚠️
iterrows, itertuplesFallback ⚠️
to_sqlFallback ⚠️
Custom groupby().apply()Fallback ⚠️
MultiIndex operationsPartial ⚠️

Row-wise apply(axis=1) is the most common performance trap. If you need per-row custom logic at scale, move to polars' map_elements or vectorise with numpy instead.

Benchmarks — when modin wins

Modin's speedup is proportional to core count and dataset size. Rough guidelines:

Scenariomodin vs pandas
read_csv (2 GB, 8 cores)~4–6× faster
groupby().agg() (50 M rows)~3–5× faster
merge (two 10 M row tables)~2–4× faster
sort_values (10 M rows)~2–3× faster
apply(axis=1)~1× (fallback)
DataFrame under 10 MB~0.5–0.8× (overhead dominates)

For datasets over ~10 GB or when you need strict performance guarantees, polars is typically faster than modin because its Rust/Arrow core has lower coordination overhead than modin's Ray-based partitioning.

Dropping back to pandas for a single operation

When you know an operation falls back anyway, it is cleaner to extract pandas explicitly for that step:

python
import modin.pandas as mpd

df = mpd.read_parquet("large.parquet")   # parallel read

# Complex custom groupby — do it in pandas
pandas_df = df._to_pandas()
result = pandas_df.groupby("user_id").apply(
    lambda g: g.nlargest(3, "value")
)

# Continue in modin
df_result = mpd.DataFrame(result.reset_index(drop=True))

Partition inspection

python
import modin.pandas as pd

df = pd.read_csv("data.csv")

# Number of row partitions
from modin.config import NPartitions
print(NPartitions.get())

# Underlying pandas partitions (for debugging)
parts = df._query_compiler._modin_frame._partitions
print(f"Partition grid: {len(parts)} rows × {len(parts[0])} cols")

Backend comparison — ray vs dask vs unidist

Each backend has different operational characteristics. Match the backend to your environment and existing infrastructure rather than chasing micro-benchmarks.

AspectRayDaskUnidist
Install footprint~150 MB~25 MB~10 MB (MPI required separately)
Single-machine speedBest on most workloads~10–20% slower than RayComparable to Ray with MPI tuning
Distributed setupray.init(address="...")Client(address="...")mpirun-driven launch
Memory overhead per worker~200 MB~80 MB~50 MB
Plays well withRay Tune, Ray Serve, RLlibdask.delayed, dask.dataframe, CoiledOpenMPI/MPICH clusters, HPC schedulers
Best whenDefault; ML stack; want one cluster for training+ETLAlready on Dask; need lower RAM ceilingHPC environment with InfiniBand and Slurm
python
# Side-by-side micro-benchmark scaffold
import os, time

def bench(engine: str, path: str):
    os.environ["MODIN_ENGINE"] = engine
    # Force reimport — clearing modin from sys.modules is required
    import importlib, sys
    for mod in [m for m in sys.modules if m.startswith("modin")]:
        del sys.modules[mod]
    import modin.pandas as pd
    t0 = time.perf_counter()
    df = pd.read_parquet(path)
    n = len(df.groupby("category")["value"].sum())
    return engine, time.perf_counter() - t0, n

# Run in a fresh subprocess for each engine — engine swap mid-process is brittle
print(bench("ray", "/data/events.parquet"))
print(bench("dask", "/data/events.parquet"))

Output:

text
('ray', 4.21, 17)
('dask', 5.07, 17)

Engine swapping in one process is fragile. Modin caches initialised executor handles. Spawn a fresh subprocess per engine when benchmarking — clearing sys.modules works most of the time but is not officially supported.

Performance tuning

Modin scales well only when partition count and engine resources match the dataset and machine. Default behaviour is "one partition per CPU", which is rarely optimal.

python
import modin.config as cfg
import os

# 1. Partition count — for very large frames, oversubscribe slightly
cfg.NPartitions.put(os.cpu_count() * 2)

# 2. Min partition size — avoid creating tiny partitions on small frames
cfg.MinPartitionSize.put(32)  # 32 MB per partition floor

# 3. Async read mode — overlap I/O with compute on multi-file reads
cfg.AsyncReadMode.put(True)

# 4. Storage format — pyarrow is faster than the pandas default for groupby
cfg.StorageFormat.put("pandas")  # or "hdk" for in-memory analytics engine

# 5. Progress bar — show parallel task progress
cfg.ProgressBar.put(True)

import modin.pandas as pd

Output: (none — exits 0 on success)

Tuning checklist for slow modin workloads:

  1. Confirm parallelism is engaged. Watch for UserWarning: Distributing <object> — its absence on a heavy op means modin fell back to pandas.
  2. Inspect partition count. cfg.NPartitions.get() should be ≥ CPU count for parallelism, but each partition should hold ≥ 32 MB to amortise overhead.
  3. Profile the engine. ray.timeline() (Ray) and client.profile() (Dask) show where wall time goes.
  4. Pin object refs. For repeatedly-used DataFrames, call _to_pandas() once and reuse the pandas object rather than re-distributing on every access.
  5. Use Parquet over CSV. Modin's parallel Parquet reader is 3–5× faster than its parallel CSV reader on the same data, mostly because Parquet stores row counts in metadata so partitioning is cheap.

Troubleshooting common errors

Error / symptomCauseFix
ImportError: ray is not installedDefault backend selected without extraspip install "modin[ray]" or set MODIN_ENGINE=dask
UserWarning: Distributing <object> ... defaulting to pandasOperation unsupported by modinExpected for the ~15% fallback surface; rewrite using a supported op
Ray dashboard shows zero tasks during heavy opOperation fell back to pandas silentlySearch release notes for the op name; consider _to_pandas() for the step
ray.exceptions.RayOutOfMemoryErrorPartitions too large for worker RAMIncrease NPartitions, decrease MinPartitionSize, or set object_store_memory on ray.init()
AttributeError: 'DataFrame' object has no attribute '_to_pandas'You imported plain pandas, not modinCheck the import line at top of the file
First op takes 5+ seconds before any computeRay/Dask cluster startupPre-warm with ray.init() at app start; cache between runs
Engine 'ray' is not initialized after forkForking with Ray running breaks workersInitialise Ray after os.fork(); or use spawn-context multiprocessing
Memory grows monotonically across iterationsRay object store fragmentationPeriodic ray.internal_kv._internal_kv_reset() or restart the driver between batches

Migration from pandas

The full drop-in promise holds for most code, but a handful of patterns need adjustment for correctness and performance.

Step-by-step migration checklist:

  1. Replace the import. import pandas as pdimport modin.pandas as pd. Do this at the top of every file, not at module level inside packages — modin must be initialised before forks.
  2. Pin the backend. Export MODIN_ENGINE=ray in your shell or set cfg.Engine.put("ray") before the first import modin.pandas.
  3. Audit inplace=True calls. Replace each with reassignment — modin does not guarantee in-place semantics.
  4. Eliminate .apply(axis=1). Row-wise apply is the most common silent fallback. Rewrite as vectorised numpy/arrow ops or as groupby().agg(...) with named aggregators.
  5. Re-test edge cases. Float precision, NaN handling, and sort stability differ slightly across pandas versions; modin tracks the most recent stable pandas API.
  6. Profile before claiming a win. On datasets under 100 MB, modin can be slower than pandas; the speedup only pays back at ~500 MB+.
python
# Before — single-threaded pandas
import pandas as pd
df = pd.read_csv("events.csv")
df.dropna(inplace=True)                                # ❌ in-place
df["score"] = df.apply(lambda r: r.a + r.b, axis=1)    # ❌ row-wise apply

# After — modin-friendly equivalent
import modin.pandas as pd
df = pd.read_csv("events.csv")
df = df.dropna()                                       # ✅ reassign
df["score"] = df["a"] + df["b"]                        # ✅ vectorised

Output: (none — exits 0 on success)

Real-world recipes

A handful of full pipelines that exercise modin's strengths — parallel I/O, parallel groupby, and pandas-compat for the long tail.

1. Multi-GB log aggregation

python
import modin.pandas as pd
import glob

paths = sorted(glob.glob("/var/log/app/events_*.parquet"))
df = pd.concat([pd.read_parquet(p) for p in paths])

# Two-level aggregation across ~50 M rows
hourly = (
    df.assign(hour=df["timestamp"].dt.floor("h"))
      .groupby(["hour", "endpoint"])
      .agg(requests=("request_id", "count"),
           p95_latency=("latency_ms", lambda s: s.quantile(0.95)))
      .reset_index()
)
hourly.to_parquet("/var/log/app/hourly_rollup.parquet")
print(hourly.head(3))

Output:

text
                  hour  endpoint  requests  p95_latency
0  2026-05-30 00:00:00  /v1/users   18421       142.0
1  2026-05-30 00:00:00  /v1/orders   3211       278.0
2  2026-05-30 01:00:00  /v1/users   17984       138.0

2. Joining two large tables

python
import modin.pandas as pd

users = pd.read_parquet("users.parquet")          # 8 M rows
orders = pd.read_parquet("orders.parquet")        # 80 M rows

joined = orders.merge(users, on="user_id", how="left")
spend = joined.groupby("country")["amount"].sum().sort_values(ascending=False)
print(spend.head(5))

Output:

text
country
US    18421094.10
DE     4218390.22
GB     3197102.55
FR     2811944.33
JP     2502887.10
Name: amount, dtype: float64

3. Mixed modin + pandas for unsupported ops

python
import modin.pandas as mpd
import pandas as pd

df = mpd.read_parquet("events.parquet")            # parallel read

# Custom groupby.apply → fallback. Do it once in pandas explicitly.
pdf = df._to_pandas()
top3_per_user = pdf.groupby("user_id").apply(
    lambda g: g.nlargest(3, "score")
).reset_index(drop=True)

back = mpd.DataFrame(top3_per_user)                # back to parallel
out = back.groupby("country")["score"].mean()
print(out.head(3))

Output:

text
country
DE    87.4
FR    86.9
GB    85.2
Name: score, dtype: float64

4. Ray cluster — distributed read across nodes

python
import ray
ray.init(address="ray://head-node:10001")  # connect to remote cluster

import modin.pandas as pd
df = pd.read_parquet("s3://my-bucket/events/year=2026/")
print(df.groupby("region")["amount"].sum())

Output:

text
region
emea    19821304.50
apac     8210444.10
amer    14328990.77
Name: amount, dtype: float64

Ecosystem integrations

Modin is API-compatible with pandas, so any library accepting a pandas DataFrame accepts a modin DataFrame via the __array__ and __dataframe__ protocols — but some integrations are worth calling out.

LibraryIntegration notes
Ray Dataray.data.from_modin(df) and ds.to_modin() round-trip without re-distribution
dask.dataframeWhen MODIN_ENGINE=dask, df._to_pandas() falls back via the dask scheduler
DuckDBduckdb.from_df(df._to_pandas()); DuckDB outperforms modin on most analytical SQL
cuDFGPU-pandas; no direct interop — convert via df._to_pandas() then cudf.from_pandas(...)
PyArrowdf.to_arrow() returns a pyarrow.Table; zero-copy if storage format is "hdk"
scikit-learnPass df._to_pandas() — sklearn does not understand modin's deferred partitions
polarspl.from_pandas(df._to_pandas()); for >10 GB workloads, polars typically beats modin

When NOT to use this

Modin's parallel coordination overhead is real, and there are workloads where it costs more than it earns.

  • DataFrames under ~100 MB on disk. Plain pandas is consistently faster. The ratio approaches 1× around 500 MB.
  • Workloads dominated by apply(axis=1) or custom groupby().apply(). These fall back to single-threaded pandas anyway. Modin adds setup cost with no parallel return.
  • Tight latency budgets. Ray and Dask take ~2–5 seconds to start their first executor. Cold-start sensitive code paths (web request handlers, serverless functions) are a poor fit.
  • You need polars' lazy planner. Polars compiles a lazy plan and pushes filters down; modin executes eagerly like pandas.
  • You need strict GPU acceleration. Use cuDF directly.
  • Strict pandas API parity is required. Modin tracks pandas but lags by a few months on new API surface; some niche operations don't roundtrip.

Quick reference

TaskCode
Import swapimport modin.pandas as pd
Set backendmodin.config.Engine.put("ray")
Set partitionsmodin.config.NPartitions.put(16)
To pandasdf._to_pandas()
From pandasmodin.pandas.DataFrame(pandas_df)
Read CSV parallelpd.read_csv("file.csv")
Read Parquet parallelpd.read_parquet("file.parquet")
groupby (parallel)df.groupby("k").agg({"v": "sum"})
merge (parallel)pd.merge(left, right, on="id")
Check enginemodin.config.Engine.get()
Start Ray manuallyray.init(num_cpus=N) before import