cheat sheet

pandas

Package-level reference for pandas — install, versioning, Python compatibility, extras, and gotchas. The de-facto DataFrame library on PyPI.

updated 05-31-2026

pandas

What it is

pandas is the standard tabular-data library for Python — DataFrame (2-D labelled) and Series (1-D labelled), built on NumPy and now Apache Arrow for some dtypes. It dominates the PyPI download charts for data tooling and is a hard dependency of most downstream libraries (scikit-learn helpers, seaborn, statsmodels, dagster IO managers, jupyter display, …).

Reach for the pandas PyPI package when you want the canonical DataFrame API, broad ecosystem interop, and a forgiving feature set; reach for polars or DuckDB instead when raw speed or out-of-core data is the bottleneck.

Install

bash

pip install pandas

Output: (none — exits 0 on success)

bash

uv add pandas

Output: dependency resolved and added to pyproject.toml

bash

poetry add pandas

Output: lockfile updated, package installed into the project venv

bash

pip install "pandas[performance,parquet,excel]"

Output: pulls pandas plus numexpr, bottleneck, pyarrow, and openpyxl

Versioning & Python support

pandas follows SemVer-ish with a documented deprecation policy (FutureWarning for one minor cycle before removal). The 2.x line introduced PyArrow-backed dtypes and the Copy-on-Write (CoW) execution model that is the default in 3.0+. As of late 2025, pandas 2.2 / 2.3 are the stable line and 3.0 has begun rolling out CoW-by-default.

Pandas line	Python support	Notes
2.2.x	3.9 – 3.12	last line supporting py3.9
2.3.x	3.10 – 3.13	PyArrow-backed string dtype opt-in
3.0.x	3.11+	Copy-on-Write default, stricter chained-assignment

The numeric Python floor moves forward roughly once per year. Pin pandas<3 if you depend on the legacy in-place mutation semantics.

Package metadata

Maintainer: NumFOCUS-sponsored project, lead maintainers Joris Van den Bossche, Matthew Roeschke, Patrick Hoefler, et al.
Project home: github.com/pandas-dev/pandas
Docs: pandas.pydata.org/docs
License: BSD-3-Clause
PyPI: pypi.org/project/pandas
Governance: PDEP (Pandas Enhancement Proposals); steering council
First released: 2008 (Wes McKinney, AQR Capital)
Downloads: > 250 M / month on PyPI (one of the top 10 packages by volume)

Optional dependencies & extras

The pandas[<extra>] syntax pulls grouped optional deps without remembering individual package names:

bash

pip install "pandas[performance,parquet,excel,plot,output-formatting]"

Output: installs the union of all named extra groups

Common extras and what they bring in:

Extra	Pulls in	When to use
`performance`	numexpr, bottleneck, numba	speeds up `eval`, rolling, groupby
`parquet`	pyarrow	required for `read_parquet` / `to_parquet`
`feather`	pyarrow	Feather/IPC files
`excel`	openpyxl, xlrd, xlsxwriter	`read_excel` / `to_excel`
`plot`	matplotlib	`df.plot(...)` chaining
`sql`	sqlalchemy	`read_sql` / `to_sql`
`aws`	s3fs	`read_csv("s3://...")`
`gcp`	gcsfs	`read_csv("gs://...")`
`all`	union of everything	quick way to skip the matrix above

Companion packages most pipelines pull alongside pandas:

bash

pip install pandas numpy matplotlib seaborn pyarrow scikit-learn jupyter

Output: installs the canonical "data stack" — numpy is already a hard dep but listing it pins it explicitly

Alternatives

Package	One-line trade-off
polars	5–30× faster Rust core, lazy/streaming, smaller API surface — switch when pandas chokes
modin	drop-in `import modin.pandas as pd`, parallelises across cores via Ray/Dask
duckdb	SQL-first; faster than pandas on multi-GB joins/aggregations, zero-copy with Arrow
dask.dataframe	distributed pandas across a cluster — overkill on a single laptop
pyspark.pandas	pandas API on Apache Spark — only worth it if you already run Spark
cuDF (RAPIDS)	GPU-accelerated drop-in; NVIDIA-only, sharp installation footprint

Common gotchas

2.x → 3.x Copy-on-Write break. Any code that relied on in-place mutation through a chained selector (df[df.x > 0]['y'] = …) silently became a no-op in 2.x with CoW opt-in and now raises in 3.0. Use .loc[mask, "y"] = … everywhere.
SettingWithCopyWarning is a CoW preview warning — never ignore it, even if the code "works". In 3.0 the warning becomes an error.
Slow on >5 GB data. Single-threaded NumPy core caps you at one CPU. Hand off to polars (in-process, Rust-parallel) or duckdb (SQL-parallel) instead of buying more RAM.
PyArrow-backed strings opt-in. Set pd.options.future.infer_string = True to use Arrow string arrays — 5–10× less memory than the legacy NumPy object dtype, but third-party libraries occasionally type-check against np.dtype("O") and break.
pd.read_excel is slow. openpyxl is pure-Python and orders of magnitude slower than read_csv. Convert XLSX → CSV/Parquet once and reuse.
Datetime tz handling. Naive vs tz-aware datetimes do not compare equal; pd.to_datetime(s, utc=True) is the safe default for mixed-source data.
pip install pandas on Apple Silicon. Wheels are now native arm64 — but if you somehow drag in an x86_64 numpy from a stale lockfile, the dtype layer crashes at import time. Regenerate the lockfile rather than hand-merging.

Real-world recipes

Pipeline-shaped tasks where pandas is still the most ergonomic answer — each is the package-level rationale (which extra is in play, how the install touches each step) rather than a re-teach of the DataFrame API. The companion sections/python/pandas covers the API depth.

CSV → Parquet conversion with chunking — the classic pandas "I have a 30 GB CSV that does not fit in RAM" pattern. The parquet extra is mandatory because pyarrow is what writes Parquet; everything else is stdlib-shaped pandas:

python

import pandas as pd

writer = None
for chunk in pd.read_csv("events.csv", chunksize=500_000, dtype_backend="pyarrow"):
    table = chunk.to_parquet  # noqa: F841 (placeholder for table conversion)
    if writer is None:
        chunk.to_parquet("events.parquet", engine="pyarrow", index=False)
    else:
        chunk.to_parquet("events.parquet", engine="pyarrow", index=False, append=True)

Output: writes a single Parquet file in roughly the time the disk reader can stream the CSV; memory stays under one chunk's worth (a few hundred MB)

The dtype_backend="pyarrow" keyword (pandas 2.x) keeps strings as Arrow strings during the read, which is usually 5-10x less RAM than the legacy object dtype for typical text-heavy CSVs.

SQL-style join across heterogeneous sources — pandas wins here when each source has a different reader (Excel, CSV, SQL, JSON), because every reader returns the same DataFrame shape:

python

import pandas as pd
from sqlalchemy import create_engine

orders = pd.read_csv("orders.csv", parse_dates=["ts"])
products = pd.read_excel("products.xlsx", sheet_name="catalogue")
customers = pd.read_sql("select id, region from customers", create_engine("sqlite:///shop.db"))

enriched = (
    orders
    .merge(products, on="sku", how="left", validate="many_to_one")
    .merge(customers, left_on="customer_id", right_on="id", how="left")
)
print(enriched.head())

Output: the merged DataFrame; the validate= keyword raises immediately if sku duplicates exist in products, which is the difference between "pandas as a polite query layer" and "pandas as a foot-gun"

This pattern pulls three extras simultaneously: excel (openpyxl), sql (sqlalchemy), and the implicit Parquet/CSV defaults.

Time-series resample + window function:

python

import pandas as pd

ts = pd.read_parquet("metrics.parquet").set_index("ts").sort_index()
daily = (
    ts["latency_ms"]
    .resample("1D")
    .quantile(0.95)
    .rolling(window=7, min_periods=1)
    .mean()
)
print(daily.tail())

Output: the 7-day rolling p95 latency — three operations, one chain, no intermediate variables

Vectorised feature engineering inside a notebook — the canonical reason teams stay on pandas even when polars is faster:

python

import pandas as pd
import numpy as np

df = pd.read_parquet("features.parquet")
df = df.assign(
    is_premium=lambda d: d["plan"].isin({"pro", "enterprise"}),
    revenue_log=lambda d: np.log1p(d["revenue"]),
    cohort=lambda d: d["signup_ts"].dt.to_period("M"),
)
print(df[["is_premium", "revenue_log", "cohort"]].head())

Output: three new columns derived in a single .assign() chain; lambda d: … keeps each derivation referring to the up-to-date frame rather than the pre-assign baseline

Performance tuning

pandas is single-threaded NumPy at the core, with optional vector accelerators (numexpr, bottleneck, numba) for specific operations. The biggest wins come from data layout choices made before the first operation rather than from runtime tuning.

python

import pandas as pd

# Force the Arrow string backend project-wide
pd.options.future.infer_string = True

# Read a 1 GB CSV in one shot
df = pd.read_csv(
    "events.csv",
    engine="pyarrow",            # pandas 2.x: multi-threaded parser via PyArrow
    dtype_backend="pyarrow",     # Arrow-native columns from the start
)
print(df.dtypes.head())

Output: Arrow-backed dtypes (string[pyarrow], int64[pyarrow], …) with multi-threaded parsing — typically 2-4x faster than the default C engine on a 4+ core box

Concrete tuning levers, ordered by impact:

Lever	Mechanism	When it helps
`engine="pyarrow"` on `read_csv`	multi-threaded Arrow parser	CSV reads > 100 MB
`dtype_backend="pyarrow"`	Arrow columns end-to-end	text-heavy and nullable-int data
`category` dtype	dictionary-encoded strings	low-cardinality columns (state, gender, sku family)
`engine="numexpr"` in `df.eval(...)`	SIMD expression evaluator	math-heavy column expressions
`bottleneck` install	C-accelerated `nansum`, `nanmean`, rolling	rolling windows, nan-aware reductions
`chunksize=` on `read_csv`	streaming iterator	RAM > file size
`parse_dates=` at read time	one parse pass vs per-cell `to_datetime`	timestamp-heavy CSVs

Profiling: where is the time actually going?

python

import pandas as pd
import cProfile, pstats

def workload():
    df = pd.read_parquet("events.parquet")
    return df.groupby("user_id")["revenue"].sum().sort_values(ascending=False).head(100)

cProfile.run("workload()", "pandas.prof")
pstats.Stats("pandas.prof").sort_stats("cumulative").print_stats(15)

Output: cumulative-time-sorted call graph; in practice the top entries are usually read_parquet (I/O bound) or the groupby hash kernel, not Python overhead

For memory specifically, df.memory_usage(deep=True) is the right tool — deep=True is mandatory for object columns or you under-count strings by an order of magnitude.

Memory & dataset-size scaling

pandas materialises the whole frame in RAM by default. For datasets that exceed RAM, the operations below let you avoid loading the full file at once. Each takes a different trade-off — chunked reads are cheap but limited to embarrassingly-parallel transforms; switching to Parquet plus column projection is the largest single win.

python

import pandas as pd

# Read only two columns from a 50-GB Parquet dataset
df = pd.read_parquet(
    "warehouse.parquet",
    columns=["user_id", "revenue"],
    filters=[("event_date", ">=", "2026-01-01")],
)
print(df.shape)

Output: a much smaller frame — Parquet's row-group pushdown skips most of the file on disk

Streaming a CSV that does not fit in RAM:

python

import pandas as pd

total = 0.0
for chunk in pd.read_csv("huge.csv", chunksize=500_000, usecols=["amount"]):
    total += chunk["amount"].sum()
print(f"{total:,.2f}")

Output: the running sum, with constant memory regardless of file size

When pandas can no longer be the answer:

>10 GB on a laptop: switch to polars (lazy + streaming) or duckdb (SQL pushdown). Both work zero-copy with Arrow, so your downstream pandas code can stay almost identical (.to_pandas() on the materialised result).
Distributed across a cluster: dask.dataframe or pyspark.pandas. Both run the pandas API across worker processes; both add orchestration overhead that is not worth it on a single machine.
Streaming/online: pandas is not a streaming framework. Use Arrow Flight or a true stream processor (Flink, Beam, Bytewax) and convert to a pandas frame at the boundary.

Version migration guide

The two recent breaks worth understanding before upgrading: the 2.0 dtype overhaul (mid-2023) and the 3.0 Copy-on-Write default (rolling out late 2025).

1.x → 2.x checklist:

np.int_, np.float_, np.bool_ aliases removed — replace with np.int64, np.float64, bool.
pd.DataFrame.append removed — use pd.concat([df, other]).
read_csv(squeeze=True) removed — call .squeeze() afterwards.
Default integer NA handling shifted: nullable integer dtype (Int64) became more prominent; some places that returned float64 with NaN now return Int64 with <NA>. Cast explicitly if downstream expects a NumPy float.
New dtype_backend="pyarrow" keyword on most read_* functions — opt-in for now, opt-out in 3.x.

2.x → 3.x (CoW) checklist — the largest behavioural change since the project began:

python

import pandas as pd

pd.set_option("mode.copy_on_write", True)   # enable now to find issues

df = pd.read_parquet("data.parquet")
sub = df[df["x"] > 0]
sub["y"] = 0          # under CoW: only sub is mutated; df is untouched

Output: with CoW on, the assignment never leaks back to df; without CoW the same code sometimes did, sometimes did not — the source of SettingWithCopyWarning

Audit your codebase by enabling pd.set_option("mode.copy_on_write", True) for one release before upgrading to 3.x. Anything that relied on the old "sometimes mutates the parent" behaviour will silently change result. The most common breakage is in feature-engineering code that uses df[mask][col] = value — replace with df.loc[mask, col] = value.

Interop with adjacent ecosystems

pandas is the lingua franca of the analytical Python stack — every other library either accepts a DataFrame or returns one. Knowing the zero-copy boundaries is the difference between a snappy notebook and one that re-allocates 5 GB four times per cell.

Convert from / to	How	Zero-copy?
pandas ↔ NumPy	`df.to_numpy()` / `pd.DataFrame(arr)`	Yes if dtype matches; no for Arrow-backed columns
pandas → Arrow	`pa.Table.from_pandas(df)`	Yes for Arrow-backed columns; copies object/string
pandas → polars	`pl.from_pandas(df)`	Yes via Arrow when both sides use Arrow
pandas → duckdb	`duckdb.sql("SELECT * FROM df")`	Yes (DuckDB registers the frame as an Arrow scan)
pandas → scikit-learn	`model.fit(df, y)`	Yes (sklearn 1.2+ preserves `feature_names_in_`)
pandas → matplotlib	`df.plot(...)`	Yes (views the same NumPy buffers)

python

import pandas as pd
import polars as pl
import duckdb

pdf = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})

# pandas -> polars zero-copy via Arrow
pf = pl.from_pandas(pdf)

# pandas -> duckdb in-place query
result = duckdb.sql("SELECT a*10 AS a10 FROM pdf").df()
print(result)

Output: a10 column with [10, 20, 30] — duckdb sees pdf directly without serialising

Troubleshooting common errors

The four errors below account for the bulk of SettingWithCopyWarning / dtype-related bug reports on the pandas tracker. Each has a one-line fix; the value is recognising the pattern.

SettingWithCopyWarning when you do df[df.x > 0]["y"] = 0. Fix: df.loc[df.x > 0, "y"] = 0. The warning becomes an error in 3.x.
KeyError: "['col'] not in index" after a merge. Cause: column has a different name in one side. Fix: rename before merging or use left_on=/right_on=.
ValueError: cannot reindex from a duplicate axis when concat-ing frames with overlapping indices. Fix: pd.concat([a, b], ignore_index=True).
MemoryError on a 5 GB CSV with 16 GB of RAM. Cause: object dtype strings inflate 4-10x. Fix: pd.read_csv(..., dtype_backend="pyarrow") or dtype={"col": "category"} for low-cardinality columns.
TypeError: Cannot compare tz-naive and tz-aware datetimes. Fix: pd.to_datetime(s, utc=True) everywhere — pick one timezone convention per pipeline.
Length mismatch: Expected axis has N elements, new values have M elements when assigning a list to a column. Cause: the source has a different length than the frame. Fix: align via index or use .merge.
Group-by silently drops NaN keys. df.groupby("col") excludes NaN groups by default. Pass dropna=False if NaN is a real category.
PerformanceWarning: indexing past lexsort depth may impact performance on a multi-index. Fix: df = df.sort_index() once.

When NOT to use this

pandas is the right answer most of the time; the cases below are the genuine ones where reaching for something else is the better trade-off.

Tiny data (< 1000 rows, simple keys): the stdlib dict + list comprehensions are faster to write and run.
Pure OLAP with multi-GB joins: duckdb's vectorised SQL is faster than pandas merge by an order of magnitude on multi-GB inputs.
Single-machine speed-critical: polars is 5-30x faster on most operations and has a more consistent API surface. The migration cost is one afternoon.
Distributed / cluster-scale: dask.dataframe, pyspark.pandas, ray-data. Each gives you the pandas API at the cost of orchestration overhead.
GPU inference: cuDF mirrors pandas on NVIDIA hardware — same API, very different install.
Streaming / online: pandas is batch-oriented. Use a real stream framework.

pandas

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Real-world recipes

Performance tuning

Memory & dataset-size scaling

Version migration guide

Interop with adjacent ecosystems

Troubleshooting common errors

When NOT to use this

See also