cheat sheet
pandas
Package-level reference for pandas — install, versioning, Python compatibility, extras, and gotchas. The de-facto DataFrame library on PyPI.
pandas
What it is
pandas is the standard tabular-data library for Python — DataFrame (2-D labelled) and Series (1-D labelled), built on NumPy and now Apache Arrow for some dtypes. It dominates the PyPI download charts for data tooling and is a hard dependency of most downstream libraries (scikit-learn helpers, seaborn, statsmodels, dagster IO managers, jupyter display, …).
Reach for the pandas PyPI package when you want the canonical DataFrame API, broad ecosystem interop, and a forgiving feature set; reach for polars or DuckDB instead when raw speed or out-of-core data is the bottleneck.
Install
pip install pandas
Output: (none — exits 0 on success)
uv add pandas
Output: dependency resolved and added to pyproject.toml
poetry add pandas
Output: lockfile updated, package installed into the project venv
pip install "pandas[performance,parquet,excel]"
Output: pulls pandas plus numexpr, bottleneck, pyarrow, and openpyxl
Versioning & Python support
pandas follows SemVer-ish with a documented deprecation policy (FutureWarning for one minor cycle before removal). The 2.x line introduced PyArrow-backed dtypes and the Copy-on-Write (CoW) execution model that is the default in 3.0+. As of late 2025, pandas 2.2 / 2.3 are the stable line and 3.0 has begun rolling out CoW-by-default.
| Pandas line | Python support | Notes |
|---|---|---|
| 2.2.x | 3.9 – 3.12 | last line supporting py3.9 |
| 2.3.x | 3.10 – 3.13 | PyArrow-backed string dtype opt-in |
| 3.0.x | 3.11+ | Copy-on-Write default, stricter chained-assignment |
The numeric Python floor moves forward roughly once per year. Pin pandas<3 if you depend on the legacy in-place mutation semantics.
Package metadata
- Maintainer: NumFOCUS-sponsored project, lead maintainers Joris Van den Bossche, Matthew Roeschke, Patrick Hoefler, et al.
- Project home: github.com/pandas-dev/pandas
- Docs: pandas.pydata.org/docs
- License: BSD-3-Clause
- PyPI: pypi.org/project/pandas
- Governance: PDEP (Pandas Enhancement Proposals); steering council
- First released: 2008 (Wes McKinney, AQR Capital)
- Downloads: > 250 M / month on PyPI (one of the top 10 packages by volume)
Optional dependencies & extras
The pandas[<extra>] syntax pulls grouped optional deps without remembering individual package names:
pip install "pandas[performance,parquet,excel,plot,output-formatting]"
Output: installs the union of all named extra groups
Common extras and what they bring in:
| Extra | Pulls in | When to use |
|---|---|---|
performance | numexpr, bottleneck, numba | speeds up eval, rolling, groupby |
parquet | pyarrow | required for read_parquet / to_parquet |
feather | pyarrow | Feather/IPC files |
excel | openpyxl, xlrd, xlsxwriter | read_excel / to_excel |
plot | matplotlib | df.plot(...) chaining |
sql | sqlalchemy | read_sql / to_sql |
aws | s3fs | read_csv("s3://...") |
gcp | gcsfs | read_csv("gs://...") |
all | union of everything | quick way to skip the matrix above |
Companion packages most pipelines pull alongside pandas:
pip install pandas numpy matplotlib seaborn pyarrow scikit-learn jupyter
Output: installs the canonical "data stack" — numpy is already a hard dep but listing it pins it explicitly
Alternatives
| Package | One-line trade-off |
|---|---|
| polars | 5–30× faster Rust core, lazy/streaming, smaller API surface — switch when pandas chokes |
| modin | drop-in import modin.pandas as pd, parallelises across cores via Ray/Dask |
| duckdb | SQL-first; faster than pandas on multi-GB joins/aggregations, zero-copy with Arrow |
| dask.dataframe | distributed pandas across a cluster — overkill on a single laptop |
| pyspark.pandas | pandas API on Apache Spark — only worth it if you already run Spark |
| cuDF (RAPIDS) | GPU-accelerated drop-in; NVIDIA-only, sharp installation footprint |
Common gotchas
- 2.x → 3.x Copy-on-Write break. Any code that relied on in-place mutation through a chained selector (
df[df.x > 0]['y'] = …) silently became a no-op in 2.x with CoW opt-in and now raises in 3.0. Use.loc[mask, "y"] = …everywhere. SettingWithCopyWarningis a CoW preview warning — never ignore it, even if the code "works". In 3.0 the warning becomes an error.- Slow on >5 GB data. Single-threaded NumPy core caps you at one CPU. Hand off to polars (in-process, Rust-parallel) or duckdb (SQL-parallel) instead of buying more RAM.
- PyArrow-backed strings opt-in. Set
pd.options.future.infer_string = Trueto use Arrow string arrays — 5–10× less memory than the legacy NumPy object dtype, but third-party libraries occasionally type-check againstnp.dtype("O")and break. pd.read_excelis slow. openpyxl is pure-Python and orders of magnitude slower thanread_csv. Convert XLSX → CSV/Parquet once and reuse.- Datetime tz handling. Naive vs tz-aware datetimes do not compare equal;
pd.to_datetime(s, utc=True)is the safe default for mixed-source data. pip install pandason Apple Silicon. Wheels are now native arm64 — but if you somehow drag in an x86_64 numpy from a stale lockfile, the dtype layer crashes at import time. Regenerate the lockfile rather than hand-merging.
Real-world recipes
Pipeline-shaped tasks where pandas is still the most ergonomic answer — each is the package-level rationale (which extra is in play, how the install touches each step) rather than a re-teach of the DataFrame API. The companion sections/python/pandas covers the API depth.
CSV → Parquet conversion with chunking — the classic pandas "I have a 30 GB CSV that does not fit in RAM" pattern. The parquet extra is mandatory because pyarrow is what writes Parquet; everything else is stdlib-shaped pandas:
import pandas as pd
writer = None
for chunk in pd.read_csv("events.csv", chunksize=500_000, dtype_backend="pyarrow"):
table = chunk.to_parquet # noqa: F841 (placeholder for table conversion)
if writer is None:
chunk.to_parquet("events.parquet", engine="pyarrow", index=False)
else:
chunk.to_parquet("events.parquet", engine="pyarrow", index=False, append=True)
Output: writes a single Parquet file in roughly the time the disk reader can stream the CSV; memory stays under one chunk's worth (a few hundred MB)
The dtype_backend="pyarrow" keyword (pandas 2.x) keeps strings as Arrow strings during the read, which is usually 5-10x less RAM than the legacy object dtype for typical text-heavy CSVs.
SQL-style join across heterogeneous sources — pandas wins here when each source has a different reader (Excel, CSV, SQL, JSON), because every reader returns the same DataFrame shape:
import pandas as pd
from sqlalchemy import create_engine
orders = pd.read_csv("orders.csv", parse_dates=["ts"])
products = pd.read_excel("products.xlsx", sheet_name="catalogue")
customers = pd.read_sql("select id, region from customers", create_engine("sqlite:///shop.db"))
enriched = (
orders
.merge(products, on="sku", how="left", validate="many_to_one")
.merge(customers, left_on="customer_id", right_on="id", how="left")
)
print(enriched.head())
Output: the merged DataFrame; the validate= keyword raises immediately if sku duplicates exist in products, which is the difference between "pandas as a polite query layer" and "pandas as a foot-gun"
This pattern pulls three extras simultaneously: excel (openpyxl), sql (sqlalchemy), and the implicit Parquet/CSV defaults.
Time-series resample + window function:
import pandas as pd
ts = pd.read_parquet("metrics.parquet").set_index("ts").sort_index()
daily = (
ts["latency_ms"]
.resample("1D")
.quantile(0.95)
.rolling(window=7, min_periods=1)
.mean()
)
print(daily.tail())
Output: the 7-day rolling p95 latency — three operations, one chain, no intermediate variables
Vectorised feature engineering inside a notebook — the canonical reason teams stay on pandas even when polars is faster:
import pandas as pd
import numpy as np
df = pd.read_parquet("features.parquet")
df = df.assign(
is_premium=lambda d: d["plan"].isin({"pro", "enterprise"}),
revenue_log=lambda d: np.log1p(d["revenue"]),
cohort=lambda d: d["signup_ts"].dt.to_period("M"),
)
print(df[["is_premium", "revenue_log", "cohort"]].head())
Output: three new columns derived in a single .assign() chain; lambda d: … keeps each derivation referring to the up-to-date frame rather than the pre-assign baseline
Performance tuning
pandas is single-threaded NumPy at the core, with optional vector accelerators (numexpr, bottleneck, numba) for specific operations. The biggest wins come from data layout choices made before the first operation rather than from runtime tuning.
import pandas as pd
# Force the Arrow string backend project-wide
pd.options.future.infer_string = True
# Read a 1 GB CSV in one shot
df = pd.read_csv(
"events.csv",
engine="pyarrow", # pandas 2.x: multi-threaded parser via PyArrow
dtype_backend="pyarrow", # Arrow-native columns from the start
)
print(df.dtypes.head())
Output: Arrow-backed dtypes (string[pyarrow], int64[pyarrow], …) with multi-threaded parsing — typically 2-4x faster than the default C engine on a 4+ core box
Concrete tuning levers, ordered by impact:
| Lever | Mechanism | When it helps |
|---|---|---|
engine="pyarrow" on read_csv | multi-threaded Arrow parser | CSV reads > 100 MB |
dtype_backend="pyarrow" | Arrow columns end-to-end | text-heavy and nullable-int data |
category dtype | dictionary-encoded strings | low-cardinality columns (state, gender, sku family) |
engine="numexpr" in df.eval(...) | SIMD expression evaluator | math-heavy column expressions |
bottleneck install | C-accelerated nansum, nanmean, rolling | rolling windows, nan-aware reductions |
chunksize= on read_csv | streaming iterator | RAM > file size |
parse_dates= at read time | one parse pass vs per-cell to_datetime | timestamp-heavy CSVs |
Profiling: where is the time actually going?
import pandas as pd
import cProfile, pstats
def workload():
df = pd.read_parquet("events.parquet")
return df.groupby("user_id")["revenue"].sum().sort_values(ascending=False).head(100)
cProfile.run("workload()", "pandas.prof")
pstats.Stats("pandas.prof").sort_stats("cumulative").print_stats(15)
Output: cumulative-time-sorted call graph; in practice the top entries are usually read_parquet (I/O bound) or the groupby hash kernel, not Python overhead
For memory specifically, df.memory_usage(deep=True) is the right tool — deep=True is mandatory for object columns or you under-count strings by an order of magnitude.
Memory & dataset-size scaling
pandas materialises the whole frame in RAM by default. For datasets that exceed RAM, the operations below let you avoid loading the full file at once. Each takes a different trade-off — chunked reads are cheap but limited to embarrassingly-parallel transforms; switching to Parquet plus column projection is the largest single win.
import pandas as pd
# Read only two columns from a 50-GB Parquet dataset
df = pd.read_parquet(
"warehouse.parquet",
columns=["user_id", "revenue"],
filters=[("event_date", ">=", "2026-01-01")],
)
print(df.shape)
Output: a much smaller frame — Parquet's row-group pushdown skips most of the file on disk
Streaming a CSV that does not fit in RAM:
import pandas as pd
total = 0.0
for chunk in pd.read_csv("huge.csv", chunksize=500_000, usecols=["amount"]):
total += chunk["amount"].sum()
print(f"{total:,.2f}")
Output: the running sum, with constant memory regardless of file size
When pandas can no longer be the answer:
- >10 GB on a laptop: switch to polars (lazy + streaming) or duckdb (SQL pushdown). Both work zero-copy with Arrow, so your downstream pandas code can stay almost identical (
.to_pandas()on the materialised result). - Distributed across a cluster: dask.dataframe or pyspark.pandas. Both run the pandas API across worker processes; both add orchestration overhead that is not worth it on a single machine.
- Streaming/online: pandas is not a streaming framework. Use Arrow Flight or a true stream processor (Flink, Beam, Bytewax) and convert to a pandas frame at the boundary.
Version migration guide
The two recent breaks worth understanding before upgrading: the 2.0 dtype overhaul (mid-2023) and the 3.0 Copy-on-Write default (rolling out late 2025).
1.x → 2.x checklist:
np.int_,np.float_,np.bool_aliases removed — replace withnp.int64,np.float64,bool.pd.DataFrame.appendremoved — usepd.concat([df, other]).read_csv(squeeze=True)removed — call.squeeze()afterwards.- Default integer NA handling shifted: nullable integer dtype (
Int64) became more prominent; some places that returnedfloat64with NaN now returnInt64with<NA>. Cast explicitly if downstream expects a NumPy float. - New
dtype_backend="pyarrow"keyword on mostread_*functions — opt-in for now, opt-out in 3.x.
2.x → 3.x (CoW) checklist — the largest behavioural change since the project began:
import pandas as pd
pd.set_option("mode.copy_on_write", True) # enable now to find issues
df = pd.read_parquet("data.parquet")
sub = df[df["x"] > 0]
sub["y"] = 0 # under CoW: only sub is mutated; df is untouched
Output: with CoW on, the assignment never leaks back to df; without CoW the same code sometimes did, sometimes did not — the source of SettingWithCopyWarning
Audit your codebase by enabling pd.set_option("mode.copy_on_write", True) for one release before upgrading to 3.x. Anything that relied on the old "sometimes mutates the parent" behaviour will silently change result. The most common breakage is in feature-engineering code that uses df[mask][col] = value — replace with df.loc[mask, col] = value.
Interop with adjacent ecosystems
pandas is the lingua franca of the analytical Python stack — every other library either accepts a DataFrame or returns one. Knowing the zero-copy boundaries is the difference between a snappy notebook and one that re-allocates 5 GB four times per cell.
| Convert from / to | How | Zero-copy? |
|---|---|---|
| pandas ↔ NumPy | df.to_numpy() / pd.DataFrame(arr) | Yes if dtype matches; no for Arrow-backed columns |
| pandas → Arrow | pa.Table.from_pandas(df) | Yes for Arrow-backed columns; copies object/string |
| pandas → polars | pl.from_pandas(df) | Yes via Arrow when both sides use Arrow |
| pandas → duckdb | duckdb.sql("SELECT * FROM df") | Yes (DuckDB registers the frame as an Arrow scan) |
| pandas → scikit-learn | model.fit(df, y) | Yes (sklearn 1.2+ preserves feature_names_in_) |
| pandas → matplotlib | df.plot(...) | Yes (views the same NumPy buffers) |
import pandas as pd
import polars as pl
import duckdb
pdf = pd.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
# pandas -> polars zero-copy via Arrow
pf = pl.from_pandas(pdf)
# pandas -> duckdb in-place query
result = duckdb.sql("SELECT a*10 AS a10 FROM pdf").df()
print(result)
Output: a10 column with [10, 20, 30] — duckdb sees pdf directly without serialising
Troubleshooting common errors
The four errors below account for the bulk of SettingWithCopyWarning / dtype-related bug reports on the pandas tracker. Each has a one-line fix; the value is recognising the pattern.
SettingWithCopyWarningwhen you dodf[df.x > 0]["y"] = 0. Fix:df.loc[df.x > 0, "y"] = 0. The warning becomes an error in 3.x.KeyError: "['col'] not in index"after a merge. Cause: column has a different name in one side. Fix: rename before merging or useleft_on=/right_on=.ValueError: cannot reindex from a duplicate axiswhen concat-ing frames with overlapping indices. Fix:pd.concat([a, b], ignore_index=True).MemoryErroron a 5 GB CSV with 16 GB of RAM. Cause: object dtype strings inflate 4-10x. Fix:pd.read_csv(..., dtype_backend="pyarrow")ordtype={"col": "category"}for low-cardinality columns.TypeError: Cannot compare tz-naive and tz-aware datetimes. Fix:pd.to_datetime(s, utc=True)everywhere — pick one timezone convention per pipeline.Length mismatch: Expected axis has N elements, new values have M elementswhen assigning a list to a column. Cause: the source has a different length than the frame. Fix: align via index or use.merge.- Group-by silently drops NaN keys.
df.groupby("col")excludes NaN groups by default. Passdropna=Falseif NaN is a real category. PerformanceWarning: indexing past lexsort depth may impact performanceon a multi-index. Fix:df = df.sort_index()once.
When NOT to use this
pandas is the right answer most of the time; the cases below are the genuine ones where reaching for something else is the better trade-off.
- Tiny data (< 1000 rows, simple keys): the stdlib
dict+ list comprehensions are faster to write and run. - Pure OLAP with multi-GB joins: duckdb's vectorised SQL is faster than pandas merge by an order of magnitude on multi-GB inputs.
- Single-machine speed-critical: polars is 5-30x faster on most operations and has a more consistent API surface. The migration cost is one afternoon.
- Distributed / cluster-scale: dask.dataframe, pyspark.pandas, ray-data. Each gives you the pandas API at the cost of orchestration overhead.
- GPU inference: cuDF mirrors pandas on NVIDIA hardware — same API, very different install.
- Streaming / online: pandas is batch-oriented. Use a real stream framework.
See also
- sections/python/pandas — full API tutorial (DataFrame ops, groupby, merge, reshape)
- sections/python/polars — faster alternative when pandas hits its single-core ceiling
- sections/python/duckdb — SQL on DataFrames and Parquet
- sections/packages-pip/pip-numpy — the numerical foundation pandas is built on
- sections/packages-pip/pip-polars — package-level comparison