cheat sheet
polars
Package-level reference for polars — install, versioning, extras, and gotchas. The Rust-powered Arrow-native alternative to pandas.
polars
What it is
polars is a DataFrame library written in Rust with a Python binding. Its memory layout is Apache Arrow, its execution engine is column-parallel SIMD, and its public API is built around lazy expression plans (pl.col("x").mean().over("group")) that can be optimised, predicate-pushed, and streamed.
On PyPI the package occupies the "fast pandas replacement" slot — the most popular non-pandas DataFrame library by download count. It is the default recommendation when pandas hits its single-threaded ceiling or when data outgrows RAM.
Install
pip install polars
Output: (none — exits 0 on success)
uv add polars
Output: dependency resolved and added to pyproject.toml
pip install "polars[all]"
Output: installs polars plus pyarrow, pandas-interop, deltalake, fsspec, xlsx2csv, …
pip install polars-lts-cpu
Output: installs the no-AVX2 build for older / VM CPUs that crash on the default wheel
Versioning & Python support
polars follows SemVer and ships frequently — multiple minor releases per quarter is normal. Breaking changes are batched into majors with an explicit deprecation cycle one minor before removal. As of late 2025, polars is on the 1.x line; the 0.x → 1.0 migration in 2024 stabilised the public Python API.
| Polars line | Python support | Notes |
|---|---|---|
| 0.20.x | 3.8 – 3.12 | legacy; many tutorials still target it |
| 1.x | 3.9 – 3.13 | current stable, post-1.0 stability guarantees |
The Rust core also exposes a Python build flag — polars (default) and polars-lts-cpu (built without AVX2/AVX-512 for older x86 hosts). Pick exactly one; they share the import name polars.
Package metadata
- Maintainer: Ritchie Vink (creator) and Polars Inc. / Polars Cloud team
- Project home: github.com/pola-rs/polars
- Docs: docs.pola.rs
- License: MIT
- PyPI: pypi.org/project/polars
- Governance: core team at Polars Inc. plus open-source maintainers
- First released: 2020 (Rust crate); Python binding shortly after
- Downloads: > 30 M / month on PyPI as of late 2025
Optional dependencies & extras
pip install "polars[pyarrow,pandas,fsspec,xlsx,deltalake,timezone]"
Output: installs polars plus the listed groups; each extra maps to a concrete dep
| Extra | Pulls in | When to use |
|---|---|---|
pyarrow | pyarrow | zero-copy interop with Arrow tables, faster Parquet |
pandas | pandas, pyarrow | df.to_pandas() / pl.from_pandas() |
numpy | numpy | already a hard dep, listed for clarity |
fsspec | fsspec | cloud-storage URLs (s3://…, gcs://…) |
xlsx | xlsx2csv, openpyxl | read_excel / write_excel |
deltalake | deltalake | read/write Delta Lake tables |
iceberg | pyiceberg | Apache Iceberg interop |
timezone | tzdata | full IANA tz database on Windows |
connectorx | connectorx | fast DB → polars via Rust |
style | great-tables | rich HTML table rendering |
all | union of all extras | one-shot install |
Common companion packages:
pip install polars pyarrow duckdb matplotlib jupyter
Output: installs the typical analytical stack — pyarrow for interop, duckdb for SQL over the same Arrow
Alternatives
| Package | One-line trade-off |
|---|---|
| pandas | larger ecosystem, slower single-threaded core |
| duckdb | SQL-first; complements polars (both use Arrow, zero-copy) |
| modin | drop-in pandas API, parallelises pandas — different mental model |
| pyspark | distributed JVM cluster; only worth it past TB-scale |
| dask.dataframe | distributed pandas; slower than polars on a single node |
| cudf (RAPIDS) | GPU-accelerated; NVIDIA-only |
| vaex | out-of-core DataFrames; smaller community than polars |
Common gotchas
- Lazy vs eager API confusion.
pl.DataFrame(...)is eager;pl.LazyFrame(...)andpl.scan_*are lazy. Lazy plans need.collect()to materialise. Mixing the two without thinking causesAttributeErroron operations that exist only on the other type. - No native CSV "streaming-write".
write_csvmaterialises the frame in memory. For streaming output usesink_csvon aLazyFrame. - Version churn on the Rust core. Minor releases occasionally rename expression-method internals; pin polars in production pipelines and read the CHANGELOG before a
--upgrade. pl.read_csvschema inference scans only the first N rows. Cast types explicitly (schema_overrides={"id": pl.UInt32}) for production loads.- AVX2 crash on older CPUs / VMs. Default wheel uses AVX2; on QEMU / older Xeons it segfaults at import. Install
polars-lts-cpuinstead — same import name, no AVX requirement. - String dtype is Arrow-native —
.cast(pl.Utf8)notastype("object"). Code copy-pasted from pandas withobjectdtype assumptions breaks. fork+ Rust threads = deadlock.polarsplusmultiprocessingdefaults toforkon Linux, which can hang on the Rust thread pool. Usespawnstart method, or run polars work in the parent process.
Real-world recipes
The package-level recipes below show the install footprint each pattern requires rather than re-teaching the polars API (the companion sections/python/polars covers that). Each is a one-screen pipeline that exercises a distinct extra.
Lazy scan of a Parquet dataset directory — the canonical polars sweet spot. scan_parquet is lazy by construction; the query optimiser pushes filters and projections into the file reader before any data is touched.
import polars as pl
q = (
pl.scan_parquet("warehouse/events/*.parquet")
.filter(pl.col("event_date") >= pl.date(2026, 1, 1))
.group_by("user_id")
.agg(
pl.col("revenue").sum().alias("total_revenue"),
pl.col("session_id").n_unique().alias("sessions"),
)
.sort("total_revenue", descending=True)
.limit(100)
)
print(q.explain())
df = q.collect()
Output: explain() prints the optimised plan (projection + predicate pushed into the file scan); collect() returns the top-100 frame having read only the necessary row groups
The Parquet reader is built in — no extras needed. For S3 / GCS URLs add the fsspec extra and use pl.scan_parquet("s3://bucket/...").
Streaming a 50 GB CSV with sink_parquet — write a Parquet output without ever materialising the frame in RAM. This is what polars-the-tool is for: data outgrowing memory on a single machine.
import polars as pl
(
pl.scan_csv("huge.csv")
.filter(pl.col("status") == "active")
.with_columns(
revenue_log=pl.col("revenue").log1p(),
)
.sink_parquet("active.parquet", compression="zstd")
)
Output: a Parquet file on disk, written in streaming chunks; peak RSS stays at a few hundred MB regardless of input size
Group-by with window expressions — polars window functions are first-class, not stitched on top of group-by like pandas:
import polars as pl
df = pl.read_parquet("orders.parquet")
ranked = df.with_columns(
rank_within_user=pl.col("order_amount").rank("dense", descending=True).over("user_id"),
user_total=pl.col("order_amount").sum().over("user_id"),
).filter(pl.col("rank_within_user") <= 3)
print(ranked.head())
Output: top-3 orders per user with their cumulative spend; over() is the polars window keyword
Join across heterogeneous lazy sources — joining a CSV stream and a Parquet directory without materialising either:
import polars as pl
events = pl.scan_csv("events.csv")
products = pl.scan_parquet("products/*.parquet")
joined = (
events
.join(products, on="sku", how="left")
.group_by("category")
.agg(pl.col("revenue").sum())
.sort("revenue", descending=True)
.collect(streaming=True) # use the streaming engine
)
Output: category-by-revenue, with streaming=True letting the engine spill chunks rather than holding everything in RAM
Performance tuning
polars defaults are already aggressive — query optimiser, columnar layout, all available cores. The remaining tuning levers are about telling the engine what to skip rather than telling it to go faster.
import polars as pl
# Inspect the planner output before collecting
plan = (
pl.scan_parquet("warehouse/*.parquet")
.filter(pl.col("ts") > pl.datetime(2026, 1, 1))
.group_by("user_id")
.agg(pl.col("revenue").sum())
)
print(plan.explain(optimized=True))
Output: the optimised logical plan — confirms filter pushdown into the scan and projection pruning
Tuning levers, ordered by impact:
| Lever | Mechanism | When it helps |
|---|---|---|
scan_* over read_* | lazy, optimised, lazy I/O | most pipelines; switch by default |
pl.Config.set_streaming_chunk_size | streaming engine spill size | RAM-limited streaming pipelines |
collect(streaming=True) | streaming executor | datasets > RAM |
Explicit schema_overrides= | skip schema inference scan | large CSVs with known schema |
pl.col(...).cast(pl.Categorical) | dictionary-encoded strings | low-cardinality columns |
POLARS_MAX_THREADS env | cap thread count | shared CI runners, oversubscribed boxes |
polars-lts-cpu wheel | no-AVX2 build | older Xeons, QEMU/VM hosts |
Inspecting the query plan after the fact:
import polars as pl
plan = (
pl.scan_parquet("data.parquet")
.filter(pl.col("status") == "active")
.group_by("region")
.agg(pl.col("revenue").sum())
)
print("Logical:\n", plan.explain(optimized=False))
print()
print("Optimised:\n", plan.explain(optimized=True))
Output: two plans side by side — the optimised version typically merges projection + filter + scan into a single Parquet pushdown step
For benchmarks, polars ships a pl.profile() helper that returns each plan stage's runtime — much more useful than wall-clock comparisons against pandas.
Memory & dataset-size scaling
polars distinguishes three execution modes: eager (loads everything, NumPy-like), lazy (builds a plan, executes via collect()), and streaming (chunked execution via collect(streaming=True) or sink_*). The streaming mode is what makes polars usable on data larger than RAM.
import polars as pl
# Stream a 200 GB CSV through to compressed Parquet
(
pl.scan_csv("logs/*.csv")
.filter(pl.col("level").is_in(["ERROR", "FATAL"]))
.group_by(["service", pl.col("ts").dt.truncate("1h")])
.agg(pl.len().alias("count"))
.sink_parquet("errors.parquet", compression="zstd")
)
Output: writes a Parquet file in streaming fashion; RSS stays bounded; the engine spills chunks to /tmp when its memory budget is exceeded
Sizing the streaming engine:
import polars as pl
# Adjust the streaming chunk size (default ~50k rows; raise on big RAM, lower on tiny boxes)
pl.Config.set_streaming_chunk_size(250_000)
Output: no return value; subsequent streaming queries use the new chunk size
The streaming engine is a moving target — operations that have a streaming kernel one release may need a regular collect the next. Read pl.collect_all and pl.explain output to confirm streaming actually applies to your query.
When polars is no longer enough: if a single node is too small even with streaming, the next step is duckdb (SQL pushdown, similar columnar engine, often complementary in the same pipeline) or a cluster scheduler (dask, ray-data). polars itself does not distribute across nodes today.
Version migration guide
The 0.x → 1.0 jump in 2024 was the most consequential change. Once on 1.x, minor versions stay backwards-compatible but the expression language keeps evolving with each release.
0.x → 1.x checklist:
DataFrame.with_column()→with_columns()(always plural now).Series.alias()returnsSeries, notExpr— convert viapl.col("x").alias("y")if you wanted an expression.groupby→group_by(snake case). Both worked during the deprecation window; only the new form works on 1.x.pl.col("x").apply(...)→.map_elements(...)for per-row Python callbacks;.map_batches(...)for whole-Series UDFs.pl.read_csv(infer_schema_length=...)defaults raised — explicitschema_overrides=is safer.pl.Utf8is still the dtype name for strings; the literal"str"was deprecated.q.collect(streaming=True)replaced the oldercollect_streaming().
1.x minor-to-minor — read the changelog before every --upgrade in production. Recent 1.x releases have:
- Reworked the struct dtype (
pl.Struct(...)) — field order and naming semantics tightened. - Tightened the lazy/streaming boundary — some kernels that used to fall back to eager now raise.
- Added new I/O surfaces (Iceberg, Delta-Sharing) behind extras.
Pin to a known-good polars==1.x.y for any production pipeline; do not float minors. Bullet-pointed upgrade notes live in CHANGELOG.md.
Interop with adjacent ecosystems
polars is Arrow-native, which makes zero-copy conversions cheap with anything else that speaks Arrow. The interop matrix below tells you when bytes actually move.
| Convert from / to | How | Zero-copy? |
|---|---|---|
| polars ↔ Arrow | df.to_arrow() / pl.from_arrow(table) | Yes |
| polars ↔ pandas | df.to_pandas() / pl.from_pandas(pdf) | Yes via Arrow when the pandas side uses Arrow-backed dtypes; copies otherwise |
| polars ↔ NumPy | df.to_numpy() / pl.from_numpy(arr) | Partial — single dtype, contiguous arrays only |
| polars ↔ duckdb | duckdb.sql("SELECT * FROM df") | Yes — DuckDB registers the polars frame as an Arrow scan |
| polars → scikit-learn | model.fit(df.to_numpy(), y) | Copy — sklearn wants NumPy |
| polars → matplotlib | df.to_pandas().plot(...) | Copy — matplotlib does not consume polars directly |
import polars as pl
import duckdb
df = pl.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})
# polars -> duckdb -> polars round-trip, zero-copy
result = duckdb.sql("SELECT a*10 AS a10 FROM df").pl()
print(result)
Output: a polars DataFrame with a10 = [10, 20, 30]; both libraries operate on the same Arrow buffers without serialisation
Troubleshooting common errors
The polars error messages are unusually good — they often print the failing expression and a hint at what to do next. The list below is the package-level catalogue of recurring frictions.
SchemaError: invalid series dtype: expected Float, got Int64— almost always an expression that mixes int and float without casting. Fix:pl.col("x").cast(pl.Float64)at the offending node.ComputeError: cannot extract from null valuein struct field access. Fix:pl.col("s").struct.field("k").fill_null(...).PanicExceptionon collect — a Rust-side panic, usually means a bug. File an issue with a repro; many of these were fixed in recent 1.x releases.SIGILL/ illegal instruction at import — AVX2 not available. Installpolars-lts-cpuinstead.AttributeError: 'LazyFrame' object has no attribute 'head'etc. — confusing eager vs lazy.LazyFrame.collect().head()or use the lazy-equivalent verbs that exist (fetch,limit).pyo3_runtime.PanicExceptionwithforkon Linux —multiprocessingdefault start method isfork, deadlocks the Rust thread pool.mp.set_start_method("spawn")at program start.OutOfMemoryErrordespite streaming=True — some operations have no streaming kernel and force materialisation. Inspect withexplain(streaming=True)to confirm streaming applies.ColumnNotFoundErrorafter a join — column was renamed by suffix (_right). Alwayssuffix=explicitly on joins and rename immediately.
When NOT to use this
polars shines on single-machine analytical workloads; the cases below are the genuine ones where another tool is the better fit.
- Tiny data (< 10k rows): the install footprint and import time outweigh the speed win. pandas is fine.
- You need a years-old, fully stable API: polars is moving fast; pandas is the steadier choice for a slow-moving production codebase.
- SQL is already your team's lingua franca: duckdb gives you the same Arrow-native engine with SQL on top.
- Distributed compute: polars does not span nodes. Use dask, ray-data, or push down to a warehouse (Snowflake, BigQuery).
- Deep ecosystem coupling (statsmodels, seaborn, sklearn extensions): these all consume pandas; round-tripping every step into polars is friction. Use pandas at the boundary.
- GPU acceleration: polars has a CUDA execution path (via cudf) but it is still experimental. cuDF directly is the more stable path on NVIDIA today.
See also
- sections/python/polars — full API tutorial (filter, group_by, joins, LazyFrame)
- sections/python/pandas — the API polars is benchmarked against
- sections/python/duckdb — Arrow-native SQL that pairs zero-copy with polars
- sections/packages-pip/pip-pandas — package-level comparison
- sections/packages-pip/pip-duckdb — sibling Arrow-native analytics