cheat sheet

polars

Package-level reference for polars — install, versioning, extras, and gotchas. The Rust-powered Arrow-native alternative to pandas.

polars

What it is

polars is a DataFrame library written in Rust with a Python binding. Its memory layout is Apache Arrow, its execution engine is column-parallel SIMD, and its public API is built around lazy expression plans (pl.col("x").mean().over("group")) that can be optimised, predicate-pushed, and streamed.

On PyPI the package occupies the "fast pandas replacement" slot — the most popular non-pandas DataFrame library by download count. It is the default recommendation when pandas hits its single-threaded ceiling or when data outgrows RAM.

Install

bash
pip install polars

Output: (none — exits 0 on success)

bash
uv add polars

Output: dependency resolved and added to pyproject.toml

bash
pip install "polars[all]"

Output: installs polars plus pyarrow, pandas-interop, deltalake, fsspec, xlsx2csv, …

bash
pip install polars-lts-cpu

Output: installs the no-AVX2 build for older / VM CPUs that crash on the default wheel

Versioning & Python support

polars follows SemVer and ships frequently — multiple minor releases per quarter is normal. Breaking changes are batched into majors with an explicit deprecation cycle one minor before removal. As of late 2025, polars is on the 1.x line; the 0.x → 1.0 migration in 2024 stabilised the public Python API.

Polars linePython supportNotes
0.20.x3.8 – 3.12legacy; many tutorials still target it
1.x3.9 – 3.13current stable, post-1.0 stability guarantees

The Rust core also exposes a Python build flag — polars (default) and polars-lts-cpu (built without AVX2/AVX-512 for older x86 hosts). Pick exactly one; they share the import name polars.

Package metadata

  • Maintainer: Ritchie Vink (creator) and Polars Inc. / Polars Cloud team
  • Project home: github.com/pola-rs/polars
  • Docs: docs.pola.rs
  • License: MIT
  • PyPI: pypi.org/project/polars
  • Governance: core team at Polars Inc. plus open-source maintainers
  • First released: 2020 (Rust crate); Python binding shortly after
  • Downloads: > 30 M / month on PyPI as of late 2025

Optional dependencies & extras

bash
pip install "polars[pyarrow,pandas,fsspec,xlsx,deltalake,timezone]"

Output: installs polars plus the listed groups; each extra maps to a concrete dep

ExtraPulls inWhen to use
pyarrowpyarrowzero-copy interop with Arrow tables, faster Parquet
pandaspandas, pyarrowdf.to_pandas() / pl.from_pandas()
numpynumpyalready a hard dep, listed for clarity
fsspecfsspeccloud-storage URLs (s3://…, gcs://…)
xlsxxlsx2csv, openpyxlread_excel / write_excel
deltalakedeltalakeread/write Delta Lake tables
icebergpyicebergApache Iceberg interop
timezonetzdatafull IANA tz database on Windows
connectorxconnectorxfast DB → polars via Rust
stylegreat-tablesrich HTML table rendering
allunion of all extrasone-shot install

Common companion packages:

bash
pip install polars pyarrow duckdb matplotlib jupyter

Output: installs the typical analytical stack — pyarrow for interop, duckdb for SQL over the same Arrow

Alternatives

PackageOne-line trade-off
pandaslarger ecosystem, slower single-threaded core
duckdbSQL-first; complements polars (both use Arrow, zero-copy)
modindrop-in pandas API, parallelises pandas — different mental model
pysparkdistributed JVM cluster; only worth it past TB-scale
dask.dataframedistributed pandas; slower than polars on a single node
cudf (RAPIDS)GPU-accelerated; NVIDIA-only
vaexout-of-core DataFrames; smaller community than polars

Common gotchas

  • Lazy vs eager API confusion. pl.DataFrame(...) is eager; pl.LazyFrame(...) and pl.scan_* are lazy. Lazy plans need .collect() to materialise. Mixing the two without thinking causes AttributeError on operations that exist only on the other type.
  • No native CSV "streaming-write". write_csv materialises the frame in memory. For streaming output use sink_csv on a LazyFrame.
  • Version churn on the Rust core. Minor releases occasionally rename expression-method internals; pin polars in production pipelines and read the CHANGELOG before a --upgrade.
  • pl.read_csv schema inference scans only the first N rows. Cast types explicitly (schema_overrides={"id": pl.UInt32}) for production loads.
  • AVX2 crash on older CPUs / VMs. Default wheel uses AVX2; on QEMU / older Xeons it segfaults at import. Install polars-lts-cpu instead — same import name, no AVX requirement.
  • String dtype is Arrow-native.cast(pl.Utf8) not astype("object"). Code copy-pasted from pandas with object dtype assumptions breaks.
  • fork + Rust threads = deadlock. polars plus multiprocessing defaults to fork on Linux, which can hang on the Rust thread pool. Use spawn start method, or run polars work in the parent process.

Real-world recipes

The package-level recipes below show the install footprint each pattern requires rather than re-teaching the polars API (the companion sections/python/polars covers that). Each is a one-screen pipeline that exercises a distinct extra.

Lazy scan of a Parquet dataset directory — the canonical polars sweet spot. scan_parquet is lazy by construction; the query optimiser pushes filters and projections into the file reader before any data is touched.

python
import polars as pl

q = (
    pl.scan_parquet("warehouse/events/*.parquet")
    .filter(pl.col("event_date") >= pl.date(2026, 1, 1))
    .group_by("user_id")
    .agg(
        pl.col("revenue").sum().alias("total_revenue"),
        pl.col("session_id").n_unique().alias("sessions"),
    )
    .sort("total_revenue", descending=True)
    .limit(100)
)
print(q.explain())
df = q.collect()

Output: explain() prints the optimised plan (projection + predicate pushed into the file scan); collect() returns the top-100 frame having read only the necessary row groups

The Parquet reader is built in — no extras needed. For S3 / GCS URLs add the fsspec extra and use pl.scan_parquet("s3://bucket/...").

Streaming a 50 GB CSV with sink_parquet — write a Parquet output without ever materialising the frame in RAM. This is what polars-the-tool is for: data outgrowing memory on a single machine.

python
import polars as pl

(
    pl.scan_csv("huge.csv")
    .filter(pl.col("status") == "active")
    .with_columns(
        revenue_log=pl.col("revenue").log1p(),
    )
    .sink_parquet("active.parquet", compression="zstd")
)

Output: a Parquet file on disk, written in streaming chunks; peak RSS stays at a few hundred MB regardless of input size

Group-by with window expressions — polars window functions are first-class, not stitched on top of group-by like pandas:

python
import polars as pl

df = pl.read_parquet("orders.parquet")
ranked = df.with_columns(
    rank_within_user=pl.col("order_amount").rank("dense", descending=True).over("user_id"),
    user_total=pl.col("order_amount").sum().over("user_id"),
).filter(pl.col("rank_within_user") <= 3)
print(ranked.head())

Output: top-3 orders per user with their cumulative spend; over() is the polars window keyword

Join across heterogeneous lazy sources — joining a CSV stream and a Parquet directory without materialising either:

python
import polars as pl

events = pl.scan_csv("events.csv")
products = pl.scan_parquet("products/*.parquet")

joined = (
    events
    .join(products, on="sku", how="left")
    .group_by("category")
    .agg(pl.col("revenue").sum())
    .sort("revenue", descending=True)
    .collect(streaming=True)   # use the streaming engine
)

Output: category-by-revenue, with streaming=True letting the engine spill chunks rather than holding everything in RAM

Performance tuning

polars defaults are already aggressive — query optimiser, columnar layout, all available cores. The remaining tuning levers are about telling the engine what to skip rather than telling it to go faster.

python
import polars as pl

# Inspect the planner output before collecting
plan = (
    pl.scan_parquet("warehouse/*.parquet")
    .filter(pl.col("ts") > pl.datetime(2026, 1, 1))
    .group_by("user_id")
    .agg(pl.col("revenue").sum())
)
print(plan.explain(optimized=True))

Output: the optimised logical plan — confirms filter pushdown into the scan and projection pruning

Tuning levers, ordered by impact:

LeverMechanismWhen it helps
scan_* over read_*lazy, optimised, lazy I/Omost pipelines; switch by default
pl.Config.set_streaming_chunk_sizestreaming engine spill sizeRAM-limited streaming pipelines
collect(streaming=True)streaming executordatasets > RAM
Explicit schema_overrides=skip schema inference scanlarge CSVs with known schema
pl.col(...).cast(pl.Categorical)dictionary-encoded stringslow-cardinality columns
POLARS_MAX_THREADS envcap thread countshared CI runners, oversubscribed boxes
polars-lts-cpu wheelno-AVX2 buildolder Xeons, QEMU/VM hosts

Inspecting the query plan after the fact:

python
import polars as pl

plan = (
    pl.scan_parquet("data.parquet")
    .filter(pl.col("status") == "active")
    .group_by("region")
    .agg(pl.col("revenue").sum())
)
print("Logical:\n", plan.explain(optimized=False))
print()
print("Optimised:\n", plan.explain(optimized=True))

Output: two plans side by side — the optimised version typically merges projection + filter + scan into a single Parquet pushdown step

For benchmarks, polars ships a pl.profile() helper that returns each plan stage's runtime — much more useful than wall-clock comparisons against pandas.

Memory & dataset-size scaling

polars distinguishes three execution modes: eager (loads everything, NumPy-like), lazy (builds a plan, executes via collect()), and streaming (chunked execution via collect(streaming=True) or sink_*). The streaming mode is what makes polars usable on data larger than RAM.

python
import polars as pl

# Stream a 200 GB CSV through to compressed Parquet
(
    pl.scan_csv("logs/*.csv")
    .filter(pl.col("level").is_in(["ERROR", "FATAL"]))
    .group_by(["service", pl.col("ts").dt.truncate("1h")])
    .agg(pl.len().alias("count"))
    .sink_parquet("errors.parquet", compression="zstd")
)

Output: writes a Parquet file in streaming fashion; RSS stays bounded; the engine spills chunks to /tmp when its memory budget is exceeded

Sizing the streaming engine:

python
import polars as pl

# Adjust the streaming chunk size (default ~50k rows; raise on big RAM, lower on tiny boxes)
pl.Config.set_streaming_chunk_size(250_000)

Output: no return value; subsequent streaming queries use the new chunk size

The streaming engine is a moving target — operations that have a streaming kernel one release may need a regular collect the next. Read pl.collect_all and pl.explain output to confirm streaming actually applies to your query.

When polars is no longer enough: if a single node is too small even with streaming, the next step is duckdb (SQL pushdown, similar columnar engine, often complementary in the same pipeline) or a cluster scheduler (dask, ray-data). polars itself does not distribute across nodes today.

Version migration guide

The 0.x → 1.0 jump in 2024 was the most consequential change. Once on 1.x, minor versions stay backwards-compatible but the expression language keeps evolving with each release.

0.x → 1.x checklist:

  • DataFrame.with_column()with_columns() (always plural now).
  • Series.alias() returns Series, not Expr — convert via pl.col("x").alias("y") if you wanted an expression.
  • groupbygroup_by (snake case). Both worked during the deprecation window; only the new form works on 1.x.
  • pl.col("x").apply(...).map_elements(...) for per-row Python callbacks; .map_batches(...) for whole-Series UDFs.
  • pl.read_csv(infer_schema_length=...) defaults raised — explicit schema_overrides= is safer.
  • pl.Utf8 is still the dtype name for strings; the literal "str" was deprecated.
  • q.collect(streaming=True) replaced the older collect_streaming().

1.x minor-to-minor — read the changelog before every --upgrade in production. Recent 1.x releases have:

  • Reworked the struct dtype (pl.Struct(...)) — field order and naming semantics tightened.
  • Tightened the lazy/streaming boundary — some kernels that used to fall back to eager now raise.
  • Added new I/O surfaces (Iceberg, Delta-Sharing) behind extras.

Pin to a known-good polars==1.x.y for any production pipeline; do not float minors. Bullet-pointed upgrade notes live in CHANGELOG.md.

Interop with adjacent ecosystems

polars is Arrow-native, which makes zero-copy conversions cheap with anything else that speaks Arrow. The interop matrix below tells you when bytes actually move.

Convert from / toHowZero-copy?
polars ↔ Arrowdf.to_arrow() / pl.from_arrow(table)Yes
polars ↔ pandasdf.to_pandas() / pl.from_pandas(pdf)Yes via Arrow when the pandas side uses Arrow-backed dtypes; copies otherwise
polars ↔ NumPydf.to_numpy() / pl.from_numpy(arr)Partial — single dtype, contiguous arrays only
polars ↔ duckdbduckdb.sql("SELECT * FROM df")Yes — DuckDB registers the polars frame as an Arrow scan
polars → scikit-learnmodel.fit(df.to_numpy(), y)Copy — sklearn wants NumPy
polars → matplotlibdf.to_pandas().plot(...)Copy — matplotlib does not consume polars directly
python
import polars as pl
import duckdb

df = pl.DataFrame({"a": [1, 2, 3], "b": ["x", "y", "z"]})

# polars -> duckdb -> polars round-trip, zero-copy
result = duckdb.sql("SELECT a*10 AS a10 FROM df").pl()
print(result)

Output: a polars DataFrame with a10 = [10, 20, 30]; both libraries operate on the same Arrow buffers without serialisation

Troubleshooting common errors

The polars error messages are unusually good — they often print the failing expression and a hint at what to do next. The list below is the package-level catalogue of recurring frictions.

  • SchemaError: invalid series dtype: expected Float, got Int64 — almost always an expression that mixes int and float without casting. Fix: pl.col("x").cast(pl.Float64) at the offending node.
  • ComputeError: cannot extract from null value in struct field access. Fix: pl.col("s").struct.field("k").fill_null(...).
  • PanicException on collect — a Rust-side panic, usually means a bug. File an issue with a repro; many of these were fixed in recent 1.x releases.
  • SIGILL / illegal instruction at import — AVX2 not available. Install polars-lts-cpu instead.
  • AttributeError: 'LazyFrame' object has no attribute 'head' etc. — confusing eager vs lazy. LazyFrame.collect().head() or use the lazy-equivalent verbs that exist (fetch, limit).
  • pyo3_runtime.PanicException with fork on Linux — multiprocessing default start method is fork, deadlocks the Rust thread pool. mp.set_start_method("spawn") at program start.
  • OutOfMemoryError despite streaming=True — some operations have no streaming kernel and force materialisation. Inspect with explain(streaming=True) to confirm streaming applies.
  • ColumnNotFoundError after a join — column was renamed by suffix (_right). Always suffix= explicitly on joins and rename immediately.

When NOT to use this

polars shines on single-machine analytical workloads; the cases below are the genuine ones where another tool is the better fit.

  • Tiny data (< 10k rows): the install footprint and import time outweigh the speed win. pandas is fine.
  • You need a years-old, fully stable API: polars is moving fast; pandas is the steadier choice for a slow-moving production codebase.
  • SQL is already your team's lingua franca: duckdb gives you the same Arrow-native engine with SQL on top.
  • Distributed compute: polars does not span nodes. Use dask, ray-data, or push down to a warehouse (Snowflake, BigQuery).
  • Deep ecosystem coupling (statsmodels, seaborn, sklearn extensions): these all consume pandas; round-tripping every step into polars is friction. Use pandas at the boundary.
  • GPU acceleration: polars has a CUDA execution path (via cudf) but it is still experimental. cuDF directly is the more stable path on NVIDIA today.

See also