cheat sheet

polars

High-performance DataFrames with a lazy expression API. Covers read/write, select, filter, group_by, joins, LazyFrame, datetime, string operations, and pandas interop.

updated 04-27-2026

polars — Fast DataFrames

What it is

Polars is a DataFrame library written in Rust that exposes a Python API. It uses Apache Arrow as its memory layout, processes data column-by-column with SIMD instructions, and executes lazy query plans in parallel across all CPU cores. Polars is significantly faster than pandas on most workloads and handles datasets that exceed RAM through streaming. It is the recommended alternative to pandas when performance or memory is a bottleneck.

Install

bash

pip install polars

Output: (none — exits 0 on success)

For Parquet, Excel, and timezone support:

bash

pip install polars[all]

Output: (none — exits 0 on success)

Quick example

python

import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept": ["Eng", "Eng", "HR", "HR"],
    "score": [92, 78, 85, 91],
})
print(df.filter(pl.col("score") > 80))

Output:

text

shape: (3, 3)
┌───────┬──────┬───────┐
│ name  ┆ dept ┆ score │
│ ---   ┆ ---  ┆ ---   │
│ str   ┆ str  ┆ i64   │
╞═══════╪══════╪═══════╡
│ Alice ┆ Eng  ┆ 92    │
│ Carol ┆ HR   ┆ 85    │
│ Dave  ┆ HR   ┆ 91    │
└───────┴──────┴───────┘

When / why to use it

Processing CSV, Parquet, or JSON files faster than pandas with a single-line change.
Datasets that are too large for pandas (polars streams through them without loading into RAM with streaming=True).
Multi-core machines where pandas' GIL-constrained single-threaded ops leave cores idle.
Pipelines where lazy evaluation (LazyFrame) lets polars optimise the entire query plan before executing it.
When you want to avoid pandas' inconsistent mutation semantics (SettingWithCopyWarning).

Common pitfalls

Expressions, not index-based mutation — polars has no .iloc or .loc. All transformations return new DataFrames via expressions. df["col"] = value raises TypeError; use df.with_columns(pl.lit(value).alias("col")) instead.

collect() must be called on LazyFrames — a LazyFrame is a query plan. Nothing executes until you call .collect(). Forgetting it returns a LazyFrame object, not a DataFrame.

Column names are case-sensitive — pl.col("Score") and pl.col("score") are different. Unlike pandas, polars never coerces column names.

Use df.glimpse() during exploration — it prints one row per column with dtype, much easier to read than df.head() on wide DataFrames.

pl.col("*") selects all columns. pl.col(pl.Float64, pl.Int64) selects all numeric columns by dtype — useful in with_columns for type-based transforms.

Reading and writing files

Polars has symmetric read_* / write_* and scan_* functions for all major formats. The scan_* family returns a LazyFrame and only reads data that the query plan actually needs, making it ideal for large files.

python

import polars as pl

# CSV
df = pl.read_csv("data.csv")
df.write_csv("output.csv")

# Parquet (fastest format for polars pipelines)
df = pl.read_parquet("data.parquet")
df.write_parquet("output.parquet")

# JSON (newline-delimited)
df = pl.read_ndjson("records.jsonl")

# Lazy scan — reads only needed columns/rows
lf = pl.scan_csv("large.csv")
result = lf.filter(pl.col("status") == "active").select("id", "name").collect()

select, filter, with_columns

The core expression-based API. select picks and transforms columns, returning a new DataFrame with only those columns. filter keeps rows matching a boolean expression. with_columns appends or replaces columns while keeping all originals.

python

import polars as pl

df = pl.DataFrame({
    "product": ["A", "B", "C", "D"],
    "price": [10.0, 25.0, 8.0, 40.0],
    "qty": [100, 50, 200, 30],
})

# select — compute revenue, keep only two columns
print(df.select(
    pl.col("product"),
    (pl.col("price") * pl.col("qty")).alias("revenue"),
))

Output:

text

shape: (4, 2)
┌─────────┬─────────┐
│ product ┆ revenue │
│ ---     ┆ ---     │
│ str     ┆ f64     │
╞═════════╪═════════╡
│ A       ┆ 1000.0  │
│ B       ┆ 1250.0  │
│ C       ┆ 1600.0  │
│ D       ┆ 1200.0  │
└─────────┴─────────┘

python

# filter + with_columns
result = (
    df.filter(pl.col("price") < 30)
    .with_columns(
        (pl.col("price") * pl.col("qty")).alias("revenue"),
        pl.col("product").str.to_lowercase().alias("product_lower"),
    )
)
print(result)

Output:

text

shape: (3, 5)
┌─────────┬───────┬─────┬─────────┬───────────────┐
│ product ┆ price ┆ qty ┆ revenue ┆ product_lower │
│ ---     ┆ ---   ┆ --- ┆ ---     ┆ ---           │
│ str     ┆ f64   ┆ i64 ┆ f64     ┆ str           │
╞═════════╪═══════╪═════╪═════════╪═══════════════╡
│ A       ┆ 10.0  ┆ 100 ┆ 1000.0  ┆ a             │
│ B       ┆ 25.0  ┆ 50  ┆ 1250.0  ┆ b             │
│ C       ┆ 8.0   ┆ 200 ┆ 1600.0  ┆ c             │
└─────────┴───────┴─────┴─────────┴───────────────┘

group_by and aggregation

group_by splits the DataFrame by one or more key columns and agg computes one or more aggregation expressions per group. Multiple aggregations are specified in a single agg(...) call.

python

import polars as pl

sales = pl.DataFrame({
    "region": ["East", "East", "West", "West", "East"],
    "product": ["A", "B", "A", "B", "A"],
    "amount": [100, 200, 150, 250, 120],
})

summary = (
    sales.group_by("region", "product")
    .agg(
        pl.col("amount").sum().alias("total"),
        pl.col("amount").count().alias("orders"),
        pl.col("amount").mean().alias("avg"),
    )
    .sort("region", "product")
)
print(summary)

Output:

text

shape: (4, 5)
┌────────┬─────────┬───────┬────────┬───────┐
│ region ┆ product ┆ total ┆ orders ┆ avg   │
│ ---    ┆ ---     ┆ ---   ┆ ---    ┆ ---   │
│ str    ┆ str     ┆ i64   ┆ u32    ┆ f64   │
╞════════╪═════════╪═══════╪════════╪═══════╡
│ East   ┆ A       ┆ 220   ┆ 2      ┆ 110.0 │
│ East   ┆ B       ┆ 200   ┆ 1      ┆ 200.0 │
│ West   ┆ A       ┆ 150   ┆ 1      ┆ 150.0 │
│ West   ┆ B       ┆ 250   ┆ 1      ┆ 250.0 │
└────────┴─────────┴───────┴────────┴───────┘

Joins

Polars supports the standard join strategies. All joins are expressed as method calls on the left DataFrame with the right DataFrame as the first argument.

python

import polars as pl

employees = pl.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
salaries  = pl.DataFrame({"id": [1, 2, 4], "salary": [75000, 82000, 91000]})

# inner join (default)
print(employees.join(salaries, on="id", how="inner"))

Output:

text

shape: (2, 3)
┌─────┬───────┬────────┐
│ id  ┆ name  ┆ salary │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ i64    │
╞═════╪═══════╪════════╡
│ 1   ┆ Alice ┆ 75000  │
│ 2   ┆ Bob   ┆ 82000  │
└─────┴───────┴────────┘

python

# left join — keep all employees
print(employees.join(salaries, on="id", how="left"))

Output:

text

shape: (3, 3)
┌─────┬───────┬────────┐
│ id  ┆ name  ┆ salary │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ i64    │
╞═════╪═══════╪════════╡
│ 1   ┆ Alice ┆ 75000  │
│ 2   ┆ Bob   ┆ 82000  │
│ 3   ┆ Carol ┆ null   │
└─────┴───────┴────────┘

python

# anti join — employees with no salary record
print(employees.join(salaries, on="id", how="anti"))

Output:

text

shape: (1, 2)
┌─────┬───────┐
│ id  ┆ name  │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 3   ┆ Carol │
└─────┴───────┘

LazyFrame — deferred execution and query optimisation

A LazyFrame records transformation steps without executing them. When you call .collect(), polars compiles all steps into an optimised physical plan — it pushes filters early (predicate pushdown), prunes unused columns (projection pushdown), and parallelises independent branches. For large files, use scan_csv / scan_parquet instead of read_* so the optimizer can skip reading columns you don't need.

python

import polars as pl

# scan_parquet never reads the file until collect()
lf = (
    pl.scan_csv("sales.csv")
    .filter(pl.col("amount") > 100)
    .group_by("region")
    .agg(pl.col("amount").sum().alias("total"))
    .sort("total", descending=True)
)

# Inspect the optimised plan before running
print(lf.explain())

# Execute
result = lf.collect()
print(result)

For datasets larger than RAM, add streaming=True to collect():

python

result = lf.collect(streaming=True)

Datetime expressions

Polars stores timestamps as Datetime (not Python datetime) and exposes time operations through the .dt namespace on expressions.

python

import polars as pl

df = pl.DataFrame({
    "ts": ["2026-01-01", "2026-03-15", "2026-06-30"],
    "value": [10, 25, 18],
}).with_columns(pl.col("ts").str.to_date().alias("date"))

print(
    df.with_columns(
        pl.col("date").dt.year().alias("year"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.day_of_week().alias("dow"),  # 0=Monday
    )
)

Output:

text

shape: (3, 5)
┌────────────┬───────┬──────┬───────┬─────┐
│ date       ┆ value ┆ year ┆ month ┆ dow │
│ ---        ┆ ---   ┆ ---  ┆ ---   ┆ --- │
│ date       ┆ i64   ┆ i32  ┆ i8    ┆ i8  │
╞════════════╪═══════╪══════╪═══════╪═════╡
│ 2026-01-01 ┆ 10    ┆ 2026 ┆ 1     ┆ 3   │
│ 2026-03-15 ┆ 25    ┆ 2026 ┆ 3     ┆ 6   │
│ 2026-06-30 ┆ 18    ┆ 2026 ┆ 6     ┆ 1   │
└────────────┴───────┴──────┴───────┴─────┘

Group by calendar period with group_by_dynamic:

python

df2 = pl.DataFrame({
    "date": pl.date_range(pl.date(2026, 1, 1), pl.date(2026, 3, 31), "1d", eager=True),
    "sales": range(90),
})

monthly = df2.group_by_dynamic("date", every="1mo").agg(pl.col("sales").sum())
print(monthly.head(3))

Output:

text

shape: (3, 2)
┌────────────┬───────┐
│ date       ┆ sales │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2026-01-01 ┆ 465   │
│ 2026-02-01 ┆ 1246  │
│ 2026-03-01 ┆ 2015  │
└────────────┴───────┘

String namespace

The .str namespace mirrors pandas' .str accessor but is expressed as pl.col("col").str.<method>() inside with_columns or select.

python

import polars as pl

df = pl.DataFrame({"raw": ["  Hello World  ", "foo_bar", "CAPS LOCK"]})
print(
    df.with_columns(
        pl.col("raw").str.strip_chars().alias("stripped"),
        pl.col("raw").str.to_lowercase().alias("lower"),
        pl.col("raw").str.replace("_", " ").alias("underscores_removed"),
        pl.col("raw").str.contains("CAPS").alias("has_caps"),
        pl.col("raw").str.len_chars().alias("length"),
    )
)

Output:

text

shape: (3, 6)
┌──────────────────┬───────────────┬───────────┬─────────────────────┬──────────┬────────┐
│ raw              ┆ stripped      ┆ lower     ┆ underscores_removed ┆ has_caps ┆ length │
│ ---              ┆ ---           ┆ ---       ┆ ---                 ┆ ---      ┆ ---    │
│ str              ┆ str           ┆ str       ┆ str                 ┆ bool     ┆ u32    │
╞══════════════════╪═══════════════╪═══════════╪═════════════════════╪══════════╪════════╡
│   Hello World    ┆ Hello World   ┆   hello … ┆   Hello World       ┆ false    ┆ 15     │
│ foo_bar          ┆ foo_bar       ┆ foo_bar   ┆ foo bar             ┆ false    ┆ 7      │
│ CAPS LOCK        ┆ CAPS LOCK     ┆ caps lock ┆ CAPS LOCK           ┆ true     ┆ 9      │
└──────────────────┴───────────────┴───────────┴─────────────────────┴──────────┴────────┘

List namespace

When a column holds lists (pl.List dtype), the .list namespace provides vectorised operations without exploding the rows.

python

import polars as pl

df = pl.DataFrame({"scores": [[90, 85, 92], [78, 80], [95, 88, 76, 91]]})
print(
    df.with_columns(
        pl.col("scores").list.len().alias("count"),
        pl.col("scores").list.mean().alias("avg"),
        pl.col("scores").list.max().alias("max"),
    )
)

Output:

text

shape: (3, 4)
┌────────────────┬───────┬──────┬─────┐
│ scores         ┆ count ┆ avg  ┆ max │
│ ---            ┆ ---   ┆ ---  ┆ --- │
│ list[i64]      ┆ u32   ┆ f64  ┆ i64 │
╞════════════════╪═══════╪══════╪═════╡
│ [90, 85, 92]   ┆ 3     ┆ 89.0 ┆ 92  │
│ [78, 80]       ┆ 2     ┆ 79.0 ┆ 80  │
│ [95, 88, 76, … ┆ 4     ┆ 87.5 ┆ 95  │
└────────────────┴───────┴──────┴─────┘

Interop with pandas and PyArrow

to_pandas() converts a polars DataFrame to pandas. from_pandas() goes the other way. Both paths use the shared Arrow memory buffer when possible, making conversion near-zero-copy.

python

import polars as pl
import pandas as pd

df_pl = pl.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]})

df_pd = df_pl.to_pandas()
print(type(df_pd), df_pd.dtypes.to_dict())

df_pl2 = pl.from_pandas(df_pd)
print(df_pl2.dtypes)

Output:

text

<class 'pandas.core.frame.DataFrame'> {'a': dtype('int64'), 'b': dtype('float64')}
[Int64, Float64]

Quick reference

Task	Code
Filter rows	`df.filter(pl.col("x") > 5)`
Select columns	`df.select("a", "b")`
Add/replace column	`df.with_columns(expr.alias("name"))`
Rename	`df.rename({"old": "new"})`
Sort	`df.sort("col", descending=True)`
Drop nulls	`df.drop_nulls(subset=["col"])`
Fill nulls	`df.with_columns(pl.col("c").fill_null(0))`
Group + agg	`df.group_by("k").agg(pl.col("v").sum())`
Inner join	`left.join(right, on="id", how="inner")`
Lazy plan	`pl.scan_csv("f.csv").filter(...).collect()`
Streaming	`.collect(streaming=True)`
To pandas	`df.to_pandas()`
From pandas	`pl.from_pandas(df_pd)`
Schema	`df.schema`
Shape	`df.shape`
Glimpse	`df.glimpse()`