cheat sheet

polars

High-performance DataFrames with a lazy expression API. Covers read/write, select, filter, group_by, joins, LazyFrame, datetime, string operations, and pandas interop.

polars — Fast DataFrames

What it is

Polars is a DataFrame library written in Rust that exposes a Python API. It uses Apache Arrow as its memory layout, processes data column-by-column with SIMD instructions, and executes lazy query plans in parallel across all CPU cores. Polars is significantly faster than pandas on most workloads and handles datasets that exceed RAM through streaming. It is the recommended alternative to pandas when performance or memory is a bottleneck.

Install

bash
pip install polars

Output: (none — exits 0 on success)

For Parquet, Excel, and timezone support:

bash
pip install polars[all]

Output: (none — exits 0 on success)

Quick example

python
import polars as pl

df = pl.DataFrame({
    "name": ["Alice", "Bob", "Carol", "Dave"],
    "dept": ["Eng", "Eng", "HR", "HR"],
    "score": [92, 78, 85, 91],
})
print(df.filter(pl.col("score") > 80))

Output:

text
shape: (3, 3)
┌───────┬──────┬───────┐
│ name  ┆ dept ┆ score │
│ ---   ┆ ---  ┆ ---   │
│ str   ┆ str  ┆ i64   │
╞═══════╪══════╪═══════╡
│ Alice ┆ Eng  ┆ 92    │
│ Carol ┆ HR   ┆ 85    │
│ Dave  ┆ HR   ┆ 91    │
└───────┴──────┴───────┘

When / why to use it

  • Processing CSV, Parquet, or JSON files faster than pandas with a single-line change.
  • Datasets that are too large for pandas (polars streams through them without loading into RAM with streaming=True).
  • Multi-core machines where pandas' GIL-constrained single-threaded ops leave cores idle.
  • Pipelines where lazy evaluation (LazyFrame) lets polars optimise the entire query plan before executing it.
  • When you want to avoid pandas' inconsistent mutation semantics (SettingWithCopyWarning).

Common pitfalls

Expressions, not index-based mutation — polars has no .iloc or .loc. All transformations return new DataFrames via expressions. df["col"] = value raises TypeError; use df.with_columns(pl.lit(value).alias("col")) instead.

collect() must be called on LazyFrames — a LazyFrame is a query plan. Nothing executes until you call .collect(). Forgetting it returns a LazyFrame object, not a DataFrame.

Column names are case-sensitivepl.col("Score") and pl.col("score") are different. Unlike pandas, polars never coerces column names.

Use df.glimpse() during exploration — it prints one row per column with dtype, much easier to read than df.head() on wide DataFrames.

pl.col("*") selects all columns. pl.col(pl.Float64, pl.Int64) selects all numeric columns by dtype — useful in with_columns for type-based transforms.

Reading and writing files

Polars has symmetric read_* / write_* and scan_* functions for all major formats. The scan_* family returns a LazyFrame and only reads data that the query plan actually needs, making it ideal for large files.

python
import polars as pl

# CSV
df = pl.read_csv("data.csv")
df.write_csv("output.csv")

# Parquet (fastest format for polars pipelines)
df = pl.read_parquet("data.parquet")
df.write_parquet("output.parquet")

# JSON (newline-delimited)
df = pl.read_ndjson("records.jsonl")

# Lazy scan — reads only needed columns/rows
lf = pl.scan_csv("large.csv")
result = lf.filter(pl.col("status") == "active").select("id", "name").collect()

select, filter, with_columns

The core expression-based API. select picks and transforms columns, returning a new DataFrame with only those columns. filter keeps rows matching a boolean expression. with_columns appends or replaces columns while keeping all originals.

python
import polars as pl

df = pl.DataFrame({
    "product": ["A", "B", "C", "D"],
    "price": [10.0, 25.0, 8.0, 40.0],
    "qty": [100, 50, 200, 30],
})

# select — compute revenue, keep only two columns
print(df.select(
    pl.col("product"),
    (pl.col("price") * pl.col("qty")).alias("revenue"),
))

Output:

text
shape: (4, 2)
┌─────────┬─────────┐
│ product ┆ revenue │
│ ---     ┆ ---     │
│ str     ┆ f64     │
╞═════════╪═════════╡
│ A       ┆ 1000.0  │
│ B       ┆ 1250.0  │
│ C       ┆ 1600.0  │
│ D       ┆ 1200.0  │
└─────────┴─────────┘
python
# filter + with_columns
result = (
    df.filter(pl.col("price") < 30)
    .with_columns(
        (pl.col("price") * pl.col("qty")).alias("revenue"),
        pl.col("product").str.to_lowercase().alias("product_lower"),
    )
)
print(result)

Output:

text
shape: (3, 5)
┌─────────┬───────┬─────┬─────────┬───────────────┐
│ product ┆ price ┆ qty ┆ revenue ┆ product_lower │
│ ---     ┆ ---   ┆ --- ┆ ---     ┆ ---           │
│ str     ┆ f64   ┆ i64 ┆ f64     ┆ str           │
╞═════════╪═══════╪═════╪═════════╪═══════════════╡
│ A       ┆ 10.0  ┆ 100 ┆ 1000.0  ┆ a             │
│ B       ┆ 25.0  ┆ 50  ┆ 1250.0  ┆ b             │
│ C       ┆ 8.0   ┆ 200 ┆ 1600.0  ┆ c             │
└─────────┴───────┴─────┴─────────┴───────────────┘

group_by and aggregation

group_by splits the DataFrame by one or more key columns and agg computes one or more aggregation expressions per group. Multiple aggregations are specified in a single agg(...) call.

python
import polars as pl

sales = pl.DataFrame({
    "region": ["East", "East", "West", "West", "East"],
    "product": ["A", "B", "A", "B", "A"],
    "amount": [100, 200, 150, 250, 120],
})

summary = (
    sales.group_by("region", "product")
    .agg(
        pl.col("amount").sum().alias("total"),
        pl.col("amount").count().alias("orders"),
        pl.col("amount").mean().alias("avg"),
    )
    .sort("region", "product")
)
print(summary)

Output:

text
shape: (4, 5)
┌────────┬─────────┬───────┬────────┬───────┐
│ region ┆ product ┆ total ┆ orders ┆ avg   │
│ ---    ┆ ---     ┆ ---   ┆ ---    ┆ ---   │
│ str    ┆ str     ┆ i64   ┆ u32    ┆ f64   │
╞════════╪═════════╪═══════╪════════╪═══════╡
│ East   ┆ A       ┆ 220   ┆ 2      ┆ 110.0 │
│ East   ┆ B       ┆ 200   ┆ 1      ┆ 200.0 │
│ West   ┆ A       ┆ 150   ┆ 1      ┆ 150.0 │
│ West   ┆ B       ┆ 250   ┆ 1      ┆ 250.0 │
└────────┴─────────┴───────┴────────┴───────┘

Joins

Polars supports the standard join strategies. All joins are expressed as method calls on the left DataFrame with the right DataFrame as the first argument.

python
import polars as pl

employees = pl.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
salaries  = pl.DataFrame({"id": [1, 2, 4], "salary": [75000, 82000, 91000]})

# inner join (default)
print(employees.join(salaries, on="id", how="inner"))

Output:

text
shape: (2, 3)
┌─────┬───────┬────────┐
│ id  ┆ name  ┆ salary │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ i64    │
╞═════╪═══════╪════════╡
│ 1   ┆ Alice ┆ 75000  │
│ 2   ┆ Bob   ┆ 82000  │
└─────┴───────┴────────┘
python
# left join — keep all employees
print(employees.join(salaries, on="id", how="left"))

Output:

text
shape: (3, 3)
┌─────┬───────┬────────┐
│ id  ┆ name  ┆ salary │
│ --- ┆ ---   ┆ ---    │
│ i64 ┆ str   ┆ i64    │
╞═════╪═══════╪════════╡
│ 1   ┆ Alice ┆ 75000  │
│ 2   ┆ Bob   ┆ 82000  │
│ 3   ┆ Carol ┆ null   │
└─────┴───────┴────────┘
python
# anti join — employees with no salary record
print(employees.join(salaries, on="id", how="anti"))

Output:

text
shape: (1, 2)
┌─────┬───────┐
│ id  ┆ name  │
│ --- ┆ ---   │
│ i64 ┆ str   │
╞═════╪═══════╡
│ 3   ┆ Carol │
└─────┴───────┘

LazyFrame — deferred execution and query optimisation

A LazyFrame records transformation steps without executing them. When you call .collect(), polars compiles all steps into an optimised physical plan — it pushes filters early (predicate pushdown), prunes unused columns (projection pushdown), and parallelises independent branches. For large files, use scan_csv / scan_parquet instead of read_* so the optimizer can skip reading columns you don't need.

python
import polars as pl

# scan_parquet never reads the file until collect()
lf = (
    pl.scan_csv("sales.csv")
    .filter(pl.col("amount") > 100)
    .group_by("region")
    .agg(pl.col("amount").sum().alias("total"))
    .sort("total", descending=True)
)

# Inspect the optimised plan before running
print(lf.explain())

# Execute
result = lf.collect()
print(result)

For datasets larger than RAM, add streaming=True to collect():

python
result = lf.collect(streaming=True)

Datetime expressions

Polars stores timestamps as Datetime (not Python datetime) and exposes time operations through the .dt namespace on expressions.

python
import polars as pl

df = pl.DataFrame({
    "ts": ["2026-01-01", "2026-03-15", "2026-06-30"],
    "value": [10, 25, 18],
}).with_columns(pl.col("ts").str.to_date().alias("date"))

print(
    df.with_columns(
        pl.col("date").dt.year().alias("year"),
        pl.col("date").dt.month().alias("month"),
        pl.col("date").dt.day_of_week().alias("dow"),  # 0=Monday
    )
)

Output:

text
shape: (3, 5)
┌────────────┬───────┬──────┬───────┬─────┐
│ date       ┆ value ┆ year ┆ month ┆ dow │
│ ---        ┆ ---   ┆ ---  ┆ ---   ┆ --- │
│ date       ┆ i64   ┆ i32  ┆ i8    ┆ i8  │
╞════════════╪═══════╪══════╪═══════╪═════╡
│ 2026-01-01 ┆ 10    ┆ 2026 ┆ 1     ┆ 3   │
│ 2026-03-15 ┆ 25    ┆ 2026 ┆ 3     ┆ 6   │
│ 2026-06-30 ┆ 18    ┆ 2026 ┆ 6     ┆ 1   │
└────────────┴───────┴──────┴───────┴─────┘

Group by calendar period with group_by_dynamic:

python
df2 = pl.DataFrame({
    "date": pl.date_range(pl.date(2026, 1, 1), pl.date(2026, 3, 31), "1d", eager=True),
    "sales": range(90),
})

monthly = df2.group_by_dynamic("date", every="1mo").agg(pl.col("sales").sum())
print(monthly.head(3))

Output:

text
shape: (3, 2)
┌────────────┬───────┐
│ date       ┆ sales │
│ ---        ┆ ---   │
│ date       ┆ i64   │
╞════════════╪═══════╡
│ 2026-01-01 ┆ 465   │
│ 2026-02-01 ┆ 1246  │
│ 2026-03-01 ┆ 2015  │
└────────────┴───────┘

String namespace

The .str namespace mirrors pandas' .str accessor but is expressed as pl.col("col").str.<method>() inside with_columns or select.

python
import polars as pl

df = pl.DataFrame({"raw": ["  Hello World  ", "foo_bar", "CAPS LOCK"]})
print(
    df.with_columns(
        pl.col("raw").str.strip_chars().alias("stripped"),
        pl.col("raw").str.to_lowercase().alias("lower"),
        pl.col("raw").str.replace("_", " ").alias("underscores_removed"),
        pl.col("raw").str.contains("CAPS").alias("has_caps"),
        pl.col("raw").str.len_chars().alias("length"),
    )
)

Output:

text
shape: (3, 6)
┌──────────────────┬───────────────┬───────────┬─────────────────────┬──────────┬────────┐
│ raw              ┆ stripped      ┆ lower     ┆ underscores_removed ┆ has_caps ┆ length │
│ ---              ┆ ---           ┆ ---       ┆ ---                 ┆ ---      ┆ ---    │
│ str              ┆ str           ┆ str       ┆ str                 ┆ bool     ┆ u32    │
╞══════════════════╪═══════════════╪═══════════╪═════════════════════╪══════════╪════════╡
│   Hello World    ┆ Hello World   ┆   hello … ┆   Hello World       ┆ false    ┆ 15     │
│ foo_bar          ┆ foo_bar       ┆ foo_bar   ┆ foo bar             ┆ false    ┆ 7      │
│ CAPS LOCK        ┆ CAPS LOCK     ┆ caps lock ┆ CAPS LOCK           ┆ true     ┆ 9      │
└──────────────────┴───────────────┴───────────┴─────────────────────┴──────────┴────────┘

List namespace

When a column holds lists (pl.List dtype), the .list namespace provides vectorised operations without exploding the rows.

python
import polars as pl

df = pl.DataFrame({"scores": [[90, 85, 92], [78, 80], [95, 88, 76, 91]]})
print(
    df.with_columns(
        pl.col("scores").list.len().alias("count"),
        pl.col("scores").list.mean().alias("avg"),
        pl.col("scores").list.max().alias("max"),
    )
)

Output:

text
shape: (3, 4)
┌────────────────┬───────┬──────┬─────┐
│ scores         ┆ count ┆ avg  ┆ max │
│ ---            ┆ ---   ┆ ---  ┆ --- │
│ list[i64]      ┆ u32   ┆ f64  ┆ i64 │
╞════════════════╪═══════╪══════╪═════╡
│ [90, 85, 92]   ┆ 3     ┆ 89.0 ┆ 92  │
│ [78, 80]       ┆ 2     ┆ 79.0 ┆ 80  │
│ [95, 88, 76, … ┆ 4     ┆ 87.5 ┆ 95  │
└────────────────┴───────┴──────┴─────┘

Interop with pandas and PyArrow

to_pandas() converts a polars DataFrame to pandas. from_pandas() goes the other way. Both paths use the shared Arrow memory buffer when possible, making conversion near-zero-copy.

python
import polars as pl
import pandas as pd

df_pl = pl.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]})

df_pd = df_pl.to_pandas()
print(type(df_pd), df_pd.dtypes.to_dict())

df_pl2 = pl.from_pandas(df_pd)
print(df_pl2.dtypes)

Output:

text
<class 'pandas.core.frame.DataFrame'> {'a': dtype('int64'), 'b': dtype('float64')}
[Int64, Float64]

Quick reference

TaskCode
Filter rowsdf.filter(pl.col("x") > 5)
Select columnsdf.select("a", "b")
Add/replace columndf.with_columns(expr.alias("name"))
Renamedf.rename({"old": "new"})
Sortdf.sort("col", descending=True)
Drop nullsdf.drop_nulls(subset=["col"])
Fill nullsdf.with_columns(pl.col("c").fill_null(0))
Group + aggdf.group_by("k").agg(pl.col("v").sum())
Inner joinleft.join(right, on="id", how="inner")
Lazy planpl.scan_csv("f.csv").filter(...).collect()
Streaming.collect(streaming=True)
To pandasdf.to_pandas()
From pandaspl.from_pandas(df_pd)
Schemadf.schema
Shapedf.shape
Glimpsedf.glimpse()