cheat sheet
polars
High-performance DataFrames with a lazy expression API. Covers read/write, select, filter, group_by, joins, LazyFrame, datetime, string operations, and pandas interop.
polars — Fast DataFrames
What it is
Polars is a DataFrame library written in Rust that exposes a Python API. It uses Apache Arrow as its memory layout, processes data column-by-column with SIMD instructions, and executes lazy query plans in parallel across all CPU cores. Polars is significantly faster than pandas on most workloads and handles datasets that exceed RAM through streaming. It is the recommended alternative to pandas when performance or memory is a bottleneck.
Install
pip install polars
Output: (none — exits 0 on success)
For Parquet, Excel, and timezone support:
pip install polars[all]
Output: (none — exits 0 on success)
Quick example
import polars as pl
df = pl.DataFrame({
"name": ["Alice", "Bob", "Carol", "Dave"],
"dept": ["Eng", "Eng", "HR", "HR"],
"score": [92, 78, 85, 91],
})
print(df.filter(pl.col("score") > 80))
Output:
shape: (3, 3)
┌───────┬──────┬───────┐
│ name ┆ dept ┆ score │
│ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 │
╞═══════╪══════╪═══════╡
│ Alice ┆ Eng ┆ 92 │
│ Carol ┆ HR ┆ 85 │
│ Dave ┆ HR ┆ 91 │
└───────┴──────┴───────┘
When / why to use it
- Processing CSV, Parquet, or JSON files faster than pandas with a single-line change.
- Datasets that are too large for pandas (polars streams through them without loading into RAM with
streaming=True). - Multi-core machines where pandas' GIL-constrained single-threaded ops leave cores idle.
- Pipelines where lazy evaluation (
LazyFrame) lets polars optimise the entire query plan before executing it. - When you want to avoid pandas' inconsistent mutation semantics (
SettingWithCopyWarning).
Common pitfalls
Expressions, not index-based mutation — polars has no
.ilocor.loc. All transformations return new DataFrames via expressions.df["col"] = valueraisesTypeError; usedf.with_columns(pl.lit(value).alias("col"))instead.
collect()must be called on LazyFrames — aLazyFrameis a query plan. Nothing executes until you call.collect(). Forgetting it returns aLazyFrameobject, not aDataFrame.
Column names are case-sensitive —
pl.col("Score")andpl.col("score")are different. Unlike pandas, polars never coerces column names.
Use
df.glimpse()during exploration — it prints one row per column with dtype, much easier to read thandf.head()on wide DataFrames.
pl.col("*")selects all columns.pl.col(pl.Float64, pl.Int64)selects all numeric columns by dtype — useful inwith_columnsfor type-based transforms.
Reading and writing files
Polars has symmetric read_* / write_* and scan_* functions for all major formats. The scan_* family returns a LazyFrame and only reads data that the query plan actually needs, making it ideal for large files.
import polars as pl
# CSV
df = pl.read_csv("data.csv")
df.write_csv("output.csv")
# Parquet (fastest format for polars pipelines)
df = pl.read_parquet("data.parquet")
df.write_parquet("output.parquet")
# JSON (newline-delimited)
df = pl.read_ndjson("records.jsonl")
# Lazy scan — reads only needed columns/rows
lf = pl.scan_csv("large.csv")
result = lf.filter(pl.col("status") == "active").select("id", "name").collect()
select, filter, with_columns
The core expression-based API. select picks and transforms columns, returning a new DataFrame with only those columns. filter keeps rows matching a boolean expression. with_columns appends or replaces columns while keeping all originals.
import polars as pl
df = pl.DataFrame({
"product": ["A", "B", "C", "D"],
"price": [10.0, 25.0, 8.0, 40.0],
"qty": [100, 50, 200, 30],
})
# select — compute revenue, keep only two columns
print(df.select(
pl.col("product"),
(pl.col("price") * pl.col("qty")).alias("revenue"),
))
Output:
shape: (4, 2)
┌─────────┬─────────┐
│ product ┆ revenue │
│ --- ┆ --- │
│ str ┆ f64 │
╞═════════╪═════════╡
│ A ┆ 1000.0 │
│ B ┆ 1250.0 │
│ C ┆ 1600.0 │
│ D ┆ 1200.0 │
└─────────┴─────────┘
# filter + with_columns
result = (
df.filter(pl.col("price") < 30)
.with_columns(
(pl.col("price") * pl.col("qty")).alias("revenue"),
pl.col("product").str.to_lowercase().alias("product_lower"),
)
)
print(result)
Output:
shape: (3, 5)
┌─────────┬───────┬─────┬─────────┬───────────────┐
│ product ┆ price ┆ qty ┆ revenue ┆ product_lower │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ i64 ┆ f64 ┆ str │
╞═════════╪═══════╪═════╪═════════╪═══════════════╡
│ A ┆ 10.0 ┆ 100 ┆ 1000.0 ┆ a │
│ B ┆ 25.0 ┆ 50 ┆ 1250.0 ┆ b │
│ C ┆ 8.0 ┆ 200 ┆ 1600.0 ┆ c │
└─────────┴───────┴─────┴─────────┴───────────────┘
group_by and aggregation
group_by splits the DataFrame by one or more key columns and agg computes one or more aggregation expressions per group. Multiple aggregations are specified in a single agg(...) call.
import polars as pl
sales = pl.DataFrame({
"region": ["East", "East", "West", "West", "East"],
"product": ["A", "B", "A", "B", "A"],
"amount": [100, 200, 150, 250, 120],
})
summary = (
sales.group_by("region", "product")
.agg(
pl.col("amount").sum().alias("total"),
pl.col("amount").count().alias("orders"),
pl.col("amount").mean().alias("avg"),
)
.sort("region", "product")
)
print(summary)
Output:
shape: (4, 5)
┌────────┬─────────┬───────┬────────┬───────┐
│ region ┆ product ┆ total ┆ orders ┆ avg │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ u32 ┆ f64 │
╞════════╪═════════╪═══════╪════════╪═══════╡
│ East ┆ A ┆ 220 ┆ 2 ┆ 110.0 │
│ East ┆ B ┆ 200 ┆ 1 ┆ 200.0 │
│ West ┆ A ┆ 150 ┆ 1 ┆ 150.0 │
│ West ┆ B ┆ 250 ┆ 1 ┆ 250.0 │
└────────┴─────────┴───────┴────────┴───────┘
Joins
Polars supports the standard join strategies. All joins are expressed as method calls on the left DataFrame with the right DataFrame as the first argument.
import polars as pl
employees = pl.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
salaries = pl.DataFrame({"id": [1, 2, 4], "salary": [75000, 82000, 91000]})
# inner join (default)
print(employees.join(salaries, on="id", how="inner"))
Output:
shape: (2, 3)
┌─────┬───────┬────────┐
│ id ┆ name ┆ salary │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪═══════╪════════╡
│ 1 ┆ Alice ┆ 75000 │
│ 2 ┆ Bob ┆ 82000 │
└─────┴───────┴────────┘
# left join — keep all employees
print(employees.join(salaries, on="id", how="left"))
Output:
shape: (3, 3)
┌─────┬───────┬────────┐
│ id ┆ name ┆ salary │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 │
╞═════╪═══════╪════════╡
│ 1 ┆ Alice ┆ 75000 │
│ 2 ┆ Bob ┆ 82000 │
│ 3 ┆ Carol ┆ null │
└─────┴───────┴────────┘
# anti join — employees with no salary record
print(employees.join(salaries, on="id", how="anti"))
Output:
shape: (1, 2)
┌─────┬───────┐
│ id ┆ name │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 3 ┆ Carol │
└─────┴───────┘
LazyFrame — deferred execution and query optimisation
A LazyFrame records transformation steps without executing them. When you call .collect(), polars compiles all steps into an optimised physical plan — it pushes filters early (predicate pushdown), prunes unused columns (projection pushdown), and parallelises independent branches. For large files, use scan_csv / scan_parquet instead of read_* so the optimizer can skip reading columns you don't need.
import polars as pl
# scan_parquet never reads the file until collect()
lf = (
pl.scan_csv("sales.csv")
.filter(pl.col("amount") > 100)
.group_by("region")
.agg(pl.col("amount").sum().alias("total"))
.sort("total", descending=True)
)
# Inspect the optimised plan before running
print(lf.explain())
# Execute
result = lf.collect()
print(result)
For datasets larger than RAM, add streaming=True to collect():
result = lf.collect(streaming=True)
Datetime expressions
Polars stores timestamps as Datetime (not Python datetime) and exposes time operations through the .dt namespace on expressions.
import polars as pl
df = pl.DataFrame({
"ts": ["2026-01-01", "2026-03-15", "2026-06-30"],
"value": [10, 25, 18],
}).with_columns(pl.col("ts").str.to_date().alias("date"))
print(
df.with_columns(
pl.col("date").dt.year().alias("year"),
pl.col("date").dt.month().alias("month"),
pl.col("date").dt.day_of_week().alias("dow"), # 0=Monday
)
)
Output:
shape: (3, 5)
┌────────────┬───────┬──────┬───────┬─────┐
│ date ┆ value ┆ year ┆ month ┆ dow │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ i32 ┆ i8 ┆ i8 │
╞════════════╪═══════╪══════╪═══════╪═════╡
│ 2026-01-01 ┆ 10 ┆ 2026 ┆ 1 ┆ 3 │
│ 2026-03-15 ┆ 25 ┆ 2026 ┆ 3 ┆ 6 │
│ 2026-06-30 ┆ 18 ┆ 2026 ┆ 6 ┆ 1 │
└────────────┴───────┴──────┴───────┴─────┘
Group by calendar period with group_by_dynamic:
df2 = pl.DataFrame({
"date": pl.date_range(pl.date(2026, 1, 1), pl.date(2026, 3, 31), "1d", eager=True),
"sales": range(90),
})
monthly = df2.group_by_dynamic("date", every="1mo").agg(pl.col("sales").sum())
print(monthly.head(3))
Output:
shape: (3, 2)
┌────────────┬───────┐
│ date ┆ sales │
│ --- ┆ --- │
│ date ┆ i64 │
╞════════════╪═══════╡
│ 2026-01-01 ┆ 465 │
│ 2026-02-01 ┆ 1246 │
│ 2026-03-01 ┆ 2015 │
└────────────┴───────┘
String namespace
The .str namespace mirrors pandas' .str accessor but is expressed as pl.col("col").str.<method>() inside with_columns or select.
import polars as pl
df = pl.DataFrame({"raw": [" Hello World ", "foo_bar", "CAPS LOCK"]})
print(
df.with_columns(
pl.col("raw").str.strip_chars().alias("stripped"),
pl.col("raw").str.to_lowercase().alias("lower"),
pl.col("raw").str.replace("_", " ").alias("underscores_removed"),
pl.col("raw").str.contains("CAPS").alias("has_caps"),
pl.col("raw").str.len_chars().alias("length"),
)
)
Output:
shape: (3, 6)
┌──────────────────┬───────────────┬───────────┬─────────────────────┬──────────┬────────┐
│ raw ┆ stripped ┆ lower ┆ underscores_removed ┆ has_caps ┆ length │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ str ┆ str ┆ bool ┆ u32 │
╞══════════════════╪═══════════════╪═══════════╪═════════════════════╪══════════╪════════╡
│ Hello World ┆ Hello World ┆ hello … ┆ Hello World ┆ false ┆ 15 │
│ foo_bar ┆ foo_bar ┆ foo_bar ┆ foo bar ┆ false ┆ 7 │
│ CAPS LOCK ┆ CAPS LOCK ┆ caps lock ┆ CAPS LOCK ┆ true ┆ 9 │
└──────────────────┴───────────────┴───────────┴─────────────────────┴──────────┴────────┘
List namespace
When a column holds lists (pl.List dtype), the .list namespace provides vectorised operations without exploding the rows.
import polars as pl
df = pl.DataFrame({"scores": [[90, 85, 92], [78, 80], [95, 88, 76, 91]]})
print(
df.with_columns(
pl.col("scores").list.len().alias("count"),
pl.col("scores").list.mean().alias("avg"),
pl.col("scores").list.max().alias("max"),
)
)
Output:
shape: (3, 4)
┌────────────────┬───────┬──────┬─────┐
│ scores ┆ count ┆ avg ┆ max │
│ --- ┆ --- ┆ --- ┆ --- │
│ list[i64] ┆ u32 ┆ f64 ┆ i64 │
╞════════════════╪═══════╪══════╪═════╡
│ [90, 85, 92] ┆ 3 ┆ 89.0 ┆ 92 │
│ [78, 80] ┆ 2 ┆ 79.0 ┆ 80 │
│ [95, 88, 76, … ┆ 4 ┆ 87.5 ┆ 95 │
└────────────────┴───────┴──────┴─────┘
Interop with pandas and PyArrow
to_pandas() converts a polars DataFrame to pandas. from_pandas() goes the other way. Both paths use the shared Arrow memory buffer when possible, making conversion near-zero-copy.
import polars as pl
import pandas as pd
df_pl = pl.DataFrame({"a": [1, 2, 3], "b": [4.0, 5.0, 6.0]})
df_pd = df_pl.to_pandas()
print(type(df_pd), df_pd.dtypes.to_dict())
df_pl2 = pl.from_pandas(df_pd)
print(df_pl2.dtypes)
Output:
<class 'pandas.core.frame.DataFrame'> {'a': dtype('int64'), 'b': dtype('float64')}
[Int64, Float64]
Quick reference
| Task | Code |
|---|---|
| Filter rows | df.filter(pl.col("x") > 5) |
| Select columns | df.select("a", "b") |
| Add/replace column | df.with_columns(expr.alias("name")) |
| Rename | df.rename({"old": "new"}) |
| Sort | df.sort("col", descending=True) |
| Drop nulls | df.drop_nulls(subset=["col"]) |
| Fill nulls | df.with_columns(pl.col("c").fill_null(0)) |
| Group + agg | df.group_by("k").agg(pl.col("v").sum()) |
| Inner join | left.join(right, on="id", how="inner") |
| Lazy plan | pl.scan_csv("f.csv").filter(...).collect() |
| Streaming | .collect(streaming=True) |
| To pandas | df.to_pandas() |
| From pandas | pl.from_pandas(df_pd) |
| Schema | df.schema |
| Shape | df.shape |
| Glimpse | df.glimpse() |