cheat sheet
pandas
Load, filter, transform, and aggregate tabular data with pandas. Covers DataFrame creation, read_csv, groupby, merge, and the SettingWithCopy pitfall.
pandas — DataFrames
What it is
pandas provides two main structures: DataFrame (2-D labeled table) and Series (1-D labeled array). It is the standard library for reading, cleaning, transforming, and analyzing tabular data in Python.
Install
pip install pandas
Output: (none — exits 0 on success)
Quick example
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob", "Charlie"],
"dept": ["Eng", "Eng", "HR"],
"score": [92, 78, 85],
})
print(df[df["score"] > 80])
Output:
name dept score
0 Alice Eng 92
2 Charlie HR 85
When / why to use it
- Reading CSV, Excel, JSON, or SQL tables:
pd.read_csv(),pd.read_excel(),pd.read_sql(). - Cleaning and reshaping data before analysis or ML.
- Exploratory data analysis with
df.describe(),groupby,pivot_table. - Joining multiple datasets:
pd.merge().
Common pitfalls
SettingWithCopyWarning — modifying a slice of a DataFrame may not modify the original. Use
.locfor assignment:df.loc[df["score"] > 80, "grade"] = "A" # ✅ correct df[df["score"] > 80]["grade"] = "A" # ❌ may silently do nothing
inplace=Trueis a trap — many operations acceptinplace=Truebut it doesn't save memory and makes code harder to chain. Prefer reassignment:df = df.dropna().
Use
df.dtypesanddf.info()early to understand what you loaded. Numeric columns imported asobject(string) will break aggregations silently.
Reading and writing files
pd.read_csv is the most common entry point, but pandas also reads Excel, JSON, Parquet, SQL, HTML tables, and more through the same read_* / to_* symmetry. Always pass dtype= or parse_dates= explicitly when the inferred types are wrong — silent type mismatches (e.g. numeric IDs read as object) cause subtle aggregation bugs downstream.
import pandas as pd
# CSV
df = pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)
# Excel
df = pd.read_excel("report.xlsx", sheet_name="Sheet1")
# JSON
df = pd.read_json("records.json", orient="records")
# from a SQL query
import sqlite3
conn = sqlite3.connect("app.db")
df = pd.read_sql("SELECT * FROM users WHERE active = 1", conn)
Richer example — groupby and merge
import pandas as pd
sales = pd.DataFrame({
"product": ["A", "B", "A", "B", "A"],
"region": ["East", "East", "West", "West", "East"],
"amount": [100, 200, 150, 250, 120],
})
# Group by region and product, aggregate
summary = (
sales.groupby(["region", "product"])["amount"]
.agg(total="sum", orders="count")
.reset_index()
)
print(summary)
print()
# Merge with a product lookup table
products = pd.DataFrame({"product": ["A", "B"], "name": ["Widget", "Gadget"]})
merged = summary.merge(products, on="product")
print(merged[["region", "name", "total", "orders"]])
Output:
region product total orders
0 East A 220 2
1 East B 200 1
2 West A 150 1
3 West B 250 1
region name total orders
0 East Widget 220 2
1 East Gadget 200 1
2 West Widget 150 1
3 West Gadget 250 1
Useful operations quick reference
| Task | Code |
|---|---|
| Filter rows | df[df["col"] > 5] |
| Select columns | df[["a", "b"]] |
| Add column | df["new"] = df["a"] * 2 |
| Drop rows with NaN | df.dropna(subset=["col"]) |
| Fill NaN | df["col"].fillna(0) |
| Sort | df.sort_values("col", ascending=False) |
| Rename columns | df.rename(columns={"old": "new"}) |
| Apply function | df["col"].apply(lambda x: x.strip()) |
| Pivot | df.pivot_table(values="sales", index="region", columns="product", aggfunc="sum") |
| Value counts | df["status"].value_counts() |
| Describe | df.describe() |
| Shape | df.shape |
Combining DataFrames
pd.concat stacks DataFrames vertically (default) or horizontally by aligning on the index; use it when the schemas are identical. pd.merge performs SQL-style joins on one or more key columns and is the right tool when the two tables have different shapes and a shared key. DataFrame.join is a convenience wrapper around merge that joins on the index by default.
import pandas as pd
# concat — stack vertically (same columns)
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"A": [5, 6], "B": [7, 8]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)
Output:
A B
0 1 3
1 2 4
2 5 7
3 6 8
# concat — side by side (same rows)
result = pd.concat([df1, df2], axis=1)
print(result)
Output:
A B A B
0 1 3 5 7
1 2 4 6 8
# merge — inner join (default)
employees = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
salaries = pd.DataFrame({"id": [1, 2, 4], "salary": [75000, 82000, 91000]})
result = pd.merge(employees, salaries, on="id")
print(result)
Output:
id name salary
0 1 Alice 75000
1 2 Bob 82000
# merge — left join (keep all from left)
result = pd.merge(employees, salaries, on="id", how="left")
print(result)
Output:
id name salary
0 1 Alice 75000.0
1 2 Bob 82000.0
2 3 Carol NaN
# merge — outer join (keep all rows from both)
result = pd.merge(employees, salaries, on="id", how="outer")
print(result)
Output:
id name salary
0 1 Alice 75000.0
1 2 Bob 82000.0
2 3 Carol NaN
3 4 NaN 91000.0
Reshaping
melt converts wide format (one column per variable) to long format (one row per observation), which most statistical and plotting libraries expect. pivot_table goes the other way, aggregating a long DataFrame into a wide summary matrix. stack / unstack move column levels to the row index and back, useful for multi-level indexed data.
# melt — wide to long
df = pd.DataFrame({"name": ["Alice", "Bob"], "math": [90, 85], "english": [88, 92]})
melted = df.melt(id_vars="name", var_name="subject", value_name="score")
print(melted)
Output:
name subject score
0 Alice math 90
1 Bob math 85
2 Alice english 88
3 Bob english 92
# pivot_table — long to wide with aggregation
print(pd.pivot_table(melted, values="score", index="name", columns="subject"))
Output:
subject english math
name
Alice 88 90
Bob 92 85
# stack / unstack — move column level to row index
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, index=["x", "y"])
print(df.stack())
Output:
x A 1
B 3
y A 2
B 4
dtype: int64
DateTime handling
pd.to_datetime converts string or numeric columns to datetime64[ns] dtype, enabling the .dt accessor for extracting components (.dt.year, .dt.day_of_week) and time-aware operations. Set the column as the index to unlock resample() for period-based aggregation (hourly, daily, monthly) and rolling() for moving-window statistics.
# Parse dates
df = pd.DataFrame({"date": ["2026-01-01", "2026-01-15", "2026-02-01"], "value": [10, 25, 18]})
df["date"] = pd.to_datetime(df["date"])
print(df.dtypes)
Output:
date datetime64[ns]
value int64
dtype: object
# dt accessor — extract components
print(df["date"].dt.year.tolist())
print(df["date"].dt.month_name().tolist())
print(df["date"].dt.day_of_week.tolist()) # 0=Monday
Output:
[2026, 2026, 2026]
['January', 'January', 'February']
[3, 3, 6]
# resample — aggregate by time period
df = df.set_index("date")
print(df.resample("ME").sum()) # month-end
Output:
value
date
2026-01-31 35
2026-02-28 18
# rolling window
df["rolling_2"] = df["value"].rolling(window=2).mean()
print(df)
Output:
value rolling_2
date
2026-01-01 10 NaN
2026-01-15 25 17.5
2026-02-01 18 21.5
String accessor
The .str accessor exposes vectorized string methods on object-dtype (or StringDtype) Series, avoiding explicit .apply(lambda x: ...) loops. Most Python str methods are available as .str.<method>(), plus regex support (str.contains, str.extract, str.replace with regex=True) and list-like indexing after str.split.
s = pd.Series([" Alice Smith ", "bob jones", "CAROL WHITE"])
print(s.str.strip())
print(s.str.lower())
print(s.str.upper())
print(s.str.title())
print(s.str.contains("jones", case=False))
print(s.str.replace(r"\s+", "_", regex=True))
print(s.str.split(" ").str[0]) # first word
print(s.str.len())
Output:
0 Alice Smith
1 bob jones
2 CAROL WHITE
dtype: object
0 alice smith
1 bob jones
2 carol white
dtype: object
0 ALICE SMITH
1 BOB JONES
2 CAROL WHITE
dtype: object
0 Alice Smith
1 Bob Jones
2 Carol White
dtype: object
0 False
1 True
2 False
dtype: bool
0 Alice_Smith
1 bob_jones
2 CAROL_WHITE
dtype: object
0 Alice
1 bob
2 CAROL
dtype: object
0 15
1 9
2 11
dtype: int64
apply vs transform vs agg
In a groupby context: agg reduces each group to a scalar (or set of scalars) and is the right choice for summary statistics. transform returns a Series with the same length as the original DataFrame, enabling group-level values (like a group mean) to be added back as a new column. apply is the escape hatch — it passes each group as a DataFrame to an arbitrary function, making it the most flexible but also the slowest.
df = pd.DataFrame({"dept": ["Eng", "Eng", "Ops"], "salary": [75000, 82000, 68000]})
# apply — custom function per group (returns reduced result)
print(df.groupby("dept")["salary"].apply(lambda x: x.max() - x.min()))
Output:
dept
Eng 7000
Ops 0
Name: salary, dtype: int64
# transform — returns same-shape result (for feature engineering)
df["dept_avg"] = df.groupby("dept")["salary"].transform("mean")
print(df)
Output:
dept salary dept_avg
0 Eng 75000 78500.0
1 Eng 82000 78500.0
2 Ops 68000 68000.0
# agg — multiple aggregations at once
print(df.groupby("dept")["salary"].agg(["mean", "min", "max", "count"]))
Output:
mean min max count
dept
Eng 78500 75000 82000 2
Ops 68000 68000 68000 1
Nullable dtypes and Parquet I/O
Pandas' nullable extension types (Int64, Float64, StringDtype, boolean) use pd.NA as the missing-value sentinel instead of float('nan'), which means integer and boolean columns can represent missingness without being upcast to float. Parquet preserves these dtypes on round-trip (unlike CSV), making it the preferred format when schema fidelity and file size matter.
# Nullable integer (can hold NaN unlike int64)
s = pd.array([1, 2, None, 4], dtype="Int64")
print(pd.Series(s))
Output:
0 1
1 2
2 <NA>
3 4
dtype: Int64
# Parquet round-trip
df.to_parquet("data.parquet", index=False)
df2 = pd.read_parquet("data.parquet")
print(df2.dtypes)
Output:
dept object
salary int64
dtype: object
groupby internals
groupby is a two-step pipeline: a split phase that builds an index of group labels to row positions, and an apply-combine phase that runs an aggregation, transformation, or filter and stitches the results back together. The split is lazy — df.groupby("k") doesn't compute anything; the real work happens when you call .agg(), .transform(), .apply(), or .filter(). Knowing which one you need has a huge impact on memory and speed.
import pandas as pd
df = pd.DataFrame({
"team": ["A", "A", "B", "B", "C"],
"score": [10, 15, 8, 12, 20],
"minutes": [40, 35, 30, 45, 50],
})
g = df.groupby("team")
# 1. agg: scalar per group → smaller result
print(g.agg(total_score=("score", "sum"), avg_min=("minutes", "mean")))
Output:
total_score avg_min
team
A 25 37.5
B 20 37.5
C 20 50.0
# 2. transform: same-shape result aligned to the original rows
df["team_avg"] = g["score"].transform("mean")
print(df)
Output:
team score minutes team_avg
0 A 10 40 12.5
1 A 15 35 12.5
2 B 8 30 10.0
3 B 12 45 10.0
4 C 20 50 20.0
# 3. filter: keep entire groups that pass a test
print(g.filter(lambda x: x["score"].mean() > 11))
Output:
team score minutes team_avg
0 A 10 40 12.5
1 A 15 35 12.5
4 C 20 50 20.0
Use named aggregations (
agg(total=("col", "sum"))) rather than passing a dict — the result has flat column names and is much easier to chain.
observed=Trueon a categorical groupby skips empty combinations. In pandas 2.x the default flips toTrue; setting it explicitly silences a deprecation warning and avoids surprise empty rows.
Multi-index DataFrames
A MultiIndex (hierarchical index) stores multiple labels per row position, useful after a groupby aggregation or set_index on two or more columns. You can select a single level with .xs("Eng", level="dept"), an entire branch with df.loc[("Eng", "Alice")], or all entries under a top-level value with df.loc["Eng"]. To go back to flat columns call .reset_index().
import pandas as pd
df = pd.DataFrame({
"dept": ["Eng", "Eng", "HR", "HR"],
"role": ["Manager", "IC", "Manager", "IC"],
"headcount": [5, 30, 2, 8],
}).set_index(["dept", "role"])
print(df)
print()
print(df.loc["Eng"]) # all Eng rows
print()
print(df.xs("Manager", level="role")) # Managers across all depts
Output:
headcount
dept role
Eng Manager 5
IC 30
HR Manager 2
IC 8
headcount
role
Manager 5
IC 30
headcount
dept
Eng 5
HR 2
Sort a MultiIndex with
df.sort_index()before slicing — unsorted multi-indexes raisePerformanceWarningand can produce wrong slices inloc.
query() and eval() for readable filters
df.query("col > 5 and name == 'Alice'") uses a pandas-flavoured expression language with implicit column references, making complex filters more readable than chained boolean masks. df.eval("revenue = price * qty") is the same idea for column creation. Both compile the expression with numexpr when available, which can be 2–5x faster than the pure-Python equivalent on large DataFrames.
import pandas as pd
df = pd.DataFrame({
"name": ["Alice", "Bob", "Carol"],
"dept": ["Eng", "Eng", "HR"],
"salary": [120000, 85000, 95000],
})
# Reference an external variable with @
threshold = 100000
print(df.query("salary > @threshold and dept == 'Eng'"))
Output:
name dept salary
0 Alice Eng 120000
df = df.eval("bonus = salary * 0.10")
print(df)
Output:
name dept salary bonus
0 Alice Eng 120000 12000.0
1 Bob Eng 85000 8500.0
2 Carol HR 95000 9500.0
Dtype optimisation — category and PyArrow
Object-dtype columns (Python strings) are slow and memory-heavy because every value is a separate PyObject. Converting low-cardinality strings to category can cut memory by 10x and speed up groupby/joins. Pandas 2.x supports PyArrow-backed dtypes ("string[pyarrow]", "int64[pyarrow]") which are zero-copy compatible with polars and store missing values natively.
import pandas as pd
df = pd.DataFrame({"city": ["NYC"] * 1000 + ["LA"] * 500 + ["SF"] * 200})
print(f"object: {df['city'].memory_usage(deep=True)} bytes")
df["city"] = df["city"].astype("category")
print(f"category: {df['city'].memory_usage(deep=True)} bytes")
Output:
object: 99828 bytes
category: 1956 bytes
# PyArrow-backed string dtype — fast, supports NA
s = pd.Series(["Alice", None, "Bob"], dtype="string[pyarrow]")
print(s.dtype)
print(s.isna())
Output:
string[pyarrow]
0 False
1 True
2 False
dtype: bool
Pass
dtype_backend="pyarrow"topd.read_csv/pd.read_parquetto get PyArrow-backed dtypes for the whole DataFrame in one shot.
read_* performance tuning
pd.read_csv is convenient but rarely the fastest path. Speed it up by passing dtype= to skip type inference, usecols= to limit columns, parse_dates=[...] to avoid a second pass, and engine="pyarrow" for the Arrow-backed parser. For Parquet, polars scan_parquet or pyarrow.parquet.read_table are typically 2–10x faster than pandas' default.
import pandas as pd
# Slow default
df = pd.read_csv("big.csv")
# Fast: tell pandas exactly what to do
df = pd.read_csv(
"big.csv",
dtype={"id": "int32", "amount": "float32", "status": "category"},
usecols=["id", "amount", "status", "ts"],
parse_dates=["ts"],
engine="pyarrow", # uses Arrow CSV reader
dtype_backend="pyarrow",
)
For a one-time conversion to Parquet (much faster on subsequent reads):
pd.read_csv("big.csv").to_parquet("big.parquet", compression="zstd")
Merge semantics in detail
pd.merge supports five join strategies via how=: inner (default, intersection), left, right, outer (union), and cross (cartesian product). Use validate="one_to_one" / "one_to_many" etc. to catch unexpected key duplication — silent many-to-many joins are a common source of inflated row counts. The indicator=True flag adds a _merge column showing which side each row came from, invaluable for debugging.
import pandas as pd
left = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
right = pd.DataFrame({"id": [2, 3, 4], "salary": [80000, 95000, 70000]})
# Outer join with indicator
merged = left.merge(right, on="id", how="outer", indicator=True)
print(merged)
Output:
id name salary _merge
0 1 Alice NaN left_only
1 2 Bob 80000.0 both
2 3 Carol 95000.0 both
3 4 NaN 70000.0 right_only
# Validate one-to-one
dup = pd.DataFrame({"id": [1, 1], "x": [10, 20]})
try:
left.merge(dup, on="id", validate="one_to_one")
except pd.errors.MergeError as e:
print(f"caught: {e}")
Output:
caught: Merge keys are not unique in right dataset; not a one-to-one merge
mergematches on column values exactly — leading/trailing whitespace and dtype mismatches silently drop rows. Rundf["id"].dtypeon both sides before merging andstr.strip()string keys.
Categorical with ordered levels
A Categorical dtype with explicit ordered levels (e.g. "low" < "medium" < "high") lets you sort and compare in semantic order rather than alphabetic order. Useful for survey responses, risk ratings, and ordinal labels in plots.
import pandas as pd
df = pd.DataFrame({"risk": ["low", "high", "medium", "low", "high"]})
df["risk"] = pd.Categorical(df["risk"], categories=["low", "medium", "high"], ordered=True)
print(df.sort_values("risk"))
print()
print((df["risk"] >= "medium").tolist())
Output:
risk
0 low
3 low
2 medium
1 high
4 high
[False, True, True, False, True]
Method chaining and pipe()
Pandas operations are designed to chain — most return a new DataFrame. df.pipe(func, *args) extends this to custom functions that take a DataFrame and return a DataFrame, keeping the call site flat instead of nesting func(df, ...). Combined with .assign() for new columns, you get pure, readable pipelines that are easy to refactor.
import pandas as pd
def add_full_name(df):
return df.assign(full=df["first"] + " " + df["last"])
result = (
pd.DataFrame({"first": ["Alice", "Bob"], "last": ["Smith", "Jones"], "age": [30, 25]})
.pipe(add_full_name)
.query("age >= 28")
.assign(decade=lambda d: (d["age"] // 10) * 10)
[["full", "decade"]]
)
print(result)
Output:
full decade
0 Alice Smith 30
Window functions — rolling, expanding, ewm
rolling(window=N) produces a fixed-length moving window; expanding() accumulates from the start of the series; ewm(span=N) is an exponentially-weighted variant that gives more weight to recent observations. All three return a window object on which you call aggregations (.mean(), .std(), .quantile(0.95)).
import pandas as pd
s = pd.Series([1, 3, 2, 5, 4, 7, 6, 9, 8, 11])
print("rolling(3) mean:", s.rolling(3).mean().tolist())
print("expanding max: ", s.expanding().max().tolist())
print("ewm(span=3): ", s.ewm(span=3).mean().round(2).tolist())
Output:
rolling(3) mean: [nan, nan, 2.0, 3.333..., 3.666..., 5.333..., 5.666..., 7.333..., 7.666..., 9.333...]
expanding max: [1.0, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0, 11.0]
ewm(span=3): [1.0, 2.33, 2.14, 3.93, 3.97, 5.71, 5.85, 7.71, 7.85, 9.78]
Pivot, crosstab, and get_dummies
pivot_table aggregates a long DataFrame into a wide matrix; crosstab is a shortcut for frequency tables (counts/proportions of two categorical variables); get_dummies one-hot encodes categorical columns into a wide indicator matrix, the standard preprocessing step before fitting a linear model in scikit-learn.
import pandas as pd
df = pd.DataFrame({
"region": ["East", "East", "West", "West", "East"],
"product": ["A", "B", "A", "B", "A"],
"amount": [100, 200, 150, 250, 120],
})
print(df.pivot_table(values="amount", index="region", columns="product", aggfunc="sum", fill_value=0))
print()
print(pd.crosstab(df["region"], df["product"]))
print()
print(pd.get_dummies(df["product"], prefix="product"))
Output:
product A B
region
East 220 200
West 150 250
product A B
region
East 2 1
West 1 1
product_A product_B
0 True False
1 False True
2 True False
3 False True
4 True False
Comparison with polars
Pandas and polars solve the same problem but with very different trade-offs.
| Concern | pandas | polars |
|---|---|---|
| Backend | NumPy / Arrow (2.x) | Arrow + Rust |
| Execution | Eager only | Eager + lazy with query optimisation |
| Threading | Single-threaded (GIL) | Multi-threaded by default |
| Memory | ~10 GB datasets uncomfortable | Streaming for > RAM datasets |
| API | Index-based (loc, iloc) + columns | Expression-based, no index |
| Mutation | SettingWithCopyWarning traps | All transforms return new frames |
| Ecosystem | Massive (matplotlib, scikit-learn integration) | Newer; growing fast |
When your pandas pipeline is slow or memory-bound, rewriting it as polars
LazyFramechained calls usually beats parallelising or chunking pandas. See polars for the full migration guide.
Common pitfalls (reference)
Chained indexing —
df[df.col > 0]["other"] = ...may not modifydf. Always usedf.loc[mask, "other"] = ....
mergerow explosion — many-to-many joins multiply rows silently. Usevalidate=to assert your expectation.
objectdtype confusion — mixed-type columns or string columns show asobject. Convert tocategoryor"string"for speed and clearer semantics.
apply(axis=1)is slow — it iterates rows in Python. Replace with vectorised expressions,df.assign(...), ornp.where.
Time-zone naivety —
pd.to_datetimereturns tz-naive timestamps by default. Always passutc=Truefor log timestamps, then convert withtz_convertfor display.
Real-world recipes
Cohort analysis: monthly retention
Group users by their first-event month, count distinct users active in each subsequent month, divide.
import pandas as pd
events = pd.DataFrame({
"user_id": [1, 1, 1, 2, 2, 3, 3, 3, 3],
"event_date": pd.to_datetime([
"2026-01-05", "2026-01-15", "2026-02-10",
"2026-01-20", "2026-03-05",
"2026-02-01", "2026-02-15", "2026-03-10", "2026-04-01",
]),
})
events["cohort"] = events.groupby("user_id")["event_date"].transform("min").dt.to_period("M")
events["period"] = events["event_date"].dt.to_period("M")
events["months_since"] = (events["period"] - events["cohort"]).apply(lambda x: x.n)
cohort = (
events.groupby(["cohort", "months_since"])["user_id"].nunique()
.unstack(fill_value=0)
)
print(cohort)
Output:
months_since 0 1 2
cohort
2026-01 2 1 1
2026-02 1 1 1
Top N per group
groupby(...).head(N) after sorting gives the top N rows per group — a one-liner replacement for window functions.
import pandas as pd
df = pd.DataFrame({
"dept": ["Eng", "Eng", "Eng", "HR", "HR"],
"name": ["A", "B", "C", "D", "E"],
"salary": [120000, 95000, 110000, 85000, 92000],
})
top2 = (
df.sort_values(["dept", "salary"], ascending=[True, False])
.groupby("dept")
.head(2)
)
print(top2)
Output:
dept name salary
0 Eng A 120000
2 Eng C 110000
4 HR E 92000
3 HR D 85000
Detect dataframe drift
Compare two snapshots of the same schema to find rows that changed.
import pandas as pd
old = pd.DataFrame({"id": [1, 2, 3], "status": ["new", "open", "closed"]})
new = pd.DataFrame({"id": [1, 2, 3], "status": ["new", "closed", "closed"]})
merged = old.merge(new, on="id", suffixes=("_old", "_new"))
changed = merged.query("status_old != status_new")
print(changed)
Output:
id status_old status_new
1 2 open closed
Pivot then chart with Streamlit
Pair pandas with Streamlit to build interactive dashboards — Streamlit's st.dataframe and st.bar_chart accept a DataFrame directly.
import pandas as pd
import streamlit as st
df = pd.read_csv("sales.csv")
pivot = df.pivot_table("amount", "region", "product", aggfunc="sum", fill_value=0)
st.bar_chart(pivot)
See also
- polars — faster, Arrow-native DataFrames with lazy execution.
- numpy — underlying array library;
.to_numpy()for zero-copy access. - matplotlib — pandas
.plot()is a thin wrapper. - scikit-learn — accepts DataFrames directly in modern versions.
- streamlit — turn a DataFrame into an interactive web UI.