cheat sheet

pandas

Load, filter, transform, and aggregate tabular data with pandas. Covers DataFrame creation, read_csv, groupby, merge, and the SettingWithCopy pitfall.

updated 04-25-2026

pandas — DataFrames

What it is

pandas provides two main structures: DataFrame (2-D labeled table) and Series (1-D labeled array). It is the standard library for reading, cleaning, transforming, and analyzing tabular data in Python.

Install

bash

pip install pandas

Output: (none — exits 0 on success)

Quick example

python

import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "dept": ["Eng", "Eng", "HR"],
    "score": [92, 78, 85],
})
print(df[df["score"] > 80])

Output:

text

      name dept  score
0    Alice  Eng     92
2  Charlie   HR     85

When / why to use it

Reading CSV, Excel, JSON, or SQL tables: pd.read_csv(), pd.read_excel(), pd.read_sql().
Cleaning and reshaping data before analysis or ML.
Exploratory data analysis with df.describe(), groupby, pivot_table.
Joining multiple datasets: pd.merge().

Common pitfalls

SettingWithCopyWarning — modifying a slice of a DataFrame may not modify the original. Use .loc for assignment:
python
df.loc[df["score"] > 80, "grade"] = "A"  # ✅ correct
df[df["score"] > 80]["grade"] = "A"       # ❌ may silently do nothing

inplace=True is a trap — many operations accept inplace=True but it doesn't save memory and makes code harder to chain. Prefer reassignment: df = df.dropna().

Use df.dtypes and df.info() early to understand what you loaded. Numeric columns imported as object (string) will break aggregations silently.

Reading and writing files

pd.read_csv is the most common entry point, but pandas also reads Excel, JSON, Parquet, SQL, HTML tables, and more through the same read_* / to_* symmetry. Always pass dtype= or parse_dates= explicitly when the inferred types are wrong — silent type mismatches (e.g. numeric IDs read as object) cause subtle aggregation bugs downstream.

python

import pandas as pd

# CSV
df = pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)

# Excel
df = pd.read_excel("report.xlsx", sheet_name="Sheet1")

# JSON
df = pd.read_json("records.json", orient="records")

# from a SQL query
import sqlite3
conn = sqlite3.connect("app.db")
df = pd.read_sql("SELECT * FROM users WHERE active = 1", conn)

Richer example — groupby and merge

python

import pandas as pd

sales = pd.DataFrame({
    "product": ["A", "B", "A", "B", "A"],
    "region": ["East", "East", "West", "West", "East"],
    "amount": [100, 200, 150, 250, 120],
})

# Group by region and product, aggregate
summary = (
    sales.groupby(["region", "product"])["amount"]
    .agg(total="sum", orders="count")
    .reset_index()
)
print(summary)
print()

# Merge with a product lookup table
products = pd.DataFrame({"product": ["A", "B"], "name": ["Widget", "Gadget"]})
merged = summary.merge(products, on="product")
print(merged[["region", "name", "total", "orders"]])

Output:

text

  region product  total  orders
0   East       A    220       2
1   East       B    200       1
2   West       A    150       1
3   West       B    250       1

  region    name  total  orders
0   East  Widget    220       2
1   East  Gadget    200       1
2   West  Widget    150       1
3   West  Gadget    250       1

Useful operations quick reference

Task	Code
Filter rows	`df[df["col"] > 5]`
Select columns	`df[["a", "b"]]`
Add column	`df["new"] = df["a"] * 2`
Drop rows with NaN	`df.dropna(subset=["col"])`
Fill NaN	`df["col"].fillna(0)`
Sort	`df.sort_values("col", ascending=False)`
Rename columns	`df.rename(columns={"old": "new"})`
Apply function	`df["col"].apply(lambda x: x.strip())`
Pivot	`df.pivot_table(values="sales", index="region", columns="product", aggfunc="sum")`
Value counts	`df["status"].value_counts()`
Describe	`df.describe()`
Shape	`df.shape`

Combining DataFrames

pd.concat stacks DataFrames vertically (default) or horizontally by aligning on the index; use it when the schemas are identical. pd.merge performs SQL-style joins on one or more key columns and is the right tool when the two tables have different shapes and a shared key. DataFrame.join is a convenience wrapper around merge that joins on the index by default.

python

import pandas as pd

# concat — stack vertically (same columns)
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"A": [5, 6], "B": [7, 8]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

text

python

# concat — side by side (same rows)
result = pd.concat([df1, df2], axis=1)
print(result)

Output:

text

   A  B  A  B
0  1  3  5  7
1  2  4  6  8

python

# merge — inner join (default)
employees = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
salaries = pd.DataFrame({"id": [1, 2, 4], "salary": [75000, 82000, 91000]})
result = pd.merge(employees, salaries, on="id")
print(result)

Output:

text

   id   name  salary
0   1  Alice   75000
1   2    Bob   82000

python

# merge — left join (keep all from left)
result = pd.merge(employees, salaries, on="id", how="left")
print(result)

Output:

text

   id   name   salary
0   1  Alice  75000.0
1   2    Bob  82000.0
2   3  Carol      NaN

python

# merge — outer join (keep all rows from both)
result = pd.merge(employees, salaries, on="id", how="outer")
print(result)

Output:

text

   id   name   salary
0   1  Alice  75000.0
1   2    Bob  82000.0
2   3  Carol      NaN
3   4    NaN  91000.0

Reshaping

melt converts wide format (one column per variable) to long format (one row per observation), which most statistical and plotting libraries expect. pivot_table goes the other way, aggregating a long DataFrame into a wide summary matrix. stack / unstack move column levels to the row index and back, useful for multi-level indexed data.

python

# melt — wide to long
df = pd.DataFrame({"name": ["Alice", "Bob"], "math": [90, 85], "english": [88, 92]})
melted = df.melt(id_vars="name", var_name="subject", value_name="score")
print(melted)

Output:

text

    name  subject  score
0  Alice     math     90
1    Bob     math     85
2  Alice  english     88
3    Bob  english     92

python

# pivot_table — long to wide with aggregation
print(pd.pivot_table(melted, values="score", index="name", columns="subject"))

Output:

text

subject  english  math
name                  
Alice         88    90
Bob           92    85

python

# stack / unstack — move column level to row index
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, index=["x", "y"])
print(df.stack())

Output:

text

x  A    1
   B    3
y  A    2
   B    4
dtype: int64

DateTime handling

pd.to_datetime converts string or numeric columns to datetime64[ns] dtype, enabling the .dt accessor for extracting components (.dt.year, .dt.day_of_week) and time-aware operations. Set the column as the index to unlock resample() for period-based aggregation (hourly, daily, monthly) and rolling() for moving-window statistics.

python

# Parse dates
df = pd.DataFrame({"date": ["2026-01-01", "2026-01-15", "2026-02-01"], "value": [10, 25, 18]})
df["date"] = pd.to_datetime(df["date"])
print(df.dtypes)

Output:

text

date     datetime64[ns]
value             int64
dtype: object

python

# dt accessor — extract components
print(df["date"].dt.year.tolist())
print(df["date"].dt.month_name().tolist())
print(df["date"].dt.day_of_week.tolist())   # 0=Monday

Output:

text

[2026, 2026, 2026]
['January', 'January', 'February']
[3, 3, 6]

python

# resample — aggregate by time period
df = df.set_index("date")
print(df.resample("ME").sum())   # month-end

Output:

text

            value
date             
2026-01-31     35
2026-02-28     18

python

# rolling window
df["rolling_2"] = df["value"].rolling(window=2).mean()
print(df)

Output:

text

            value  rolling_2
date                        
2026-01-01     10        NaN
2026-01-15     25       17.5
2026-02-01     18       21.5

String accessor

The .str accessor exposes vectorized string methods on object-dtype (or StringDtype) Series, avoiding explicit .apply(lambda x: ...) loops. Most Python str methods are available as .str.<method>(), plus regex support (str.contains, str.extract, str.replace with regex=True) and list-like indexing after str.split.

python

s = pd.Series(["  Alice Smith  ", "bob jones", "CAROL WHITE"])

print(s.str.strip())
print(s.str.lower())
print(s.str.upper())
print(s.str.title())
print(s.str.contains("jones", case=False))
print(s.str.replace(r"\s+", "_", regex=True))
print(s.str.split(" ").str[0])   # first word
print(s.str.len())

Output:

text

0    Alice Smith
1      bob jones
2    CAROL WHITE
dtype: object

0    alice smith
1      bob jones
2    carol white
dtype: object

0      ALICE SMITH
1        BOB JONES
2      CAROL WHITE
dtype: object

0      Alice Smith
1        Bob Jones
2      Carol White
dtype: object

0    False
1     True
2    False
dtype: bool

0    Alice_Smith
1      bob_jones
2    CAROL_WHITE
dtype: object

0     Alice
1       bob
2     CAROL
dtype: object

0    15
1     9
2    11
dtype: int64

apply vs transform vs agg

In a groupby context: agg reduces each group to a scalar (or set of scalars) and is the right choice for summary statistics. transform returns a Series with the same length as the original DataFrame, enabling group-level values (like a group mean) to be added back as a new column. apply is the escape hatch — it passes each group as a DataFrame to an arbitrary function, making it the most flexible but also the slowest.

python

df = pd.DataFrame({"dept": ["Eng", "Eng", "Ops"], "salary": [75000, 82000, 68000]})

# apply — custom function per group (returns reduced result)
print(df.groupby("dept")["salary"].apply(lambda x: x.max() - x.min()))

Output:

text

dept
Eng    7000
Ops       0
Name: salary, dtype: int64

python

# transform — returns same-shape result (for feature engineering)
df["dept_avg"] = df.groupby("dept")["salary"].transform("mean")
print(df)

Output:

text

  dept  salary  dept_avg
0  Eng   75000   78500.0
1  Eng   82000   78500.0
2  Ops   68000   68000.0

python

# agg — multiple aggregations at once
print(df.groupby("dept")["salary"].agg(["mean", "min", "max", "count"]))

Output:

text

        mean    min    max  count
dept                             
Eng   78500  75000  82000      2
Ops   68000  68000  68000      1

Nullable dtypes and Parquet I/O

Pandas' nullable extension types (Int64, Float64, StringDtype, boolean) use pd.NA as the missing-value sentinel instead of float('nan'), which means integer and boolean columns can represent missingness without being upcast to float. Parquet preserves these dtypes on round-trip (unlike CSV), making it the preferred format when schema fidelity and file size matter.

python

# Nullable integer (can hold NaN unlike int64)
s = pd.array([1, 2, None, 4], dtype="Int64")
print(pd.Series(s))

Output:

text

0       1
1       2
2    <NA>
3       4
dtype: Int64

python

# Parquet round-trip
df.to_parquet("data.parquet", index=False)
df2 = pd.read_parquet("data.parquet")
print(df2.dtypes)

Output:

text

dept      object
salary     int64
dtype: object

groupby internals

groupby is a two-step pipeline: a split phase that builds an index of group labels to row positions, and an apply-combine phase that runs an aggregation, transformation, or filter and stitches the results back together. The split is lazy — df.groupby("k") doesn't compute anything; the real work happens when you call .agg(), .transform(), .apply(), or .filter(). Knowing which one you need has a huge impact on memory and speed.

python

import pandas as pd

df = pd.DataFrame({
    "team": ["A", "A", "B", "B", "C"],
    "score": [10, 15, 8, 12, 20],
    "minutes": [40, 35, 30, 45, 50],
})

g = df.groupby("team")

# 1. agg: scalar per group → smaller result
print(g.agg(total_score=("score", "sum"), avg_min=("minutes", "mean")))

Output:

text

      total_score  avg_min
team                      
A              25     37.5
B              20     37.5
C              20     50.0

python

# 2. transform: same-shape result aligned to the original rows
df["team_avg"] = g["score"].transform("mean")
print(df)

Output:

text

  team  score  minutes  team_avg
0    A     10       40      12.5
1    A     15       35      12.5
2    B      8       30      10.0
3    B     12       45      10.0
4    C     20       50      20.0

python

# 3. filter: keep entire groups that pass a test
print(g.filter(lambda x: x["score"].mean() > 11))

Output:

text

  team  score  minutes  team_avg
0    A     10       40      12.5
1    A     15       35      12.5
4    C     20       50      20.0

Use named aggregations (agg(total=("col", "sum"))) rather than passing a dict — the result has flat column names and is much easier to chain.

observed=True on a categorical groupby skips empty combinations. In pandas 2.x the default flips to True; setting it explicitly silences a deprecation warning and avoids surprise empty rows.

Multi-index DataFrames

A MultiIndex (hierarchical index) stores multiple labels per row position, useful after a groupby aggregation or set_index on two or more columns. You can select a single level with .xs("Eng", level="dept"), an entire branch with df.loc[("Eng", "Alice")], or all entries under a top-level value with df.loc["Eng"]. To go back to flat columns call .reset_index().

python

import pandas as pd

df = pd.DataFrame({
    "dept": ["Eng", "Eng", "HR", "HR"],
    "role": ["Manager", "IC", "Manager", "IC"],
    "headcount": [5, 30, 2, 8],
}).set_index(["dept", "role"])

print(df)
print()
print(df.loc["Eng"])               # all Eng rows
print()
print(df.xs("Manager", level="role"))  # Managers across all depts

Output:

text

              headcount
dept role              
Eng  Manager          5
     IC              30
HR   Manager          2
     IC               8

         headcount
role              
Manager          5
IC              30

      headcount
dept           
Eng           5
HR            2

Sort a MultiIndex with df.sort_index() before slicing — unsorted multi-indexes raise PerformanceWarning and can produce wrong slices in loc.

query() and eval() for readable filters

df.query("col > 5 and name == 'Alice'") uses a pandas-flavoured expression language with implicit column references, making complex filters more readable than chained boolean masks. df.eval("revenue = price * qty") is the same idea for column creation. Both compile the expression with numexpr when available, which can be 2–5x faster than the pure-Python equivalent on large DataFrames.

python

import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol"],
    "dept": ["Eng", "Eng", "HR"],
    "salary": [120000, 85000, 95000],
})

# Reference an external variable with @
threshold = 100000
print(df.query("salary > @threshold and dept == 'Eng'"))

Output:

text

    name dept  salary
0  Alice  Eng  120000

python

df = df.eval("bonus = salary * 0.10")
print(df)

Output:

text

    name dept  salary    bonus
0  Alice  Eng  120000  12000.0
1    Bob  Eng   85000   8500.0
2  Carol   HR   95000   9500.0

Dtype optimisation — category and PyArrow

Object-dtype columns (Python strings) are slow and memory-heavy because every value is a separate PyObject. Converting low-cardinality strings to category can cut memory by 10x and speed up groupby/joins. Pandas 2.x supports PyArrow-backed dtypes ("string[pyarrow]", "int64[pyarrow]") which are zero-copy compatible with polars and store missing values natively.

python

import pandas as pd

df = pd.DataFrame({"city": ["NYC"] * 1000 + ["LA"] * 500 + ["SF"] * 200})
print(f"object: {df['city'].memory_usage(deep=True)} bytes")

df["city"] = df["city"].astype("category")
print(f"category: {df['city'].memory_usage(deep=True)} bytes")

Output:

text

object: 99828 bytes
category: 1956 bytes

python

# PyArrow-backed string dtype — fast, supports NA
s = pd.Series(["Alice", None, "Bob"], dtype="string[pyarrow]")
print(s.dtype)
print(s.isna())

Output:

text

string[pyarrow]
0    False
1     True
2    False
dtype: bool

Pass dtype_backend="pyarrow" to pd.read_csv / pd.read_parquet to get PyArrow-backed dtypes for the whole DataFrame in one shot.

read_* performance tuning

pd.read_csv is convenient but rarely the fastest path. Speed it up by passing dtype= to skip type inference, usecols= to limit columns, parse_dates=[...] to avoid a second pass, and engine="pyarrow" for the Arrow-backed parser. For Parquet, polars scan_parquet or pyarrow.parquet.read_table are typically 2–10x faster than pandas' default.

python

import pandas as pd

# Slow default
df = pd.read_csv("big.csv")

# Fast: tell pandas exactly what to do
df = pd.read_csv(
    "big.csv",
    dtype={"id": "int32", "amount": "float32", "status": "category"},
    usecols=["id", "amount", "status", "ts"],
    parse_dates=["ts"],
    engine="pyarrow",                  # uses Arrow CSV reader
    dtype_backend="pyarrow",
)

For a one-time conversion to Parquet (much faster on subsequent reads):

python

pd.read_csv("big.csv").to_parquet("big.parquet", compression="zstd")

Merge semantics in detail

pd.merge supports five join strategies via how=: inner (default, intersection), left, right, outer (union), and cross (cartesian product). Use validate="one_to_one" / "one_to_many" etc. to catch unexpected key duplication — silent many-to-many joins are a common source of inflated row counts. The indicator=True flag adds a _merge column showing which side each row came from, invaluable for debugging.

python

import pandas as pd

left  = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
right = pd.DataFrame({"id": [2, 3, 4], "salary": [80000, 95000, 70000]})

# Outer join with indicator
merged = left.merge(right, on="id", how="outer", indicator=True)
print(merged)

Output:

text

   id   name   salary      _merge
0   1  Alice      NaN   left_only
1   2    Bob  80000.0        both
2   3  Carol  95000.0        both
3   4    NaN  70000.0  right_only

python

# Validate one-to-one
dup = pd.DataFrame({"id": [1, 1], "x": [10, 20]})
try:
    left.merge(dup, on="id", validate="one_to_one")
except pd.errors.MergeError as e:
    print(f"caught: {e}")

Output:

text

caught: Merge keys are not unique in right dataset; not a one-to-one merge

merge matches on column values exactly — leading/trailing whitespace and dtype mismatches silently drop rows. Run df["id"].dtype on both sides before merging and str.strip() string keys.

Categorical with ordered levels

A Categorical dtype with explicit ordered levels (e.g. "low" < "medium" < "high") lets you sort and compare in semantic order rather than alphabetic order. Useful for survey responses, risk ratings, and ordinal labels in plots.

python

import pandas as pd

df = pd.DataFrame({"risk": ["low", "high", "medium", "low", "high"]})
df["risk"] = pd.Categorical(df["risk"], categories=["low", "medium", "high"], ordered=True)

print(df.sort_values("risk"))
print()
print((df["risk"] >= "medium").tolist())

Output:

text

     risk
0     low
3     low
2  medium
1    high
4    high

[False, True, True, False, True]

Method chaining and pipe()

Pandas operations are designed to chain — most return a new DataFrame. df.pipe(func, *args) extends this to custom functions that take a DataFrame and return a DataFrame, keeping the call site flat instead of nesting func(df, ...). Combined with .assign() for new columns, you get pure, readable pipelines that are easy to refactor.

python

import pandas as pd

def add_full_name(df):
    return df.assign(full=df["first"] + " " + df["last"])

result = (
    pd.DataFrame({"first": ["Alice", "Bob"], "last": ["Smith", "Jones"], "age": [30, 25]})
    .pipe(add_full_name)
    .query("age >= 28")
    .assign(decade=lambda d: (d["age"] // 10) * 10)
    [["full", "decade"]]
)
print(result)

Output:

text

          full  decade
0  Alice Smith      30

Window functions — rolling, expanding, ewm

rolling(window=N) produces a fixed-length moving window; expanding() accumulates from the start of the series; ewm(span=N) is an exponentially-weighted variant that gives more weight to recent observations. All three return a window object on which you call aggregations (.mean(), .std(), .quantile(0.95)).

python

import pandas as pd

s = pd.Series([1, 3, 2, 5, 4, 7, 6, 9, 8, 11])
print("rolling(3) mean:", s.rolling(3).mean().tolist())
print("expanding max: ", s.expanding().max().tolist())
print("ewm(span=3):   ", s.ewm(span=3).mean().round(2).tolist())

Output:

text

rolling(3) mean: [nan, nan, 2.0, 3.333..., 3.666..., 5.333..., 5.666..., 7.333..., 7.666..., 9.333...]
expanding max:  [1.0, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0, 11.0]
ewm(span=3):    [1.0, 2.33, 2.14, 3.93, 3.97, 5.71, 5.85, 7.71, 7.85, 9.78]

Pivot, crosstab, and get_dummies

pivot_table aggregates a long DataFrame into a wide matrix; crosstab is a shortcut for frequency tables (counts/proportions of two categorical variables); get_dummies one-hot encodes categorical columns into a wide indicator matrix, the standard preprocessing step before fitting a linear model in scikit-learn.

python

import pandas as pd

df = pd.DataFrame({
    "region": ["East", "East", "West", "West", "East"],
    "product": ["A", "B", "A", "B", "A"],
    "amount": [100, 200, 150, 250, 120],
})

print(df.pivot_table(values="amount", index="region", columns="product", aggfunc="sum", fill_value=0))
print()
print(pd.crosstab(df["region"], df["product"]))
print()
print(pd.get_dummies(df["product"], prefix="product"))

Output:

text

product    A    B
region            
East     220  200
West     150  250

product  A  B
region        
East     2  1
West     1  1

   product_A  product_B
0       True      False
1      False       True
2       True      False
3      False       True
4       True      False

Comparison with polars

Pandas and polars solve the same problem but with very different trade-offs.

Concern	pandas	polars
Backend	NumPy / Arrow (2.x)	Arrow + Rust
Execution	Eager only	Eager + lazy with query optimisation
Threading	Single-threaded (GIL)	Multi-threaded by default
Memory	~10 GB datasets uncomfortable	Streaming for > RAM datasets
API	Index-based (`loc`, `iloc`) + columns	Expression-based, no index
Mutation	`SettingWithCopyWarning` traps	All transforms return new frames
Ecosystem	Massive (matplotlib, scikit-learn integration)	Newer; growing fast

When your pandas pipeline is slow or memory-bound, rewriting it as polars LazyFrame chained calls usually beats parallelising or chunking pandas. See polars for the full migration guide.

Common pitfalls (reference)

Chained indexing — df[df.col > 0]["other"] = ... may not modify df. Always use df.loc[mask, "other"] = ....

merge row explosion — many-to-many joins multiply rows silently. Use validate= to assert your expectation.

object dtype confusion — mixed-type columns or string columns show as object. Convert to category or "string" for speed and clearer semantics.

apply(axis=1) is slow — it iterates rows in Python. Replace with vectorised expressions, df.assign(...), or np.where.

Time-zone naivety — pd.to_datetime returns tz-naive timestamps by default. Always pass utc=True for log timestamps, then convert with tz_convert for display.

Real-world recipes

Cohort analysis: monthly retention

Group users by their first-event month, count distinct users active in each subsequent month, divide.

python

import pandas as pd

events = pd.DataFrame({
    "user_id": [1, 1, 1, 2, 2, 3, 3, 3, 3],
    "event_date": pd.to_datetime([
        "2026-01-05", "2026-01-15", "2026-02-10",
        "2026-01-20", "2026-03-05",
        "2026-02-01", "2026-02-15", "2026-03-10", "2026-04-01",
    ]),
})

events["cohort"] = events.groupby("user_id")["event_date"].transform("min").dt.to_period("M")
events["period"] = events["event_date"].dt.to_period("M")
events["months_since"] = (events["period"] - events["cohort"]).apply(lambda x: x.n)

cohort = (
    events.groupby(["cohort", "months_since"])["user_id"].nunique()
    .unstack(fill_value=0)
)
print(cohort)

Output:

text

months_since  0  1  2
cohort               
2026-01       2  1  1
2026-02       1  1  1

Top N per group

groupby(...).head(N) after sorting gives the top N rows per group — a one-liner replacement for window functions.

python

import pandas as pd

df = pd.DataFrame({
    "dept": ["Eng", "Eng", "Eng", "HR", "HR"],
    "name": ["A", "B", "C", "D", "E"],
    "salary": [120000, 95000, 110000, 85000, 92000],
})

top2 = (
    df.sort_values(["dept", "salary"], ascending=[True, False])
    .groupby("dept")
    .head(2)
)
print(top2)

Output:

text

  dept name  salary
0  Eng    A  120000
2  Eng    C  110000
4   HR    E   92000
3   HR    D   85000

Detect dataframe drift

Compare two snapshots of the same schema to find rows that changed.

python

import pandas as pd

old = pd.DataFrame({"id": [1, 2, 3], "status": ["new", "open", "closed"]})
new = pd.DataFrame({"id": [1, 2, 3], "status": ["new", "closed", "closed"]})

merged = old.merge(new, on="id", suffixes=("_old", "_new"))
changed = merged.query("status_old != status_new")
print(changed)

Output:

text

   id status_old status_new
1   2       open     closed

Pivot then chart with Streamlit

Pair pandas with Streamlit to build interactive dashboards — Streamlit's st.dataframe and st.bar_chart accept a DataFrame directly.

python

import pandas as pd
import streamlit as st

df = pd.read_csv("sales.csv")
pivot = df.pivot_table("amount", "region", "product", aggfunc="sum", fill_value=0)
st.bar_chart(pivot)

pandas — DataFrames

What it is

Install

Quick example

When / why to use it

Common pitfalls

Reading and writing files

Richer example — groupby and merge

Useful operations quick reference

Combining DataFrames

Reshaping

DateTime handling

String accessor

apply vs transform vs agg

Nullable dtypes and Parquet I/O

groupby internals

Multi-index DataFrames

query() and eval() for readable filters

Dtype optimisation — category and PyArrow

read_* performance tuning

Merge semantics in detail

Categorical with ordered levels

Method chaining and pipe()

Window functions — rolling, expanding, ewm

Pivot, crosstab, and get_dummies

Comparison with polars

Common pitfalls (reference)

Real-world recipes

Cohort analysis: monthly retention

Top N per group

Detect dataframe drift

Pivot then chart with Streamlit

See also