cheat sheet

pandas

Load, filter, transform, and aggregate tabular data with pandas. Covers DataFrame creation, read_csv, groupby, merge, and the SettingWithCopy pitfall.

pandas — DataFrames

What it is

pandas provides two main structures: DataFrame (2-D labeled table) and Series (1-D labeled array). It is the standard library for reading, cleaning, transforming, and analyzing tabular data in Python.

Install

bash
pip install pandas

Output: (none — exits 0 on success)

Quick example

python
import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Charlie"],
    "dept": ["Eng", "Eng", "HR"],
    "score": [92, 78, 85],
})
print(df[df["score"] > 80])

Output:

text
      name dept  score
0    Alice  Eng     92
2  Charlie   HR     85

When / why to use it

  • Reading CSV, Excel, JSON, or SQL tables: pd.read_csv(), pd.read_excel(), pd.read_sql().
  • Cleaning and reshaping data before analysis or ML.
  • Exploratory data analysis with df.describe(), groupby, pivot_table.
  • Joining multiple datasets: pd.merge().

Common pitfalls

SettingWithCopyWarning — modifying a slice of a DataFrame may not modify the original. Use .loc for assignment:

python
df.loc[df["score"] > 80, "grade"] = "A"  # ✅ correct
df[df["score"] > 80]["grade"] = "A"       # ❌ may silently do nothing

inplace=True is a trap — many operations accept inplace=True but it doesn't save memory and makes code harder to chain. Prefer reassignment: df = df.dropna().

Use df.dtypes and df.info() early to understand what you loaded. Numeric columns imported as object (string) will break aggregations silently.

Reading and writing files

pd.read_csv is the most common entry point, but pandas also reads Excel, JSON, Parquet, SQL, HTML tables, and more through the same read_* / to_* symmetry. Always pass dtype= or parse_dates= explicitly when the inferred types are wrong — silent type mismatches (e.g. numeric IDs read as object) cause subtle aggregation bugs downstream.

python
import pandas as pd

# CSV
df = pd.read_csv("data.csv")
df.to_csv("output.csv", index=False)

# Excel
df = pd.read_excel("report.xlsx", sheet_name="Sheet1")

# JSON
df = pd.read_json("records.json", orient="records")

# from a SQL query
import sqlite3
conn = sqlite3.connect("app.db")
df = pd.read_sql("SELECT * FROM users WHERE active = 1", conn)

Richer example — groupby and merge

python
import pandas as pd

sales = pd.DataFrame({
    "product": ["A", "B", "A", "B", "A"],
    "region": ["East", "East", "West", "West", "East"],
    "amount": [100, 200, 150, 250, 120],
})

# Group by region and product, aggregate
summary = (
    sales.groupby(["region", "product"])["amount"]
    .agg(total="sum", orders="count")
    .reset_index()
)
print(summary)
print()

# Merge with a product lookup table
products = pd.DataFrame({"product": ["A", "B"], "name": ["Widget", "Gadget"]})
merged = summary.merge(products, on="product")
print(merged[["region", "name", "total", "orders"]])

Output:

text
  region product  total  orders
0   East       A    220       2
1   East       B    200       1
2   West       A    150       1
3   West       B    250       1

  region    name  total  orders
0   East  Widget    220       2
1   East  Gadget    200       1
2   West  Widget    150       1
3   West  Gadget    250       1

Useful operations quick reference

TaskCode
Filter rowsdf[df["col"] > 5]
Select columnsdf[["a", "b"]]
Add columndf["new"] = df["a"] * 2
Drop rows with NaNdf.dropna(subset=["col"])
Fill NaNdf["col"].fillna(0)
Sortdf.sort_values("col", ascending=False)
Rename columnsdf.rename(columns={"old": "new"})
Apply functiondf["col"].apply(lambda x: x.strip())
Pivotdf.pivot_table(values="sales", index="region", columns="product", aggfunc="sum")
Value countsdf["status"].value_counts()
Describedf.describe()
Shapedf.shape

Combining DataFrames

pd.concat stacks DataFrames vertically (default) or horizontally by aligning on the index; use it when the schemas are identical. pd.merge performs SQL-style joins on one or more key columns and is the right tool when the two tables have different shapes and a shared key. DataFrame.join is a convenience wrapper around merge that joins on the index by default.

python
import pandas as pd

# concat — stack vertically (same columns)
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df2 = pd.DataFrame({"A": [5, 6], "B": [7, 8]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)

Output:

text
   A  B
0  1  3
1  2  4
2  5  7
3  6  8
python
# concat — side by side (same rows)
result = pd.concat([df1, df2], axis=1)
print(result)

Output:

text
   A  B  A  B
0  1  3  5  7
1  2  4  6  8
python
# merge — inner join (default)
employees = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
salaries = pd.DataFrame({"id": [1, 2, 4], "salary": [75000, 82000, 91000]})
result = pd.merge(employees, salaries, on="id")
print(result)

Output:

text
   id   name  salary
0   1  Alice   75000
1   2    Bob   82000
python
# merge — left join (keep all from left)
result = pd.merge(employees, salaries, on="id", how="left")
print(result)

Output:

text
   id   name   salary
0   1  Alice  75000.0
1   2    Bob  82000.0
2   3  Carol      NaN
python
# merge — outer join (keep all rows from both)
result = pd.merge(employees, salaries, on="id", how="outer")
print(result)

Output:

text
   id   name   salary
0   1  Alice  75000.0
1   2    Bob  82000.0
2   3  Carol      NaN
3   4    NaN  91000.0

Reshaping

melt converts wide format (one column per variable) to long format (one row per observation), which most statistical and plotting libraries expect. pivot_table goes the other way, aggregating a long DataFrame into a wide summary matrix. stack / unstack move column levels to the row index and back, useful for multi-level indexed data.

python
# melt — wide to long
df = pd.DataFrame({"name": ["Alice", "Bob"], "math": [90, 85], "english": [88, 92]})
melted = df.melt(id_vars="name", var_name="subject", value_name="score")
print(melted)

Output:

text
    name  subject  score
0  Alice     math     90
1    Bob     math     85
2  Alice  english     88
3    Bob  english     92
python
# pivot_table — long to wide with aggregation
print(pd.pivot_table(melted, values="score", index="name", columns="subject"))

Output:

text
subject  english  math
name                  
Alice         88    90
Bob           92    85
python
# stack / unstack — move column level to row index
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, index=["x", "y"])
print(df.stack())

Output:

text
x  A    1
   B    3
y  A    2
   B    4
dtype: int64

DateTime handling

pd.to_datetime converts string or numeric columns to datetime64[ns] dtype, enabling the .dt accessor for extracting components (.dt.year, .dt.day_of_week) and time-aware operations. Set the column as the index to unlock resample() for period-based aggregation (hourly, daily, monthly) and rolling() for moving-window statistics.

python
# Parse dates
df = pd.DataFrame({"date": ["2026-01-01", "2026-01-15", "2026-02-01"], "value": [10, 25, 18]})
df["date"] = pd.to_datetime(df["date"])
print(df.dtypes)

Output:

text
date     datetime64[ns]
value             int64
dtype: object
python
# dt accessor — extract components
print(df["date"].dt.year.tolist())
print(df["date"].dt.month_name().tolist())
print(df["date"].dt.day_of_week.tolist())   # 0=Monday

Output:

text
[2026, 2026, 2026]
['January', 'January', 'February']
[3, 3, 6]
python
# resample — aggregate by time period
df = df.set_index("date")
print(df.resample("ME").sum())   # month-end

Output:

text
            value
date             
2026-01-31     35
2026-02-28     18
python
# rolling window
df["rolling_2"] = df["value"].rolling(window=2).mean()
print(df)

Output:

text
            value  rolling_2
date                        
2026-01-01     10        NaN
2026-01-15     25       17.5
2026-02-01     18       21.5

String accessor

The .str accessor exposes vectorized string methods on object-dtype (or StringDtype) Series, avoiding explicit .apply(lambda x: ...) loops. Most Python str methods are available as .str.<method>(), plus regex support (str.contains, str.extract, str.replace with regex=True) and list-like indexing after str.split.

python
s = pd.Series(["  Alice Smith  ", "bob jones", "CAROL WHITE"])

print(s.str.strip())
print(s.str.lower())
print(s.str.upper())
print(s.str.title())
print(s.str.contains("jones", case=False))
print(s.str.replace(r"\s+", "_", regex=True))
print(s.str.split(" ").str[0])   # first word
print(s.str.len())

Output:

text
0    Alice Smith
1      bob jones
2    CAROL WHITE
dtype: object

0    alice smith
1      bob jones
2    carol white
dtype: object

0      ALICE SMITH
1        BOB JONES
2      CAROL WHITE
dtype: object

0      Alice Smith
1        Bob Jones
2      Carol White
dtype: object

0    False
1     True
2    False
dtype: bool

0    Alice_Smith
1      bob_jones
2    CAROL_WHITE
dtype: object

0     Alice
1       bob
2     CAROL
dtype: object

0    15
1     9
2    11
dtype: int64

apply vs transform vs agg

In a groupby context: agg reduces each group to a scalar (or set of scalars) and is the right choice for summary statistics. transform returns a Series with the same length as the original DataFrame, enabling group-level values (like a group mean) to be added back as a new column. apply is the escape hatch — it passes each group as a DataFrame to an arbitrary function, making it the most flexible but also the slowest.

python
df = pd.DataFrame({"dept": ["Eng", "Eng", "Ops"], "salary": [75000, 82000, 68000]})

# apply — custom function per group (returns reduced result)
print(df.groupby("dept")["salary"].apply(lambda x: x.max() - x.min()))

Output:

text
dept
Eng    7000
Ops       0
Name: salary, dtype: int64
python
# transform — returns same-shape result (for feature engineering)
df["dept_avg"] = df.groupby("dept")["salary"].transform("mean")
print(df)

Output:

text
  dept  salary  dept_avg
0  Eng   75000   78500.0
1  Eng   82000   78500.0
2  Ops   68000   68000.0
python
# agg — multiple aggregations at once
print(df.groupby("dept")["salary"].agg(["mean", "min", "max", "count"]))

Output:

text
        mean    min    max  count
dept                             
Eng   78500  75000  82000      2
Ops   68000  68000  68000      1

Nullable dtypes and Parquet I/O

Pandas' nullable extension types (Int64, Float64, StringDtype, boolean) use pd.NA as the missing-value sentinel instead of float('nan'), which means integer and boolean columns can represent missingness without being upcast to float. Parquet preserves these dtypes on round-trip (unlike CSV), making it the preferred format when schema fidelity and file size matter.

python
# Nullable integer (can hold NaN unlike int64)
s = pd.array([1, 2, None, 4], dtype="Int64")
print(pd.Series(s))

Output:

text
0       1
1       2
2    <NA>
3       4
dtype: Int64
python
# Parquet round-trip
df.to_parquet("data.parquet", index=False)
df2 = pd.read_parquet("data.parquet")
print(df2.dtypes)

Output:

text
dept      object
salary     int64
dtype: object

groupby internals

groupby is a two-step pipeline: a split phase that builds an index of group labels to row positions, and an apply-combine phase that runs an aggregation, transformation, or filter and stitches the results back together. The split is lazy — df.groupby("k") doesn't compute anything; the real work happens when you call .agg(), .transform(), .apply(), or .filter(). Knowing which one you need has a huge impact on memory and speed.

python
import pandas as pd

df = pd.DataFrame({
    "team": ["A", "A", "B", "B", "C"],
    "score": [10, 15, 8, 12, 20],
    "minutes": [40, 35, 30, 45, 50],
})

g = df.groupby("team")

# 1. agg: scalar per group → smaller result
print(g.agg(total_score=("score", "sum"), avg_min=("minutes", "mean")))

Output:

text
      total_score  avg_min
team                      
A              25     37.5
B              20     37.5
C              20     50.0
python
# 2. transform: same-shape result aligned to the original rows
df["team_avg"] = g["score"].transform("mean")
print(df)

Output:

text
  team  score  minutes  team_avg
0    A     10       40      12.5
1    A     15       35      12.5
2    B      8       30      10.0
3    B     12       45      10.0
4    C     20       50      20.0
python
# 3. filter: keep entire groups that pass a test
print(g.filter(lambda x: x["score"].mean() > 11))

Output:

text
  team  score  minutes  team_avg
0    A     10       40      12.5
1    A     15       35      12.5
4    C     20       50      20.0

Use named aggregations (agg(total=("col", "sum"))) rather than passing a dict — the result has flat column names and is much easier to chain.

observed=True on a categorical groupby skips empty combinations. In pandas 2.x the default flips to True; setting it explicitly silences a deprecation warning and avoids surprise empty rows.

Multi-index DataFrames

A MultiIndex (hierarchical index) stores multiple labels per row position, useful after a groupby aggregation or set_index on two or more columns. You can select a single level with .xs("Eng", level="dept"), an entire branch with df.loc[("Eng", "Alice")], or all entries under a top-level value with df.loc["Eng"]. To go back to flat columns call .reset_index().

python
import pandas as pd

df = pd.DataFrame({
    "dept": ["Eng", "Eng", "HR", "HR"],
    "role": ["Manager", "IC", "Manager", "IC"],
    "headcount": [5, 30, 2, 8],
}).set_index(["dept", "role"])

print(df)
print()
print(df.loc["Eng"])               # all Eng rows
print()
print(df.xs("Manager", level="role"))  # Managers across all depts

Output:

text
              headcount
dept role              
Eng  Manager          5
     IC              30
HR   Manager          2
     IC               8

         headcount
role              
Manager          5
IC              30

      headcount
dept           
Eng           5
HR            2

Sort a MultiIndex with df.sort_index() before slicing — unsorted multi-indexes raise PerformanceWarning and can produce wrong slices in loc.

query() and eval() for readable filters

df.query("col > 5 and name == 'Alice'") uses a pandas-flavoured expression language with implicit column references, making complex filters more readable than chained boolean masks. df.eval("revenue = price * qty") is the same idea for column creation. Both compile the expression with numexpr when available, which can be 2–5x faster than the pure-Python equivalent on large DataFrames.

python
import pandas as pd

df = pd.DataFrame({
    "name": ["Alice", "Bob", "Carol"],
    "dept": ["Eng", "Eng", "HR"],
    "salary": [120000, 85000, 95000],
})

# Reference an external variable with @
threshold = 100000
print(df.query("salary > @threshold and dept == 'Eng'"))

Output:

text
    name dept  salary
0  Alice  Eng  120000
python
df = df.eval("bonus = salary * 0.10")
print(df)

Output:

text
    name dept  salary    bonus
0  Alice  Eng  120000  12000.0
1    Bob  Eng   85000   8500.0
2  Carol   HR   95000   9500.0

Dtype optimisation — category and PyArrow

Object-dtype columns (Python strings) are slow and memory-heavy because every value is a separate PyObject. Converting low-cardinality strings to category can cut memory by 10x and speed up groupby/joins. Pandas 2.x supports PyArrow-backed dtypes ("string[pyarrow]", "int64[pyarrow]") which are zero-copy compatible with polars and store missing values natively.

python
import pandas as pd

df = pd.DataFrame({"city": ["NYC"] * 1000 + ["LA"] * 500 + ["SF"] * 200})
print(f"object: {df['city'].memory_usage(deep=True)} bytes")

df["city"] = df["city"].astype("category")
print(f"category: {df['city'].memory_usage(deep=True)} bytes")

Output:

text
object: 99828 bytes
category: 1956 bytes
python
# PyArrow-backed string dtype — fast, supports NA
s = pd.Series(["Alice", None, "Bob"], dtype="string[pyarrow]")
print(s.dtype)
print(s.isna())

Output:

text
string[pyarrow]
0    False
1     True
2    False
dtype: bool

Pass dtype_backend="pyarrow" to pd.read_csv / pd.read_parquet to get PyArrow-backed dtypes for the whole DataFrame in one shot.

read_* performance tuning

pd.read_csv is convenient but rarely the fastest path. Speed it up by passing dtype= to skip type inference, usecols= to limit columns, parse_dates=[...] to avoid a second pass, and engine="pyarrow" for the Arrow-backed parser. For Parquet, polars scan_parquet or pyarrow.parquet.read_table are typically 2–10x faster than pandas' default.

python
import pandas as pd

# Slow default
df = pd.read_csv("big.csv")

# Fast: tell pandas exactly what to do
df = pd.read_csv(
    "big.csv",
    dtype={"id": "int32", "amount": "float32", "status": "category"},
    usecols=["id", "amount", "status", "ts"],
    parse_dates=["ts"],
    engine="pyarrow",                  # uses Arrow CSV reader
    dtype_backend="pyarrow",
)

For a one-time conversion to Parquet (much faster on subsequent reads):

python
pd.read_csv("big.csv").to_parquet("big.parquet", compression="zstd")

Merge semantics in detail

pd.merge supports five join strategies via how=: inner (default, intersection), left, right, outer (union), and cross (cartesian product). Use validate="one_to_one" / "one_to_many" etc. to catch unexpected key duplication — silent many-to-many joins are a common source of inflated row counts. The indicator=True flag adds a _merge column showing which side each row came from, invaluable for debugging.

python
import pandas as pd

left  = pd.DataFrame({"id": [1, 2, 3], "name": ["Alice", "Bob", "Carol"]})
right = pd.DataFrame({"id": [2, 3, 4], "salary": [80000, 95000, 70000]})

# Outer join with indicator
merged = left.merge(right, on="id", how="outer", indicator=True)
print(merged)

Output:

text
   id   name   salary      _merge
0   1  Alice      NaN   left_only
1   2    Bob  80000.0        both
2   3  Carol  95000.0        both
3   4    NaN  70000.0  right_only
python
# Validate one-to-one
dup = pd.DataFrame({"id": [1, 1], "x": [10, 20]})
try:
    left.merge(dup, on="id", validate="one_to_one")
except pd.errors.MergeError as e:
    print(f"caught: {e}")

Output:

text
caught: Merge keys are not unique in right dataset; not a one-to-one merge

merge matches on column values exactly — leading/trailing whitespace and dtype mismatches silently drop rows. Run df["id"].dtype on both sides before merging and str.strip() string keys.

Categorical with ordered levels

A Categorical dtype with explicit ordered levels (e.g. "low" < "medium" < "high") lets you sort and compare in semantic order rather than alphabetic order. Useful for survey responses, risk ratings, and ordinal labels in plots.

python
import pandas as pd

df = pd.DataFrame({"risk": ["low", "high", "medium", "low", "high"]})
df["risk"] = pd.Categorical(df["risk"], categories=["low", "medium", "high"], ordered=True)

print(df.sort_values("risk"))
print()
print((df["risk"] >= "medium").tolist())

Output:

text
     risk
0     low
3     low
2  medium
1    high
4    high

[False, True, True, False, True]

Method chaining and pipe()

Pandas operations are designed to chain — most return a new DataFrame. df.pipe(func, *args) extends this to custom functions that take a DataFrame and return a DataFrame, keeping the call site flat instead of nesting func(df, ...). Combined with .assign() for new columns, you get pure, readable pipelines that are easy to refactor.

python
import pandas as pd

def add_full_name(df):
    return df.assign(full=df["first"] + " " + df["last"])

result = (
    pd.DataFrame({"first": ["Alice", "Bob"], "last": ["Smith", "Jones"], "age": [30, 25]})
    .pipe(add_full_name)
    .query("age >= 28")
    .assign(decade=lambda d: (d["age"] // 10) * 10)
    [["full", "decade"]]
)
print(result)

Output:

text
          full  decade
0  Alice Smith      30

Window functions — rolling, expanding, ewm

rolling(window=N) produces a fixed-length moving window; expanding() accumulates from the start of the series; ewm(span=N) is an exponentially-weighted variant that gives more weight to recent observations. All three return a window object on which you call aggregations (.mean(), .std(), .quantile(0.95)).

python
import pandas as pd

s = pd.Series([1, 3, 2, 5, 4, 7, 6, 9, 8, 11])
print("rolling(3) mean:", s.rolling(3).mean().tolist())
print("expanding max: ", s.expanding().max().tolist())
print("ewm(span=3):   ", s.ewm(span=3).mean().round(2).tolist())

Output:

text
rolling(3) mean: [nan, nan, 2.0, 3.333..., 3.666..., 5.333..., 5.666..., 7.333..., 7.666..., 9.333...]
expanding max:  [1.0, 3.0, 3.0, 5.0, 5.0, 7.0, 7.0, 9.0, 9.0, 11.0]
ewm(span=3):    [1.0, 2.33, 2.14, 3.93, 3.97, 5.71, 5.85, 7.71, 7.85, 9.78]

Pivot, crosstab, and get_dummies

pivot_table aggregates a long DataFrame into a wide matrix; crosstab is a shortcut for frequency tables (counts/proportions of two categorical variables); get_dummies one-hot encodes categorical columns into a wide indicator matrix, the standard preprocessing step before fitting a linear model in scikit-learn.

python
import pandas as pd

df = pd.DataFrame({
    "region": ["East", "East", "West", "West", "East"],
    "product": ["A", "B", "A", "B", "A"],
    "amount": [100, 200, 150, 250, 120],
})

print(df.pivot_table(values="amount", index="region", columns="product", aggfunc="sum", fill_value=0))
print()
print(pd.crosstab(df["region"], df["product"]))
print()
print(pd.get_dummies(df["product"], prefix="product"))

Output:

text
product    A    B
region            
East     220  200
West     150  250

product  A  B
region        
East     2  1
West     1  1

   product_A  product_B
0       True      False
1      False       True
2       True      False
3      False       True
4       True      False

Comparison with polars

Pandas and polars solve the same problem but with very different trade-offs.

Concernpandaspolars
BackendNumPy / Arrow (2.x)Arrow + Rust
ExecutionEager onlyEager + lazy with query optimisation
ThreadingSingle-threaded (GIL)Multi-threaded by default
Memory~10 GB datasets uncomfortableStreaming for > RAM datasets
APIIndex-based (loc, iloc) + columnsExpression-based, no index
MutationSettingWithCopyWarning trapsAll transforms return new frames
EcosystemMassive (matplotlib, scikit-learn integration)Newer; growing fast

When your pandas pipeline is slow or memory-bound, rewriting it as polars LazyFrame chained calls usually beats parallelising or chunking pandas. See polars for the full migration guide.

Common pitfalls (reference)

Chained indexingdf[df.col > 0]["other"] = ... may not modify df. Always use df.loc[mask, "other"] = ....

merge row explosion — many-to-many joins multiply rows silently. Use validate= to assert your expectation.

object dtype confusion — mixed-type columns or string columns show as object. Convert to category or "string" for speed and clearer semantics.

apply(axis=1) is slow — it iterates rows in Python. Replace with vectorised expressions, df.assign(...), or np.where.

Time-zone naivetypd.to_datetime returns tz-naive timestamps by default. Always pass utc=True for log timestamps, then convert with tz_convert for display.

Real-world recipes

Cohort analysis: monthly retention

Group users by their first-event month, count distinct users active in each subsequent month, divide.

python
import pandas as pd

events = pd.DataFrame({
    "user_id": [1, 1, 1, 2, 2, 3, 3, 3, 3],
    "event_date": pd.to_datetime([
        "2026-01-05", "2026-01-15", "2026-02-10",
        "2026-01-20", "2026-03-05",
        "2026-02-01", "2026-02-15", "2026-03-10", "2026-04-01",
    ]),
})

events["cohort"] = events.groupby("user_id")["event_date"].transform("min").dt.to_period("M")
events["period"] = events["event_date"].dt.to_period("M")
events["months_since"] = (events["period"] - events["cohort"]).apply(lambda x: x.n)

cohort = (
    events.groupby(["cohort", "months_since"])["user_id"].nunique()
    .unstack(fill_value=0)
)
print(cohort)

Output:

text
months_since  0  1  2
cohort               
2026-01       2  1  1
2026-02       1  1  1

Top N per group

groupby(...).head(N) after sorting gives the top N rows per group — a one-liner replacement for window functions.

python
import pandas as pd

df = pd.DataFrame({
    "dept": ["Eng", "Eng", "Eng", "HR", "HR"],
    "name": ["A", "B", "C", "D", "E"],
    "salary": [120000, 95000, 110000, 85000, 92000],
})

top2 = (
    df.sort_values(["dept", "salary"], ascending=[True, False])
    .groupby("dept")
    .head(2)
)
print(top2)

Output:

text
  dept name  salary
0  Eng    A  120000
2  Eng    C  110000
4   HR    E   92000
3   HR    D   85000

Detect dataframe drift

Compare two snapshots of the same schema to find rows that changed.

python
import pandas as pd

old = pd.DataFrame({"id": [1, 2, 3], "status": ["new", "open", "closed"]})
new = pd.DataFrame({"id": [1, 2, 3], "status": ["new", "closed", "closed"]})

merged = old.merge(new, on="id", suffixes=("_old", "_new"))
changed = merged.query("status_old != status_new")
print(changed)

Output:

text
   id status_old status_new
1   2       open     closed

Pivot then chart with Streamlit

Pair pandas with Streamlit to build interactive dashboards — Streamlit's st.dataframe and st.bar_chart accept a DataFrame directly.

python
import pandas as pd
import streamlit as st

df = pd.read_csv("sales.csv")
pivot = df.pivot_table("amount", "region", "product", aggfunc="sum", fill_value=0)
st.bar_chart(pivot)

See also

  • polars — faster, Arrow-native DataFrames with lazy execution.
  • numpy — underlying array library; .to_numpy() for zero-copy access.
  • matplotlib — pandas .plot() is a thin wrapper.
  • scikit-learn — accepts DataFrames directly in modern versions.
  • streamlit — turn a DataFrame into an interactive web UI.