cheat sheet

DuckDB

Run fast analytical SQL queries in-process with DuckDB. Covers Python API, CSV/Parquet ingestion, pandas interop, Arrow, window functions, and persistent databases.

DuckDB — Embedded Analytics Database

What it is

DuckDB is an embedded analytical database — it runs inside your Python process with no server, no installation, and no daemon. It executes SQL against in-memory tables, local CSV and Parquet files, pandas DataFrames, and Arrow tables. DuckDB is columnar (like Redshift or BigQuery) and optimised for analytical queries (aggregations, window functions, large scans), making it dramatically faster than SQLite for those workloads. It is the standard tool for "run SQL on a file" workflows in Python.

Install

bash
pip install duckdb

Output: (none — exits 0 on success)

Quick example

python
import duckdb

result = duckdb.sql("SELECT 42 AS answer, 'hello' AS greeting")
print(result)

Output:

text
┌────────┬──────────┐
│ answer │ greeting │
│  int32 │ varchar  │
╞════════╪══════════╡
│     42 │ hello    │
└────────┴──────────┘

When / why to use it

  • Running SQL analytics on CSV, Parquet, or JSON files without loading them into a database first.
  • Replacing pandas groupby/merge with SQL for complex aggregations — DuckDB is often 10–100× faster on multi-GB files.
  • Interactive data exploration where you want SQL syntax but don't want to spin up a server.
  • ETL pipelines that need window functions, CTEs, and QUALIFY without installing PostgreSQL.
  • Reading from S3 / GCS / HTTP directly via the httpfs extension.
  • Converting data between formats (CSV → Parquet, Parquet → pandas) in a single SQL statement.

Common pitfalls

Default connection is process-scopedduckdb.sql() uses a shared in-memory connection. If you run DuckDB in threads or multiple modules, create an explicit connection with duckdb.connect() to avoid contention.

In-memory tables are not persistedCREATE TABLE in the default in-memory connection disappears when the process exits. Use duckdb.connect("path/to/file.db") for persistence.

Parallel writes are not supported — DuckDB allows many concurrent readers but only one writer at a time. For concurrent write workloads, use PostgreSQL instead.

duckdb.sql("SELECT * FROM 'data.csv'") reads files directly without creating a table first — this is the fastest path for ad-hoc exploration.

result.df() converts a DuckDB query result to a pandas DataFrame. result.pl() converts to polars. result.arrow() returns a PyArrow Table.

Connection types

python
import duckdb

# In-memory (default) — data lost when connection closes
con = duckdb.connect()

# Persistent — data saved to disk
con = duckdb.connect("analytics.db")

# Read-only (safe for concurrent reads)
con = duckdb.connect("analytics.db", read_only=True)

# Default module-level connection (convenience — same as duckdb.connect())
duckdb.sql("SELECT 1")   # uses the shared in-memory connection

Querying CSV files directly

DuckDB can read CSV files without loading them into a table. The file path is used directly in the FROM clause.

python
import duckdb

# Single file
result = duckdb.sql("SELECT * FROM 'sales.csv' LIMIT 5")
print(result)

# Auto-detect schema
result = duckdb.sql("DESCRIBE SELECT * FROM 'sales.csv'")
print(result)

# Glob — multiple files at once
result = duckdb.sql("SELECT COUNT(*) FROM 'data/logs_*.csv'")
print(result)

# CSV with options
result = duckdb.sql("""
    SELECT region, SUM(amount) AS total
    FROM read_csv('sales.csv', delim=',', header=true, dateformat='%Y-%m-%d')
    GROUP BY region
    ORDER BY total DESC
""")
print(result)

Output:

text
┌──────────┬──────────┐
│  region  │  total   │
│ varchar  │  double  │
╞══════════╪══════════╡
│ East     │ 48291.00 │
│ West     │ 33019.00 │
└──────────┴──────────┘

Querying Parquet files

python
import duckdb

# Read Parquet directly (much faster than CSV)
result = duckdb.sql("""
    SELECT category, COUNT(*) AS count, AVG(price) AS avg_price
    FROM 'products.parquet'
    GROUP BY category
    ORDER BY count DESC
    LIMIT 5
""")
print(result)

# Glob multiple Parquet shards
result = duckdb.sql("""
    SELECT year, SUM(revenue) FROM 'data/year=*/events.parquet'
    GROUP BY year ORDER BY year
""")

CREATE TABLE and CRUD

python
import duckdb

con = duckdb.connect()   # in-memory

con.sql("""
    CREATE TABLE employees (
        id      INTEGER PRIMARY KEY,
        name    VARCHAR NOT NULL,
        dept    VARCHAR,
        salary  DOUBLE
    )
""")

con.sql("""
    INSERT INTO employees VALUES
        (1, 'Alice Dev',   'Engineering', 95000),
        (2, 'Bob Builder', 'Design',      78000),
        (3, 'Carol Test',  'QA',          82000)
""")

result = con.sql("SELECT * FROM employees WHERE salary > 80000 ORDER BY salary DESC")
print(result)

# UPDATE and DELETE
con.sql("UPDATE employees SET salary = 99000 WHERE name = 'Alice Dev'")
con.sql("DELETE FROM employees WHERE dept = 'Design'")

print(con.sql("SELECT COUNT(*) AS remaining FROM employees").fetchone())

Output:

text
┌─────┬────────────┬─────────────┬──────────┐
│ id  │    name    │    dept     │  salary  │
│ i32 │  varchar   │   varchar   │  double  │
╞═════╪════════════╪═════════════╪══════════╡
│   3 │ Carol Test │ QA          │  82000.0 │
│   1 │ Alice Dev  │ Engineering │  95000.0 │
└─────┴────────────┴─────────────┴──────────┘
(2,)

Pandas and polars interop

DuckDB can query pandas DataFrames and polars DataFrames as if they were tables, by name, without copying data.

python
import duckdb
import pandas as pd

df = pd.DataFrame({
    "product": ["A", "B", "A", "B", "C"],
    "region":  ["East", "East", "West", "West", "East"],
    "sales":   [100, 200, 150, 250, 80],
})

# Query the DataFrame by variable name (zero-copy)
result = duckdb.sql("""
    SELECT product, SUM(sales) AS total
    FROM df
    GROUP BY product
    ORDER BY total DESC
""")
print(result)

Output:

text
┌─────────┬───────┐
│ product │ total │
│ varchar │ int64 │
╞═════════╪═══════╡
│ B       │   450 │
│ A       │   250 │
│ C       │    80 │
└─────────┴───────┘
python
# Convert result back to pandas or polars
df_result = result.df()    # pandas DataFrame
pl_result  = result.pl()   # polars DataFrame
arrow_tbl  = result.arrow() # PyArrow Table

# Register a DataFrame as a named view
con = duckdb.connect()
con.register("sales_view", df)
print(con.sql("SELECT COUNT(*) FROM sales_view").fetchone())

Output:

text
(5,)

Window functions

DuckDB has full window function support, including ROW_NUMBER, RANK, LAG, LEAD, NTILE, and running aggregates — identical syntax to PostgreSQL.

python
import duckdb

con = duckdb.connect()
con.sql("""
    CREATE TABLE sales AS
    SELECT * FROM (VALUES
        ('Alice', 'Q1', 100),
        ('Alice', 'Q2', 150),
        ('Alice', 'Q3', 120),
        ('Bob',   'Q1', 200),
        ('Bob',   'Q2', 180),
        ('Bob',   'Q3', 220)
    ) AS t(rep, quarter, amount)
""")

result = con.sql("""
    SELECT
        rep,
        quarter,
        amount,
        SUM(amount)  OVER (PARTITION BY rep ORDER BY quarter) AS running_total,
        LAG(amount)  OVER (PARTITION BY rep ORDER BY quarter) AS prev_amount,
        RANK()       OVER (ORDER BY amount DESC)               AS global_rank
    FROM sales
    ORDER BY rep, quarter
""")
print(result)

Output:

text
┌───────┬─────────┬────────┬───────────────┬─────────────┬─────────────┐
│  rep  │ quarter │ amount │ running_total │ prev_amount │ global_rank │
│  str  │   str   │ int32  │    int32      │   int32     │   int64     │
╞═══════╪═════════╪════════╪═══════════════╪═════════════╪═════════════╡
│ Alice │ Q1      │    100 │           100 │        NULL │           5 │
│ Alice │ Q2      │    150 │           250 │         100 │           3 │
│ Alice │ Q3      │    120 │           370 │         150 │           4 │
│ Bob   │ Q1      │    200 │           200 │        NULL │           2 │
│ Bob   │ Q2      │    180 │           380 │         200 │           3 │
│ Bob   │ Q3      │    220 │           600 │         180 │           1 │
└───────┴─────────┴────────┴───────────────┴─────────────┴─────────────┘

CTEs and complex queries

DuckDB supports standard CTEs (WITH), recursive CTEs, and QUALIFY (filter on window function results without a subquery).

python
import duckdb

result = duckdb.sql("""
    WITH
    monthly AS (
        SELECT
            strftime(order_date, '%Y-%m') AS month,
            SUM(amount) AS revenue
        FROM 'orders.csv'
        GROUP BY month
    ),
    ranked AS (
        SELECT
            month,
            revenue,
            RANK() OVER (ORDER BY revenue DESC) AS rank
        FROM monthly
    )
    SELECT * FROM ranked WHERE rank <= 3
""")
print(result)

Python API — fetchall, fetchone, fetchdf

python
import duckdb

con = duckdb.connect()
con.sql("CREATE TABLE t AS SELECT range(5) AS n")

# fetchall — list of tuples
rows = con.sql("SELECT * FROM t").fetchall()
print(rows)         # [(0,), (1,), (2,), (3,), (4,)]

# fetchone — single row tuple
row = con.sql("SELECT MAX(n) FROM t").fetchone()
print(row[0])       # 4

# fetchdf — pandas DataFrame
df = con.sql("SELECT * FROM t").df()
print(df)

# fetchnumpy — dict of numpy arrays
arrays = con.sql("SELECT * FROM t").fetchnumpy()
print(arrays["n"])

# Parameterised queries (prevent SQL injection)
name = "Alice"
con.sql("SELECT * FROM employees WHERE name = ?", params=[name])

Output:

text
[(0,), (1,), (2,), (3,), (4,)]
4
   n
0  0
1  1
2  2
3  3
4  4

Extensions — httpfs, JSON, spatial

DuckDB extensions add capabilities like S3/HTTP file reading, JSON parsing, and geospatial operations.

python
import duckdb

con = duckdb.connect()

# httpfs — read files from S3, GCS, HTTP
con.install_extension("httpfs")
con.load_extension("httpfs")
con.sql("SET s3_region='us-east-1'")
result = con.sql("SELECT COUNT(*) FROM 's3://my-bucket/data/*.parquet'")

# JSON — parse JSON columns
result = con.sql("""
    SELECT
        json_extract_string('{"name":"Alice","age":30}', '$.name') AS name,
        json_extract_integer('{"name":"Alice","age":30}', '$.age')  AS age
""")
print(result)

Output:

text
┌─────────┬───────┐
│  name   │  age  │
│ varchar │ int32 │
╞═════════╪═══════╡
│ Alice   │    30 │
└─────────┴───────┘

Export — CSV and Parquet

python
import duckdb

con = duckdb.connect("analytics.db")

# Export query result to CSV
con.sql("COPY (SELECT * FROM sales WHERE region = 'East') TO 'east_sales.csv' (HEADER, DELIMITER ',')")

# Export to Parquet (recommended for large files)
con.sql("COPY (SELECT * FROM sales) TO 'sales.parquet' (FORMAT PARQUET)")

# Create table from CSV and immediately export to Parquet
con.sql("COPY (SELECT * FROM read_csv_auto('raw.csv')) TO 'clean.parquet' (FORMAT PARQUET)")

Quick reference

TaskCode
In-memory connectionduckdb.connect()
Persistent DBduckdb.connect("file.db")
Run SQLduckdb.sql("SELECT ...")
Query CSVFROM 'file.csv' or read_csv('file.csv')
Query ParquetFROM 'file.parquet'
Query globFROM 'dir/*.parquet'
Query DataFrameFROM df (by variable name)
Register viewcon.register("name", df)
To pandasresult.df()
To polarsresult.pl()
To Arrowresult.arrow()
Fetch allresult.fetchall() → list of tuples
Fetch oneresult.fetchone() → tuple
Parameterisedcon.sql("SELECT * WHERE x = ?", params=[val])
Export CSVCOPY (SELECT ...) TO 'out.csv' (HEADER)
Export ParquetCOPY (SELECT ...) TO 'out.parquet' (FORMAT PARQUET)
Load extensioncon.install_extension("httpfs"); con.load_extension("httpfs")