cheat sheet
DuckDB
Run fast analytical SQL queries in-process with DuckDB. Covers Python API, CSV/Parquet ingestion, pandas interop, Arrow, window functions, and persistent databases.
DuckDB — Embedded Analytics Database
What it is
DuckDB is an embedded analytical database — it runs inside your Python process with no server, no installation, and no daemon. It executes SQL against in-memory tables, local CSV and Parquet files, pandas DataFrames, and Arrow tables. DuckDB is columnar (like Redshift or BigQuery) and optimised for analytical queries (aggregations, window functions, large scans), making it dramatically faster than SQLite for those workloads. It is the standard tool for "run SQL on a file" workflows in Python.
Install
pip install duckdb
Output: (none — exits 0 on success)
Quick example
import duckdb
result = duckdb.sql("SELECT 42 AS answer, 'hello' AS greeting")
print(result)
Output:
┌────────┬──────────┐
│ answer │ greeting │
│ int32 │ varchar │
╞════════╪══════════╡
│ 42 │ hello │
└────────┴──────────┘
When / why to use it
- Running SQL analytics on CSV, Parquet, or JSON files without loading them into a database first.
- Replacing pandas groupby/merge with SQL for complex aggregations — DuckDB is often 10–100× faster on multi-GB files.
- Interactive data exploration where you want SQL syntax but don't want to spin up a server.
- ETL pipelines that need window functions, CTEs, and QUALIFY without installing PostgreSQL.
- Reading from S3 / GCS / HTTP directly via the
httpfsextension. - Converting data between formats (CSV → Parquet, Parquet → pandas) in a single SQL statement.
Common pitfalls
Default connection is process-scoped —
duckdb.sql()uses a shared in-memory connection. If you run DuckDB in threads or multiple modules, create an explicit connection withduckdb.connect()to avoid contention.
In-memory tables are not persisted —
CREATE TABLEin the default in-memory connection disappears when the process exits. Useduckdb.connect("path/to/file.db")for persistence.
Parallel writes are not supported — DuckDB allows many concurrent readers but only one writer at a time. For concurrent write workloads, use PostgreSQL instead.
duckdb.sql("SELECT * FROM 'data.csv'")reads files directly without creating a table first — this is the fastest path for ad-hoc exploration.
result.df()converts a DuckDB query result to a pandas DataFrame.result.pl()converts to polars.result.arrow()returns a PyArrow Table.
Connection types
import duckdb
# In-memory (default) — data lost when connection closes
con = duckdb.connect()
# Persistent — data saved to disk
con = duckdb.connect("analytics.db")
# Read-only (safe for concurrent reads)
con = duckdb.connect("analytics.db", read_only=True)
# Default module-level connection (convenience — same as duckdb.connect())
duckdb.sql("SELECT 1") # uses the shared in-memory connection
Querying CSV files directly
DuckDB can read CSV files without loading them into a table. The file path is used directly in the FROM clause.
import duckdb
# Single file
result = duckdb.sql("SELECT * FROM 'sales.csv' LIMIT 5")
print(result)
# Auto-detect schema
result = duckdb.sql("DESCRIBE SELECT * FROM 'sales.csv'")
print(result)
# Glob — multiple files at once
result = duckdb.sql("SELECT COUNT(*) FROM 'data/logs_*.csv'")
print(result)
# CSV with options
result = duckdb.sql("""
SELECT region, SUM(amount) AS total
FROM read_csv('sales.csv', delim=',', header=true, dateformat='%Y-%m-%d')
GROUP BY region
ORDER BY total DESC
""")
print(result)
Output:
┌──────────┬──────────┐
│ region │ total │
│ varchar │ double │
╞══════════╪══════════╡
│ East │ 48291.00 │
│ West │ 33019.00 │
└──────────┴──────────┘
Querying Parquet files
import duckdb
# Read Parquet directly (much faster than CSV)
result = duckdb.sql("""
SELECT category, COUNT(*) AS count, AVG(price) AS avg_price
FROM 'products.parquet'
GROUP BY category
ORDER BY count DESC
LIMIT 5
""")
print(result)
# Glob multiple Parquet shards
result = duckdb.sql("""
SELECT year, SUM(revenue) FROM 'data/year=*/events.parquet'
GROUP BY year ORDER BY year
""")
CREATE TABLE and CRUD
import duckdb
con = duckdb.connect() # in-memory
con.sql("""
CREATE TABLE employees (
id INTEGER PRIMARY KEY,
name VARCHAR NOT NULL,
dept VARCHAR,
salary DOUBLE
)
""")
con.sql("""
INSERT INTO employees VALUES
(1, 'Alice Dev', 'Engineering', 95000),
(2, 'Bob Builder', 'Design', 78000),
(3, 'Carol Test', 'QA', 82000)
""")
result = con.sql("SELECT * FROM employees WHERE salary > 80000 ORDER BY salary DESC")
print(result)
# UPDATE and DELETE
con.sql("UPDATE employees SET salary = 99000 WHERE name = 'Alice Dev'")
con.sql("DELETE FROM employees WHERE dept = 'Design'")
print(con.sql("SELECT COUNT(*) AS remaining FROM employees").fetchone())
Output:
┌─────┬────────────┬─────────────┬──────────┐
│ id │ name │ dept │ salary │
│ i32 │ varchar │ varchar │ double │
╞═════╪════════════╪═════════════╪══════════╡
│ 3 │ Carol Test │ QA │ 82000.0 │
│ 1 │ Alice Dev │ Engineering │ 95000.0 │
└─────┴────────────┴─────────────┴──────────┘
(2,)
Pandas and polars interop
DuckDB can query pandas DataFrames and polars DataFrames as if they were tables, by name, without copying data.
import duckdb
import pandas as pd
df = pd.DataFrame({
"product": ["A", "B", "A", "B", "C"],
"region": ["East", "East", "West", "West", "East"],
"sales": [100, 200, 150, 250, 80],
})
# Query the DataFrame by variable name (zero-copy)
result = duckdb.sql("""
SELECT product, SUM(sales) AS total
FROM df
GROUP BY product
ORDER BY total DESC
""")
print(result)
Output:
┌─────────┬───────┐
│ product │ total │
│ varchar │ int64 │
╞═════════╪═══════╡
│ B │ 450 │
│ A │ 250 │
│ C │ 80 │
└─────────┴───────┘
# Convert result back to pandas or polars
df_result = result.df() # pandas DataFrame
pl_result = result.pl() # polars DataFrame
arrow_tbl = result.arrow() # PyArrow Table
# Register a DataFrame as a named view
con = duckdb.connect()
con.register("sales_view", df)
print(con.sql("SELECT COUNT(*) FROM sales_view").fetchone())
Output:
(5,)
Window functions
DuckDB has full window function support, including ROW_NUMBER, RANK, LAG, LEAD, NTILE, and running aggregates — identical syntax to PostgreSQL.
import duckdb
con = duckdb.connect()
con.sql("""
CREATE TABLE sales AS
SELECT * FROM (VALUES
('Alice', 'Q1', 100),
('Alice', 'Q2', 150),
('Alice', 'Q3', 120),
('Bob', 'Q1', 200),
('Bob', 'Q2', 180),
('Bob', 'Q3', 220)
) AS t(rep, quarter, amount)
""")
result = con.sql("""
SELECT
rep,
quarter,
amount,
SUM(amount) OVER (PARTITION BY rep ORDER BY quarter) AS running_total,
LAG(amount) OVER (PARTITION BY rep ORDER BY quarter) AS prev_amount,
RANK() OVER (ORDER BY amount DESC) AS global_rank
FROM sales
ORDER BY rep, quarter
""")
print(result)
Output:
┌───────┬─────────┬────────┬───────────────┬─────────────┬─────────────┐
│ rep │ quarter │ amount │ running_total │ prev_amount │ global_rank │
│ str │ str │ int32 │ int32 │ int32 │ int64 │
╞═══════╪═════════╪════════╪═══════════════╪═════════════╪═════════════╡
│ Alice │ Q1 │ 100 │ 100 │ NULL │ 5 │
│ Alice │ Q2 │ 150 │ 250 │ 100 │ 3 │
│ Alice │ Q3 │ 120 │ 370 │ 150 │ 4 │
│ Bob │ Q1 │ 200 │ 200 │ NULL │ 2 │
│ Bob │ Q2 │ 180 │ 380 │ 200 │ 3 │
│ Bob │ Q3 │ 220 │ 600 │ 180 │ 1 │
└───────┴─────────┴────────┴───────────────┴─────────────┴─────────────┘
CTEs and complex queries
DuckDB supports standard CTEs (WITH), recursive CTEs, and QUALIFY (filter on window function results without a subquery).
import duckdb
result = duckdb.sql("""
WITH
monthly AS (
SELECT
strftime(order_date, '%Y-%m') AS month,
SUM(amount) AS revenue
FROM 'orders.csv'
GROUP BY month
),
ranked AS (
SELECT
month,
revenue,
RANK() OVER (ORDER BY revenue DESC) AS rank
FROM monthly
)
SELECT * FROM ranked WHERE rank <= 3
""")
print(result)
Python API — fetchall, fetchone, fetchdf
import duckdb
con = duckdb.connect()
con.sql("CREATE TABLE t AS SELECT range(5) AS n")
# fetchall — list of tuples
rows = con.sql("SELECT * FROM t").fetchall()
print(rows) # [(0,), (1,), (2,), (3,), (4,)]
# fetchone — single row tuple
row = con.sql("SELECT MAX(n) FROM t").fetchone()
print(row[0]) # 4
# fetchdf — pandas DataFrame
df = con.sql("SELECT * FROM t").df()
print(df)
# fetchnumpy — dict of numpy arrays
arrays = con.sql("SELECT * FROM t").fetchnumpy()
print(arrays["n"])
# Parameterised queries (prevent SQL injection)
name = "Alice"
con.sql("SELECT * FROM employees WHERE name = ?", params=[name])
Output:
[(0,), (1,), (2,), (3,), (4,)]
4
n
0 0
1 1
2 2
3 3
4 4
Extensions — httpfs, JSON, spatial
DuckDB extensions add capabilities like S3/HTTP file reading, JSON parsing, and geospatial operations.
import duckdb
con = duckdb.connect()
# httpfs — read files from S3, GCS, HTTP
con.install_extension("httpfs")
con.load_extension("httpfs")
con.sql("SET s3_region='us-east-1'")
result = con.sql("SELECT COUNT(*) FROM 's3://my-bucket/data/*.parquet'")
# JSON — parse JSON columns
result = con.sql("""
SELECT
json_extract_string('{"name":"Alice","age":30}', '$.name') AS name,
json_extract_integer('{"name":"Alice","age":30}', '$.age') AS age
""")
print(result)
Output:
┌─────────┬───────┐
│ name │ age │
│ varchar │ int32 │
╞═════════╪═══════╡
│ Alice │ 30 │
└─────────┴───────┘
Export — CSV and Parquet
import duckdb
con = duckdb.connect("analytics.db")
# Export query result to CSV
con.sql("COPY (SELECT * FROM sales WHERE region = 'East') TO 'east_sales.csv' (HEADER, DELIMITER ',')")
# Export to Parquet (recommended for large files)
con.sql("COPY (SELECT * FROM sales) TO 'sales.parquet' (FORMAT PARQUET)")
# Create table from CSV and immediately export to Parquet
con.sql("COPY (SELECT * FROM read_csv_auto('raw.csv')) TO 'clean.parquet' (FORMAT PARQUET)")
Quick reference
| Task | Code |
|---|---|
| In-memory connection | duckdb.connect() |
| Persistent DB | duckdb.connect("file.db") |
| Run SQL | duckdb.sql("SELECT ...") |
| Query CSV | FROM 'file.csv' or read_csv('file.csv') |
| Query Parquet | FROM 'file.parquet' |
| Query glob | FROM 'dir/*.parquet' |
| Query DataFrame | FROM df (by variable name) |
| Register view | con.register("name", df) |
| To pandas | result.df() |
| To polars | result.pl() |
| To Arrow | result.arrow() |
| Fetch all | result.fetchall() → list of tuples |
| Fetch one | result.fetchone() → tuple |
| Parameterised | con.sql("SELECT * WHERE x = ?", params=[val]) |
| Export CSV | COPY (SELECT ...) TO 'out.csv' (HEADER) |
| Export Parquet | COPY (SELECT ...) TO 'out.parquet' (FORMAT PARQUET) |
| Load extension | con.install_extension("httpfs"); con.load_extension("httpfs") |