cheat sheet
duckdb
Package-level reference for duckdb — install, versioning, extensions, and gotchas. In-process columnar OLAP for Python.
duckdb
What it is
duckdb is an embedded OLAP database — SQL engine, columnar storage, vectorised executor — packaged as a Python library you pip install. There is no server process; queries run in-process against in-memory tables, local CSV/Parquet/JSON files, pandas DataFrames, polars DataFrames, and Arrow tables, with zero-copy where possible.
On PyPI it occupies the niche between SQLite (row-store transactional) and a standalone analytics warehouse — fast, embedded, zero-config, designed for analytical queries on local data.
Install
pip install duckdb
Output: (none — exits 0 on success)
uv add duckdb
Output: dependency resolved, added to pyproject.toml
poetry add duckdb
Output: lockfile updated, installed into the project venv
pip install --upgrade duckdb
Output: in-place upgrade — re-open any persistent .duckdb files written by an older binary
Versioning & Python support
duckdb uses a fast-moving CalVer-ish minor + storage-format version scheme. Releases are roughly monthly; major (0.x → 1.x) jumps signal stable on-disk format guarantees. As of late 2025 the project is on the 1.x line with cross-version storage compatibility.
| DuckDB line | Python support | On-disk format |
|---|---|---|
| 0.10.x | 3.7 – 3.11 | legacy; requires EXPORT DATABASE / re-IMPORT to upgrade |
| 0.11.x | 3.8 – 3.12 | improved storage; still legacy |
| 1.x | 3.9 – 3.13 | forward-compatible storage format |
The Python binding bundles the C++ engine — there is no separate libduckdb to install. Wheels exist for Linux (manylinux/musllinux), macOS (universal2), and Windows.
Package metadata
- Maintainer: DuckDB Foundation; primary authors Mark Raasveldt and Hannes Mühleisen (CWI Amsterdam)
- Project home: github.com/duckdb/duckdb
- Docs: duckdb.org/docs
- License: MIT
- PyPI: pypi.org/project/duckdb
- Governance: non-profit DuckDB Foundation; commercial sponsor MotherDuck
- First released: 2018 (research project); 1.0 in 2024
- Downloads: > 20 M / month on PyPI as of late 2025
Optional dependencies & extras
duckdb itself ships with no PyPI extras — additional functionality is loaded as DuckDB extensions from inside SQL:
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_parquet('s3://bucket/data.parquet');
Output: httpfs extension downloaded into ~/.duckdb/extensions/<version>/ and registered
Common extensions:
| Extension | What it enables |
|---|---|
httpfs | read/write http://, s3://, gcs://, azure:// URLs |
parquet | bundled by default since 1.0 |
json | bundled; full JSON read/write |
iceberg | read Apache Iceberg tables |
delta | read Delta Lake tables |
postgres | scan a live Postgres instance from DuckDB |
sqlite | scan a SQLite file as DuckDB tables |
mysql | scan a live MySQL instance |
excel | read XLSX files natively |
spatial | GIS — geometry types, ST_* functions |
PyPI companion packages most pipelines pull alongside duckdb:
pip install duckdb pandas polars pyarrow jupyter
Output: installs duckdb + the DataFrame libraries it interops with zero-copy
Alternatives
| Package | One-line trade-off |
|---|---|
| SQLite | row-store OLTP — slow on analytical queries |
| pandas | imperative API; slower joins/aggregations on multi-GB data |
| polars | lazy DataFrame; complementary, not a replacement (use both) |
| PostgreSQL | server-based; needs a daemon, much heavier setup |
| ClickHouse | server-based OLAP, scales further but operational overhead |
| Apache Datafusion | Rust-native query engine; less Python-polished |
Common gotchas
- In-process only. duckdb is not a server. Concurrent processes opening the same
.duckdbfile fight for the write lock; use a single writer or switch to ClickHouse/Postgres for true concurrency. - Storage-format compatibility. Files written by 0.x can require export/import to load in 1.x. Within 1.x, format is forward-compatible. Pin the duckdb version in lockfiles for any persistent database.
- Default connection is process-shared.
duckdb.sql(...)uses one global in-memory connection. For threads or independent state useduckdb.connect()and pass the connection explicitly. - pandas/polars interop is zero-copy on Arrow — but not always.
df = duckdb.sql("...").df()materialises into pandas (copies). Use.arrow()or.pl()to stay in Arrow and avoid the round-trip. - Extensions download on first
INSTALL. Air-gapped environments needduckdb_extension_directorypre-seeded or the extension built in. - httpfs S3 credentials. duckdb does not read
~/.aws/credentialsautomatically — useSET s3_access_key_idands3_secret_access_key, orCREATE SECRET(1.0+). pip install duckdbon Apple Silicon Rosetta. Wheels are universal2; under Rosetta you accidentally pull the x86_64 slice and run 3× slower than necessary. Verify withimport duckdb; duckdb.__file__.
Real-world recipes
duckdb is the "one-process SQL warehouse" — its sweet spot is querying mixed local files (CSV, Parquet, JSON), pandas / polars frames, and remote object storage from a single Python script. The companion sections/python/duckdb covers SQL/API depth; the recipes below focus on the packaging-level setup each pattern requires.
Multi-format federation in one query — duckdb's killer feature: join a Parquet directory, a CSV, a JSON dump, and a pandas frame in a single SQL statement.
import duckdb
import pandas as pd
users = pd.DataFrame({"user_id": [1, 2, 3], "tier": ["free", "pro", "free"]})
con = duckdb.connect(":memory:")
con.register("users_df", users)
result = con.sql("""
SELECT u.tier, COUNT(*) AS event_count, SUM(e.revenue) AS total
FROM read_parquet('events/*.parquet') e
JOIN users_df u USING (user_id)
WHERE e.event_date >= DATE '2026-01-01'
GROUP BY u.tier
ORDER BY total DESC
""").df()
print(result)
Output: the aggregation; no copy of users is made — DuckDB scans it via Arrow as a virtual table
S3 / GCS query with the httpfs extension:
import duckdb
con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("CREATE SECRET aws (TYPE s3, KEY_ID 'AKIA...', SECRET '...', REGION 'us-east-1');")
q = """
SELECT date_trunc('day', ts) AS day, count(*) AS n
FROM read_parquet('s3://bucket/events/2026/*/*.parquet')
GROUP BY day
ORDER BY day
"""
df = con.sql(q).df()
print(df.head())
Output: the per-day count; httpfs streams just the necessary row groups from S3, never downloading whole files
Parquet pushdown for column projection + filter:
import duckdb
q = """
SELECT user_id, SUM(revenue) AS total
FROM read_parquet('warehouse/events/*.parquet')
WHERE event_date >= '2026-01-01' AND status = 'active'
GROUP BY user_id
ORDER BY total DESC
LIMIT 10
"""
duckdb.sql(q).show()
Output: the top-10 users by revenue; DuckDB's optimiser pushes the filter and projection into the Parquet reader so only the necessary columns and row groups are decoded
Persistent on-disk database:
import duckdb
con = duckdb.connect("warehouse.duckdb")
con.execute("CREATE TABLE IF NOT EXISTS orders AS SELECT * FROM read_csv_auto('orders.csv')")
con.execute("CREATE INDEX IF NOT EXISTS idx_orders_user ON orders(user_id)")
con.close()
Output: a single-file warehouse.duckdb on disk, queryable by any DuckDB client at the same major version
Postgres scan (no ETL needed):
import duckdb
con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("ATTACH 'host=db.example.com dbname=prod user=ro' AS pg (TYPE postgres)")
df = con.sql("SELECT id, name FROM pg.public.users WHERE created_at > '2026-01-01'").df()
print(df.head())
Output: a DataFrame containing live Postgres data; no dump-and-reload step required
Performance tuning
duckdb's vectorised executor is fast by default. The remaining tuning levers are about what to skip (column projection, predicate pushdown) and how to spend RAM (thread count, temp-disk spill).
import duckdb
con = duckdb.connect()
con.execute("SET threads TO 8") # cap CPU usage
con.execute("SET memory_limit = '12GB'") # spill to disk past this
con.execute("SET temp_directory = '/fast-ssd/duckdb-tmp'")
Output: no return value; subsequent queries respect the settings
Tuning levers in order of impact:
| Lever | Mechanism | When it helps |
|---|---|---|
| Use Parquet over CSV | columnar pushdown | most analytical workloads |
SELECT col1, col2 (no SELECT *) | projection pushdown | wide tables, narrow queries |
Explicit WHERE ts >= ... on partitioned data | predicate pushdown | partitioned Parquet trees |
SET threads TO N | thread cap | shared/CI runners |
SET memory_limit = '...' | spill threshold | low-RAM containers |
SET preserve_insertion_order=false | drop ordering for shuffles | huge group-by / sort |
pragma enable_progress_bar | progress UI | long interactive queries |
Inspecting the query plan:
import duckdb
q = """
SELECT region, sum(revenue)
FROM read_parquet('events/*.parquet')
WHERE event_date >= '2026-01-01'
GROUP BY region
"""
duckdb.sql("EXPLAIN " + q).show()
duckdb.sql("EXPLAIN ANALYZE " + q).show()
Output: the physical plan (EXPLAIN) and the same plan annotated with per-operator timings (EXPLAIN ANALYZE) — the difference shows where time is going
Memory & dataset-size scaling
duckdb is out-of-core by design — it spills to disk past the configured memory limit and works on datasets much larger than RAM. The main scaling lever is configuring spill location and memory budget; you rarely need to chunk data manually.
import duckdb
con = duckdb.connect()
con.execute("SET memory_limit = '4GB'")
con.execute("SET temp_directory = '/var/tmp/duckdb'")
# Aggregate 500 GB of Parquet on a 4 GB-RAM box
con.sql("""
COPY (
SELECT user_id, SUM(revenue) AS total
FROM read_parquet('warehouse/**/*.parquet')
GROUP BY user_id
) TO 'user_totals.parquet' (FORMAT 'parquet', COMPRESSION 'zstd');
""")
Output: writes user_totals.parquet; spills intermediate hash tables to /var/tmp/duckdb as needed
Streaming arrow output without materialising:
import duckdb
import pyarrow as pa
con = duckdb.connect()
reader = con.execute("SELECT * FROM read_parquet('huge.parquet')").fetch_record_batch(rows_per_batch=100_000)
for batch in reader:
# process one batch at a time — never holds the full table in RAM
print(batch.num_rows)
Output: prints the rows per batch as DuckDB streams Arrow record batches; useful for piping into downstream processors
For genuinely massive workloads (TB+), the right pairing is DuckDB + MotherDuck (the hosted DuckDB service from the same authors) or DuckDB + an Iceberg/Delta table layout so multiple processes can read the same dataset.
Version migration guide
DuckDB ships roughly monthly. The on-disk storage format compatibility story is the most important migration concern.
Format compatibility timeline:
- 0.8.x → 0.9.x → 0.10.x: required
EXPORT DATABASE/IMPORT DATABASEround-trips between minors. Painful for any persistent file. - 1.0 (2024): forward-compatible storage format declared. Files written by 1.0 can be opened by every later 1.x release.
- 1.x → 1.x: SQL surface and extension API can change, but storage stays compatible. Read the changelog for breaking SQL keywords or function renames.
import duckdb
print(duckdb.__version__)
con = duckdb.connect("warehouse.duckdb")
print(con.execute("PRAGMA database_size").fetchone())
Output: the installed version and the on-disk size — sanity-check both before upgrading
Common SQL-level migrations within 1.x:
read_csv_autois still alive butread_csvwithauto_detect=trueis the preferred form.- The
STRUCT_PACKkeyword for inline structs became more strict on field naming. LISTandARRAYare aliases; older code mixed them — pick one.- Several extensions moved from "bundled" to "auto-loadable" — first-run pulls them from the extension repository unless
autoinstall_known_extensionsis false (offline-safe).
Pin the duckdb version in production lockfiles. Persistent .duckdb files written by a future version cannot be opened by an older binary even within 1.x.
Interop with adjacent ecosystems
DuckDB speaks Arrow natively, which is why it pairs so well with pandas and polars. The interop matrix below tells you when a query result is zero-copy versus when it triggers a serialisation.
| Result method | Returns | Zero-copy? |
|---|---|---|
.df() | pandas DataFrame | Yes if backed by Arrow string types; copy otherwise |
.pl() | polars DataFrame | Yes |
.arrow() | pyarrow Table | Yes |
.fetch_record_batch() | streaming Arrow batches | Yes, streamed |
.fetchall() | list of tuples | Copy (Python objects) |
.fetchone() | one tuple | Copy |
.fetchnumpy() | dict of NumPy arrays | Yes for numeric, copy for strings |
import duckdb
import pandas as pd
import polars as pl
pdf = pd.DataFrame({"a": [1, 2, 3]})
plf = pl.DataFrame({"a": [4, 5, 6]})
# Both frames are accessible inside SQL with zero copy
result = duckdb.sql("SELECT a FROM pdf UNION ALL SELECT a FROM plf").pl()
print(result)
Output: a polars frame with [1, 2, 3, 4, 5, 6]; DuckDB sees both source frames as Arrow scans
Crossing into the Jupyter ecosystem: duckdb.sql(query) returns a DuckDBPyRelation that renders nicely in notebooks. Use %load_ext sql + %sql duckdb:/// for cell-magic SQL.
Troubleshooting common errors
The errors below are the recurring frictions; most resolve with a one-line setting change.
IO Error: Cannot open file '...': No such file— relative path mismatch. DuckDB resolves paths relative to the process CWD, not the script directory. Use absolute paths.Catalog Error: Table with name X does not exist— you usedduckdb.sql(...)(which uses a fresh in-memory connection) and registered the frame on a different connection. Use one explicitcon = duckdb.connect(...)for the whole script.Permission deniedwriting to a.duckdbfile — another process holds the write lock. DuckDB allows multiple readers but only one writer.HTTP 403from httpfs — wrong region or stale credentials. UseCREATE SECRET ... TYPE s3(1.0+) rather than legacySET s3_*for clean error messages.Cannot load extension '...': not on the autoload list— an extension is community-maintained. SetSET allow_community_extensions = truefirst.- Stale
~/.duckdb/extensions/<version>/after upgrading — extensions are version-keyed; delete the old version directory or runINSTALL <ext> -- updateto refresh. - Storage version error opening a
.duckdbfile with an older binary — upgrade the binary or useEXPORT DATABASEfrom the newer one andIMPORT DATABASEon the older. - OOM despite
memory_limit— some operators (large window functions, recursive CTEs) ignore the limit. AddSET temp_directory = '/fast-disk/...'and ensure that disk has room.
When NOT to use this
duckdb is the right answer surprisingly often; the cases below are where another tool wins.
- Concurrent writers / transactional workloads: DuckDB is one-writer. Use Postgres, SQLite, or a true OLTP system.
- Streaming ingest: duckdb is batch-oriented; use Flink, Beam, or a streaming framework.
- Cluster-scale (multi-TB): duckdb is single-process. Use Snowflake, BigQuery, Clickhouse, or a distributed query engine for petabyte-scale.
- Pure DataFrame ergonomics: if everyone on the team prefers fluent APIs over SQL, polars is a better daily-driver.
- Vector / time-series specialist workloads: duckdb has good support but specialised stores (DuckDB + VSS extension vs Pinecone; TimescaleDB; ClickHouse) sometimes win on shape-specific operations.
- You need cross-region latency-bounded reads: DuckDB does not know about region affinity; a cloud warehouse with regional storage will do better.
See also
- sections/python/duckdb — full API tutorial (SQL, pandas/polars interop, Arrow)
- sections/python/pandas — DataFrames that duckdb queries zero-copy
- sections/python/polars — Arrow-native DataFrame, perfect pair with duckdb
- sections/packages-pip/pip-pandas — package-level comparison
- sections/packages-pip/pip-polars — sibling Arrow-native analytics