cheat sheet

duckdb

Package-level reference for duckdb — install, versioning, extensions, and gotchas. In-process columnar OLAP for Python.

duckdb

What it is

duckdb is an embedded OLAP database — SQL engine, columnar storage, vectorised executor — packaged as a Python library you pip install. There is no server process; queries run in-process against in-memory tables, local CSV/Parquet/JSON files, pandas DataFrames, polars DataFrames, and Arrow tables, with zero-copy where possible.

On PyPI it occupies the niche between SQLite (row-store transactional) and a standalone analytics warehouse — fast, embedded, zero-config, designed for analytical queries on local data.

Install

bash
pip install duckdb

Output: (none — exits 0 on success)

bash
uv add duckdb

Output: dependency resolved, added to pyproject.toml

bash
poetry add duckdb

Output: lockfile updated, installed into the project venv

bash
pip install --upgrade duckdb

Output: in-place upgrade — re-open any persistent .duckdb files written by an older binary

Versioning & Python support

duckdb uses a fast-moving CalVer-ish minor + storage-format version scheme. Releases are roughly monthly; major (0.x → 1.x) jumps signal stable on-disk format guarantees. As of late 2025 the project is on the 1.x line with cross-version storage compatibility.

DuckDB linePython supportOn-disk format
0.10.x3.7 – 3.11legacy; requires EXPORT DATABASE / re-IMPORT to upgrade
0.11.x3.8 – 3.12improved storage; still legacy
1.x3.9 – 3.13forward-compatible storage format

The Python binding bundles the C++ engine — there is no separate libduckdb to install. Wheels exist for Linux (manylinux/musllinux), macOS (universal2), and Windows.

Package metadata

  • Maintainer: DuckDB Foundation; primary authors Mark Raasveldt and Hannes Mühleisen (CWI Amsterdam)
  • Project home: github.com/duckdb/duckdb
  • Docs: duckdb.org/docs
  • License: MIT
  • PyPI: pypi.org/project/duckdb
  • Governance: non-profit DuckDB Foundation; commercial sponsor MotherDuck
  • First released: 2018 (research project); 1.0 in 2024
  • Downloads: > 20 M / month on PyPI as of late 2025

Optional dependencies & extras

duckdb itself ships with no PyPI extras — additional functionality is loaded as DuckDB extensions from inside SQL:

sql
INSTALL httpfs;
LOAD httpfs;
SELECT * FROM read_parquet('s3://bucket/data.parquet');

Output: httpfs extension downloaded into ~/.duckdb/extensions/<version>/ and registered

Common extensions:

ExtensionWhat it enables
httpfsread/write http://, s3://, gcs://, azure:// URLs
parquetbundled by default since 1.0
jsonbundled; full JSON read/write
icebergread Apache Iceberg tables
deltaread Delta Lake tables
postgresscan a live Postgres instance from DuckDB
sqlitescan a SQLite file as DuckDB tables
mysqlscan a live MySQL instance
excelread XLSX files natively
spatialGIS — geometry types, ST_* functions

PyPI companion packages most pipelines pull alongside duckdb:

bash
pip install duckdb pandas polars pyarrow jupyter

Output: installs duckdb + the DataFrame libraries it interops with zero-copy

Alternatives

PackageOne-line trade-off
SQLiterow-store OLTP — slow on analytical queries
pandasimperative API; slower joins/aggregations on multi-GB data
polarslazy DataFrame; complementary, not a replacement (use both)
PostgreSQLserver-based; needs a daemon, much heavier setup
ClickHouseserver-based OLAP, scales further but operational overhead
Apache DatafusionRust-native query engine; less Python-polished

Common gotchas

  • In-process only. duckdb is not a server. Concurrent processes opening the same .duckdb file fight for the write lock; use a single writer or switch to ClickHouse/Postgres for true concurrency.
  • Storage-format compatibility. Files written by 0.x can require export/import to load in 1.x. Within 1.x, format is forward-compatible. Pin the duckdb version in lockfiles for any persistent database.
  • Default connection is process-shared. duckdb.sql(...) uses one global in-memory connection. For threads or independent state use duckdb.connect() and pass the connection explicitly.
  • pandas/polars interop is zero-copy on Arrow — but not always. df = duckdb.sql("...").df() materialises into pandas (copies). Use .arrow() or .pl() to stay in Arrow and avoid the round-trip.
  • Extensions download on first INSTALL. Air-gapped environments need duckdb_extension_directory pre-seeded or the extension built in.
  • httpfs S3 credentials. duckdb does not read ~/.aws/credentials automatically — use SET s3_access_key_id and s3_secret_access_key, or CREATE SECRET (1.0+).
  • pip install duckdb on Apple Silicon Rosetta. Wheels are universal2; under Rosetta you accidentally pull the x86_64 slice and run 3× slower than necessary. Verify with import duckdb; duckdb.__file__.

Real-world recipes

duckdb is the "one-process SQL warehouse" — its sweet spot is querying mixed local files (CSV, Parquet, JSON), pandas / polars frames, and remote object storage from a single Python script. The companion sections/python/duckdb covers SQL/API depth; the recipes below focus on the packaging-level setup each pattern requires.

Multi-format federation in one query — duckdb's killer feature: join a Parquet directory, a CSV, a JSON dump, and a pandas frame in a single SQL statement.

python
import duckdb
import pandas as pd

users = pd.DataFrame({"user_id": [1, 2, 3], "tier": ["free", "pro", "free"]})

con = duckdb.connect(":memory:")
con.register("users_df", users)

result = con.sql("""
    SELECT u.tier, COUNT(*) AS event_count, SUM(e.revenue) AS total
    FROM read_parquet('events/*.parquet') e
    JOIN users_df u USING (user_id)
    WHERE e.event_date >= DATE '2026-01-01'
    GROUP BY u.tier
    ORDER BY total DESC
""").df()
print(result)

Output: the aggregation; no copy of users is made — DuckDB scans it via Arrow as a virtual table

S3 / GCS query with the httpfs extension:

python
import duckdb

con = duckdb.connect()
con.execute("INSTALL httpfs; LOAD httpfs;")
con.execute("CREATE SECRET aws (TYPE s3, KEY_ID 'AKIA...', SECRET '...', REGION 'us-east-1');")

q = """
SELECT date_trunc('day', ts) AS day, count(*) AS n
FROM read_parquet('s3://bucket/events/2026/*/*.parquet')
GROUP BY day
ORDER BY day
"""
df = con.sql(q).df()
print(df.head())

Output: the per-day count; httpfs streams just the necessary row groups from S3, never downloading whole files

Parquet pushdown for column projection + filter:

python
import duckdb

q = """
SELECT user_id, SUM(revenue) AS total
FROM read_parquet('warehouse/events/*.parquet')
WHERE event_date >= '2026-01-01' AND status = 'active'
GROUP BY user_id
ORDER BY total DESC
LIMIT 10
"""
duckdb.sql(q).show()

Output: the top-10 users by revenue; DuckDB's optimiser pushes the filter and projection into the Parquet reader so only the necessary columns and row groups are decoded

Persistent on-disk database:

python
import duckdb

con = duckdb.connect("warehouse.duckdb")
con.execute("CREATE TABLE IF NOT EXISTS orders AS SELECT * FROM read_csv_auto('orders.csv')")
con.execute("CREATE INDEX IF NOT EXISTS idx_orders_user ON orders(user_id)")
con.close()

Output: a single-file warehouse.duckdb on disk, queryable by any DuckDB client at the same major version

Postgres scan (no ETL needed):

python
import duckdb

con = duckdb.connect()
con.execute("INSTALL postgres; LOAD postgres;")
con.execute("ATTACH 'host=db.example.com dbname=prod user=ro' AS pg (TYPE postgres)")
df = con.sql("SELECT id, name FROM pg.public.users WHERE created_at > '2026-01-01'").df()
print(df.head())

Output: a DataFrame containing live Postgres data; no dump-and-reload step required

Performance tuning

duckdb's vectorised executor is fast by default. The remaining tuning levers are about what to skip (column projection, predicate pushdown) and how to spend RAM (thread count, temp-disk spill).

python
import duckdb

con = duckdb.connect()
con.execute("SET threads TO 8")           # cap CPU usage
con.execute("SET memory_limit = '12GB'")  # spill to disk past this
con.execute("SET temp_directory = '/fast-ssd/duckdb-tmp'")

Output: no return value; subsequent queries respect the settings

Tuning levers in order of impact:

LeverMechanismWhen it helps
Use Parquet over CSVcolumnar pushdownmost analytical workloads
SELECT col1, col2 (no SELECT *)projection pushdownwide tables, narrow queries
Explicit WHERE ts >= ... on partitioned datapredicate pushdownpartitioned Parquet trees
SET threads TO Nthread capshared/CI runners
SET memory_limit = '...'spill thresholdlow-RAM containers
SET preserve_insertion_order=falsedrop ordering for shuffleshuge group-by / sort
pragma enable_progress_barprogress UIlong interactive queries

Inspecting the query plan:

python
import duckdb

q = """
SELECT region, sum(revenue)
FROM read_parquet('events/*.parquet')
WHERE event_date >= '2026-01-01'
GROUP BY region
"""
duckdb.sql("EXPLAIN " + q).show()
duckdb.sql("EXPLAIN ANALYZE " + q).show()

Output: the physical plan (EXPLAIN) and the same plan annotated with per-operator timings (EXPLAIN ANALYZE) — the difference shows where time is going

Memory & dataset-size scaling

duckdb is out-of-core by design — it spills to disk past the configured memory limit and works on datasets much larger than RAM. The main scaling lever is configuring spill location and memory budget; you rarely need to chunk data manually.

python
import duckdb

con = duckdb.connect()
con.execute("SET memory_limit = '4GB'")
con.execute("SET temp_directory = '/var/tmp/duckdb'")

# Aggregate 500 GB of Parquet on a 4 GB-RAM box
con.sql("""
COPY (
    SELECT user_id, SUM(revenue) AS total
    FROM read_parquet('warehouse/**/*.parquet')
    GROUP BY user_id
) TO 'user_totals.parquet' (FORMAT 'parquet', COMPRESSION 'zstd');
""")

Output: writes user_totals.parquet; spills intermediate hash tables to /var/tmp/duckdb as needed

Streaming arrow output without materialising:

python
import duckdb
import pyarrow as pa

con = duckdb.connect()
reader = con.execute("SELECT * FROM read_parquet('huge.parquet')").fetch_record_batch(rows_per_batch=100_000)
for batch in reader:
    # process one batch at a time — never holds the full table in RAM
    print(batch.num_rows)

Output: prints the rows per batch as DuckDB streams Arrow record batches; useful for piping into downstream processors

For genuinely massive workloads (TB+), the right pairing is DuckDB + MotherDuck (the hosted DuckDB service from the same authors) or DuckDB + an Iceberg/Delta table layout so multiple processes can read the same dataset.

Version migration guide

DuckDB ships roughly monthly. The on-disk storage format compatibility story is the most important migration concern.

Format compatibility timeline:

  • 0.8.x → 0.9.x → 0.10.x: required EXPORT DATABASE/IMPORT DATABASE round-trips between minors. Painful for any persistent file.
  • 1.0 (2024): forward-compatible storage format declared. Files written by 1.0 can be opened by every later 1.x release.
  • 1.x → 1.x: SQL surface and extension API can change, but storage stays compatible. Read the changelog for breaking SQL keywords or function renames.
python
import duckdb

print(duckdb.__version__)
con = duckdb.connect("warehouse.duckdb")
print(con.execute("PRAGMA database_size").fetchone())

Output: the installed version and the on-disk size — sanity-check both before upgrading

Common SQL-level migrations within 1.x:

  • read_csv_auto is still alive but read_csv with auto_detect=true is the preferred form.
  • The STRUCT_PACK keyword for inline structs became more strict on field naming.
  • LIST and ARRAY are aliases; older code mixed them — pick one.
  • Several extensions moved from "bundled" to "auto-loadable" — first-run pulls them from the extension repository unless autoinstall_known_extensions is false (offline-safe).

Pin the duckdb version in production lockfiles. Persistent .duckdb files written by a future version cannot be opened by an older binary even within 1.x.

Interop with adjacent ecosystems

DuckDB speaks Arrow natively, which is why it pairs so well with pandas and polars. The interop matrix below tells you when a query result is zero-copy versus when it triggers a serialisation.

Result methodReturnsZero-copy?
.df()pandas DataFrameYes if backed by Arrow string types; copy otherwise
.pl()polars DataFrameYes
.arrow()pyarrow TableYes
.fetch_record_batch()streaming Arrow batchesYes, streamed
.fetchall()list of tuplesCopy (Python objects)
.fetchone()one tupleCopy
.fetchnumpy()dict of NumPy arraysYes for numeric, copy for strings
python
import duckdb
import pandas as pd
import polars as pl

pdf = pd.DataFrame({"a": [1, 2, 3]})
plf = pl.DataFrame({"a": [4, 5, 6]})

# Both frames are accessible inside SQL with zero copy
result = duckdb.sql("SELECT a FROM pdf UNION ALL SELECT a FROM plf").pl()
print(result)

Output: a polars frame with [1, 2, 3, 4, 5, 6]; DuckDB sees both source frames as Arrow scans

Crossing into the Jupyter ecosystem: duckdb.sql(query) returns a DuckDBPyRelation that renders nicely in notebooks. Use %load_ext sql + %sql duckdb:/// for cell-magic SQL.

Troubleshooting common errors

The errors below are the recurring frictions; most resolve with a one-line setting change.

  • IO Error: Cannot open file '...': No such file — relative path mismatch. DuckDB resolves paths relative to the process CWD, not the script directory. Use absolute paths.
  • Catalog Error: Table with name X does not exist — you used duckdb.sql(...) (which uses a fresh in-memory connection) and registered the frame on a different connection. Use one explicit con = duckdb.connect(...) for the whole script.
  • Permission denied writing to a .duckdb file — another process holds the write lock. DuckDB allows multiple readers but only one writer.
  • HTTP 403 from httpfs — wrong region or stale credentials. Use CREATE SECRET ... TYPE s3 (1.0+) rather than legacy SET s3_* for clean error messages.
  • Cannot load extension '...': not on the autoload list — an extension is community-maintained. Set SET allow_community_extensions = true first.
  • Stale ~/.duckdb/extensions/<version>/ after upgrading — extensions are version-keyed; delete the old version directory or run INSTALL <ext> -- update to refresh.
  • Storage version error opening a .duckdb file with an older binary — upgrade the binary or use EXPORT DATABASE from the newer one and IMPORT DATABASE on the older.
  • OOM despite memory_limit — some operators (large window functions, recursive CTEs) ignore the limit. Add SET temp_directory = '/fast-disk/...' and ensure that disk has room.

When NOT to use this

duckdb is the right answer surprisingly often; the cases below are where another tool wins.

  • Concurrent writers / transactional workloads: DuckDB is one-writer. Use Postgres, SQLite, or a true OLTP system.
  • Streaming ingest: duckdb is batch-oriented; use Flink, Beam, or a streaming framework.
  • Cluster-scale (multi-TB): duckdb is single-process. Use Snowflake, BigQuery, Clickhouse, or a distributed query engine for petabyte-scale.
  • Pure DataFrame ergonomics: if everyone on the team prefers fluent APIs over SQL, polars is a better daily-driver.
  • Vector / time-series specialist workloads: duckdb has good support but specialised stores (DuckDB + VSS extension vs Pinecone; TimescaleDB; ClickHouse) sometimes win on shape-specific operations.
  • You need cross-region latency-bounded reads: DuckDB does not know about region affinity; a cloud warehouse with regional storage will do better.

See also