cheat sheet

fsspec

Package-level reference for fsspec on PyPI — install, protocols, chained URIs, caching layers, and integration with pandas / dask.

updated 05-31-2026

fsspec

What it is

fsspec (filesystem-spec) is the unified file-access abstraction underneath pandas, dask, xarray, intake, zarr, mlflow, and many other PyData-stack libraries. It provides one URI-driven API for reading and writing to local disk, HTTP(S), S3, GCS, Azure Blob, HDFS, FTP, SFTP, SSH, GitHub, Memory, ZIP, TAR — anything for which an implementation has been written or can be installed as a plugin.

Reach for fsspec when you want code that doesn't care where its data lives, when you need transparent caching of remote files, or when you're already using a library that consumes fsspec URIs (pandas read_parquet("s3://...") goes through fsspec). Direct use is also common for build/deploy tooling that needs to handle local and cloud paths uniformly.

Install

bash

pip install fsspec

Output: (none — exits 0 on success)

bash

uv add fsspec

Output: resolved + added to pyproject.toml

bash

poetry add fsspec

Output: updated lockfile + virtualenv install

Backends are separate packages — install the ones you need:

bash

pip install "fsspec[s3]"        # → s3fs
pip install "fsspec[gcs]"       # → gcsfs
pip install "fsspec[abfs]"      # → adlfs (Azure Blob / Data Lake)
pip install "fsspec[ssh]"       # → paramiko
pip install "fsspec[http]"      # → aiohttp
pip install "fsspec[github]"    # GitHub raw-file access
pip install "fsspec[full]"      # everything

Output: core fsspec plus the named backend(s).

Versioning & Python support

Current line is the 2024.x / 2025.x calendar-versioned series (e.g., 2025.5.1).
Supports Python 3.9+ on recent releases.
Release cadence is monthly; bumps are usually minor and additive but occasionally tighten the protocol interface.
Backend packages (s3fs, gcsfs, adlfs) are versioned in lockstep with fsspec — keep them on the same minor (s3fs==2025.5.* with fsspec==2025.5.*) to avoid registry-API drift.

Package metadata

Maintainer: Martin Durant + community (originally Anaconda)
Project home: github.com/fsspec/filesystem_spec
Docs: filesystem-spec.readthedocs.io
PyPI: pypi.org/project/fsspec
License: BSD-3-Clause
Governance: community; fsspec GitHub org houses sibling backend packages
First released: 2018
Downloads: hundreds of millions per month — transitive dep of pandas, dask, pyarrow, others

Optional dependencies & extras

The core fsspec package has no runtime deps. Backend extras pull in the corresponding implementation packages:

fsspec[s3] → s3fs (which depends on aiobotocore).
fsspec[gcs] → gcsfs (uses google-auth).
fsspec[abfs] → adlfs (Azure SDK).
fsspec[ssh]/fsspec[sftp] → paramiko.
fsspec[ftp] → stdlib ftplib (no extra needed actually).
fsspec[http] → aiohttp and requests.
fsspec[arrow] → pyarrow for Arrow-backed filesystems.
fsspec[full] → everything above.

Filesystems can also be registered by third-party packages via entry points — see huggingface_hub's hf://, dvc's dvc://, pyiceberg's table reads.

Alternatives

Package	Trade-off
`boto3` / `google-cloud-storage` / `azure-storage-blob`	Cloud-vendor SDKs; full feature surface, vendor-locked code.
`s3fs`	Just S3; lower abstraction than fsspec but compatible.
`cloudpathlib`	`pathlib.Path`-style API for cloud URIs. Less plumbing, fewer backends.
`smart_open`	Single-file open() abstraction; lighter and older; doesn't do listing or globbing.
`pyfilesystem2`	Older filesystem-abstraction project; less PyData adoption.
`pathlib` (stdlib)	Local paths only.

Common gotchas

Backend installs are mandatory. fsspec.open("s3://bucket/key") raises ImportError: s3fs not installed if you didn't pip install fsspec[s3].
Version skew between fsspec and backends causes "AsyncFileSystem has no attribute …". Pin both to the same calendar release.
fsspec.open() returns a file-like in BINARY mode by default. Use mode="rt" for text and pass encoding= explicitly.
Chained protocols (simplecache::s3://...) only work in supported orderings. Cache layer goes on the OUTSIDE (simplecache::s3://), not inside.
AbstractFileSystem.ls() is cached per filesystem instance. If the remote changes during the run, invalidate_cache() or pass refresh=True.
S3 anonymous reads need anon=True. Otherwise s3fs tries to use ambient AWS creds and fails on missing-credentials, not on permission.
open_files() returns a list of OpenFile objects that aren't open yet — they open on __enter__ of each, not on creation. This is intentional for delayed I/O in dask.
Async vs sync. AsyncFileSystem subclasses (S3FileSystem, GCSFileSystem) have both sync and _async-prefixed methods; in async code call _cat, _ls, etc.

Real-world recipes

The recipes below show local + remote paths through one API, chained caches, and the directory-listing surface that's common for build tooling.

Recipe 1 — Open a file via the abstraction (local OR remote, same code).

python

import fsspec

# Local path
with fsspec.open("file:///tmp/notes.txt", "rt") as f:
    print(f.read()[:200])

# S3 path — same API
with fsspec.open("s3://my-bucket/notes.txt", "rt", anon=False) as f:
    print(f.read()[:200])

Output: first 200 characters of each file — identical Python-level API, different transports.

Recipe 2 — Read directly from S3 via the protocol prefix.

python

import fsspec, pandas as pd

# Pandas knows fsspec — pass an S3 URI directly.
df = pd.read_parquet("s3://my-bucket/path/data.parquet", storage_options={"anon": False})
print(df.shape)

Output: (rows, cols) — pandas resolved the URI through fsspec without explicit setup.

Recipe 3 — HTTP read.

python

import fsspec
with fsspec.open("https://example.com/large.csv", "rb") as f:
    head = f.read(512)
print(head[:80])

Output: first 80 bytes of the remote CSV — the HTTP backend supports partial reads via Range headers.

Recipe 4 — Chained protocols: cache an S3 file to local disk on first read.

python

import fsspec

# `simplecache` wraps any backend; reads are pinned to disk on first access.
with fsspec.open("simplecache::s3://my-bucket/large.parquet",
                 s3={"anon": False},
                 simplecache={"cache_storage": "/var/cache/fsspec"}) as f:
    data = f.read()

Output: cached copy at /var/cache/fsspec/<hash> after the first read; subsequent reads are local-disk speed. Other cache layers: filecache (per-file), blockcache (block-level random-access cache).

Recipe 5 — Directory listing across backends.

python

import fsspec
fs = fsspec.filesystem("s3", anon=False)
for path in fs.ls("my-bucket/path/", detail=False):
    print(path)

Output: one S3 key per line. detail=True returns dicts with size, LastModified, etc.

Recipe 6 — Globbing remote paths.

python

import fsspec
fs = fsspec.filesystem("s3", anon=False)
for p in fs.glob("my-bucket/year=2026/**/*.parquet"):
    print(p)

Output: every parquet file under any subdir of year=2026/. Globbing on remote backends incurs LIST calls; cache aggressively.

Production deployment notes

Pin fsspec and backends to the same calendar release. API drift between e.g. fsspec==2025.5.0 and s3fs==2024.9.0 causes obscure attribute errors.
Configure credentials via env or instance metadata, not in code. S3: AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY or an IAM role. GCS: GOOGLE_APPLICATION_CREDENTIALS. Azure: AZURE_STORAGE_CONNECTION_STRING or DefaultAzureCredential.
Use simplecache or filecache for read-heavy workloads. Pinning hot files to local SSD cuts S3 GET costs and tail latency.
Be mindful of LIST costs. S3 LIST is per-1000-keys; for high-cardinality prefixes use partitioned paths and avoid recursive glob in tight loops.
Set default_block_size on HTTPFileSystem and S3FileSystem to a sensible value (5-50 MB). Default is OK for small files but suboptimal for large columnar reads.
Health-check the backend — read a small known file at startup so credentials and network reachability fail loudly at boot, not mid-request.

Performance tuning

open_files() + dask processes many remote files in parallel without you writing a thread pool.
blockcache:: for random-access reads over HTTP — parquet/zarr workloads benefit massively because the cache layer turns range-reads into block-aligned reads.
anon=True for public buckets — skips the credential lookup roundtrip on AWS.
Large reads: raise default_block_size and read_block size; the default 5 MB is small for multi-GB files.
List once, then iterate. fs.ls(...) + slice is far cheaper than calling fs.exists() per file.
Async loops: in an asyncio app, use the _* methods directly (await fs._cat(...)) to avoid the sync-bridge overhead.

Version migration guide

< 2022.x — pre-calendar-version, less plugin coverage; upgrade.
2023.x → 2024.x — AsyncFileSystem interfaces tightened; subclasses must implement _async-prefixed methods.
2024.x → 2025.x — storage_options validation strict-mode; unknown kwargs now error.
Backend-specific gotchas:
- s3fs >= 2024.x requires aiobotocore >= 2.7 which requires a compatible botocore. Pin the trio.
- gcsfs switched to google-auth modern token refresh in 2024.
- adlfs requires azure-storage-blob >= 12 and azure-identity.

python

# Old (< 2023): storage_options passed loosely
df = pd.read_parquet("s3://b/k", storage_options={"key": "...", "secret": "...", "anonymous": False})

# Current: anonymous → anon; strict kwargs
df = pd.read_parquet("s3://b/k", storage_options={"anon": False})

Output: same dataframe; cleaner kwargs surface.

Security considerations

Credentials in URIs leak in logs. Avoid s3://AKIA…:secret@bucket/key-style URIs; use env vars or a credentials provider.
anon=True is read-only-public — putting a private bucket behind anon mode silently fails.
Path traversal in custom backends. If you write a backend, sanitize ../ from user-supplied keys before stitching to your storage root.
HTTP backend follows redirects. Disable explicitly for security-sensitive reads or set an allowlist.
TLS verification on https:// is on by default; do NOT override verify=False in production.
Caching to shared disk can leak data across tenants. Use per-tenant cache paths.

Testing & CI integration

Use memory:// for in-process testing — fully in-RAM filesystem.
Use moto[s3] (a boto3 mock) under s3fs for S3 integration tests without a real bucket.
Pin a known calendar release across all dev / CI / prod environments. Backend version drift is the #1 cause of test flake.

python

# tests/test_io.py
import fsspec, pytest

def test_memory_roundtrip():
    fs = fsspec.filesystem("memory")
    with fs.open("/tmp/notes.txt", "wb") as f:
        f.write(b"hello")
    with fs.open("/tmp/notes.txt", "rb") as f:
        assert f.read() == b"hello"

Output: test passes; in-process roundtrip works against the memory backend.

Ecosystem integrations

pandas read_* / to_* — pandas passes URIs and storage_options straight to fsspec.
dask — built on fsspec; dask.dataframe.read_parquet handles globs and remote paths.
pyarrow.dataset / pyarrow.fs — has its own filesystem layer but consumes fsspec via PyFileSystem(FSSpecHandler(fs)).
zarr — uses fsspec as its primary I/O layer.
mlflow — artifact stores use fsspec for s3:// and gs:// URIs.
huggingface_hub — registers hf:// URIs for model and dataset access.
intake — data catalog system layered on fsspec.
xarray — zarr-backed remote datasets via fsspec.

Compatibility matrix

Python	`fsspec` line	Notes
3.8	`2024.5` and earlier	Dropped.
3.9	`2024.x`+	Current floor for recent releases.
3.10	`2024.x`+	Supported.
3.11	`2024.x`+	Supported.
3.12	`2024.6`+	Supported.
3.13	`2024.10`+	Supported, free-threaded build untested.

Troubleshooting common errors

Error / Symptom	Likely cause	Fix
`ImportError: install s3fs`	Missing backend	`pip install fsspec[s3]`.
`AttributeError: 'S3FileSystem' object has no attribute '_async_cat'`	`fsspec`/`s3fs` version skew	Pin both to the same calendar release.
`PermissionError: 403` from S3	Anon mode against private bucket, or missing IAM	Pass `anon=False` AND configure AWS credentials.
`FileNotFoundError` for an existing file	`ls()` cache stale	`fs.invalidate_cache()` or pass `refresh=True`.
`aiohttp.ClientPayloadError` on `https://` reads	Server closed connection mid-stream	Retry; raise `default_block_size` so fewer requests are needed.
Listing is very slow	Recursive `glob` over a deep tree	Use partitioned prefixes; pre-list and cache.
`RuntimeError: This event loop is already running`	Sync fsspec call inside an async event loop	Use the `_async`-prefixed methods (`await fs._cat(...)`).
`simplecache` returns stale data	Cache never invalidated	Set `expiry_time=` on the cache layer or wipe `cache_storage`.

When NOT to use this

You only ever read local files. pathlib is simpler and pulls no deps.
You need vendor-specific features. Multipart-upload tuning on S3, customer-managed encryption keys on GCS, etc. — use the vendor SDK directly.
You're writing a one-off ETL script where the cloud bucket and the code live in the same project — boto3 / gcsfs directly may be clearer.
Strict typing matters. fsspec's API is dynamically typed; pathlib and the vendor SDKs have better stubs.

Worked example: cloud-and-local data loader with caching

A common pattern: a load(path) function that works for local/path.parquet, s3://bucket/key.parquet, and https://example.com/data.parquet uniformly, with on-disk caching of remote files.

Step 1 — single entry point that abstracts the source.

python

import fsspec, pandas as pd
from pathlib import Path

CACHE_DIR = Path("/var/cache/myapp")
CACHE_DIR.mkdir(parents=True, exist_ok=True)

def load_parquet(uri: str, **kw) -> pd.DataFrame:
    # Wrap remote URIs in simplecache; leave local paths alone.
    if "://" in uri and not uri.startswith("file://"):
        uri = f"simplecache::{uri}"
        kw.setdefault("simplecache", {"cache_storage": str(CACHE_DIR)})
    with fsspec.open(uri, "rb", **kw) as f:
        return pd.read_parquet(f)

Output: the function takes any path; first remote read pins a copy to disk, subsequent reads hit the cache.

Step 2 — write a companion save(path, df).

python

def save_parquet(uri: str, df: pd.DataFrame, **kw) -> None:
    with fsspec.open(uri, "wb", **kw) as f:
        df.to_parquet(f)

Output: symmetric API — write to local, S3, GCS via the same call.

Step 3 — bake credentials into per-environment config.

python

# config/prod.yaml
storage_options:
  anon: false
  client_kwargs:
    region_name: us-east-1
  config_kwargs:
    retries:
      max_attempts: 5
      mode: adaptive

Output: load this once at startup and pass **storage_options to load_parquet.

Step 4 — invalidate cache when source changes.

python

import shutil
def clear_cache():
    shutil.rmtree(CACHE_DIR, ignore_errors=True)
    CACHE_DIR.mkdir(parents=True, exist_ok=True)

Output: simple, deliberate. For production prefer filecache with expiry_time= instead of manual wipes.

FAQ

Q: How do I list every parquet file under an S3 prefix? A: fs.glob("bucket/prefix/**/*.parquet") — be aware of LIST costs on large prefixes. For very large keyspaces, use S3 Inventory and read the manifest.

Q: Can fsspec read from a private GCS bucket without ambient creds? A: Yes — pass storage_options={"token": "/path/to/service-account.json"} (file path), or pass a dict / google.auth.credentials.Credentials object directly.

Q: How do I configure S3 to use a custom endpoint (MinIO, R2)? A: storage_options={"endpoint_url": "https://minio.example.com:9000", "anon": False, "key": "...", "secret": "..."} — works for any S3-compatible service.

Q: Does fsspec support seekable HTTP reads (Range)? A: Yes, via HTTPFileSystem. The HTTP backend issues range requests on f.seek + f.read; servers that don't honor Range: bytes=... fall back to streaming.

Q: How do I monitor S3 LIST/GET counts? A: Enable AWS CloudTrail data events on the bucket, or wrap the filesystem in your own wrapper that increments counters in _cat/_ls/_open. There's no built-in metric layer.

Q: Can multiple processes share the same simplecache directory? A: Yes — simplecache uses content-addressed filenames so concurrent writes are idempotent. For high concurrency, use a fast local SSD.

Async usage patterns

Inside an asyncio application, prefer the _* methods on AsyncFileSystem subclasses (S3FileSystem, GCSFileSystem, HTTPFileSystem) — the sync wrappers thunk through a loop.run_until_complete and block the event loop.

python

import asyncio, fsspec

async def fanout(keys):
    fs = fsspec.filesystem("s3", anon=False, asynchronous=True)
    bodies = await asyncio.gather(*(fs._cat(k) for k in keys))
    return bodies

asyncio.run(fanout(["bucket/a", "bucket/b", "bucket/c"]))

Output: three concurrent S3 GETs; no blocking the event loop. The asynchronous=True flag tells fsspec to skip its sync-bridge layer.

For mixed sync/async code, the safer pattern is one filesystem per side — instantiate fsspec.filesystem("s3") in sync code and fsspec.filesystem("s3", asynchronous=True) in async code, rather than sharing one instance across both worlds.

fsspec

What it is

Install

Versioning & Python support

Package metadata

Optional dependencies & extras

Alternatives

Common gotchas

Real-world recipes

Production deployment notes

Performance tuning

Version migration guide

Security considerations

Testing & CI integration

Ecosystem integrations

Compatibility matrix

Troubleshooting common errors

When NOT to use this

Worked example: cloud-and-local data loader with caching

FAQ

Async usage patterns

See also