cheat sheet
fsspec
Package-level reference for fsspec on PyPI — install, protocols, chained URIs, caching layers, and integration with pandas / dask.
fsspec
What it is
fsspec (filesystem-spec) is the unified file-access abstraction underneath pandas, dask, xarray, intake, zarr, mlflow, and many other PyData-stack libraries. It provides one URI-driven API for reading and writing to local disk, HTTP(S), S3, GCS, Azure Blob, HDFS, FTP, SFTP, SSH, GitHub, Memory, ZIP, TAR — anything for which an implementation has been written or can be installed as a plugin.
Reach for fsspec when you want code that doesn't care where its data lives, when you need transparent caching of remote files, or when you're already using a library that consumes fsspec URIs (pandas read_parquet("s3://...") goes through fsspec). Direct use is also common for build/deploy tooling that needs to handle local and cloud paths uniformly.
Install
pip install fsspec
Output: (none — exits 0 on success)
uv add fsspec
Output: resolved + added to pyproject.toml
poetry add fsspec
Output: updated lockfile + virtualenv install
Backends are separate packages — install the ones you need:
pip install "fsspec[s3]" # → s3fs
pip install "fsspec[gcs]" # → gcsfs
pip install "fsspec[abfs]" # → adlfs (Azure Blob / Data Lake)
pip install "fsspec[ssh]" # → paramiko
pip install "fsspec[http]" # → aiohttp
pip install "fsspec[github]" # GitHub raw-file access
pip install "fsspec[full]" # everything
Output: core fsspec plus the named backend(s).
Versioning & Python support
- Current line is the
2024.x/2025.xcalendar-versioned series (e.g.,2025.5.1). - Supports Python 3.9+ on recent releases.
- Release cadence is monthly; bumps are usually minor and additive but occasionally tighten the protocol interface.
- Backend packages (
s3fs,gcsfs,adlfs) are versioned in lockstep withfsspec— keep them on the same minor (s3fs==2025.5.*withfsspec==2025.5.*) to avoid registry-API drift.
Package metadata
- Maintainer: Martin Durant + community (originally Anaconda)
- Project home: github.com/fsspec/filesystem_spec
- Docs: filesystem-spec.readthedocs.io
- PyPI: pypi.org/project/fsspec
- License: BSD-3-Clause
- Governance: community;
fsspecGitHub org houses sibling backend packages - First released: 2018
- Downloads: hundreds of millions per month — transitive dep of
pandas,dask,pyarrow, others
Optional dependencies & extras
The core fsspec package has no runtime deps. Backend extras pull in the corresponding implementation packages:
fsspec[s3]→s3fs(which depends onaiobotocore).fsspec[gcs]→gcsfs(uses google-auth).fsspec[abfs]→adlfs(Azure SDK).fsspec[ssh]/fsspec[sftp]→paramiko.fsspec[ftp]→ stdlibftplib(no extra needed actually).fsspec[http]→aiohttpandrequests.fsspec[arrow]→pyarrowfor Arrow-backed filesystems.fsspec[full]→ everything above.
Filesystems can also be registered by third-party packages via entry points — see huggingface_hub's hf://, dvc's dvc://, pyiceberg's table reads.
Alternatives
| Package | Trade-off |
|---|---|
boto3 / google-cloud-storage / azure-storage-blob | Cloud-vendor SDKs; full feature surface, vendor-locked code. |
s3fs | Just S3; lower abstraction than fsspec but compatible. |
cloudpathlib | pathlib.Path-style API for cloud URIs. Less plumbing, fewer backends. |
smart_open | Single-file open() abstraction; lighter and older; doesn't do listing or globbing. |
pyfilesystem2 | Older filesystem-abstraction project; less PyData adoption. |
pathlib (stdlib) | Local paths only. |
Common gotchas
- Backend installs are mandatory.
fsspec.open("s3://bucket/key")raisesImportError: s3fs not installedif you didn'tpip install fsspec[s3]. - Version skew between
fsspecand backends causes "AsyncFileSystem has no attribute …". Pin both to the same calendar release. fsspec.open()returns a file-like in BINARY mode by default. Usemode="rt"for text and passencoding=explicitly.- Chained protocols (
simplecache::s3://...) only work in supported orderings. Cache layer goes on the OUTSIDE (simplecache::s3://), not inside. AbstractFileSystem.ls()is cached per filesystem instance. If the remote changes during the run,invalidate_cache()or passrefresh=True.- S3 anonymous reads need
anon=True. Otherwises3fstries to use ambient AWS creds and fails on missing-credentials, not on permission. open_files()returns a list ofOpenFileobjects that aren't open yet — they open on__enter__of each, not on creation. This is intentional for delayed I/O in dask.- Async vs sync.
AsyncFileSystemsubclasses (S3FileSystem,GCSFileSystem) have both sync and_async-prefixed methods; in async code call_cat,_ls, etc.
Real-world recipes
The recipes below show local + remote paths through one API, chained caches, and the directory-listing surface that's common for build tooling.
Recipe 1 — Open a file via the abstraction (local OR remote, same code).
import fsspec
# Local path
with fsspec.open("file:///tmp/notes.txt", "rt") as f:
print(f.read()[:200])
# S3 path — same API
with fsspec.open("s3://my-bucket/notes.txt", "rt", anon=False) as f:
print(f.read()[:200])
Output: first 200 characters of each file — identical Python-level API, different transports.
Recipe 2 — Read directly from S3 via the protocol prefix.
import fsspec, pandas as pd
# Pandas knows fsspec — pass an S3 URI directly.
df = pd.read_parquet("s3://my-bucket/path/data.parquet", storage_options={"anon": False})
print(df.shape)
Output: (rows, cols) — pandas resolved the URI through fsspec without explicit setup.
Recipe 3 — HTTP read.
import fsspec
with fsspec.open("https://example.com/large.csv", "rb") as f:
head = f.read(512)
print(head[:80])
Output: first 80 bytes of the remote CSV — the HTTP backend supports partial reads via Range headers.
Recipe 4 — Chained protocols: cache an S3 file to local disk on first read.
import fsspec
# `simplecache` wraps any backend; reads are pinned to disk on first access.
with fsspec.open("simplecache::s3://my-bucket/large.parquet",
s3={"anon": False},
simplecache={"cache_storage": "/var/cache/fsspec"}) as f:
data = f.read()
Output: cached copy at /var/cache/fsspec/<hash> after the first read; subsequent reads are local-disk speed. Other cache layers: filecache (per-file), blockcache (block-level random-access cache).
Recipe 5 — Directory listing across backends.
import fsspec
fs = fsspec.filesystem("s3", anon=False)
for path in fs.ls("my-bucket/path/", detail=False):
print(path)
Output: one S3 key per line. detail=True returns dicts with size, LastModified, etc.
Recipe 6 — Globbing remote paths.
import fsspec
fs = fsspec.filesystem("s3", anon=False)
for p in fs.glob("my-bucket/year=2026/**/*.parquet"):
print(p)
Output: every parquet file under any subdir of year=2026/. Globbing on remote backends incurs LIST calls; cache aggressively.
Production deployment notes
- Pin
fsspecand backends to the same calendar release. API drift between e.g.fsspec==2025.5.0ands3fs==2024.9.0causes obscure attribute errors. - Configure credentials via env or instance metadata, not in code. S3:
AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEYor an IAM role. GCS:GOOGLE_APPLICATION_CREDENTIALS. Azure:AZURE_STORAGE_CONNECTION_STRINGor DefaultAzureCredential. - Use
simplecacheorfilecachefor read-heavy workloads. Pinning hot files to local SSD cuts S3 GET costs and tail latency. - Be mindful of LIST costs. S3 LIST is per-1000-keys; for high-cardinality prefixes use partitioned paths and avoid recursive
globin tight loops. - Set
default_block_sizeonHTTPFileSystemandS3FileSystemto a sensible value (5-50 MB). Default is OK for small files but suboptimal for large columnar reads. - Health-check the backend — read a small known file at startup so credentials and network reachability fail loudly at boot, not mid-request.
Performance tuning
open_files()+ dask processes many remote files in parallel without you writing a thread pool.blockcache::for random-access reads over HTTP — parquet/zarr workloads benefit massively because the cache layer turns range-reads into block-aligned reads.anon=Truefor public buckets — skips the credential lookup roundtrip on AWS.- Large reads: raise
default_block_sizeandread_blocksize; the default 5 MB is small for multi-GB files. - List once, then iterate.
fs.ls(...)+ slice is far cheaper than callingfs.exists()per file. - Async loops: in an asyncio app, use the
_*methods directly (await fs._cat(...)) to avoid the sync-bridge overhead.
Version migration guide
< 2022.x— pre-calendar-version, less plugin coverage; upgrade.2023.x → 2024.x—AsyncFileSysteminterfaces tightened; subclasses must implement_async-prefixed methods.2024.x → 2025.x—storage_optionsvalidation strict-mode; unknown kwargs now error.- Backend-specific gotchas:
s3fs >= 2024.xrequiresaiobotocore >= 2.7which requires a compatiblebotocore. Pin the trio.gcsfsswitched togoogle-authmodern token refresh in 2024.adlfsrequiresazure-storage-blob >= 12andazure-identity.
# Old (< 2023): storage_options passed loosely
df = pd.read_parquet("s3://b/k", storage_options={"key": "...", "secret": "...", "anonymous": False})
# Current: anonymous → anon; strict kwargs
df = pd.read_parquet("s3://b/k", storage_options={"anon": False})
Output: same dataframe; cleaner kwargs surface.
Security considerations
- Credentials in URIs leak in logs. Avoid
s3://AKIA…:secret@bucket/key-style URIs; use env vars or a credentials provider. anon=Trueis read-only-public — putting a private bucket behind anon mode silently fails.- Path traversal in custom backends. If you write a backend, sanitize
../from user-supplied keys before stitching to your storage root. - HTTP backend follows redirects. Disable explicitly for security-sensitive reads or set an allowlist.
- TLS verification on
https://is on by default; do NOT overrideverify=Falsein production. - Caching to shared disk can leak data across tenants. Use per-tenant cache paths.
Testing & CI integration
- Use
memory://for in-process testing — fully in-RAM filesystem. - Use
moto[s3](aboto3mock) unders3fsfor S3 integration tests without a real bucket. - Pin a known calendar release across all dev / CI / prod environments. Backend version drift is the #1 cause of test flake.
# tests/test_io.py
import fsspec, pytest
def test_memory_roundtrip():
fs = fsspec.filesystem("memory")
with fs.open("/tmp/notes.txt", "wb") as f:
f.write(b"hello")
with fs.open("/tmp/notes.txt", "rb") as f:
assert f.read() == b"hello"
Output: test passes; in-process roundtrip works against the memory backend.
Ecosystem integrations
pandasread_*/to_*— pandas passes URIs andstorage_optionsstraight to fsspec.dask— built on fsspec;dask.dataframe.read_parquethandles globs and remote paths.pyarrow.dataset/pyarrow.fs— has its own filesystem layer but consumes fsspec viaPyFileSystem(FSSpecHandler(fs)).zarr— uses fsspec as its primary I/O layer.mlflow— artifact stores use fsspec for s3:// and gs:// URIs.huggingface_hub— registershf://URIs for model and dataset access.intake— data catalog system layered on fsspec.xarray— zarr-backed remote datasets via fsspec.
Compatibility matrix
| Python | fsspec line | Notes |
|---|---|---|
| 3.8 | 2024.5 and earlier | Dropped. |
| 3.9 | 2024.x+ | Current floor for recent releases. |
| 3.10 | 2024.x+ | Supported. |
| 3.11 | 2024.x+ | Supported. |
| 3.12 | 2024.6+ | Supported. |
| 3.13 | 2024.10+ | Supported, free-threaded build untested. |
Troubleshooting common errors
| Error / Symptom | Likely cause | Fix |
|---|---|---|
ImportError: install s3fs | Missing backend | pip install fsspec[s3]. |
AttributeError: 'S3FileSystem' object has no attribute '_async_cat' | fsspec/s3fs version skew | Pin both to the same calendar release. |
PermissionError: 403 from S3 | Anon mode against private bucket, or missing IAM | Pass anon=False AND configure AWS credentials. |
FileNotFoundError for an existing file | ls() cache stale | fs.invalidate_cache() or pass refresh=True. |
aiohttp.ClientPayloadError on https:// reads | Server closed connection mid-stream | Retry; raise default_block_size so fewer requests are needed. |
| Listing is very slow | Recursive glob over a deep tree | Use partitioned prefixes; pre-list and cache. |
RuntimeError: This event loop is already running | Sync fsspec call inside an async event loop | Use the _async-prefixed methods (await fs._cat(...)). |
simplecache returns stale data | Cache never invalidated | Set expiry_time= on the cache layer or wipe cache_storage. |
When NOT to use this
- You only ever read local files.
pathlibis simpler and pulls no deps. - You need vendor-specific features. Multipart-upload tuning on S3, customer-managed encryption keys on GCS, etc. — use the vendor SDK directly.
- You're writing a one-off ETL script where the cloud bucket and the code live in the same project —
boto3/gcsfsdirectly may be clearer. - Strict typing matters.
fsspec's API is dynamically typed;pathliband the vendor SDKs have better stubs.
Worked example: cloud-and-local data loader with caching
A common pattern: a load(path) function that works for local/path.parquet, s3://bucket/key.parquet, and https://example.com/data.parquet uniformly, with on-disk caching of remote files.
Step 1 — single entry point that abstracts the source.
import fsspec, pandas as pd
from pathlib import Path
CACHE_DIR = Path("/var/cache/myapp")
CACHE_DIR.mkdir(parents=True, exist_ok=True)
def load_parquet(uri: str, **kw) -> pd.DataFrame:
# Wrap remote URIs in simplecache; leave local paths alone.
if "://" in uri and not uri.startswith("file://"):
uri = f"simplecache::{uri}"
kw.setdefault("simplecache", {"cache_storage": str(CACHE_DIR)})
with fsspec.open(uri, "rb", **kw) as f:
return pd.read_parquet(f)
Output: the function takes any path; first remote read pins a copy to disk, subsequent reads hit the cache.
Step 2 — write a companion save(path, df).
def save_parquet(uri: str, df: pd.DataFrame, **kw) -> None:
with fsspec.open(uri, "wb", **kw) as f:
df.to_parquet(f)
Output: symmetric API — write to local, S3, GCS via the same call.
Step 3 — bake credentials into per-environment config.
# config/prod.yaml
storage_options:
anon: false
client_kwargs:
region_name: us-east-1
config_kwargs:
retries:
max_attempts: 5
mode: adaptive
Output: load this once at startup and pass **storage_options to load_parquet.
Step 4 — invalidate cache when source changes.
import shutil
def clear_cache():
shutil.rmtree(CACHE_DIR, ignore_errors=True)
CACHE_DIR.mkdir(parents=True, exist_ok=True)
Output: simple, deliberate. For production prefer filecache with expiry_time= instead of manual wipes.
FAQ
Q: How do I list every parquet file under an S3 prefix?
A: fs.glob("bucket/prefix/**/*.parquet") — be aware of LIST costs on large prefixes. For very large keyspaces, use S3 Inventory and read the manifest.
Q: Can fsspec read from a private GCS bucket without ambient creds?
A: Yes — pass storage_options={"token": "/path/to/service-account.json"} (file path), or pass a dict / google.auth.credentials.Credentials object directly.
Q: How do I configure S3 to use a custom endpoint (MinIO, R2)?
A: storage_options={"endpoint_url": "https://minio.example.com:9000", "anon": False, "key": "...", "secret": "..."} — works for any S3-compatible service.
Q: Does fsspec support seekable HTTP reads (Range)?
A: Yes, via HTTPFileSystem. The HTTP backend issues range requests on f.seek + f.read; servers that don't honor Range: bytes=... fall back to streaming.
Q: How do I monitor S3 LIST/GET counts?
A: Enable AWS CloudTrail data events on the bucket, or wrap the filesystem in your own wrapper that increments counters in _cat/_ls/_open. There's no built-in metric layer.
Q: Can multiple processes share the same simplecache directory?
A: Yes — simplecache uses content-addressed filenames so concurrent writes are idempotent. For high concurrency, use a fast local SSD.
Async usage patterns
Inside an asyncio application, prefer the _* methods on AsyncFileSystem subclasses (S3FileSystem, GCSFileSystem, HTTPFileSystem) — the sync wrappers thunk through a loop.run_until_complete and block the event loop.
import asyncio, fsspec
async def fanout(keys):
fs = fsspec.filesystem("s3", anon=False, asynchronous=True)
bodies = await asyncio.gather(*(fs._cat(k) for k in keys))
return bodies
asyncio.run(fanout(["bucket/a", "bucket/b", "bucket/c"]))
Output: three concurrent S3 GETs; no blocking the event loop. The asynchronous=True flag tells fsspec to skip its sync-bridge layer.
For mixed sync/async code, the safer pattern is one filesystem per side — instantiate fsspec.filesystem("s3") in sync code and fsspec.filesystem("s3", asynchronous=True) in async code, rather than sharing one instance across both worlds.
See also
- Concept: filesystem — POSIX-vs-cloud semantics
- Concept: HTTP — context for the HTTP backend and range reads