cheat sheet

fsspec

Package-level reference for fsspec on PyPI — install, protocols, chained URIs, caching layers, and integration with pandas / dask.

#pip#package#fs#storageupdated 05-31-2026

fsspec

What it is

fsspec (filesystem-spec) is the unified file-access abstraction underneath pandas, dask, xarray, intake, zarr, mlflow, and many other PyData-stack libraries. It provides one URI-driven API for reading and writing to local disk, HTTP(S), S3, GCS, Azure Blob, HDFS, FTP, SFTP, SSH, GitHub, Memory, ZIP, TAR — anything for which an implementation has been written or can be installed as a plugin.

Reach for fsspec when you want code that doesn't care where its data lives, when you need transparent caching of remote files, or when you're already using a library that consumes fsspec URIs (pandas read_parquet("s3://...") goes through fsspec). Direct use is also common for build/deploy tooling that needs to handle local and cloud paths uniformly.

Install

bash
pip install fsspec

Output: (none — exits 0 on success)

bash
uv add fsspec

Output: resolved + added to pyproject.toml

bash
poetry add fsspec

Output: updated lockfile + virtualenv install

Backends are separate packages — install the ones you need:

bash
pip install "fsspec[s3]"        # → s3fs
pip install "fsspec[gcs]"       # → gcsfs
pip install "fsspec[abfs]"      # → adlfs (Azure Blob / Data Lake)
pip install "fsspec[ssh]"       # → paramiko
pip install "fsspec[http]"      # → aiohttp
pip install "fsspec[github]"    # GitHub raw-file access
pip install "fsspec[full]"      # everything

Output: core fsspec plus the named backend(s).

Versioning & Python support

  • Current line is the 2024.x / 2025.x calendar-versioned series (e.g., 2025.5.1).
  • Supports Python 3.9+ on recent releases.
  • Release cadence is monthly; bumps are usually minor and additive but occasionally tighten the protocol interface.
  • Backend packages (s3fs, gcsfs, adlfs) are versioned in lockstep with fsspec — keep them on the same minor (s3fs==2025.5.* with fsspec==2025.5.*) to avoid registry-API drift.

Package metadata

  • Maintainer: Martin Durant + community (originally Anaconda)
  • Project home: github.com/fsspec/filesystem_spec
  • Docs: filesystem-spec.readthedocs.io
  • PyPI: pypi.org/project/fsspec
  • License: BSD-3-Clause
  • Governance: community; fsspec GitHub org houses sibling backend packages
  • First released: 2018
  • Downloads: hundreds of millions per month — transitive dep of pandas, dask, pyarrow, others

Optional dependencies & extras

The core fsspec package has no runtime deps. Backend extras pull in the corresponding implementation packages:

  • fsspec[s3]s3fs (which depends on aiobotocore).
  • fsspec[gcs]gcsfs (uses google-auth).
  • fsspec[abfs]adlfs (Azure SDK).
  • fsspec[ssh]/fsspec[sftp]paramiko.
  • fsspec[ftp] → stdlib ftplib (no extra needed actually).
  • fsspec[http]aiohttp and requests.
  • fsspec[arrow]pyarrow for Arrow-backed filesystems.
  • fsspec[full] → everything above.

Filesystems can also be registered by third-party packages via entry points — see huggingface_hub's hf://, dvc's dvc://, pyiceberg's table reads.

Alternatives

PackageTrade-off
boto3 / google-cloud-storage / azure-storage-blobCloud-vendor SDKs; full feature surface, vendor-locked code.
s3fsJust S3; lower abstraction than fsspec but compatible.
cloudpathlibpathlib.Path-style API for cloud URIs. Less plumbing, fewer backends.
smart_openSingle-file open() abstraction; lighter and older; doesn't do listing or globbing.
pyfilesystem2Older filesystem-abstraction project; less PyData adoption.
pathlib (stdlib)Local paths only.

Common gotchas

  1. Backend installs are mandatory. fsspec.open("s3://bucket/key") raises ImportError: s3fs not installed if you didn't pip install fsspec[s3].
  2. Version skew between fsspec and backends causes "AsyncFileSystem has no attribute …". Pin both to the same calendar release.
  3. fsspec.open() returns a file-like in BINARY mode by default. Use mode="rt" for text and pass encoding= explicitly.
  4. Chained protocols (simplecache::s3://...) only work in supported orderings. Cache layer goes on the OUTSIDE (simplecache::s3://), not inside.
  5. AbstractFileSystem.ls() is cached per filesystem instance. If the remote changes during the run, invalidate_cache() or pass refresh=True.
  6. S3 anonymous reads need anon=True. Otherwise s3fs tries to use ambient AWS creds and fails on missing-credentials, not on permission.
  7. open_files() returns a list of OpenFile objects that aren't open yet — they open on __enter__ of each, not on creation. This is intentional for delayed I/O in dask.
  8. Async vs sync. AsyncFileSystem subclasses (S3FileSystem, GCSFileSystem) have both sync and _async-prefixed methods; in async code call _cat, _ls, etc.

Real-world recipes

The recipes below show local + remote paths through one API, chained caches, and the directory-listing surface that's common for build tooling.

Recipe 1 — Open a file via the abstraction (local OR remote, same code).

python
import fsspec

# Local path
with fsspec.open("file:///tmp/notes.txt", "rt") as f:
    print(f.read()[:200])

# S3 path — same API
with fsspec.open("s3://my-bucket/notes.txt", "rt", anon=False) as f:
    print(f.read()[:200])

Output: first 200 characters of each file — identical Python-level API, different transports.

Recipe 2 — Read directly from S3 via the protocol prefix.

python
import fsspec, pandas as pd

# Pandas knows fsspec — pass an S3 URI directly.
df = pd.read_parquet("s3://my-bucket/path/data.parquet", storage_options={"anon": False})
print(df.shape)

Output: (rows, cols) — pandas resolved the URI through fsspec without explicit setup.

Recipe 3 — HTTP read.

python
import fsspec
with fsspec.open("https://example.com/large.csv", "rb") as f:
    head = f.read(512)
print(head[:80])

Output: first 80 bytes of the remote CSV — the HTTP backend supports partial reads via Range headers.

Recipe 4 — Chained protocols: cache an S3 file to local disk on first read.

python
import fsspec

# `simplecache` wraps any backend; reads are pinned to disk on first access.
with fsspec.open("simplecache::s3://my-bucket/large.parquet",
                 s3={"anon": False},
                 simplecache={"cache_storage": "/var/cache/fsspec"}) as f:
    data = f.read()

Output: cached copy at /var/cache/fsspec/<hash> after the first read; subsequent reads are local-disk speed. Other cache layers: filecache (per-file), blockcache (block-level random-access cache).

Recipe 5 — Directory listing across backends.

python
import fsspec
fs = fsspec.filesystem("s3", anon=False)
for path in fs.ls("my-bucket/path/", detail=False):
    print(path)

Output: one S3 key per line. detail=True returns dicts with size, LastModified, etc.

Recipe 6 — Globbing remote paths.

python
import fsspec
fs = fsspec.filesystem("s3", anon=False)
for p in fs.glob("my-bucket/year=2026/**/*.parquet"):
    print(p)

Output: every parquet file under any subdir of year=2026/. Globbing on remote backends incurs LIST calls; cache aggressively.

Production deployment notes

  • Pin fsspec and backends to the same calendar release. API drift between e.g. fsspec==2025.5.0 and s3fs==2024.9.0 causes obscure attribute errors.
  • Configure credentials via env or instance metadata, not in code. S3: AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY or an IAM role. GCS: GOOGLE_APPLICATION_CREDENTIALS. Azure: AZURE_STORAGE_CONNECTION_STRING or DefaultAzureCredential.
  • Use simplecache or filecache for read-heavy workloads. Pinning hot files to local SSD cuts S3 GET costs and tail latency.
  • Be mindful of LIST costs. S3 LIST is per-1000-keys; for high-cardinality prefixes use partitioned paths and avoid recursive glob in tight loops.
  • Set default_block_size on HTTPFileSystem and S3FileSystem to a sensible value (5-50 MB). Default is OK for small files but suboptimal for large columnar reads.
  • Health-check the backend — read a small known file at startup so credentials and network reachability fail loudly at boot, not mid-request.

Performance tuning

  • open_files() + dask processes many remote files in parallel without you writing a thread pool.
  • blockcache:: for random-access reads over HTTP — parquet/zarr workloads benefit massively because the cache layer turns range-reads into block-aligned reads.
  • anon=True for public buckets — skips the credential lookup roundtrip on AWS.
  • Large reads: raise default_block_size and read_block size; the default 5 MB is small for multi-GB files.
  • List once, then iterate. fs.ls(...) + slice is far cheaper than calling fs.exists() per file.
  • Async loops: in an asyncio app, use the _* methods directly (await fs._cat(...)) to avoid the sync-bridge overhead.

Version migration guide

  • < 2022.x — pre-calendar-version, less plugin coverage; upgrade.
  • 2023.x → 2024.xAsyncFileSystem interfaces tightened; subclasses must implement _async-prefixed methods.
  • 2024.x → 2025.xstorage_options validation strict-mode; unknown kwargs now error.
  • Backend-specific gotchas:
    • s3fs >= 2024.x requires aiobotocore >= 2.7 which requires a compatible botocore. Pin the trio.
    • gcsfs switched to google-auth modern token refresh in 2024.
    • adlfs requires azure-storage-blob >= 12 and azure-identity.
python
# Old (< 2023): storage_options passed loosely
df = pd.read_parquet("s3://b/k", storage_options={"key": "...", "secret": "...", "anonymous": False})

# Current: anonymous → anon; strict kwargs
df = pd.read_parquet("s3://b/k", storage_options={"anon": False})

Output: same dataframe; cleaner kwargs surface.

Security considerations

  • Credentials in URIs leak in logs. Avoid s3://AKIA…:secret@bucket/key-style URIs; use env vars or a credentials provider.
  • anon=True is read-only-public — putting a private bucket behind anon mode silently fails.
  • Path traversal in custom backends. If you write a backend, sanitize ../ from user-supplied keys before stitching to your storage root.
  • HTTP backend follows redirects. Disable explicitly for security-sensitive reads or set an allowlist.
  • TLS verification on https:// is on by default; do NOT override verify=False in production.
  • Caching to shared disk can leak data across tenants. Use per-tenant cache paths.

Testing & CI integration

  • Use memory:// for in-process testing — fully in-RAM filesystem.
  • Use moto[s3] (a boto3 mock) under s3fs for S3 integration tests without a real bucket.
  • Pin a known calendar release across all dev / CI / prod environments. Backend version drift is the #1 cause of test flake.
python
# tests/test_io.py
import fsspec, pytest

def test_memory_roundtrip():
    fs = fsspec.filesystem("memory")
    with fs.open("/tmp/notes.txt", "wb") as f:
        f.write(b"hello")
    with fs.open("/tmp/notes.txt", "rb") as f:
        assert f.read() == b"hello"

Output: test passes; in-process roundtrip works against the memory backend.

Ecosystem integrations

  • pandas read_* / to_* — pandas passes URIs and storage_options straight to fsspec.
  • dask — built on fsspec; dask.dataframe.read_parquet handles globs and remote paths.
  • pyarrow.dataset / pyarrow.fs — has its own filesystem layer but consumes fsspec via PyFileSystem(FSSpecHandler(fs)).
  • zarr — uses fsspec as its primary I/O layer.
  • mlflow — artifact stores use fsspec for s3:// and gs:// URIs.
  • huggingface_hub — registers hf:// URIs for model and dataset access.
  • intake — data catalog system layered on fsspec.
  • xarray — zarr-backed remote datasets via fsspec.

Compatibility matrix

Pythonfsspec lineNotes
3.82024.5 and earlierDropped.
3.92024.x+Current floor for recent releases.
3.102024.x+Supported.
3.112024.x+Supported.
3.122024.6+Supported.
3.132024.10+Supported, free-threaded build untested.

Troubleshooting common errors

Error / SymptomLikely causeFix
ImportError: install s3fsMissing backendpip install fsspec[s3].
AttributeError: 'S3FileSystem' object has no attribute '_async_cat'fsspec/s3fs version skewPin both to the same calendar release.
PermissionError: 403 from S3Anon mode against private bucket, or missing IAMPass anon=False AND configure AWS credentials.
FileNotFoundError for an existing filels() cache stalefs.invalidate_cache() or pass refresh=True.
aiohttp.ClientPayloadError on https:// readsServer closed connection mid-streamRetry; raise default_block_size so fewer requests are needed.
Listing is very slowRecursive glob over a deep treeUse partitioned prefixes; pre-list and cache.
RuntimeError: This event loop is already runningSync fsspec call inside an async event loopUse the _async-prefixed methods (await fs._cat(...)).
simplecache returns stale dataCache never invalidatedSet expiry_time= on the cache layer or wipe cache_storage.

When NOT to use this

  • You only ever read local files. pathlib is simpler and pulls no deps.
  • You need vendor-specific features. Multipart-upload tuning on S3, customer-managed encryption keys on GCS, etc. — use the vendor SDK directly.
  • You're writing a one-off ETL script where the cloud bucket and the code live in the same project — boto3 / gcsfs directly may be clearer.
  • Strict typing matters. fsspec's API is dynamically typed; pathlib and the vendor SDKs have better stubs.

Worked example: cloud-and-local data loader with caching

A common pattern: a load(path) function that works for local/path.parquet, s3://bucket/key.parquet, and https://example.com/data.parquet uniformly, with on-disk caching of remote files.

Step 1 — single entry point that abstracts the source.

python
import fsspec, pandas as pd
from pathlib import Path

CACHE_DIR = Path("/var/cache/myapp")
CACHE_DIR.mkdir(parents=True, exist_ok=True)

def load_parquet(uri: str, **kw) -> pd.DataFrame:
    # Wrap remote URIs in simplecache; leave local paths alone.
    if "://" in uri and not uri.startswith("file://"):
        uri = f"simplecache::{uri}"
        kw.setdefault("simplecache", {"cache_storage": str(CACHE_DIR)})
    with fsspec.open(uri, "rb", **kw) as f:
        return pd.read_parquet(f)

Output: the function takes any path; first remote read pins a copy to disk, subsequent reads hit the cache.

Step 2 — write a companion save(path, df).

python
def save_parquet(uri: str, df: pd.DataFrame, **kw) -> None:
    with fsspec.open(uri, "wb", **kw) as f:
        df.to_parquet(f)

Output: symmetric API — write to local, S3, GCS via the same call.

Step 3 — bake credentials into per-environment config.

python
# config/prod.yaml
storage_options:
  anon: false
  client_kwargs:
    region_name: us-east-1
  config_kwargs:
    retries:
      max_attempts: 5
      mode: adaptive

Output: load this once at startup and pass **storage_options to load_parquet.

Step 4 — invalidate cache when source changes.

python
import shutil
def clear_cache():
    shutil.rmtree(CACHE_DIR, ignore_errors=True)
    CACHE_DIR.mkdir(parents=True, exist_ok=True)

Output: simple, deliberate. For production prefer filecache with expiry_time= instead of manual wipes.

FAQ

Q: How do I list every parquet file under an S3 prefix? A: fs.glob("bucket/prefix/**/*.parquet") — be aware of LIST costs on large prefixes. For very large keyspaces, use S3 Inventory and read the manifest.

Q: Can fsspec read from a private GCS bucket without ambient creds? A: Yes — pass storage_options={"token": "/path/to/service-account.json"} (file path), or pass a dict / google.auth.credentials.Credentials object directly.

Q: How do I configure S3 to use a custom endpoint (MinIO, R2)? A: storage_options={"endpoint_url": "https://minio.example.com:9000", "anon": False, "key": "...", "secret": "..."} — works for any S3-compatible service.

Q: Does fsspec support seekable HTTP reads (Range)? A: Yes, via HTTPFileSystem. The HTTP backend issues range requests on f.seek + f.read; servers that don't honor Range: bytes=... fall back to streaming.

Q: How do I monitor S3 LIST/GET counts? A: Enable AWS CloudTrail data events on the bucket, or wrap the filesystem in your own wrapper that increments counters in _cat/_ls/_open. There's no built-in metric layer.

Q: Can multiple processes share the same simplecache directory? A: Yes — simplecache uses content-addressed filenames so concurrent writes are idempotent. For high concurrency, use a fast local SSD.

Async usage patterns

Inside an asyncio application, prefer the _* methods on AsyncFileSystem subclasses (S3FileSystem, GCSFileSystem, HTTPFileSystem) — the sync wrappers thunk through a loop.run_until_complete and block the event loop.

python
import asyncio, fsspec

async def fanout(keys):
    fs = fsspec.filesystem("s3", anon=False, asynchronous=True)
    bodies = await asyncio.gather(*(fs._cat(k) for k in keys))
    return bodies

asyncio.run(fanout(["bucket/a", "bucket/b", "bucket/c"]))

Output: three concurrent S3 GETs; no blocking the event loop. The asynchronous=True flag tells fsspec to skip its sync-bridge layer.

For mixed sync/async code, the safer pattern is one filesystem per side — instantiate fsspec.filesystem("s3") in sync code and fsspec.filesystem("s3", asynchronous=True) in async code, rather than sharing one instance across both worlds.

See also