concept · weight 10

Filesystems

The OS-layer abstraction that turns a flat block device into named, permissioned, hierarchical files — covering inodes, paths, permissions, journaling, and copy-on-write.

Filesystems

Definition

A filesystem is the operating-system layer that turns a flat array of disk sectors into a hierarchical namespace of named, permissioned files and directories. It owns the on-disk data structures (inodes, directory tables, block maps, journals) and the kernel code that lets open, read, write, rename, and stat behave like operations on a tree of objects rather than offsets into a raw device. Every persistent-data operation a program issues — from print("hello") redirected to a log to a database flushing a transaction — ultimately hits a filesystem implementation: ext4, XFS, Btrfs, ZFS, APFS, NTFS, ReFS, or a network one such as NFS or SMB.

Why it matters

Almost every persistence bug, "where did my disk space go?" mystery, and surprising-deletion incident traces back to a filesystem detail that the language-level API hid from you. A mv is instant on the same filesystem but secretly a full copy across one. A df and du mismatch is a held-open deleted file. A web app that writes config and immediately reboots can lose data because no one called fsync. A "shouldn't this work?" path with mixed case fails on Linux and works on macOS and Windows. A daemon running as a different user can't read its own log because the permission bits on the parent directory drop the execute bit. Understanding what the filesystem actually does — inodes, atomic operations, durability, permissions, case rules — is the difference between guessing and knowing.

How it works

At the bottom is a block device: a numbered sequence of fixed-size sectors. The filesystem superimposes three on-disk structures on top of it:

  1. Inodes — fixed-size records (256–1024 bytes on modern filesystems) holding everything about a file except its name: mode bits, owner UID/GID, size, timestamps (atime, mtime, ctime, sometimes btime), link count, and pointers (direct, indirect, or extent-based) to the data blocks. The inode is the file. Every operation that talks about "the file" — stat, chmod, truncate, fstat, read — ultimately works through an inode.
  2. Directories — themselves just files whose contents are a table of (name → inode-number) entries. A file's name is not a property of the file; it lives in the parent directory's table. This is why a single inode can have many names (hard links), why mv inside one filesystem is just two directory updates and one inode unchanged, and why deleting a file decrements the link count rather than freeing the data — the data lives until the last name and the last open file descriptor are gone.
  3. A journal or log (ext4, XFS, NTFS) or a copy-on-write tree (Btrfs, ZFS, APFS) — the mechanism that keeps metadata consistent after a crash. Journaling writes intended changes to a log first, then applies them; on crash the log is replayed. Copy-on-write never overwrites a live block — new versions go to fresh locations and a single atomic pointer-swap commits them — which is what makes instant snapshots cheap.

A path is a slash-separated walk down the tree. Each component is looked up in its parent directory's table; the walk ends at an inode (or fails). Mount points graft another filesystem's root onto a directory in the current tree, which is why mv across mounts is cp + rm (different inode spaces) and a df for /home and /var can report independent free space.

Three guarantees programs lean on:

  • Atomic rename within one filesystemrename(old, new) either succeeds or leaves both unchanged; readers never see a half-renamed name. This is the backbone of the write-temp + rename idiom for atomic config updates.
  • Durability is not free — a successful write() only places data in the kernel's page cache. To survive a power loss the program must call fsync(fd) on the file and fsync(dir_fd) on the directory that holds the new name; modern ext4 and XFS implement this with a cache flush plus an FUA (force-unit-access) journal write.
  • Permissions gate every operation — POSIX mode bits (rwx for user/group/other) are the baseline; POSIX.1e ACLs add per-user/per-group entries and are stored as extended attributes in the system xattr namespace; xattrs themselves store arbitrary name=value metadata (Gatekeeper's com.apple.quarantine, SELinux contexts, file capabilities). Linux capabilities split root's powers into ~40 discrete units (CAP_NET_BIND_SERVICE, CAP_CHOWN, CAP_SYS_ADMIN) that ride in the security.capability xattr; a binary tagged with setcap cap_net_bind_service=+ep can bind to port 80 without ever running setuid-root. NFSv4 mounts use a richer Windows-style ACL grammar (per-ACE allow/deny, fine-grained rntcyD permissions, inheritance flags) that does not interoperate with POSIX setfacl / getfacl. Windows NTFS uses a completely different model — ACL-based DACLs with deny rules and inheritance — exposed through icacls rather than chmod.

Where filesystems differ in practice:

FilesystemCrash safetyCase ruleKiller feature
ext4Metadata journal (data=ordered)SensitiveBoring, fast, ubiquitous on Linux
XFSMetadata journal, delayed allocationSensitiveThroughput on large files, RHEL default
BtrfsCopy-on-writeSensitiveSubvolumes, writable snapshots, send/receive
ZFSCopy-on-write, end-to-end checksumsSensitiveSelf-healing on mirrors, ARC, datasets
APFSCopy-on-writeInsensitive (preserving) by default; sensitive opt-inAtomic snapshots, clones, native crypto
NTFSMetadata journalInsensitive (preserving) by default; per-dir sensitive via fsutilDACL ACLs, alternate data streams, reparse points
FAT32 / exFATNoneInsensitiveLingua franca for USB sticks; no permissions, no journal

Common pitfalls

  1. Assuming write() is durable. It isn't — kernel page cache can hold the bytes for seconds. Use the write temp + fsync + rename + fsync(dir) pattern for crash-safe updates: write to config.tmp, fsync(config.tmp), rename(config.tmp, config), fsync(parent_dir). Skip any step and a poorly-timed power loss can leave you with old data, new data, or a zero-byte file.
  2. mv across filesystems silently turns into cp + rm. It loses atomicity, may take minutes instead of milliseconds, and fails with EXDEV from the bare rename(2) syscall — language wrappers like Python's shutil.move or Node's fs.rename fall back to a copy behind your back. Always check whether source and destination are on the same mount.
  3. Case rules differ across hosts. A repo with both README.md and readme.md checks out fine on Linux/ext4 (case-sensitive) and on macOS/APFS (case-insensitive — second file overwrites first) and breaks Windows tooling in confusing ways. Pick one case and stick with it; for cross-platform projects, treat the filesystem as case-insensitive even when it is not.
  4. Running out of inodes with disk space to spare. Each ext4 filesystem is created with a fixed inode count; lots of tiny files (mail spools, npm node_modules, web caches) can exhaust them while df shows plenty of free blocks. df -i reports inode usage; if you are near the ceiling, the only fix is mkfs with a higher inode density.
  5. df and du disagree. Almost always a deleted-but-still-open file. The directory entry is gone (so du misses it) but the inode and blocks stay reserved until the process closes the descriptor (so df still counts them). lsof | grep deleted finds the culprit; restart or close the holding process to actually free the space.
  6. TOCTOU on path lookups. Code that does stat(path) then open(path) lets an attacker swap a symlink in between. Use the openat/fstatat family with AT_SYMLINK_NOFOLLOW, or open(O_NOFOLLOW) so the syscall both checks and opens atomically. Never trust a name twice.
  7. Forgetting that extended attributes don't travel. APFS and HFS+ store macOS xattrs natively; plain cp, scp, tar without --xattrs, and any trip through FAT32 or many SMB shares silently drop them. Use cp -p, rsync -X, or ditto -ck when Gatekeeper flags, Spotlight metadata, or ACLs matter.
  8. Path encoding mismatches. Linux paths are byte strings (any non-NUL byte is legal); macOS APFS normalises to NFD-ish Unicode; Windows is UTF-16. A filename that round-trips through email or a zip from another OS can come back with a different byte sequence and stop matching by ==.
  9. MAX_PATH on Windows. The 260-character ceiling still bites tools that haven't opted in via the LongPathsEnabled registry key or a per-app manifest, even on modern Windows 10/11. Deep node_modules trees and time-stamped log directories trip it constantly; many builtins (including mkdir and del) ignore the \\?\ long-path prefix.
  10. setfacl / getfacl returning empty on NFS. Those tools only speak POSIX.1e ACLs. NFSv4 mounts carry a completely different ACL grammar; running getfacl /mnt/nfs/file shows only the base mode bits with no extended entries even though the server has a rich ACE list set. Install nfs4-acl-tools and use nfs4_getfacl / nfs4_setfacl on those mounts — the silent empty output is the trap. The same caveat applies to setfacl -m's mask auto-recalculation: every grant recomputes the mask as the union of named-user/named-group/owning-group, so a deliberately tight m::r quietly widens to rwx the next time you grant u:alice:rwx unless you pass -n (--no-mask).
  11. cp and rsync silently strip file capabilities. A binary's setcap cap_net_bind_service=+ep lives in the security.capability xattr; plain cp, moves across a filesystem that doesn't support xattrs, and any pipeline that round-trips through a tarball without --xattrs all drop it. The binary suddenly returns EACCES on bind(:80) at runtime with no build-time error. Use cp -a / rsync -X and re-run getcap in CI; for long-running services, prefer systemd's AmbientCapabilities= (with CapabilityBoundingSet= and NoNewPrivileges=true) so the privilege is granted at process start and the binary on disk stays plain — package upgrades won't quietly de-cap it.
  12. du is slow on huge trees, and df and du don't see the same thing. du walks the directory tree and sums file sizes; modern parallel rewrites (dust in Rust, gdu in Go) are noticeably faster on NVMe and render a visual tree without a | sort -rh | head pipeline. But neither tool fixes the classic df > du mystery — that's always a deleted-but-still-open file (lsof +L1 lists them) whose blocks stay allocated until the holding process closes the fd. Restart the process, or truncate via /proc/<pid>/fd/<n>.

Where to go next

Sibling concepts and the tool-specific cheat sheets where the filesystem layer shows up day to day.

  • /sections/os/filesystems — the deep-dive on inodes, links, journaling, and a head-to-head comparison of ext4, XFS, Btrfs, ZFS, APFS, and NTFS with mount options.
  • /sections/linux/permissions — mode bits, chmod, chown, setuid/setgid, sticky bit, umask, POSIX ACLs, NFSv4 ACLs, and the five Linux capability sets with setcap / getcap.
  • /sections/linux/df-du-duf — free-space and directory-size accounting with df, du, duf, and the modern parallel walkers dust (Rust) and gdu (Go), including the df/du mismatch story.
  • /sections/linux/systemd-units — service unit files including AmbientCapabilities=, CapabilityBoundingSet=, NoNewPrivileges=, and UMask= for filesystem-scoped service hardening.
  • /sections/linux/find — expression-based traversal of the filesystem with -name, -type, -mtime, -perm, -exec.
  • /sections/osx/xattr — macOS extended attributes: Gatekeeper quarantine, Spotlight metadata, ACL bridging.
  • /sections/osx/mdfind — Spotlight CLI for indexed-volume search across the macOS filesystem.
  • /sections/windows/attrib — NTFS/FAT attribute bits (H, S, R, A, L) and the cmd builtin that toggles them.
  • /sections/windows/robocopy — robust copy/mirror with ACL preservation, retries, and multithreading.
  • /sections/python/pathlib — object-oriented filesystem paths in Python, with cross-platform separator handling.
  • /sections/javascript/node-fs — Node.js fs / fs.promises for reads, writes, atomic renames, and watchers.

Sources

References consulted while writing this concept page. Links open in a new tab.

  • Linux man-pages: xattr(7) — Canonical reference for the four xattr namespaces (user, trusted, security, system) and how POSIX ACLs piggy-back on system.
  • Evan Jones — Durability and Linux File APIs — Walk-through of why write() is not durable, why the directory fsync matters, and the write-temp + rename recipe used in the Common pitfalls section.
  • LWN: Filesystems and case-insensitivity — Kernel-side discussion of per-directory case-folding on ext4 and the cross-OS portability problems it tries to address.
  • Microsoft Dev Blogs — Per-directory case sensitivity and WSL — Authoritative source on fsutil file setCaseSensitiveInfo and the NTFS / WSL case-sensitivity bridge referenced in the comparison table.
  • SEI CERT POS35-C — Avoid symlink race conditions — Foundational guidance for the TOCTOU pitfall: use O_NOFOLLOW and the openat/fstatat family rather than re-resolving a path.
  • Linux Kernel docs — ext4 extended attributes — On-disk layout of xattrs and how ACL data is stored, confirming the mode-bits / POSIX-ACL / xattr layering described in How it works.
  • WunderTech — Btrfs vs ZFS comparison (2026) — Modern, current snapshot/CoW comparison that informed the journaling vs copy-on-write contrast and the "killer feature" column in the comparison table.
  • Steamsprocket — POSIX file semantics in Windows — Background on FILE_FLAG_POSIX_SEMANTICS and why NTFS-with-Windows-API behaves case-insensitively even though the on-disk filesystem is not.
  • Linux man-pages: capabilities(7) — Canonical reference for the five capability sets (permitted, effective, inheritable, bounding, ambient) and the security.capability xattr that backs file capabilities.
  • systemd.exec — AmbientCapabilities & CapabilityBoundingSet — Authoritative reference for granting capabilities to services without touching the binary on disk, plus NoNewPrivileges= and UMask= for service hardening.
  • Linux man-pages: nfs4_setfacl(1) — The NFSv4 ACL grammar (per-ACE allow/deny with rntcyD flags) that POSIX setfacl/getfacl cannot read, and the nfs4-acl-tools package that does.
  • Red Hat — Why setfacl/getfacl don't work on NFSv4 — The silent-empty-output trap referenced in the pitfalls section, with the workaround chain through nfs4-acl-tools.
  • bootandy/dust — Rust du alternative — Parallel directory walker that renders a colour bar-chart tree and respects .gitignore; the modern one-shot replacement for du \| sort -rh \| head.
  • dundee/gdu — Go du with TUI and CLI modes — Parallel walker with both ncdu-style interactive UI and a non-interactive mode for scripting; typically the fastest of the alternatives on NVMe.