concept · weight 10
Filesystems
The OS-layer abstraction that turns a flat block device into named, permissioned, hierarchical files — covering inodes, paths, permissions, journaling, and copy-on-write.
Filesystems
Definition
A filesystem is the operating-system layer that turns a flat array of disk sectors into a hierarchical namespace of named, permissioned files and directories. It owns the on-disk data structures (inodes, directory tables, block maps, journals) and the kernel code that lets open, read, write, rename, and stat behave like operations on a tree of objects rather than offsets into a raw device. Every persistent-data operation a program issues — from print("hello") redirected to a log to a database flushing a transaction — ultimately hits a filesystem implementation: ext4, XFS, Btrfs, ZFS, APFS, NTFS, ReFS, or a network one such as NFS or SMB.
Why it matters
Almost every persistence bug, "where did my disk space go?" mystery, and surprising-deletion incident traces back to a filesystem detail that the language-level API hid from you. A mv is instant on the same filesystem but secretly a full copy across one. A df and du mismatch is a held-open deleted file. A web app that writes config and immediately reboots can lose data because no one called fsync. A "shouldn't this work?" path with mixed case fails on Linux and works on macOS and Windows. A daemon running as a different user can't read its own log because the permission bits on the parent directory drop the execute bit. Understanding what the filesystem actually does — inodes, atomic operations, durability, permissions, case rules — is the difference between guessing and knowing.
How it works
At the bottom is a block device: a numbered sequence of fixed-size sectors. The filesystem superimposes three on-disk structures on top of it:
- Inodes — fixed-size records (256–1024 bytes on modern filesystems) holding everything about a file except its name: mode bits, owner UID/GID, size, timestamps (
atime,mtime,ctime, sometimesbtime), link count, and pointers (direct, indirect, or extent-based) to the data blocks. The inode is the file. Every operation that talks about "the file" —stat,chmod,truncate,fstat,read— ultimately works through an inode. - Directories — themselves just files whose contents are a table of
(name → inode-number)entries. A file's name is not a property of the file; it lives in the parent directory's table. This is why a single inode can have many names (hard links), whymvinside one filesystem is just two directory updates and one inode unchanged, and why deleting a file decrements the link count rather than freeing the data — the data lives until the last name and the last open file descriptor are gone. - A journal or log (ext4, XFS, NTFS) or a copy-on-write tree (Btrfs, ZFS, APFS) — the mechanism that keeps metadata consistent after a crash. Journaling writes intended changes to a log first, then applies them; on crash the log is replayed. Copy-on-write never overwrites a live block — new versions go to fresh locations and a single atomic pointer-swap commits them — which is what makes instant snapshots cheap.
A path is a slash-separated walk down the tree. Each component is looked up in its parent directory's table; the walk ends at an inode (or fails). Mount points graft another filesystem's root onto a directory in the current tree, which is why mv across mounts is cp + rm (different inode spaces) and a df for /home and /var can report independent free space.
Three guarantees programs lean on:
- Atomic rename within one filesystem —
rename(old, new)either succeeds or leaves both unchanged; readers never see a half-renamed name. This is the backbone of the write-temp + rename idiom for atomic config updates. - Durability is not free — a successful
write()only places data in the kernel's page cache. To survive a power loss the program must callfsync(fd)on the file andfsync(dir_fd)on the directory that holds the new name; modern ext4 and XFS implement this with a cache flush plus an FUA (force-unit-access) journal write. - Permissions gate every operation — POSIX mode bits (
rwxfor user/group/other) are the baseline; POSIX.1e ACLs add per-user/per-group entries and are stored as extended attributes in thesystemxattr namespace; xattrs themselves store arbitraryname=valuemetadata (Gatekeeper'scom.apple.quarantine, SELinux contexts, file capabilities). Linux capabilities split root's powers into ~40 discrete units (CAP_NET_BIND_SERVICE,CAP_CHOWN,CAP_SYS_ADMIN) that ride in thesecurity.capabilityxattr; a binary tagged withsetcap cap_net_bind_service=+epcan bind to port 80 without ever running setuid-root. NFSv4 mounts use a richer Windows-style ACL grammar (per-ACE allow/deny, fine-grainedrntcyDpermissions, inheritance flags) that does not interoperate with POSIXsetfacl/getfacl. Windows NTFS uses a completely different model — ACL-based DACLs with deny rules and inheritance — exposed throughicaclsrather thanchmod.
Where filesystems differ in practice:
| Filesystem | Crash safety | Case rule | Killer feature |
|---|---|---|---|
| ext4 | Metadata journal (data=ordered) | Sensitive | Boring, fast, ubiquitous on Linux |
| XFS | Metadata journal, delayed allocation | Sensitive | Throughput on large files, RHEL default |
| Btrfs | Copy-on-write | Sensitive | Subvolumes, writable snapshots, send/receive |
| ZFS | Copy-on-write, end-to-end checksums | Sensitive | Self-healing on mirrors, ARC, datasets |
| APFS | Copy-on-write | Insensitive (preserving) by default; sensitive opt-in | Atomic snapshots, clones, native crypto |
| NTFS | Metadata journal | Insensitive (preserving) by default; per-dir sensitive via fsutil | DACL ACLs, alternate data streams, reparse points |
| FAT32 / exFAT | None | Insensitive | Lingua franca for USB sticks; no permissions, no journal |
Common pitfalls
- Assuming
write()is durable. It isn't — kernel page cache can hold the bytes for seconds. Use the write temp + fsync + rename + fsync(dir) pattern for crash-safe updates: write toconfig.tmp,fsync(config.tmp),rename(config.tmp, config),fsync(parent_dir). Skip any step and a poorly-timed power loss can leave you with old data, new data, or a zero-byte file. mvacross filesystems silently turns intocp+rm. It loses atomicity, may take minutes instead of milliseconds, and fails withEXDEVfrom the barerename(2)syscall — language wrappers like Python'sshutil.moveor Node'sfs.renamefall back to a copy behind your back. Always check whether source and destination are on the same mount.- Case rules differ across hosts. A repo with both
README.mdandreadme.mdchecks out fine on Linux/ext4 (case-sensitive) and on macOS/APFS (case-insensitive — second file overwrites first) and breaks Windows tooling in confusing ways. Pick one case and stick with it; for cross-platform projects, treat the filesystem as case-insensitive even when it is not. - Running out of inodes with disk space to spare. Each ext4 filesystem is created with a fixed inode count; lots of tiny files (mail spools, npm
node_modules, web caches) can exhaust them whiledfshows plenty of free blocks.df -ireports inode usage; if you are near the ceiling, the only fix ismkfswith a higher inode density. dfanddudisagree. Almost always a deleted-but-still-open file. The directory entry is gone (sodumisses it) but the inode and blocks stay reserved until the process closes the descriptor (sodfstill counts them).lsof | grep deletedfinds the culprit; restart or close the holding process to actually free the space.- TOCTOU on path lookups. Code that does
stat(path)thenopen(path)lets an attacker swap a symlink in between. Use theopenat/fstatatfamily withAT_SYMLINK_NOFOLLOW, oropen(O_NOFOLLOW)so the syscall both checks and opens atomically. Never trust a name twice. - Forgetting that extended attributes don't travel. APFS and HFS+ store macOS xattrs natively; plain
cp,scp,tarwithout--xattrs, and any trip through FAT32 or many SMB shares silently drop them. Usecp -p,rsync -X, orditto -ckwhen Gatekeeper flags, Spotlight metadata, or ACLs matter. - Path encoding mismatches. Linux paths are byte strings (any non-NUL byte is legal); macOS APFS normalises to NFD-ish Unicode; Windows is UTF-16. A filename that round-trips through email or a zip from another OS can come back with a different byte sequence and stop matching by
==. MAX_PATHon Windows. The 260-character ceiling still bites tools that haven't opted in via theLongPathsEnabledregistry key or a per-app manifest, even on modern Windows 10/11. Deepnode_modulestrees and time-stamped log directories trip it constantly; many builtins (includingmkdiranddel) ignore the\\?\long-path prefix.setfacl/getfaclreturning empty on NFS. Those tools only speak POSIX.1e ACLs. NFSv4 mounts carry a completely different ACL grammar; runninggetfacl /mnt/nfs/fileshows only the base mode bits with no extended entries even though the server has a rich ACE list set. Installnfs4-acl-toolsand usenfs4_getfacl/nfs4_setfaclon those mounts — the silent empty output is the trap. The same caveat applies tosetfacl -m's mask auto-recalculation: every grant recomputes the mask as the union of named-user/named-group/owning-group, so a deliberately tightm::rquietly widens torwxthe next time you grantu:alice:rwxunless you pass-n(--no-mask).cpandrsyncsilently strip file capabilities. A binary'ssetcap cap_net_bind_service=+eplives in thesecurity.capabilityxattr; plaincp, moves across a filesystem that doesn't support xattrs, and any pipeline that round-trips through a tarball without--xattrsall drop it. The binary suddenly returnsEACCESonbind(:80)at runtime with no build-time error. Usecp -a/rsync -Xand re-rungetcapin CI; for long-running services, prefer systemd'sAmbientCapabilities=(withCapabilityBoundingSet=andNoNewPrivileges=true) so the privilege is granted at process start and the binary on disk stays plain — package upgrades won't quietly de-cap it.duis slow on huge trees, anddfanddudon't see the same thing.duwalks the directory tree and sums file sizes; modern parallel rewrites (dustin Rust,gduin Go) are noticeably faster on NVMe and render a visual tree without a| sort -rh | headpipeline. But neither tool fixes the classicdf > dumystery — that's always a deleted-but-still-open file (lsof +L1lists them) whose blocks stay allocated until the holding process closes the fd. Restart the process, or truncate via/proc/<pid>/fd/<n>.
Where to go next
Sibling concepts and the tool-specific cheat sheets where the filesystem layer shows up day to day.
- /sections/os/filesystems — the deep-dive on inodes, links, journaling, and a head-to-head comparison of ext4, XFS, Btrfs, ZFS, APFS, and NTFS with mount options.
- /sections/linux/permissions — mode bits,
chmod,chown, setuid/setgid, sticky bit, umask, POSIX ACLs, NFSv4 ACLs, and the five Linux capability sets withsetcap/getcap. - /sections/linux/df-du-duf — free-space and directory-size accounting with
df,du,duf, and the modern parallel walkersdust(Rust) andgdu(Go), including thedf/dumismatch story. - /sections/linux/systemd-units — service unit files including
AmbientCapabilities=,CapabilityBoundingSet=,NoNewPrivileges=, andUMask=for filesystem-scoped service hardening. - /sections/linux/find — expression-based traversal of the filesystem with
-name,-type,-mtime,-perm,-exec. - /sections/osx/xattr — macOS extended attributes: Gatekeeper quarantine, Spotlight metadata, ACL bridging.
- /sections/osx/mdfind — Spotlight CLI for indexed-volume search across the macOS filesystem.
- /sections/windows/attrib — NTFS/FAT attribute bits (
H,S,R,A,L) and the cmd builtin that toggles them. - /sections/windows/robocopy — robust copy/mirror with ACL preservation, retries, and multithreading.
- /sections/python/pathlib — object-oriented filesystem paths in Python, with cross-platform separator handling.
- /sections/javascript/node-fs — Node.js
fs/fs.promisesfor reads, writes, atomic renames, and watchers.
Sources
References consulted while writing this concept page. Links open in a new tab.
- Linux man-pages: xattr(7) — Canonical reference for the four xattr namespaces (
user,trusted,security,system) and how POSIX ACLs piggy-back onsystem. - Evan Jones — Durability and Linux File APIs — Walk-through of why
write()is not durable, why the directoryfsyncmatters, and the write-temp + rename recipe used in the Common pitfalls section. - LWN: Filesystems and case-insensitivity — Kernel-side discussion of per-directory case-folding on ext4 and the cross-OS portability problems it tries to address.
- Microsoft Dev Blogs — Per-directory case sensitivity and WSL — Authoritative source on
fsutil file setCaseSensitiveInfoand the NTFS / WSL case-sensitivity bridge referenced in the comparison table. - SEI CERT POS35-C — Avoid symlink race conditions — Foundational guidance for the TOCTOU pitfall: use
O_NOFOLLOWand theopenat/fstatatfamily rather than re-resolving a path. - Linux Kernel docs — ext4 extended attributes — On-disk layout of xattrs and how ACL data is stored, confirming the mode-bits / POSIX-ACL / xattr layering described in How it works.
- WunderTech — Btrfs vs ZFS comparison (2026) — Modern, current snapshot/CoW comparison that informed the journaling vs copy-on-write contrast and the "killer feature" column in the comparison table.
- Steamsprocket — POSIX file semantics in Windows — Background on
FILE_FLAG_POSIX_SEMANTICSand why NTFS-with-Windows-API behaves case-insensitively even though the on-disk filesystem is not. - Linux man-pages: capabilities(7) — Canonical reference for the five capability sets (permitted, effective, inheritable, bounding, ambient) and the
security.capabilityxattr that backs file capabilities. - systemd.exec — AmbientCapabilities & CapabilityBoundingSet — Authoritative reference for granting capabilities to services without touching the binary on disk, plus
NoNewPrivileges=andUMask=for service hardening. - Linux man-pages: nfs4_setfacl(1) — The NFSv4 ACL grammar (per-ACE allow/deny with
rntcyDflags) that POSIXsetfacl/getfaclcannot read, and thenfs4-acl-toolspackage that does. - Red Hat — Why setfacl/getfacl don't work on NFSv4 — The silent-empty-output trap referenced in the pitfalls section, with the workaround chain through
nfs4-acl-tools. - bootandy/dust — Rust du alternative — Parallel directory walker that renders a colour bar-chart tree and respects
.gitignore; the modern one-shot replacement fordu \| sort -rh \| head. - dundee/gdu — Go du with TUI and CLI modes — Parallel walker with both
ncdu-style interactive UI and a non-interactive mode for scripting; typically the fastest of the alternatives on NVMe.