cheat sheet
Filesystems
Core filesystem concepts every operator should know: inodes, directory structure, hard vs symbolic links, journaling, copy-on-write, and a head-to-head of ext4, XFS, Btrfs, ZFS, APFS, and NTFS with mount options and pitfalls.
Filesystems — Inodes, Links, Journaling, ext4/XFS/Btrfs/ZFS/APFS/NTFS
What it is
A filesystem is the layer between the block device and "files and directories" — the on-disk data structures and the kernel code that gives a flat array of sectors the shape of a hierarchical, named, permissioned namespace. Every Unix filesystem revolves around the inode: a fixed-size record describing one file's metadata (mode, owner, size, timestamps, block pointers) without any name attached. Directories are simply files whose contents are name-to-inode tables, which is why hard links, mount points, and case-sensitivity all behave the way they do. Reach for this article when something has gone strange with disk usage, when a df and du mismatch appears, when a cp runs slower than expected, or when you need to pick a filesystem for new storage.
Inodes — the on-disk file record
An inode is a fixed-size data structure (256–1024 bytes on modern filesystems) that records every property of a file except its name. Directories store name-to-inode pairs; the inode itself contains the mode, owner UID/GID, size, timestamps (atime, mtime, ctime, sometimes btime), link count, and pointers (direct, indirect, or extent-based) to the data blocks. When you delete a file you decrement its link count; when the count reaches zero and no process has the file open, the kernel frees the inode and its data blocks.
# Show inode number and metadata for a file
stat /etc/hostname
Output:
File: /etc/hostname
Size: 7 Blocks: 8 IO Block: 4096 regular file
Device: 803h/2051d Inode: 1572873 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2026-05-25 09:14:02.000000000 -0400
Modify: 2026-04-30 12:00:00.000000000 -0400
Change: 2026-04-30 12:00:00.000000000 -0400
Birth: 2026-04-30 12:00:00.000000000 -0400
# List inode numbers in a directory
ls -li /etc | head -5
Output:
1572873 -rw-r--r-- 1 root root 7 Apr 30 12:00 hostname
1572874 -rw-r--r-- 1 root root 264 Apr 30 12:00 hosts
1572880 drwxr-xr-x 2 root root 4096 May 24 10:00 cron.daily
1572881 lrwxrwxrwx 1 root root 21 Apr 30 12:00 localtime -> /usr/share/zoneinfo/UTC
1572883 -rw-r--r-- 1 root root 604 Apr 30 12:00 fstab
A file's name is not a property of the file itself — it lives in the parent directory's name table. That is why
mvwithin a single filesystem is just a directory update (instant, atomic) whilemvacross filesystems is a fullcpplusrm(slow, non-atomic, observable to readers mid-copy).
Inode exhaustion
Each filesystem is created with a fixed number of inodes; you can run out of inodes long before the disk is full, especially on volumes holding many small files (cache directories, mail spools, node_modules). df -i reports inode usage separately from block usage.
df -i /
Output:
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/nvme0n1p2 6553600 412034 6141566 7% /
If IUse% is near 100% you cannot create a new file even with terabytes of free space — the error is No space left on device. The fix is either to delete small files, switch to a filesystem that allocates inodes dynamically (XFS, Btrfs, ZFS), or remake the filesystem with mkfs.ext4 -N <count>.
Directory structure
A directory is a special file whose contents are a list of (name, inode_number) pairs. The shape of that list differs by filesystem: ext4 uses HTrees (hashed B-trees) for fast lookup on large directories; XFS and Btrfs use B+ trees natively; ZFS uses extensible hashing. On every modern filesystem you can keep millions of files in one directory without the historical slowdown ext2 and HFS+ suffered.
# Count entries in a directory
ls -1A /etc | wc -l
# Find directories with the most entries
sudo find / -xdev -type d -exec sh -c 'echo "$(ls -A "$1" | wc -l) $1"' _ {} \; 2>/dev/null | sort -rn | head -5
Output:
182
54392 /var/lib/dpkg/info
4112 /usr/share/doc
1893 /usr/lib/x86_64-linux-gnu
512 /usr/bin
411 /etc
[!WARN] Even with HTree indexing, enumerating a million-entry directory still costs O(n) and stats every inode in turn — that is the slow part of
ls,rm -rf, and tab completion. Shard large data sets into a two- or three-level hash tree (e.g./cache/ab/cd/abcdef.dat) rather than dumping everything at the top.
Hard vs symbolic links
Unix supports two distinct link types and confusing them is one of the top filesystem pitfalls. A hard link is an additional name pointing at the same inode — there is no "original" once one is made; all hard links are equal peers and the file vanishes only when the last is removed. A symbolic link (symlink, soft link) is its own inode whose contents are a path string the kernel follows at lookup time — it can point at anything, including a path that doesn't exist or one on another filesystem, but it adds a level of indirection and is itself a file you can stat.
| Property | Hard link | Symbolic link |
|---|---|---|
| Own inode | No — shares target's inode | Yes |
| Crosses filesystems | No | Yes |
| Links to a directory | Generally no (root + special tools only) | Yes |
| Survives target deletion | Yes (link count keeps file alive) | Becomes dangling |
Visible as a link in ls -l | No — looks like a regular file | Yes (l type, -> arrow) |
| Pointed-to path | N/A — same inode | Stored string |
| Tooling | ln target name | ln -s target name |
# Create both kinds
echo "hello" > /tmp/source.txt
ln /tmp/source.txt /tmp/hardlink.txt # hard link
ln -s /tmp/source.txt /tmp/symlink.txt # symbolic link
# Compare
ls -li /tmp/source.txt /tmp/hardlink.txt /tmp/symlink.txt
Output:
1810023 -rw-r--r-- 2 alice staff 6 May 25 09:14 /tmp/source.txt
1810023 -rw-r--r-- 2 alice staff 6 May 25 09:14 /tmp/hardlink.txt
1810024 lrwxrwxrwx 1 alice staff 15 May 25 09:14 /tmp/symlink.txt -> /tmp/source.txt
Note that source.txt and hardlink.txt share inode 1810023 and both report a link count of 2. The symlink has its own inode (1810024) and shows the target path.
# Delete the original — hard link survives, symlink dangles
rm /tmp/source.txt
cat /tmp/hardlink.txt # works
cat /tmp/symlink.txt # error
Output:
hello
cat: /tmp/symlink.txt: No such file or directory
Use hard links when you want the file to persist as long as any name references it — backup snapshots, deduplicated trees (
cp -al,rsync --link-dest). Use symlinks when you want a path-shaped reference that updates when you replace the target (e.g./usr/bin/python -> python3).
Timestamps
Inodes track several timestamps; understanding which one updates when is essential for find -mtime, make, and incremental backups.
| Timestamp | Updated when |
|---|---|
atime | The file's contents are read (subject to relatime/noatime) |
mtime | The file's contents are modified |
ctime | The file's metadata changes (mode, owner, link count) or contents change |
btime (birth time) | The file is created; never updated (ext4, XFS, Btrfs, ZFS, APFS) |
Most modern Linux mounts use relatime (the default since 2.6.30): atime is updated only if the existing atime is older than the current mtime/ctime or older than 24 hours. Use noatime to skip atime updates entirely for read-heavy workloads — a measurable speedup for busy mail servers and caches.
Journaling
A journal is a small region of the filesystem where the kernel records pending metadata changes (or full data) before applying them. On a crash the journal is replayed during the next mount, so the filesystem comes back consistent without a slow fsck pass. ext3, ext4, XFS, and NTFS all journal metadata; ZFS and Btrfs use copy-on-write instead, which is conceptually similar but stronger.
| Mode (ext4) | What's logged | Trade-off |
|---|---|---|
data=journal | Both metadata and data | Slowest; strongest consistency |
data=ordered (default) | Metadata, but data blocks written before commit | Default — safe and fast |
data=writeback | Metadata only; data ordering is free | Fastest; brief window where metadata refers to stale data after a crash |
# Inspect a mount's journaling mode
mount | grep ' / '
Output:
/dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)
[!WARN] Journaling protects against a crash mid-write, not against bit rot, cosmic rays, or controller firmware bugs that ack a write that never reached the platter. For those, you need filesystem-level checksumming (Btrfs, ZFS, ReFS) or block-layer integrity (dm-integrity, T10-DIF).
Copy-on-write (CoW)
Copy-on-write filesystems never overwrite live data; every change writes new blocks and atomically updates the root pointer when the transaction commits. Crashes can only leave the filesystem at the previous consistent state. CoW enables free snapshots (just keep the old root around), free clones (share blocks until written), and end-to-end checksums (every block is fingerprinted on write).
# Take an instant snapshot on Btrfs
sudo btrfs subvolume snapshot /home /home-2026-05-25
# Same on ZFS
sudo zfs snapshot tank/home@2026-05-25
Output:
Create a snapshot of '/home' in '/home-2026-05-25'
The cost is write amplification and fragmentation: a 4 KiB modification to a large file produces a new block and pointer chain rather than updating in place. For database workloads, set nodatacow on Btrfs or recordsize=16K plus logbias=throughput on ZFS to mitigate.
Filesystem comparison
| FS | OS | Type | Max file | Max FS | Snapshots | Checksums | Best for |
|---|---|---|---|---|---|---|---|
| ext4 | Linux | Journaled extent-based | 16 TiB | 1 EiB | No (LVM only) | Metadata only | Default Linux root, broad compatibility |
| XFS | Linux | Journaled extent-based | 8 EiB | 8 EiB | No (reflinks since 5.0) | Metadata + CRC headers | Large files, NAS, high-throughput |
| Btrfs | Linux | CoW | 16 EiB | 16 EiB | Yes (subvolumes) | Yes (data + metadata) | Snapshots, RAID 0/1/10, single-disk workstations |
| ZFS | Linux/BSD/Solaris/macOS | CoW + integrated volume manager | 16 EiB | 256 ZiB | Yes (datasets) | Yes (data + metadata) | Servers, NAS, integrity-critical data |
| APFS | macOS / iOS | CoW | 8 EiB | 8 EiB | Yes | Metadata only | macOS system + user volumes |
| NTFS | Windows | Journaled | 16 EiB | 16 EiB | VSS (volume-level) | Metadata only | Windows system volume |
# Identify a mounted filesystem's type
mount | grep ' /home '
findmnt /home -o FSTYPE,SOURCE,OPTIONS
Output:
FSTYPE SOURCE OPTIONS
ext4 /dev/nvme0n1p3 rw,relatime
When to pick each
- ext4 — anywhere you want the default. Boring, fast, well-understood. Good for
/,/homeon most servers, and any storage that doesn't need snapshots. - XFS — large single-file workloads (databases, log archives, video). Default on RHEL since 7. Cannot shrink in-place.
- Btrfs — single-disk workstations that want snapshots + send/receive, or simple RAID-1. Default on openSUSE and Fedora Workstation 33+. Avoid RAID-5/6 in production.
- ZFS — multi-disk servers and NAS where data integrity dominates. Outside Linux kernel mainline due to license; install via
zfs-dkms. - APFS — macOS only; nothing to choose, it's automatic since macOS 10.13.
- NTFS — Windows volumes; read on Linux is fine, write through
ntfs-3g(FUSE) is acceptable, butntfs-3gis slow vs. native. macOS reads natively but does not write.
Fragmentation
Fragmentation happens when a file's blocks are scattered across the device rather than contiguous, forcing more seeks per read. On SSDs it matters far less (no head movement), but heavy fragmentation can still increase the number of I/O requests and hurt throughput on cheap drives. Modern Linux filesystems (ext4, XFS, Btrfs) allocate extents — runs of contiguous blocks — which dramatically reduces fragmentation compared to ext2/3.
# Check ext4 fragmentation
sudo e4defrag -c /home
Output:
<Fragmented files> now/best
1. /home/alice/big.tar.gz 12/1
2. /home/alice/.cache/firefox/places.sqlite 7/1
...
Total/best extents 4823/4811
Average size per extent 512 KB
Fragmentation score 0
[0-30 no problem: defragmentation is not needed]
# Defragment one file or a tree
sudo e4defrag /home/alice/big.tar.gz
sudo e4defrag /home/alice
Output:
ext4 defragmentation for /home/alice/big.tar.gz
[1/1]/home/alice/big.tar.gz: 100% extents: 12 -> 1 [ OK ]
Success: [1/1]
Btrfs has btrfs filesystem defragment -r /mountpoint; XFS has xfs_fsr; APFS does it transparently in the background. NTFS uses defrag on Windows.
On an SSD, do not run a defrag tool on a schedule — it generates write amplification with no benefit. Use it only when a specific file or workload shows symptoms (e.g.
filefragreports tens of thousands of extents).
Mount options
The mount command and /etc/fstab accept per-filesystem options that change the trade-off between safety, speed, and behaviour. The big ones are common to most Linux filesystems; the rest are FS-specific.
# Show current mount options for everything
mount | column -t
findmnt -t ext4,xfs,btrfs -o TARGET,SOURCE,FSTYPE,OPTIONS
Output (findmnt):
TARGET SOURCE FSTYPE OPTIONS
/ /dev/nvme0n1p2 ext4 rw,relatime,errors=remount-ro
/home /dev/nvme0n1p3 ext4 rw,relatime,nodev,nosuid
/data /dev/sdb1 xfs rw,relatime,attr2,inode64,logbufs=8
Generic mount options
| Option | Effect |
|---|---|
ro / rw | Read-only / read-write |
noatime | Skip atime updates entirely (fastest for reads) |
relatime | atime updated only if older than mtime/ctime or 24h (Linux default) |
nodev | Don't allow device files |
nosuid | Ignore setuid/setgid bits — defence in depth for user-writable mounts |
noexec | Disallow execute — useful on /tmp, /var/tmp, /dev/shm |
nofail | Don't block boot if the device is missing (USB drives, NFS) |
x-systemd.automount | Lazy-mount on first access via systemd |
sync | Force every write to hit disk before returning (slow; for removable media) |
discard | Issue TRIM on file deletion (SSD) — prefer periodic fstrim instead |
errors=remount-ro | Switch to read-only on FS error (default for / on ext4) |
# /etc/fstab entry — a hardened /tmp
tmpfs /tmp tmpfs nodev,nosuid,noexec,size=2G 0 0
# A USB drive that should not block boot
UUID=ABCD-1234 /mnt/usb exfat nofail,x-systemd.automount,nodev,nosuid 0 0
Output: (none — fstab configuration only)
ext4-specific
| Option | Effect |
|---|---|
data=ordered | Default; metadata journaled, data ordered before commit |
data=writeback | Faster; brief stale-data window after crash |
data=journal | Safest; journals data too — slowest |
barrier=1 | Default; force barriers — required for crash safety on caching disks |
journal_async_commit | Async journal commits — slight speedup, slight risk |
commit=N | Sync interval in seconds (default 5) |
XFS-specific
| Option | Effect |
|---|---|
inode64 | Allow inodes >32 bits (default for >1 TiB) |
logbufs=N | Number of in-memory log buffers (default 8) |
largeio | Optimal-I/O hints for large sequential reads |
noquota | Disable quota accounting |
Btrfs-specific
| Option | Effect |
|---|---|
subvol=NAME | Mount a specific subvolume as root |
compress=zstd:3 | Transparent compression (also lzo, zlib) |
nodatacow | Disable CoW per-file or per-mount (DB / VM images) |
ssd | Enable SSD optimisations (auto-detected for most modern drives) |
autodefrag | Defragment on write |
Common pitfalls
dfanddudisagree — usually a process is holding a deleted file open. Find it withlsof | grep deletedorlsof +L1. See the lsof & ss cheatsheet for the recovery recipe (: > /proc/PID/fd/N).- Out of inodes despite free disk —
df -iand migrate small files off or recreate the filesystem with more inodes. rmof a "deleted" file doesn't free space — same root cause: an open file descriptor. Kill or signal the holder.- Symlink loops —
ln -s . loopandfindfollows them forever. Usefind -Lonly when you mean it, or pass-maxdepth. - Hard link across filesystems fails —
lnreturnsInvalid cross-device link. Useln -s(symbolic link) instead. - Copying preserves links — sometimes —
cpdoes not preserve hard links by default; usecp -aorcp --preserve=links.rsyncneeds-H. tarandzipdiffer on links —tarpreserves hard links (deduplicates within the archive);zipdoes not. For backups, usetarwith--hard-dereference(follow them) or default (preserve them).mvbetween filesystems iscp + rm— readers can observe a partial state and the operation is not atomic. Usersync --remove-source-filesfor safer cross-FS moves.noexecon/tmpbreaks installers — many shell installers (get.docker.comstyle) write to/tmpthen exec. Either remount with exec briefly or setTMPDIR=/var/tmp(if that's exec-allowed).- APFS volumes are case-insensitive by default —
Foo.txtandfoo.txtcollide. Format--case-sensitiveif you need POSIX semantics (rare on macOS; sometimes needed for Linux-style repos). - NTFS write from Linux is slow —
ntfs-3gis FUSE-based. For frequent writes, format the disk as exFAT (cross-platform) instead. - TRIM via
discardmount option vs. weeklyfstrim—discardissues a TRIM per delete and can saturate the controller on small-file workloads. Preferfstrim.timer(weekly, batched).
Real-world recipes
Find the largest directories on /
A perennial "disk full" first step. du walks the tree; --max-depth caps the recursion so you can drill down.
sudo du -hx --max-depth=1 / 2>/dev/null | sort -hr | head
Output:
17G /var
12G /home
5.4G /usr
1.2G /opt
880M /root
312M /tmp
180M /etc
-x keeps du on the same filesystem so virtual mounts (/proc, /sys, /dev) don't pollute the report.
Find files held open after deletion
When you've deleted a log file but disk hasn't been reclaimed.
sudo lsof +L1 | awk 'NR==1 || $NF=="(deleted)"'
Output:
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NLINK NODE NAME
nginx 1234 nginx 2w REG 259,2 1073741824 0 ... /var/log/nginx/access.log (deleted)
Reclaim without restart by truncating via /proc:
sudo : > /proc/1234/fd/2
Output: (none — exits 0 on success)
Move /home onto a new disk preserving everything
The "I bought a bigger SSD" recipe. rsync -aHAX preserves modes, hard links, ACLs, and xattrs.
sudo mount /dev/sdb1 /mnt/newhome
sudo rsync -aHAX --info=progress2 /home/ /mnt/newhome/
sudo blkid /dev/sdb1 # capture the new UUID
# Edit /etc/fstab to mount UUID=... at /home
sudo umount /mnt/newhome
sudo mv /home /home.old
sudo mkdir /home && sudo mount /home
ls /home
Output:
alice bob shared
Convert a Btrfs subvolume to a snapshot
For taking a point-in-time copy before a risky upgrade.
sudo btrfs subvolume snapshot -r /home /home-2026-05-25
sudo btrfs subvolume list /
Output:
ID 256 gen 12042 top level 5 path home
ID 312 gen 12042 top level 5 path home-2026-05-25
If the upgrade goes wrong, roll back by swapping the default subvolume.
sudo btrfs subvolume set-default 312 /
sudo reboot
Output: (none — exits 0 on success)
Audit a filesystem for fragmentation
Quick health-check on ext4. Anything above a score of 30 is worth defragmenting.
sudo e4defrag -c /home | tail -5
Output:
Average size per extent 512 KB
Fragmentation score 8
[0-30 no problem: defragmentation is not needed]
[31-55 little bit fragmented: defragmentation is recommended]
[56- needs defragmentation: run with -v option to find fragmented files]
This filesystem (/home) does not need defragmentation.
Mount a Btrfs root with compression
Once-and-for-all transparent compression for a low-write workstation. zstd:3 is the sweet spot.
# Append to the existing options column for the / line in /etc/fstab:
# UUID=... / btrfs defaults,subvol=@,compress=zstd:3,ssd 0 0
sudo mount -o remount,compress=zstd:3 /
# Re-compress existing files:
sudo btrfs filesystem defragment -r -czstd /
Output: (none — exits 0 on success)
Recover deleted files with debugfs (ext4)
A last-resort recovery path that only works if the inode and data blocks haven't been overwritten. Always unmount or remount read-only first.
sudo mount -o remount,ro /
sudo debugfs -w /dev/nvme0n1p2
debugfs: lsdel
Output (truncated):
Inode Owner Mode Size Blocks Time deleted
1820000 1000 100644 1048576 1024/1024 Sun May 25 09:14:02 2026
debugfs: dump <1820000> /tmp/recovered.bin
debugfs: quit
Recovery rate falls off a cliff once the disk has been written to. For routine recovery, use snapshots (Btrfs, ZFS) or backups.
Build a deduplicated nightly backup with hard links
Classic rsync-snapshot trick: each night looks like a full copy but only changed files consume new space.
yesterday=$(date -d 'yesterday' +%F)
today=$(date +%F)
rsync -aH --link-dest=/backup/$yesterday /home/ /backup/$today/
du -sh /backup/$today /backup/$yesterday
Output:
12G /backup/2026-05-25
12G /backup/2026-05-24
The numbers are misleading by design — both trees look 12 GB, but on disk they share most blocks via hard links.
Force a stuck umount to drop
When the filesystem is busy because a process you can't easily kill is holding it.
fuser -vm /mnt/usb # who's holding it open?
sudo umount /mnt/usb || sudo umount -l /mnt/usb # lazy unmount as last resort
Output (fuser -vm):
USER PID ACCESS COMMAND
/mnt/usb: alice 4810 ..c.. vlc
alice 8821 ..c.. bash
-l (lazy) detaches the mount from the namespace immediately but keeps the underlying FS busy until the holders close their FDs. Prefer fixing the holders.
Tips
Always identify a disk by
UUID=orLABEL=in/etc/fstab, never/dev/sdaN— device names are not stable across reboots when you add or remove drives. Grab the UUID withblkid /dev/sdaN.
findmntis the modern view of mounts: it produces a tree and accepts column selectors like-o TARGET,SOURCE,FSTYPE,OPTIONS. It replaces bothmount(no args) andcat /proc/mountsfor human use.
[!WARN] Never run
fsckon a mounted writable filesystem — it can corrupt the very thing it's trying to repair. Reboot into single-user mode, ormount -o remount,rofirst, or use a live USB.