cheat sheet

Filesystems

Core filesystem concepts every operator should know: inodes, directory structure, hard vs symbolic links, journaling, copy-on-write, and a head-to-head of ext4, XFS, Btrfs, ZFS, APFS, and NTFS with mount options and pitfalls.

Filesystems — Inodes, Links, Journaling, ext4/XFS/Btrfs/ZFS/APFS/NTFS

What it is

A filesystem is the layer between the block device and "files and directories" — the on-disk data structures and the kernel code that gives a flat array of sectors the shape of a hierarchical, named, permissioned namespace. Every Unix filesystem revolves around the inode: a fixed-size record describing one file's metadata (mode, owner, size, timestamps, block pointers) without any name attached. Directories are simply files whose contents are name-to-inode tables, which is why hard links, mount points, and case-sensitivity all behave the way they do. Reach for this article when something has gone strange with disk usage, when a df and du mismatch appears, when a cp runs slower than expected, or when you need to pick a filesystem for new storage.

Inodes — the on-disk file record

An inode is a fixed-size data structure (256–1024 bytes on modern filesystems) that records every property of a file except its name. Directories store name-to-inode pairs; the inode itself contains the mode, owner UID/GID, size, timestamps (atime, mtime, ctime, sometimes btime), link count, and pointers (direct, indirect, or extent-based) to the data blocks. When you delete a file you decrement its link count; when the count reaches zero and no process has the file open, the kernel frees the inode and its data blocks.

bash
# Show inode number and metadata for a file
stat /etc/hostname

Output:

text
  File: /etc/hostname
  Size: 7         Blocks: 8          IO Block: 4096   regular file
Device: 803h/2051d  Inode: 1572873     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2026-05-25 09:14:02.000000000 -0400
Modify: 2026-04-30 12:00:00.000000000 -0400
Change: 2026-04-30 12:00:00.000000000 -0400
 Birth: 2026-04-30 12:00:00.000000000 -0400
bash
# List inode numbers in a directory
ls -li /etc | head -5

Output:

text
1572873 -rw-r--r-- 1 root root    7 Apr 30 12:00 hostname
1572874 -rw-r--r-- 1 root root  264 Apr 30 12:00 hosts
1572880 drwxr-xr-x 2 root root 4096 May 24 10:00 cron.daily
1572881 lrwxrwxrwx 1 root root   21 Apr 30 12:00 localtime -> /usr/share/zoneinfo/UTC
1572883 -rw-r--r-- 1 root root  604 Apr 30 12:00 fstab

A file's name is not a property of the file itself — it lives in the parent directory's name table. That is why mv within a single filesystem is just a directory update (instant, atomic) while mv across filesystems is a full cp plus rm (slow, non-atomic, observable to readers mid-copy).

Inode exhaustion

Each filesystem is created with a fixed number of inodes; you can run out of inodes long before the disk is full, especially on volumes holding many small files (cache directories, mail spools, node_modules). df -i reports inode usage separately from block usage.

bash
df -i /

Output:

text
Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/nvme0n1p2 6553600 412034 6141566    7% /

If IUse% is near 100% you cannot create a new file even with terabytes of free space — the error is No space left on device. The fix is either to delete small files, switch to a filesystem that allocates inodes dynamically (XFS, Btrfs, ZFS), or remake the filesystem with mkfs.ext4 -N <count>.

Directory structure

A directory is a special file whose contents are a list of (name, inode_number) pairs. The shape of that list differs by filesystem: ext4 uses HTrees (hashed B-trees) for fast lookup on large directories; XFS and Btrfs use B+ trees natively; ZFS uses extensible hashing. On every modern filesystem you can keep millions of files in one directory without the historical slowdown ext2 and HFS+ suffered.

bash
# Count entries in a directory
ls -1A /etc | wc -l

# Find directories with the most entries
sudo find / -xdev -type d -exec sh -c 'echo "$(ls -A "$1" | wc -l) $1"' _ {} \; 2>/dev/null | sort -rn | head -5

Output:

text
182
54392 /var/lib/dpkg/info
4112 /usr/share/doc
1893 /usr/lib/x86_64-linux-gnu
512  /usr/bin
411  /etc

[!WARN] Even with HTree indexing, enumerating a million-entry directory still costs O(n) and stats every inode in turn — that is the slow part of ls, rm -rf, and tab completion. Shard large data sets into a two- or three-level hash tree (e.g. /cache/ab/cd/abcdef.dat) rather than dumping everything at the top.

Unix supports two distinct link types and confusing them is one of the top filesystem pitfalls. A hard link is an additional name pointing at the same inode — there is no "original" once one is made; all hard links are equal peers and the file vanishes only when the last is removed. A symbolic link (symlink, soft link) is its own inode whose contents are a path string the kernel follows at lookup time — it can point at anything, including a path that doesn't exist or one on another filesystem, but it adds a level of indirection and is itself a file you can stat.

PropertyHard linkSymbolic link
Own inodeNo — shares target's inodeYes
Crosses filesystemsNoYes
Links to a directoryGenerally no (root + special tools only)Yes
Survives target deletionYes (link count keeps file alive)Becomes dangling
Visible as a link in ls -lNo — looks like a regular fileYes (l type, -> arrow)
Pointed-to pathN/A — same inodeStored string
Toolingln target nameln -s target name
bash
# Create both kinds
echo "hello" > /tmp/source.txt
ln    /tmp/source.txt /tmp/hardlink.txt        # hard link
ln -s /tmp/source.txt /tmp/symlink.txt         # symbolic link

# Compare
ls -li /tmp/source.txt /tmp/hardlink.txt /tmp/symlink.txt

Output:

text
1810023 -rw-r--r-- 2 alice staff  6 May 25 09:14 /tmp/source.txt
1810023 -rw-r--r-- 2 alice staff  6 May 25 09:14 /tmp/hardlink.txt
1810024 lrwxrwxrwx 1 alice staff 15 May 25 09:14 /tmp/symlink.txt -> /tmp/source.txt

Note that source.txt and hardlink.txt share inode 1810023 and both report a link count of 2. The symlink has its own inode (1810024) and shows the target path.

bash
# Delete the original — hard link survives, symlink dangles
rm /tmp/source.txt
cat /tmp/hardlink.txt    # works
cat /tmp/symlink.txt     # error

Output:

text
hello
cat: /tmp/symlink.txt: No such file or directory

Use hard links when you want the file to persist as long as any name references it — backup snapshots, deduplicated trees (cp -al, rsync --link-dest). Use symlinks when you want a path-shaped reference that updates when you replace the target (e.g. /usr/bin/python -> python3).

Timestamps

Inodes track several timestamps; understanding which one updates when is essential for find -mtime, make, and incremental backups.

TimestampUpdated when
atimeThe file's contents are read (subject to relatime/noatime)
mtimeThe file's contents are modified
ctimeThe file's metadata changes (mode, owner, link count) or contents change
btime (birth time)The file is created; never updated (ext4, XFS, Btrfs, ZFS, APFS)

Most modern Linux mounts use relatime (the default since 2.6.30): atime is updated only if the existing atime is older than the current mtime/ctime or older than 24 hours. Use noatime to skip atime updates entirely for read-heavy workloads — a measurable speedup for busy mail servers and caches.

Journaling

A journal is a small region of the filesystem where the kernel records pending metadata changes (or full data) before applying them. On a crash the journal is replayed during the next mount, so the filesystem comes back consistent without a slow fsck pass. ext3, ext4, XFS, and NTFS all journal metadata; ZFS and Btrfs use copy-on-write instead, which is conceptually similar but stronger.

Mode (ext4)What's loggedTrade-off
data=journalBoth metadata and dataSlowest; strongest consistency
data=ordered (default)Metadata, but data blocks written before commitDefault — safe and fast
data=writebackMetadata only; data ordering is freeFastest; brief window where metadata refers to stale data after a crash
bash
# Inspect a mount's journaling mode
mount | grep ' / '

Output:

text
/dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)

[!WARN] Journaling protects against a crash mid-write, not against bit rot, cosmic rays, or controller firmware bugs that ack a write that never reached the platter. For those, you need filesystem-level checksumming (Btrfs, ZFS, ReFS) or block-layer integrity (dm-integrity, T10-DIF).

Copy-on-write (CoW)

Copy-on-write filesystems never overwrite live data; every change writes new blocks and atomically updates the root pointer when the transaction commits. Crashes can only leave the filesystem at the previous consistent state. CoW enables free snapshots (just keep the old root around), free clones (share blocks until written), and end-to-end checksums (every block is fingerprinted on write).

bash
# Take an instant snapshot on Btrfs
sudo btrfs subvolume snapshot /home /home-2026-05-25

# Same on ZFS
sudo zfs snapshot tank/home@2026-05-25

Output:

text
Create a snapshot of '/home' in '/home-2026-05-25'

The cost is write amplification and fragmentation: a 4 KiB modification to a large file produces a new block and pointer chain rather than updating in place. For database workloads, set nodatacow on Btrfs or recordsize=16K plus logbias=throughput on ZFS to mitigate.

Filesystem comparison

FSOSTypeMax fileMax FSSnapshotsChecksumsBest for
ext4LinuxJournaled extent-based16 TiB1 EiBNo (LVM only)Metadata onlyDefault Linux root, broad compatibility
XFSLinuxJournaled extent-based8 EiB8 EiBNo (reflinks since 5.0)Metadata + CRC headersLarge files, NAS, high-throughput
BtrfsLinuxCoW16 EiB16 EiBYes (subvolumes)Yes (data + metadata)Snapshots, RAID 0/1/10, single-disk workstations
ZFSLinux/BSD/Solaris/macOSCoW + integrated volume manager16 EiB256 ZiBYes (datasets)Yes (data + metadata)Servers, NAS, integrity-critical data
APFSmacOS / iOSCoW8 EiB8 EiBYesMetadata onlymacOS system + user volumes
NTFSWindowsJournaled16 EiB16 EiBVSS (volume-level)Metadata onlyWindows system volume
bash
# Identify a mounted filesystem's type
mount | grep ' /home '
findmnt /home -o FSTYPE,SOURCE,OPTIONS

Output:

text
FSTYPE SOURCE         OPTIONS
ext4   /dev/nvme0n1p3 rw,relatime

When to pick each

  • ext4 — anywhere you want the default. Boring, fast, well-understood. Good for /, /home on most servers, and any storage that doesn't need snapshots.
  • XFS — large single-file workloads (databases, log archives, video). Default on RHEL since 7. Cannot shrink in-place.
  • Btrfs — single-disk workstations that want snapshots + send/receive, or simple RAID-1. Default on openSUSE and Fedora Workstation 33+. Avoid RAID-5/6 in production.
  • ZFS — multi-disk servers and NAS where data integrity dominates. Outside Linux kernel mainline due to license; install via zfs-dkms.
  • APFS — macOS only; nothing to choose, it's automatic since macOS 10.13.
  • NTFS — Windows volumes; read on Linux is fine, write through ntfs-3g (FUSE) is acceptable, but ntfs-3g is slow vs. native. macOS reads natively but does not write.

Fragmentation

Fragmentation happens when a file's blocks are scattered across the device rather than contiguous, forcing more seeks per read. On SSDs it matters far less (no head movement), but heavy fragmentation can still increase the number of I/O requests and hurt throughput on cheap drives. Modern Linux filesystems (ext4, XFS, Btrfs) allocate extents — runs of contiguous blocks — which dramatically reduces fragmentation compared to ext2/3.

bash
# Check ext4 fragmentation
sudo e4defrag -c /home

Output:

text
<Fragmented files>                             now/best
1. /home/alice/big.tar.gz                       12/1
2. /home/alice/.cache/firefox/places.sqlite     7/1
...
Total/best extents                              4823/4811
Average size per extent                         512 KB
Fragmentation score                             0
 [0-30 no problem: defragmentation is not needed]
bash
# Defragment one file or a tree
sudo e4defrag /home/alice/big.tar.gz
sudo e4defrag /home/alice

Output:

text
ext4 defragmentation for /home/alice/big.tar.gz
[1/1]/home/alice/big.tar.gz:        100%  extents: 12 -> 1   [ OK ]
 Success:                        [1/1]

Btrfs has btrfs filesystem defragment -r /mountpoint; XFS has xfs_fsr; APFS does it transparently in the background. NTFS uses defrag on Windows.

On an SSD, do not run a defrag tool on a schedule — it generates write amplification with no benefit. Use it only when a specific file or workload shows symptoms (e.g. filefrag reports tens of thousands of extents).

Mount options

The mount command and /etc/fstab accept per-filesystem options that change the trade-off between safety, speed, and behaviour. The big ones are common to most Linux filesystems; the rest are FS-specific.

bash
# Show current mount options for everything
mount | column -t
findmnt -t ext4,xfs,btrfs -o TARGET,SOURCE,FSTYPE,OPTIONS

Output (findmnt):

text
TARGET   SOURCE         FSTYPE OPTIONS
/        /dev/nvme0n1p2 ext4   rw,relatime,errors=remount-ro
/home    /dev/nvme0n1p3 ext4   rw,relatime,nodev,nosuid
/data    /dev/sdb1      xfs    rw,relatime,attr2,inode64,logbufs=8

Generic mount options

OptionEffect
ro / rwRead-only / read-write
noatimeSkip atime updates entirely (fastest for reads)
relatimeatime updated only if older than mtime/ctime or 24h (Linux default)
nodevDon't allow device files
nosuidIgnore setuid/setgid bits — defence in depth for user-writable mounts
noexecDisallow execute — useful on /tmp, /var/tmp, /dev/shm
nofailDon't block boot if the device is missing (USB drives, NFS)
x-systemd.automountLazy-mount on first access via systemd
syncForce every write to hit disk before returning (slow; for removable media)
discardIssue TRIM on file deletion (SSD) — prefer periodic fstrim instead
errors=remount-roSwitch to read-only on FS error (default for / on ext4)
bash
# /etc/fstab entry — a hardened /tmp
tmpfs   /tmp   tmpfs   nodev,nosuid,noexec,size=2G   0 0

# A USB drive that should not block boot
UUID=ABCD-1234   /mnt/usb   exfat   nofail,x-systemd.automount,nodev,nosuid   0 0

Output: (none — fstab configuration only)

ext4-specific

OptionEffect
data=orderedDefault; metadata journaled, data ordered before commit
data=writebackFaster; brief stale-data window after crash
data=journalSafest; journals data too — slowest
barrier=1Default; force barriers — required for crash safety on caching disks
journal_async_commitAsync journal commits — slight speedup, slight risk
commit=NSync interval in seconds (default 5)

XFS-specific

OptionEffect
inode64Allow inodes >32 bits (default for >1 TiB)
logbufs=NNumber of in-memory log buffers (default 8)
largeioOptimal-I/O hints for large sequential reads
noquotaDisable quota accounting

Btrfs-specific

OptionEffect
subvol=NAMEMount a specific subvolume as root
compress=zstd:3Transparent compression (also lzo, zlib)
nodatacowDisable CoW per-file or per-mount (DB / VM images)
ssdEnable SSD optimisations (auto-detected for most modern drives)
autodefragDefragment on write

Common pitfalls

  1. df and du disagree — usually a process is holding a deleted file open. Find it with lsof | grep deleted or lsof +L1. See the lsof & ss cheatsheet for the recovery recipe (: > /proc/PID/fd/N).
  2. Out of inodes despite free diskdf -i and migrate small files off or recreate the filesystem with more inodes.
  3. rm of a "deleted" file doesn't free space — same root cause: an open file descriptor. Kill or signal the holder.
  4. Symlink loopsln -s . loop and find follows them forever. Use find -L only when you mean it, or pass -maxdepth.
  5. Hard link across filesystems failsln returns Invalid cross-device link. Use ln -s (symbolic link) instead.
  6. Copying preserves links — sometimescp does not preserve hard links by default; use cp -a or cp --preserve=links. rsync needs -H.
  7. tar and zip differ on linkstar preserves hard links (deduplicates within the archive); zip does not. For backups, use tar with --hard-dereference (follow them) or default (preserve them).
  8. mv between filesystems is cp + rm — readers can observe a partial state and the operation is not atomic. Use rsync --remove-source-files for safer cross-FS moves.
  9. noexec on /tmp breaks installers — many shell installers (get.docker.com style) write to /tmp then exec. Either remount with exec briefly or set TMPDIR=/var/tmp (if that's exec-allowed).
  10. APFS volumes are case-insensitive by defaultFoo.txt and foo.txt collide. Format --case-sensitive if you need POSIX semantics (rare on macOS; sometimes needed for Linux-style repos).
  11. NTFS write from Linux is slowntfs-3g is FUSE-based. For frequent writes, format the disk as exFAT (cross-platform) instead.
  12. TRIM via discard mount option vs. weekly fstrimdiscard issues a TRIM per delete and can saturate the controller on small-file workloads. Prefer fstrim.timer (weekly, batched).

Real-world recipes

Find the largest directories on /

A perennial "disk full" first step. du walks the tree; --max-depth caps the recursion so you can drill down.

bash
sudo du -hx --max-depth=1 / 2>/dev/null | sort -hr | head

Output:

text
17G    /var
12G    /home
5.4G   /usr
1.2G   /opt
880M   /root
312M   /tmp
180M   /etc

-x keeps du on the same filesystem so virtual mounts (/proc, /sys, /dev) don't pollute the report.

Find files held open after deletion

When you've deleted a log file but disk hasn't been reclaimed.

bash
sudo lsof +L1 | awk 'NR==1 || $NF=="(deleted)"'

Output:

text
COMMAND    PID  USER   FD   TYPE DEVICE     SIZE/OFF   NLINK NODE NAME
nginx     1234 nginx    2w   REG  259,2  1073741824       0  ... /var/log/nginx/access.log (deleted)

Reclaim without restart by truncating via /proc:

bash
sudo : > /proc/1234/fd/2

Output: (none — exits 0 on success)

Move /home onto a new disk preserving everything

The "I bought a bigger SSD" recipe. rsync -aHAX preserves modes, hard links, ACLs, and xattrs.

bash
sudo mount /dev/sdb1 /mnt/newhome
sudo rsync -aHAX --info=progress2 /home/ /mnt/newhome/
sudo blkid /dev/sdb1                        # capture the new UUID
# Edit /etc/fstab to mount UUID=... at /home
sudo umount /mnt/newhome
sudo mv /home /home.old
sudo mkdir /home && sudo mount /home
ls /home

Output:

text
alice  bob  shared

Convert a Btrfs subvolume to a snapshot

For taking a point-in-time copy before a risky upgrade.

bash
sudo btrfs subvolume snapshot -r /home /home-2026-05-25
sudo btrfs subvolume list /

Output:

text
ID 256 gen 12042 top level 5 path home
ID 312 gen 12042 top level 5 path home-2026-05-25

If the upgrade goes wrong, roll back by swapping the default subvolume.

bash
sudo btrfs subvolume set-default 312 /
sudo reboot

Output: (none — exits 0 on success)

Audit a filesystem for fragmentation

Quick health-check on ext4. Anything above a score of 30 is worth defragmenting.

bash
sudo e4defrag -c /home | tail -5

Output:

text
Average size per extent                  512 KB
Fragmentation score                      8
 [0-30 no problem: defragmentation is not needed]
 [31-55 little bit fragmented: defragmentation is recommended]
 [56- needs defragmentation: run with -v option to find fragmented files]
This filesystem (/home) does not need defragmentation.

Mount a Btrfs root with compression

Once-and-for-all transparent compression for a low-write workstation. zstd:3 is the sweet spot.

bash
# Append to the existing options column for the / line in /etc/fstab:
# UUID=...   /   btrfs   defaults,subvol=@,compress=zstd:3,ssd   0 0
sudo mount -o remount,compress=zstd:3 /
# Re-compress existing files:
sudo btrfs filesystem defragment -r -czstd /

Output: (none — exits 0 on success)

Recover deleted files with debugfs (ext4)

A last-resort recovery path that only works if the inode and data blocks haven't been overwritten. Always unmount or remount read-only first.

bash
sudo mount -o remount,ro /
sudo debugfs -w /dev/nvme0n1p2
debugfs:  lsdel

Output (truncated):

text
 Inode  Owner  Mode    Size      Blocks   Time deleted
1820000  1000   100644  1048576   1024/1024 Sun May 25 09:14:02 2026
text
debugfs:  dump <1820000> /tmp/recovered.bin
debugfs:  quit

Recovery rate falls off a cliff once the disk has been written to. For routine recovery, use snapshots (Btrfs, ZFS) or backups.

Classic rsync-snapshot trick: each night looks like a full copy but only changed files consume new space.

bash
yesterday=$(date -d 'yesterday' +%F)
today=$(date +%F)
rsync -aH --link-dest=/backup/$yesterday /home/ /backup/$today/
du -sh /backup/$today /backup/$yesterday

Output:

text
12G    /backup/2026-05-25
12G    /backup/2026-05-24

The numbers are misleading by design — both trees look 12 GB, but on disk they share most blocks via hard links.

Force a stuck umount to drop

When the filesystem is busy because a process you can't easily kill is holding it.

bash
fuser -vm /mnt/usb               # who's holding it open?
sudo umount /mnt/usb || sudo umount -l /mnt/usb   # lazy unmount as last resort

Output (fuser -vm):

text
                     USER        PID ACCESS COMMAND
/mnt/usb:            alice       4810 ..c.. vlc
                     alice       8821 ..c.. bash

-l (lazy) detaches the mount from the namespace immediately but keeps the underlying FS busy until the holders close their FDs. Prefer fixing the holders.

Tips

Always identify a disk by UUID= or LABEL= in /etc/fstab, never /dev/sdaN — device names are not stable across reboots when you add or remove drives. Grab the UUID with blkid /dev/sdaN.

findmnt is the modern view of mounts: it produces a tree and accepts column selectors like -o TARGET,SOURCE,FSTYPE,OPTIONS. It replaces both mount (no args) and cat /proc/mounts for human use.

[!WARN] Never run fsck on a mounted writable filesystem — it can corrupt the very thing it's trying to repair. Reboot into single-user mode, or mount -o remount,ro first, or use a live USB.