cheat sheet

Filesystems

Core filesystem concepts every operator should know: inodes, directory structure, hard vs symbolic links, journaling, copy-on-write, and a head-to-head of ext4, XFS, Btrfs, ZFS, APFS, and NTFS with mount options and pitfalls.

updated 05-25-2026

Filesystems — Inodes, Links, Journaling, ext4/XFS/Btrfs/ZFS/APFS/NTFS

What it is

A filesystem is the layer between the block device and "files and directories" — the on-disk data structures and the kernel code that gives a flat array of sectors the shape of a hierarchical, named, permissioned namespace. Every Unix filesystem revolves around the inode: a fixed-size record describing one file's metadata (mode, owner, size, timestamps, block pointers) without any name attached. Directories are simply files whose contents are name-to-inode tables, which is why hard links, mount points, and case-sensitivity all behave the way they do. Reach for this article when something has gone strange with disk usage, when a df and du mismatch appears, when a cp runs slower than expected, or when you need to pick a filesystem for new storage.

Inodes — the on-disk file record

An inode is a fixed-size data structure (256–1024 bytes on modern filesystems) that records every property of a file except its name. Directories store name-to-inode pairs; the inode itself contains the mode, owner UID/GID, size, timestamps (atime, mtime, ctime, sometimes btime), link count, and pointers (direct, indirect, or extent-based) to the data blocks. When you delete a file you decrement its link count; when the count reaches zero and no process has the file open, the kernel frees the inode and its data blocks.

bash

# Show inode number and metadata for a file
stat /etc/hostname

Output:

text

  File: /etc/hostname
  Size: 7         Blocks: 8          IO Block: 4096   regular file
Device: 803h/2051d  Inode: 1572873     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2026-05-25 09:14:02.000000000 -0400
Modify: 2026-04-30 12:00:00.000000000 -0400
Change: 2026-04-30 12:00:00.000000000 -0400
 Birth: 2026-04-30 12:00:00.000000000 -0400

bash

# List inode numbers in a directory
ls -li /etc | head -5

Output:

text

1572873 -rw-r--r-- 1 root root    7 Apr 30 12:00 hostname
1572874 -rw-r--r-- 1 root root  264 Apr 30 12:00 hosts
1572880 drwxr-xr-x 2 root root 4096 May 24 10:00 cron.daily
1572881 lrwxrwxrwx 1 root root   21 Apr 30 12:00 localtime -> /usr/share/zoneinfo/UTC
1572883 -rw-r--r-- 1 root root  604 Apr 30 12:00 fstab

A file's name is not a property of the file itself — it lives in the parent directory's name table. That is why mv within a single filesystem is just a directory update (instant, atomic) while mv across filesystems is a full cp plus rm (slow, non-atomic, observable to readers mid-copy).

Inode exhaustion

Each filesystem is created with a fixed number of inodes; you can run out of inodes long before the disk is full, especially on volumes holding many small files (cache directories, mail spools, node_modules). df -i reports inode usage separately from block usage.

bash

df -i /

Output:

text

Filesystem      Inodes  IUsed   IFree IUse% Mounted on
/dev/nvme0n1p2 6553600 412034 6141566    7% /

If IUse% is near 100% you cannot create a new file even with terabytes of free space — the error is No space left on device. The fix is either to delete small files, switch to a filesystem that allocates inodes dynamically (XFS, Btrfs, ZFS), or remake the filesystem with mkfs.ext4 -N <count>.

Directory structure

A directory is a special file whose contents are a list of (name, inode_number) pairs. The shape of that list differs by filesystem: ext4 uses HTrees (hashed B-trees) for fast lookup on large directories; XFS and Btrfs use B+ trees natively; ZFS uses extensible hashing. On every modern filesystem you can keep millions of files in one directory without the historical slowdown ext2 and HFS+ suffered.

bash

# Count entries in a directory
ls -1A /etc | wc -l

# Find directories with the most entries
sudo find / -xdev -type d -exec sh -c 'echo "$(ls -A "$1" | wc -l) $1"' _ {} \; 2>/dev/null | sort -rn | head -5

Output:

text

182
54392 /var/lib/dpkg/info
4112 /usr/share/doc
1893 /usr/lib/x86_64-linux-gnu
512  /usr/bin
411  /etc

[!WARN] Even with HTree indexing, enumerating a million-entry directory still costs O(n) and stats every inode in turn — that is the slow part of ls, rm -rf, and tab completion. Shard large data sets into a two- or three-level hash tree (e.g. /cache/ab/cd/abcdef.dat) rather than dumping everything at the top.

Hard vs symbolic links

Unix supports two distinct link types and confusing them is one of the top filesystem pitfalls. A hard link is an additional name pointing at the same inode — there is no "original" once one is made; all hard links are equal peers and the file vanishes only when the last is removed. A symbolic link (symlink, soft link) is its own inode whose contents are a path string the kernel follows at lookup time — it can point at anything, including a path that doesn't exist or one on another filesystem, but it adds a level of indirection and is itself a file you can stat.

Property	Hard link	Symbolic link
Own inode	No — shares target's inode	Yes
Crosses filesystems	No	Yes
Links to a directory	Generally no (root + special tools only)	Yes
Survives target deletion	Yes (link count keeps file alive)	Becomes dangling
Visible as a link in `ls -l`	No — looks like a regular file	Yes (`l` type, `->` arrow)
Pointed-to path	N/A — same inode	Stored string
Tooling	`ln target name`	`ln -s target name`

bash

# Create both kinds
echo "hello" > /tmp/source.txt
ln    /tmp/source.txt /tmp/hardlink.txt        # hard link
ln -s /tmp/source.txt /tmp/symlink.txt         # symbolic link

# Compare
ls -li /tmp/source.txt /tmp/hardlink.txt /tmp/symlink.txt

Output:

text

1810023 -rw-r--r-- 2 alice staff  6 May 25 09:14 /tmp/source.txt
1810023 -rw-r--r-- 2 alice staff  6 May 25 09:14 /tmp/hardlink.txt
1810024 lrwxrwxrwx 1 alice staff 15 May 25 09:14 /tmp/symlink.txt -> /tmp/source.txt

Note that source.txt and hardlink.txt share inode 1810023 and both report a link count of 2. The symlink has its own inode (1810024) and shows the target path.

bash

# Delete the original — hard link survives, symlink dangles
rm /tmp/source.txt
cat /tmp/hardlink.txt    # works
cat /tmp/symlink.txt     # error

Output:

text

hello
cat: /tmp/symlink.txt: No such file or directory

Use hard links when you want the file to persist as long as any name references it — backup snapshots, deduplicated trees (cp -al, rsync --link-dest). Use symlinks when you want a path-shaped reference that updates when you replace the target (e.g. /usr/bin/python -> python3).

Timestamps

Inodes track several timestamps; understanding which one updates when is essential for find -mtime, make, and incremental backups.

Timestamp	Updated when
`atime`	The file's contents are read (subject to `relatime`/`noatime`)
`mtime`	The file's contents are modified
`ctime`	The file's metadata changes (mode, owner, link count) or contents change
`btime` (birth time)	The file is created; never updated (ext4, XFS, Btrfs, ZFS, APFS)

Most modern Linux mounts use relatime (the default since 2.6.30): atime is updated only if the existing atime is older than the current mtime/ctime or older than 24 hours. Use noatime to skip atime updates entirely for read-heavy workloads — a measurable speedup for busy mail servers and caches.

Journaling

A journal is a small region of the filesystem where the kernel records pending metadata changes (or full data) before applying them. On a crash the journal is replayed during the next mount, so the filesystem comes back consistent without a slow fsck pass. ext3, ext4, XFS, and NTFS all journal metadata; ZFS and Btrfs use copy-on-write instead, which is conceptually similar but stronger.

Mode (ext4)	What's logged	Trade-off
`data=journal`	Both metadata and data	Slowest; strongest consistency
`data=ordered` (default)	Metadata, but data blocks written before commit	Default — safe and fast
`data=writeback`	Metadata only; data ordering is free	Fastest; brief window where metadata refers to stale data after a crash

bash

# Inspect a mount's journaling mode
mount | grep ' / '

Output:

text

/dev/nvme0n1p2 on / type ext4 (rw,relatime,errors=remount-ro)

[!WARN] Journaling protects against a crash mid-write, not against bit rot, cosmic rays, or controller firmware bugs that ack a write that never reached the platter. For those, you need filesystem-level checksumming (Btrfs, ZFS, ReFS) or block-layer integrity (dm-integrity, T10-DIF).

Copy-on-write (CoW)

Copy-on-write filesystems never overwrite live data; every change writes new blocks and atomically updates the root pointer when the transaction commits. Crashes can only leave the filesystem at the previous consistent state. CoW enables free snapshots (just keep the old root around), free clones (share blocks until written), and end-to-end checksums (every block is fingerprinted on write).

bash

# Take an instant snapshot on Btrfs
sudo btrfs subvolume snapshot /home /home-2026-05-25

# Same on ZFS
sudo zfs snapshot tank/home@2026-05-25

Output:

text

Create a snapshot of '/home' in '/home-2026-05-25'

The cost is write amplification and fragmentation: a 4 KiB modification to a large file produces a new block and pointer chain rather than updating in place. For database workloads, set nodatacow on Btrfs or recordsize=16K plus logbias=throughput on ZFS to mitigate.

Filesystem comparison

FS	OS	Type	Max file	Max FS	Snapshots	Checksums	Best for
ext4	Linux	Journaled extent-based	16 TiB	1 EiB	No (LVM only)	Metadata only	Default Linux root, broad compatibility
XFS	Linux	Journaled extent-based	8 EiB	8 EiB	No (reflinks since 5.0)	Metadata + CRC headers	Large files, NAS, high-throughput
Btrfs	Linux	CoW	16 EiB	16 EiB	Yes (subvolumes)	Yes (data + metadata)	Snapshots, RAID 0/1/10, single-disk workstations
ZFS	Linux/BSD/Solaris/macOS	CoW + integrated volume manager	16 EiB	256 ZiB	Yes (datasets)	Yes (data + metadata)	Servers, NAS, integrity-critical data
APFS	macOS / iOS	CoW	8 EiB	8 EiB	Yes	Metadata only	macOS system + user volumes
NTFS	Windows	Journaled	16 EiB	16 EiB	VSS (volume-level)	Metadata only	Windows system volume

bash

# Identify a mounted filesystem's type
mount | grep ' /home '
findmnt /home -o FSTYPE,SOURCE,OPTIONS

Output:

text

FSTYPE SOURCE         OPTIONS
ext4   /dev/nvme0n1p3 rw,relatime

When to pick each

ext4 — anywhere you want the default. Boring, fast, well-understood. Good for /, /home on most servers, and any storage that doesn't need snapshots.
XFS — large single-file workloads (databases, log archives, video). Default on RHEL since 7. Cannot shrink in-place.
Btrfs — single-disk workstations that want snapshots + send/receive, or simple RAID-1. Default on openSUSE and Fedora Workstation 33+. Avoid RAID-5/6 in production.
ZFS — multi-disk servers and NAS where data integrity dominates. Outside Linux kernel mainline due to license; install via zfs-dkms.
APFS — macOS only; nothing to choose, it's automatic since macOS 10.13.
NTFS — Windows volumes; read on Linux is fine, write through ntfs-3g (FUSE) is acceptable, but ntfs-3g is slow vs. native. macOS reads natively but does not write.

Fragmentation

Fragmentation happens when a file's blocks are scattered across the device rather than contiguous, forcing more seeks per read. On SSDs it matters far less (no head movement), but heavy fragmentation can still increase the number of I/O requests and hurt throughput on cheap drives. Modern Linux filesystems (ext4, XFS, Btrfs) allocate extents — runs of contiguous blocks — which dramatically reduces fragmentation compared to ext2/3.

bash

# Check ext4 fragmentation
sudo e4defrag -c /home

Output:

text

<Fragmented files>                             now/best
1. /home/alice/big.tar.gz                       12/1
2. /home/alice/.cache/firefox/places.sqlite     7/1
...
Total/best extents                              4823/4811
Average size per extent                         512 KB
Fragmentation score                             0
 [0-30 no problem: defragmentation is not needed]

bash

# Defragment one file or a tree
sudo e4defrag /home/alice/big.tar.gz
sudo e4defrag /home/alice

Output:

text

ext4 defragmentation for /home/alice/big.tar.gz
[1/1]/home/alice/big.tar.gz:        100%  extents: 12 -> 1   [ OK ]
 Success:                        [1/1]

Btrfs has btrfs filesystem defragment -r /mountpoint; XFS has xfs_fsr; APFS does it transparently in the background. NTFS uses defrag on Windows.

On an SSD, do not run a defrag tool on a schedule — it generates write amplification with no benefit. Use it only when a specific file or workload shows symptoms (e.g. filefrag reports tens of thousands of extents).

Mount options

The mount command and /etc/fstab accept per-filesystem options that change the trade-off between safety, speed, and behaviour. The big ones are common to most Linux filesystems; the rest are FS-specific.

bash

# Show current mount options for everything
mount | column -t
findmnt -t ext4,xfs,btrfs -o TARGET,SOURCE,FSTYPE,OPTIONS

Output (findmnt):

text

TARGET   SOURCE         FSTYPE OPTIONS
/        /dev/nvme0n1p2 ext4   rw,relatime,errors=remount-ro
/home    /dev/nvme0n1p3 ext4   rw,relatime,nodev,nosuid
/data    /dev/sdb1      xfs    rw,relatime,attr2,inode64,logbufs=8

Generic mount options

Option	Effect
`ro` / `rw`	Read-only / read-write
`noatime`	Skip atime updates entirely (fastest for reads)
`relatime`	atime updated only if older than mtime/ctime or 24h (Linux default)
`nodev`	Don't allow device files
`nosuid`	Ignore setuid/setgid bits — defence in depth for user-writable mounts
`noexec`	Disallow execute — useful on `/tmp`, `/var/tmp`, `/dev/shm`
`nofail`	Don't block boot if the device is missing (USB drives, NFS)
`x-systemd.automount`	Lazy-mount on first access via systemd
`sync`	Force every write to hit disk before returning (slow; for removable media)
`discard`	Issue TRIM on file deletion (SSD) — prefer periodic `fstrim` instead
`errors=remount-ro`	Switch to read-only on FS error (default for `/` on ext4)

bash

# /etc/fstab entry — a hardened /tmp
tmpfs   /tmp   tmpfs   nodev,nosuid,noexec,size=2G   0 0

# A USB drive that should not block boot
UUID=ABCD-1234   /mnt/usb   exfat   nofail,x-systemd.automount,nodev,nosuid   0 0

Output: (none — fstab configuration only)

ext4-specific

Option	Effect
`data=ordered`	Default; metadata journaled, data ordered before commit
`data=writeback`	Faster; brief stale-data window after crash
`data=journal`	Safest; journals data too — slowest
`barrier=1`	Default; force barriers — required for crash safety on caching disks
`journal_async_commit`	Async journal commits — slight speedup, slight risk
`commit=N`	Sync interval in seconds (default 5)

XFS-specific

Option	Effect
`inode64`	Allow inodes >32 bits (default for >1 TiB)
`logbufs=N`	Number of in-memory log buffers (default 8)
`largeio`	Optimal-I/O hints for large sequential reads
`noquota`	Disable quota accounting

Btrfs-specific

Option	Effect
`subvol=NAME`	Mount a specific subvolume as root
`compress=zstd:3`	Transparent compression (also `lzo`, `zlib`)
`nodatacow`	Disable CoW per-file or per-mount (DB / VM images)
`ssd`	Enable SSD optimisations (auto-detected for most modern drives)
`autodefrag`	Defragment on write

Common pitfalls

df and du disagree — usually a process is holding a deleted file open. Find it with lsof | grep deleted or lsof +L1. See the lsof & ss cheatsheet for the recovery recipe (: > /proc/PID/fd/N).
Out of inodes despite free disk — df -i and migrate small files off or recreate the filesystem with more inodes.
rm of a "deleted" file doesn't free space — same root cause: an open file descriptor. Kill or signal the holder.
Symlink loops — ln -s . loop and find follows them forever. Use find -L only when you mean it, or pass -maxdepth.
Hard link across filesystems fails — ln returns Invalid cross-device link. Use ln -s (symbolic link) instead.
Copying preserves links — sometimes — cp does not preserve hard links by default; use cp -a or cp --preserve=links. rsync needs -H.
tar and zip differ on links — tar preserves hard links (deduplicates within the archive); zip does not. For backups, use tar with --hard-dereference (follow them) or default (preserve them).
mv between filesystems is cp + rm — readers can observe a partial state and the operation is not atomic. Use rsync --remove-source-files for safer cross-FS moves.
noexec on /tmp breaks installers — many shell installers (get.docker.com style) write to /tmp then exec. Either remount with exec briefly or set TMPDIR=/var/tmp (if that's exec-allowed).
APFS volumes are case-insensitive by default — Foo.txt and foo.txt collide. Format --case-sensitive if you need POSIX semantics (rare on macOS; sometimes needed for Linux-style repos).
NTFS write from Linux is slow — ntfs-3g is FUSE-based. For frequent writes, format the disk as exFAT (cross-platform) instead.
TRIM via discard mount option vs. weekly fstrim — discard issues a TRIM per delete and can saturate the controller on small-file workloads. Prefer fstrim.timer (weekly, batched).

Real-world recipes

Find the largest directories on /

A perennial "disk full" first step. du walks the tree; --max-depth caps the recursion so you can drill down.

bash

sudo du -hx --max-depth=1 / 2>/dev/null | sort -hr | head

Output:

text

17G    /var
12G    /home
5.4G   /usr
1.2G   /opt
880M   /root
312M   /tmp
180M   /etc

-x keeps du on the same filesystem so virtual mounts (/proc, /sys, /dev) don't pollute the report.

Find files held open after deletion

When you've deleted a log file but disk hasn't been reclaimed.

bash

sudo lsof +L1 | awk 'NR==1 || $NF=="(deleted)"'

Output:

text

COMMAND    PID  USER   FD   TYPE DEVICE     SIZE/OFF   NLINK NODE NAME
nginx     1234 nginx    2w   REG  259,2  1073741824       0  ... /var/log/nginx/access.log (deleted)

Reclaim without restart by truncating via /proc:

bash

sudo : > /proc/1234/fd/2

Output: (none — exits 0 on success)

Move /home onto a new disk preserving everything

The "I bought a bigger SSD" recipe. rsync -aHAX preserves modes, hard links, ACLs, and xattrs.

bash

sudo mount /dev/sdb1 /mnt/newhome
sudo rsync -aHAX --info=progress2 /home/ /mnt/newhome/
sudo blkid /dev/sdb1                        # capture the new UUID
# Edit /etc/fstab to mount UUID=... at /home
sudo umount /mnt/newhome
sudo mv /home /home.old
sudo mkdir /home && sudo mount /home
ls /home

Output:

text

alice  bob  shared

Convert a Btrfs subvolume to a snapshot

For taking a point-in-time copy before a risky upgrade.

bash

sudo btrfs subvolume snapshot -r /home /home-2026-05-25
sudo btrfs subvolume list /

Output:

text

ID 256 gen 12042 top level 5 path home
ID 312 gen 12042 top level 5 path home-2026-05-25

If the upgrade goes wrong, roll back by swapping the default subvolume.

bash

sudo btrfs subvolume set-default 312 /
sudo reboot

Output: (none — exits 0 on success)

Audit a filesystem for fragmentation

Quick health-check on ext4. Anything above a score of 30 is worth defragmenting.

bash

sudo e4defrag -c /home | tail -5

Output:

text

Average size per extent                  512 KB
Fragmentation score                      8
 [0-30 no problem: defragmentation is not needed]
 [31-55 little bit fragmented: defragmentation is recommended]
 [56- needs defragmentation: run with -v option to find fragmented files]
This filesystem (/home) does not need defragmentation.

Mount a Btrfs root with compression

Once-and-for-all transparent compression for a low-write workstation. zstd:3 is the sweet spot.

bash

# Append to the existing options column for the / line in /etc/fstab:
# UUID=...   /   btrfs   defaults,subvol=@,compress=zstd:3,ssd   0 0
sudo mount -o remount,compress=zstd:3 /
# Re-compress existing files:
sudo btrfs filesystem defragment -r -czstd /

Output: (none — exits 0 on success)

Recover deleted files with `debugfs` (ext4)

A last-resort recovery path that only works if the inode and data blocks haven't been overwritten. Always unmount or remount read-only first.

bash

sudo mount -o remount,ro /
sudo debugfs -w /dev/nvme0n1p2
debugfs:  lsdel

Output (truncated):

text

 Inode  Owner  Mode    Size      Blocks   Time deleted
1820000  1000   100644  1048576   1024/1024 Sun May 25 09:14:02 2026

text

debugfs:  dump <1820000> /tmp/recovered.bin
debugfs:  quit

Recovery rate falls off a cliff once the disk has been written to. For routine recovery, use snapshots (Btrfs, ZFS) or backups.

Build a deduplicated nightly backup with hard links

Classic rsync-snapshot trick: each night looks like a full copy but only changed files consume new space.

bash

yesterday=$(date -d 'yesterday' +%F)
today=$(date +%F)
rsync -aH --link-dest=/backup/$yesterday /home/ /backup/$today/
du -sh /backup/$today /backup/$yesterday

Output:

text

12G    /backup/2026-05-25
12G    /backup/2026-05-24

The numbers are misleading by design — both trees look 12 GB, but on disk they share most blocks via hard links.

Force a stuck `umount` to drop

When the filesystem is busy because a process you can't easily kill is holding it.

bash

fuser -vm /mnt/usb               # who's holding it open?
sudo umount /mnt/usb || sudo umount -l /mnt/usb   # lazy unmount as last resort

Output (fuser -vm):

text

                     USER        PID ACCESS COMMAND
/mnt/usb:            alice       4810 ..c.. vlc
                     alice       8821 ..c.. bash

-l (lazy) detaches the mount from the namespace immediately but keeps the underlying FS busy until the holders close their FDs. Prefer fixing the holders.

Tips

Always identify a disk by UUID= or LABEL= in /etc/fstab, never /dev/sdaN — device names are not stable across reboots when you add or remove drives. Grab the UUID with blkid /dev/sdaN.

findmnt is the modern view of mounts: it produces a tree and accepts column selectors like -o TARGET,SOURCE,FSTYPE,OPTIONS. It replaces both mount (no args) and cat /proc/mounts for human use.

[!WARN] Never run fsck on a mounted writable filesystem — it can corrupt the very thing it's trying to repair. Reboot into single-user mode, or mount -o remount,ro first, or use a live USB.

Filesystems — Inodes, Links, Journaling, ext4/XFS/Btrfs/ZFS/APFS/NTFS

What it is

Inodes — the on-disk file record

Inode exhaustion

Directory structure

Hard vs symbolic links

Timestamps

Journaling

Copy-on-write (CoW)

Filesystem comparison

When to pick each

Fragmentation

Mount options

Generic mount options

ext4-specific

XFS-specific

Btrfs-specific

Common pitfalls

Real-world recipes

Find the largest directories on /

Find files held open after deletion

Move /home onto a new disk preserving everything

Convert a Btrfs subvolume to a snapshot

Audit a filesystem for fragmentation

Mount a Btrfs root with compression

Recover deleted files with debugfs (ext4)

Build a deduplicated nightly backup with hard links

Force a stuck umount to drop

Tips

Recover deleted files with `debugfs` (ext4)

Force a stuck `umount` to drop