cheat sheet

Memory Management

How operating systems give every process its own address space: virtual memory and paging, swap, the OOM killer, mmap, copy-on-write, the page cache, allocator choices (glibc, jemalloc, mimalloc), and how to read memory counters in top, ps, and free.

Memory Management — Virtual Memory, Paging, Swap, OOM, mmap

What it is

Operating-system memory management is the layer that turns the limited, physical RAM in your machine into the illusion of a private, contiguous, oversized address space for each process. Every modern OS does this with virtual memory: a per-process page table, transparently translated to physical frames by the CPU's MMU, with the kernel orchestrating allocation, eviction (to swap), file mapping (mmap), copy-on-write, and caching of recently-used disk pages. Reach for this article when you need to understand why free says you have "no free memory" (the page cache does), why your process used 4 GB of "virtual" memory while only touching 50 MB (overcommit + reservation), why the OOM killer chose the wrong victim, or which allocator your application should use under load.

Virtual memory and paging

Every running process has its own virtual address space — a flat range from 0 to (on x86-64) 256 TiB of usable user-space addresses, partitioned by the MMU into 4 KiB pages. The kernel maintains a per-process page table mapping virtual pages to physical frames; the CPU consults this table on every memory access through a small hardware cache called the TLB (translation lookaside buffer). Pages can be resident (backed by physical RAM), swapped out (backed by swap), file-backed (backed by an mmap'd file), or unmapped (accessing them faults with SIGSEGV).

bash
# Inspect a process's memory map
cat /proc/self/maps | head

Output:

text
55c4a8a00000-55c4a8a23000 r--p 00000000 fd:01 1572873  /usr/bin/bash
55c4a8a23000-55c4a8af0000 r-xp 00023000 fd:01 1572873  /usr/bin/bash
55c4a8af0000-55c4a8b30000 r--p 000f0000 fd:01 1572873  /usr/bin/bash
55c4a8b30000-55c4a8b34000 r--p 0012f000 fd:01 1572873  /usr/bin/bash
55c4a8b34000-55c4a8b3d000 rw-p 00133000 fd:01 1572873  /usr/bin/bash
55c4a8b3d000-55c4a8b48000 rw-p 00000000 00:00 0
7f9b1c000000-7f9b1c021000 rw-p 00000000 00:00 0
7f9b1c021000-7f9b20000000 ---p 00000000 00:00 0

Each row is a VMA (virtual memory area): start-end, permission flags (r-xp = read+exec, private), file offset, device, inode, and the backing file. Anonymous regions (heap, stack, malloc'd memory) show inode 0 and no path. Permission ---p is a guard region that faults on any access.

Page-fault flow

When a process touches an address whose page isn't currently resident, the MMU raises a page fault. The kernel handler then chooses an action:

Fault kindAction
Minor fault — page is in RAM, just not in this process's TLB or page tableUpdate the table; no I/O
Major fault — page must be loaded (file-backed, demand-paged, or swap)Issue disk read; block until data arrives
Copy-on-write fault — page is shared but the process tried to writeAllocate a new frame, copy, update the table
Protection faultSend SIGSEGV
bash
# Page-fault rates over time
vmstat 1 5

Output (excerpt):

text
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0  812M  118M   3.4G    0    0    24    18 1124 2010  4  1 94  0  0
 0  0      0  810M  118M   3.4G    0    0     0    32 1042 1894  3  1 96  0  0

si/so are pages swapped in/out per second — anything sustained above zero usually means you've exhausted RAM. bi/bo are block I/O.

Swap

Swap is a region of disk the kernel uses to evict memory pages when physical RAM is full. Swapping out is a major page fault on disk, so swap is much slower than RAM — but having some swap lets the kernel evict cold pages (long-idle anonymous allocations) to make room for hot ones (file cache, active working sets). The conventional wisdom that "swap is bad" is half-right: thrashing on swap is bad, but a small amount of swap activity is healthy.

bash
# Show swap usage and configuration
free -h
swapon --show

Output:

text
              total        used        free      shared  buff/cache   available
Mem:           15Gi       5.2Gi       812Mi       128Mi       9.0Gi       9.4Gi
Swap:         8.0Gi       128Mi       7.9Gi

NAME          TYPE  SIZE   USED PRIO
/dev/nvme0n1p4 partition  8G   128M   -2

Swappiness

vm.swappiness (0–200, default 60) controls how aggressively the kernel prefers swapping anonymous pages over evicting file-cache pages. A lower value keeps anonymous memory in RAM at the cost of dropping file cache; a higher value does the opposite.

bash
# Inspect and tune
cat /proc/sys/vm/swappiness
sudo sysctl -w vm.swappiness=10           # workstation default many people use
echo "vm.swappiness = 10" | sudo tee /etc/sysctl.d/99-swappiness.conf

Output:

text
60
vm.swappiness = 10

On a desktop or laptop, vm.swappiness=10 keeps interactive apps in RAM (no lag when you return to a backgrounded window). On a database server, the right setting is often 1 (swap only to avoid OOM, never proactively). On a memory-overcommitted node, the default 60 is correct.

zswap and zram

zswap is a compressed RAM cache for swap pages — the kernel compresses cold pages instead of writing them to disk first, and only spills truly cold ones to swap. zram creates a compressed RAM-backed block device used as swap, common on RAM-constrained systems (ChromeOS, Fedora Workstation since 33).

bash
# Check zswap status
cat /sys/module/zswap/parameters/enabled

# Enable zram swap (typical Fedora-style config)
sudo dnf install zram-generator
sudo systemctl enable --now systemd-zram-setup@zram0
swapon --show

Output:

text
Y

NAME       TYPE      SIZE USED PRIO
/dev/zram0 partition  4G   0B  100

The OOM killer

When the kernel cannot satisfy an allocation and has nothing left to evict, the out-of-memory killer picks a process to terminate. The choice is based on an OOM score (/proc/PID/oom_score) computed from the process's RSS, its children's RSS, and a tunable adjustment (oom_score_adj, range -1000 to 1000). The kernel logs every OOM event to dmesg with a memory summary and the chosen victim.

bash
# Inspect OOM tunables for a process
cat /proc/self/oom_score
cat /proc/self/oom_score_adj

# Make a critical service immune to OOM
echo -1000 | sudo tee /proc/1234/oom_score_adj

# View past OOM events
sudo dmesg -T | grep -i "killed process"
sudo journalctl -k --grep "Out of memory"

Output (dmesg):

text
[Sun May 25 09:14:02 2026] Out of memory: Killed process 4521 (java) total-vm:8388608kB, anon-rss:5242880kB, file-rss:0kB

systemd-level OOM protection

systemd lets you set OOM adjustments and explicit kill behaviour in a unit file. Pair with cgroup memory limits to bound a service before the system-wide OOM killer ever runs.

ini
[Service]
ExecStart=/opt/myapp/bin/myapp
MemoryMax=2G             # hard ceiling — kernel kills inside this cgroup if exceeded
MemoryHigh=1.5G          # throttle when above this
OOMScoreAdjust=-500      # prefer killing other things first
OOMPolicy=stop           # if killed, stop the whole service

[!WARN] Setting oom_score_adj=-1000 (immune) on too many processes is dangerous — at OOM time the kernel must kill something, and if every plausible victim is immune it kills random kernel threads or panics. Reserve -1000 for at most a handful of truly critical processes (sshd, init).

mmap and file-backed memory

mmap() maps a file (or anonymous memory) into the process's address space. After mapping, reads and writes to the mapped region are translated by the kernel into reads and writes of the underlying file — without the copy-through-userspace overhead of read/write. mmap is how shared libraries are loaded, how databases (SQLite, PostgreSQL, MongoDB) get cheap shared caches, and how cat /dev/zero to a madvise-tuned region gives you the fastest possible zero-fill.

c
// Map a file read-only
int fd = open("/var/log/syslog", O_RDONLY);
struct stat st; fstat(fd, &st);
void *p = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
// p[0..st.st_size-1] now behaves like an in-memory char array.

MAP_PRIVATE means writes are CoW (changes don't reach the file); MAP_SHARED means writes go through to the file and are visible to every other mapping. MAP_ANONYMOUS (no fd) gives you a clean, zero-filled region for application use — this is what malloc does for large allocations.

Demand paging

mmap does not read the file's contents up-front. The kernel installs page-table entries that fault on first access; on each fault it reads exactly one page (4 KiB) from disk. The result: opening a 10 GiB file via mmap is instant; touching every page is the same total I/O as reading sequentially, but only the pages you touch are loaded. Use madvise(MADV_SEQUENTIAL) if you plan to scan linearly so the kernel prefetches ahead.

bash
# Watch a process's RSS grow as it touches a mapped file
sudo cat /proc/PID/status | grep -E '^(VmRSS|VmSize|VmData)'

Output:

text
VmSize:    10485760 kB
VmRSS:        45120 kB
VmData:        2048 kB

VmSize (virtual) is huge — the whole mapped file. VmRSS (resident) is small — only the pages actually touched.

Copy-on-write (CoW)

When a parent process forks, the kernel does not duplicate every memory page. Instead it marks all writable pages of both parent and child as read-only and shared; on the first write to a page, the MMU faults, the kernel allocates a new frame, copies the data, and updates the writer's table. The cost of fork is therefore proportional to the page table size, not to the resident set — which is why fork-then-exec is fast even on processes with gigabytes of memory.

bash
# Demonstrate CoW: parent and child share most memory until they write
(
  python3 -c '
import os, time
big = bytearray(500 * 1024 * 1024)        # 500 MB
pid = os.fork()
if pid == 0:
    time.sleep(60)                          # child sleeps
else:
    time.sleep(60)                          # parent sleeps
' &
)
sleep 2 ; ps -eo pid,ppid,rss,comm | grep python3

Output:

text
14201 14200 511048 python3
14202 14201   3120 python3

The child shows ~3 MB RSS even though it inherited a 500 MB address space — the pages are still shared with the parent. As the child writes to that memory, its RSS grows page by page.

Page cache

The kernel caches every disk read in the page cache. The next read of the same offset returns from RAM with no disk I/O. The page cache is unified with the file-mapping subsystem — mmap shares pages with read/write on the same file. Free memory looks scarce on a healthy system precisely because the kernel uses it all for the page cache.

bash
free -h

Output:

text
              total        used        free      shared  buff/cache   available
Mem:           15Gi       5.2Gi       812Mi       128Mi       9.0Gi       9.4Gi

The 9.0 GiB in buff/cache is available to processes — the kernel evicts cache pages on demand. The number to watch is available, not free.

Drop the page cache

For benchmarking — never for production tuning.

bash
sync                                 # flush dirty pages first
sudo sysctl -w vm.drop_caches=3      # 1=pagecache, 2=dentries+inodes, 3=both
free -h

Output:

text
              total        used        free      shared  buff/cache   available
Mem:           15Gi       5.1Gi       9.4Gi       128Mi       512Mi       9.5Gi

The cache is now empty; the next file accesses will be slow until it warms up again.

If you have a sluggish file system, posix_fadvise(POSIX_FADV_WILLNEED) or vmtouch -t /path pre-loads files into the page cache. Pair with vmtouch -l to lock them so they don't get evicted.

Memory allocators

User-space malloc() does not call the kernel for every allocation — it requests pages from the kernel via mmap or brk and partitions them into the small chunks malloc returns. The implementation of that partitioning is called the memory allocator. The default on Linux is glibc's ptmalloc; alternatives that often perform better under multi-threaded load are jemalloc and mimalloc.

AllocatorMaintained byStrengthsWhere it shines
glibc ptmallocGNUDefault everywhere, well-knownSingle-threaded or low-thread workloads
jemallocMeta (originally FreeBSD)Excellent fragmentation, multi-arenaMulti-threaded servers (Redis, Cassandra, MariaDB)
mimallocMicrosoft ResearchVery fast small-allocations, low metadata overheadLatency-sensitive services
TCMallocGooglePer-thread caches, integrated with pprofC++ services with heavy small-object churn
HoardEmery BergerCross-platform, scalableResearch / cross-OS

Switching the allocator at runtime

LD_PRELOAD swaps the allocator without recompiling. Use this to A/B test against your default.

bash
# Try jemalloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 \
  ./myapp

# Try mimalloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so \
  ./myapp

Output: (none — exits 0 on success)

systemd-friendly equivalent (drop-in for the unit file):

ini
[Service]
Environment=LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

Diagnose allocator-level fragmentation

For long-running services that grow RSS over time despite stable live data.

bash
# jemalloc — emit a stats dump on SIGUSR2 (if MALLOC_CONF has prof_active:true)
MALLOC_CONF="prof:true,prof_active:true,prof_prefix:jeprof.out" ./myapp
kill -USR2 $(pidof myapp)
jeprof --text ./myapp jeprof.out.*.heap | head

# mimalloc — set MIMALLOC_VERBOSE=1 to print on exit
MIMALLOC_VERBOSE=1 LD_PRELOAD=.../libmimalloc.so ./myapp

Output:

text
Total: 1024.0 MB
   512.0  50.0%  50.0%   512.0  50.0% allocate_buffer
   256.0  25.0%  75.0%   256.0  25.0% parse_request
   128.0  12.5%  87.5%   128.0  12.5% cache_entry_new
    64.0   6.3%  93.8%    64.0   6.3% session_alloc
    32.0   3.1%  96.9%    32.0   3.1% log_buffer_grow

Reading memory counters

The fields in top, ps, free, and /proc/PID/status look interchangeable but mean different things. Confusing them is the most common source of misreported memory bugs.

CounterMeaning
VmSize / VSZTotal virtual address space — includes mmaps that haven't been touched. Usually not a meaningful number.
VmRSS / RSSResident set size — physical RAM the process actually occupies. Includes shared library pages counted once per process.
RES (top)Same as RSS.
SHR (top)Shareable pages — backing libraries, mmap'd files, shared anonymous regions.
PSS (proportional set size)Like RSS but shared pages are divided by the number of sharers. The fairest "how much RAM does this process really use?" number.
USS (unique set size)Only this process's private pages. The lower bound: free this much by killing the process.
SwapPages currently in swap.
AnonAnonymous (non-file-backed) pages.
FileFile-backed pages.
bash
# PSS / USS for a process
sudo cat /proc/1234/smaps_rollup

Output:

text
55c4a8a00000-7ffffffff000 ---p 00000000 00:00 0                          [rollup]
Rss:              412580 kB
Pss:              310120 kB
Shared_Clean:      89200 kB
Shared_Dirty:       2048 kB
Private_Clean:    218304 kB
Private_Dirty:    103028 kB
Anonymous:        103028 kB
Swap:                  0 kB

PSS is 310 MB — that is what this process actually costs the system, with shared library pages fairly attributed.

bash
# Top consumers by PSS
sudo apt install smem
smem -tk -s pss | head

Output:

text
  PID User     Command                         Swap      USS      PSS      RSS
 1234 nginx    nginx: worker process              0   180.0M   195.4M   240.0M
 4521 java     java -jar app.jar                  0   312.4M   324.8M   380.2M

Huge pages

Standard pages are 4 KiB; huge pages are 2 MiB or 1 GiB. They reduce TLB pressure for workloads that touch large contiguous ranges (databases, JVMs, ML training). Linux supports two flavours: static (hugetlbfs, allocated at boot) and transparent (THP, the kernel promotes 4 KiB pages opportunistically).

bash
# Show huge-page state
cat /sys/kernel/mm/transparent_hugepage/enabled
grep -i huge /proc/meminfo

Output:

text
always [madvise] never
AnonHugePages:    65536 kB
HugePages_Total:       0
HugePages_Free:        0
Hugepagesize:       2048 kB

Most distros default to madvise (allocate huge pages only when the application asks via madvise(MADV_HUGEPAGE)). Switch to always only after measuring — THP can cause latency spikes for some workloads.

bash
# Disable THP for a specific service (databases often want this)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled

Output:

text
never

Common pitfalls

  1. "Free memory" panicfree -h shows free=200M and you assume disaster. Look at available instead — the kernel will reclaim cache pages on demand.
  2. VSZ is hugeps shows a process at 20 GB. Almost always meaningless: it's the address space, not RAM. Check RSS (or better, PSS).
  3. OOM killed the wrong process — by default OOM picks the largest RSS, including children. Tune oom_score_adj for critical services and their children, or move them into a cgroup with MemoryMax.
  4. Swapping kills latency — if vmstat shows sustained si/so, you are out of physical memory. Either add RAM, drop a workload, or lower swappiness for latency-critical processes.
  5. mmap of a huge file makes my process look bloated — only if you read VmSize. VmRSS only includes pages you've touched. PSS is the honest number.
  6. malloc returns but the kernel later kills you — Linux overcommits. vm.overcommit_memory=1 (default 0 = heuristic) lets allocations exceed RAM+swap; the kill happens at write time. Set vm.overcommit_memory=2 for strict accounting if you can't tolerate this.
  7. Allocator swap doesn't help — many "Redis uses too much memory" reports turn out to be fragmentation. Switching to jemalloc usually reclaims 10–20 % without code changes.
  8. /dev/shm is RAM — files in /dev/shm consume RAM directly (tmpfs). A runaway Chrome tab can fill it; size it explicitly in /etc/fstab.
  9. THP and databases — transparent huge pages can hurt PostgreSQL, MongoDB, and Redis under fork-heavy workloads (large CoW pages). Disable for those services.
  10. Locked pages exhaust the limitmlock() requires RLIMIT_MEMLOCK. On a service that mlocks large regions (databases with mlock=true), raise the systemd LimitMEMLOCK=infinity.
  11. Container memory limit triggers OOM but no swap — Kubernetes/containerd disable swap; once you hit memory.max the kernel kills you immediately. Tune MemoryHigh (soft) and MemoryMax (hard) so you throttle before dying.

Real-world recipes

Where did my memory go?

A diagnostic that pairs free with the top RSS and PSS consumers.

bash
{
  echo "=== free -h ===" ; free -h
  echo "=== top 10 by RSS ==="
  ps -eo pid,user,rss,pcpu,comm --sort=-rss | head -11
  echo "=== top 10 by PSS (smem) ==="
  sudo smem -tk -s pss 2>/dev/null | head -11
  echo "=== swap users ==="
  for f in /proc/[0-9]*/status; do
    awk '/^Pid:/{p=$2} /^Name:/{n=$2} /^VmSwap:/&&$2>0{print $2, p, n}' "$f"
  done | sort -rn | head
}

Output (excerpt):

text
=== free -h ===
              total        used        free      shared  buff/cache   available
Mem:           15Gi       5.2Gi       812Mi       128Mi       9.0Gi       9.4Gi
=== top 10 by RSS ===
  PID USER       RSS %CPU COMMAND
 4521 alice  524288  12.3 java
 1234 nginx  240000   0.4 nginx
 9200 alice  118400   3.1 node

Cap a service's memory at 1 GB

systemd is the cleanest way. The kernel enforces the limit at the cgroup boundary.

bash
sudo systemctl edit myapp.service

Output: (none — opens editor, writes drop-in on save)

ini
[Service]
MemoryHigh=800M
MemoryMax=1G
OOMPolicy=stop
bash
sudo systemctl restart myapp.service
systemctl status myapp.service | grep -i memory

Output:

text
     Memory: 412.4M (high: 800.0M max: 1.0G)

Find what's swapping

When vmstat shows continuous si/so, you want to know which processes are the swap users.

bash
for f in /proc/[0-9]*/status; do
  awk '/^Pid:/{p=$2} /^Name:/{n=$2} /^VmSwap:/&&$2+0>0{printf "%10d kB %6d %s\n", $2, p, n}' "$f"
done | sort -rn | head

Output:

text
    412000 kB   4521 java
     88000 kB   9200 node
     32000 kB   1234 nginx

Free the page cache without restarting

Almost never the right answer in production (you'll just refill it on the next read), but useful when benchmarking.

bash
sync ; sudo sysctl -w vm.drop_caches=3

Output:

text
vm.drop_caches = 3

Reduce server swappiness for latency-sensitive workloads

Database servers and game servers usually want vm.swappiness=1 — the kernel only swaps when absolutely necessary.

bash
echo "vm.swappiness = 1" | sudo tee /etc/sysctl.d/99-db-swappiness.conf
sudo sysctl --system

Output:

text
* Applying /etc/sysctl.d/99-db-swappiness.conf ...
vm.swappiness = 1

Switch to jemalloc for a Redis server

Redis ships with jemalloc baked in on most builds, but for ad-hoc apps:

bash
sudo apt install libjemalloc2
sudo systemctl edit myapp.service

Output: (none — installs package and opens editor for the drop-in)

ini
[Service]
Environment=LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
bash
sudo systemctl restart myapp.service
sudo cat /proc/$(pgrep -f myapp)/maps | grep jemalloc | head -1

Output:

text
7f9b1c000000-7f9b1c1f0000 r--p 00000000 fd:01 1572881  /usr/lib/x86_64-linux-gnu/libjemalloc.so.2

Lock critical pages in RAM

For a process that must never page out (real-time audio, security daemons).

bash
sudo systemctl edit critical.service

Output: (none — opens editor, writes drop-in on save)

ini
[Service]
LimitMEMLOCK=infinity

In code: mlockall(MCL_CURRENT | MCL_FUTURE) after startup.

Disable transparent huge pages for a database

PostgreSQL, MongoDB, and Redis usually want THP off. Add a unit that runs at boot.

bash
cat <<'EOF' | sudo tee /etc/systemd/system/disable-thp.service
[Unit]
Description=Disable Transparent Huge Pages
DefaultDependencies=no
After=sysinit.target local-fs.target
Before=basic.target

[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/defrag'
RemainAfterExit=yes

[Install]
WantedBy=basic.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now disable-thp.service
cat /sys/kernel/mm/transparent_hugepage/enabled

Output:

text
always madvise [never]

Watch the OOM killer in real time

For an investigation where you suspect imminent OOM.

bash
sudo journalctl -k -f --grep='oom-killer|Out of memory'

Output: (none — exits 0 on success; events stream live)

Tips

When in doubt about "how much RAM is this process really using?", use PSS, not RSS. smem -tk -s pss ranks the box's processes fairly when many of them share libraries.

slabtop shows kernel slab cache use — useful when free says RAM is gone but no process accounts for it. The dentry and inode caches can hold gigabytes after a find / run; vm.drop_caches=2 flushes them.

[!WARN] Don't disable swap entirely on a Linux system to "make it faster" — without swap the kernel cannot evict cold anonymous pages, and that pushes the OOM killer closer to active workloads. A small swap (1–2 GB) gives the kernel headroom; the cost is paid only if you actually swap.