cheat sheet
Memory Management
How operating systems give every process its own address space: virtual memory and paging, swap, the OOM killer, mmap, copy-on-write, the page cache, allocator choices (glibc, jemalloc, mimalloc), and how to read memory counters in top, ps, and free.
Memory Management — Virtual Memory, Paging, Swap, OOM, mmap
What it is
Operating-system memory management is the layer that turns the limited, physical RAM in your machine into the illusion of a private, contiguous, oversized address space for each process. Every modern OS does this with virtual memory: a per-process page table, transparently translated to physical frames by the CPU's MMU, with the kernel orchestrating allocation, eviction (to swap), file mapping (mmap), copy-on-write, and caching of recently-used disk pages. Reach for this article when you need to understand why free says you have "no free memory" (the page cache does), why your process used 4 GB of "virtual" memory while only touching 50 MB (overcommit + reservation), why the OOM killer chose the wrong victim, or which allocator your application should use under load.
Virtual memory and paging
Every running process has its own virtual address space — a flat range from 0 to (on x86-64) 256 TiB of usable user-space addresses, partitioned by the MMU into 4 KiB pages. The kernel maintains a per-process page table mapping virtual pages to physical frames; the CPU consults this table on every memory access through a small hardware cache called the TLB (translation lookaside buffer). Pages can be resident (backed by physical RAM), swapped out (backed by swap), file-backed (backed by an mmap'd file), or unmapped (accessing them faults with SIGSEGV).
# Inspect a process's memory map
cat /proc/self/maps | head
Output:
55c4a8a00000-55c4a8a23000 r--p 00000000 fd:01 1572873 /usr/bin/bash
55c4a8a23000-55c4a8af0000 r-xp 00023000 fd:01 1572873 /usr/bin/bash
55c4a8af0000-55c4a8b30000 r--p 000f0000 fd:01 1572873 /usr/bin/bash
55c4a8b30000-55c4a8b34000 r--p 0012f000 fd:01 1572873 /usr/bin/bash
55c4a8b34000-55c4a8b3d000 rw-p 00133000 fd:01 1572873 /usr/bin/bash
55c4a8b3d000-55c4a8b48000 rw-p 00000000 00:00 0
7f9b1c000000-7f9b1c021000 rw-p 00000000 00:00 0
7f9b1c021000-7f9b20000000 ---p 00000000 00:00 0
Each row is a VMA (virtual memory area): start-end, permission flags (r-xp = read+exec, private), file offset, device, inode, and the backing file. Anonymous regions (heap, stack, malloc'd memory) show inode 0 and no path. Permission ---p is a guard region that faults on any access.
Page-fault flow
When a process touches an address whose page isn't currently resident, the MMU raises a page fault. The kernel handler then chooses an action:
| Fault kind | Action |
|---|---|
| Minor fault — page is in RAM, just not in this process's TLB or page table | Update the table; no I/O |
| Major fault — page must be loaded (file-backed, demand-paged, or swap) | Issue disk read; block until data arrives |
| Copy-on-write fault — page is shared but the process tried to write | Allocate a new frame, copy, update the table |
| Protection fault | Send SIGSEGV |
# Page-fault rates over time
vmstat 1 5
Output (excerpt):
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 812M 118M 3.4G 0 0 24 18 1124 2010 4 1 94 0 0
0 0 0 810M 118M 3.4G 0 0 0 32 1042 1894 3 1 96 0 0
si/so are pages swapped in/out per second — anything sustained above zero usually means you've exhausted RAM. bi/bo are block I/O.
Swap
Swap is a region of disk the kernel uses to evict memory pages when physical RAM is full. Swapping out is a major page fault on disk, so swap is much slower than RAM — but having some swap lets the kernel evict cold pages (long-idle anonymous allocations) to make room for hot ones (file cache, active working sets). The conventional wisdom that "swap is bad" is half-right: thrashing on swap is bad, but a small amount of swap activity is healthy.
# Show swap usage and configuration
free -h
swapon --show
Output:
total used free shared buff/cache available
Mem: 15Gi 5.2Gi 812Mi 128Mi 9.0Gi 9.4Gi
Swap: 8.0Gi 128Mi 7.9Gi
NAME TYPE SIZE USED PRIO
/dev/nvme0n1p4 partition 8G 128M -2
Swappiness
vm.swappiness (0–200, default 60) controls how aggressively the kernel prefers swapping anonymous pages over evicting file-cache pages. A lower value keeps anonymous memory in RAM at the cost of dropping file cache; a higher value does the opposite.
# Inspect and tune
cat /proc/sys/vm/swappiness
sudo sysctl -w vm.swappiness=10 # workstation default many people use
echo "vm.swappiness = 10" | sudo tee /etc/sysctl.d/99-swappiness.conf
Output:
60
vm.swappiness = 10
On a desktop or laptop,
vm.swappiness=10keeps interactive apps in RAM (no lag when you return to a backgrounded window). On a database server, the right setting is often1(swap only to avoid OOM, never proactively). On a memory-overcommitted node, the default60is correct.
zswap and zram
zswap is a compressed RAM cache for swap pages — the kernel compresses cold pages instead of writing them to disk first, and only spills truly cold ones to swap. zram creates a compressed RAM-backed block device used as swap, common on RAM-constrained systems (ChromeOS, Fedora Workstation since 33).
# Check zswap status
cat /sys/module/zswap/parameters/enabled
# Enable zram swap (typical Fedora-style config)
sudo dnf install zram-generator
sudo systemctl enable --now systemd-zram-setup@zram0
swapon --show
Output:
Y
NAME TYPE SIZE USED PRIO
/dev/zram0 partition 4G 0B 100
The OOM killer
When the kernel cannot satisfy an allocation and has nothing left to evict, the out-of-memory killer picks a process to terminate. The choice is based on an OOM score (/proc/PID/oom_score) computed from the process's RSS, its children's RSS, and a tunable adjustment (oom_score_adj, range -1000 to 1000). The kernel logs every OOM event to dmesg with a memory summary and the chosen victim.
# Inspect OOM tunables for a process
cat /proc/self/oom_score
cat /proc/self/oom_score_adj
# Make a critical service immune to OOM
echo -1000 | sudo tee /proc/1234/oom_score_adj
# View past OOM events
sudo dmesg -T | grep -i "killed process"
sudo journalctl -k --grep "Out of memory"
Output (dmesg):
[Sun May 25 09:14:02 2026] Out of memory: Killed process 4521 (java) total-vm:8388608kB, anon-rss:5242880kB, file-rss:0kB
systemd-level OOM protection
systemd lets you set OOM adjustments and explicit kill behaviour in a unit file. Pair with cgroup memory limits to bound a service before the system-wide OOM killer ever runs.
[Service]
ExecStart=/opt/myapp/bin/myapp
MemoryMax=2G # hard ceiling — kernel kills inside this cgroup if exceeded
MemoryHigh=1.5G # throttle when above this
OOMScoreAdjust=-500 # prefer killing other things first
OOMPolicy=stop # if killed, stop the whole service
[!WARN] Setting
oom_score_adj=-1000(immune) on too many processes is dangerous — at OOM time the kernel must kill something, and if every plausible victim is immune it kills random kernel threads or panics. Reserve-1000for at most a handful of truly critical processes (sshd, init).
mmap and file-backed memory
mmap() maps a file (or anonymous memory) into the process's address space. After mapping, reads and writes to the mapped region are translated by the kernel into reads and writes of the underlying file — without the copy-through-userspace overhead of read/write. mmap is how shared libraries are loaded, how databases (SQLite, PostgreSQL, MongoDB) get cheap shared caches, and how cat /dev/zero to a madvise-tuned region gives you the fastest possible zero-fill.
// Map a file read-only
int fd = open("/var/log/syslog", O_RDONLY);
struct stat st; fstat(fd, &st);
void *p = mmap(NULL, st.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
// p[0..st.st_size-1] now behaves like an in-memory char array.
MAP_PRIVATE means writes are CoW (changes don't reach the file); MAP_SHARED means writes go through to the file and are visible to every other mapping. MAP_ANONYMOUS (no fd) gives you a clean, zero-filled region for application use — this is what malloc does for large allocations.
Demand paging
mmap does not read the file's contents up-front. The kernel installs page-table entries that fault on first access; on each fault it reads exactly one page (4 KiB) from disk. The result: opening a 10 GiB file via mmap is instant; touching every page is the same total I/O as reading sequentially, but only the pages you touch are loaded. Use madvise(MADV_SEQUENTIAL) if you plan to scan linearly so the kernel prefetches ahead.
# Watch a process's RSS grow as it touches a mapped file
sudo cat /proc/PID/status | grep -E '^(VmRSS|VmSize|VmData)'
Output:
VmSize: 10485760 kB
VmRSS: 45120 kB
VmData: 2048 kB
VmSize (virtual) is huge — the whole mapped file. VmRSS (resident) is small — only the pages actually touched.
Copy-on-write (CoW)
When a parent process forks, the kernel does not duplicate every memory page. Instead it marks all writable pages of both parent and child as read-only and shared; on the first write to a page, the MMU faults, the kernel allocates a new frame, copies the data, and updates the writer's table. The cost of fork is therefore proportional to the page table size, not to the resident set — which is why fork-then-exec is fast even on processes with gigabytes of memory.
# Demonstrate CoW: parent and child share most memory until they write
(
python3 -c '
import os, time
big = bytearray(500 * 1024 * 1024) # 500 MB
pid = os.fork()
if pid == 0:
time.sleep(60) # child sleeps
else:
time.sleep(60) # parent sleeps
' &
)
sleep 2 ; ps -eo pid,ppid,rss,comm | grep python3
Output:
14201 14200 511048 python3
14202 14201 3120 python3
The child shows ~3 MB RSS even though it inherited a 500 MB address space — the pages are still shared with the parent. As the child writes to that memory, its RSS grows page by page.
Page cache
The kernel caches every disk read in the page cache. The next read of the same offset returns from RAM with no disk I/O. The page cache is unified with the file-mapping subsystem — mmap shares pages with read/write on the same file. Free memory looks scarce on a healthy system precisely because the kernel uses it all for the page cache.
free -h
Output:
total used free shared buff/cache available
Mem: 15Gi 5.2Gi 812Mi 128Mi 9.0Gi 9.4Gi
The 9.0 GiB in buff/cache is available to processes — the kernel evicts cache pages on demand. The number to watch is available, not free.
Drop the page cache
For benchmarking — never for production tuning.
sync # flush dirty pages first
sudo sysctl -w vm.drop_caches=3 # 1=pagecache, 2=dentries+inodes, 3=both
free -h
Output:
total used free shared buff/cache available
Mem: 15Gi 5.1Gi 9.4Gi 128Mi 512Mi 9.5Gi
The cache is now empty; the next file accesses will be slow until it warms up again.
If you have a sluggish file system,
posix_fadvise(POSIX_FADV_WILLNEED)orvmtouch -t /pathpre-loads files into the page cache. Pair withvmtouch -lto lock them so they don't get evicted.
Memory allocators
User-space malloc() does not call the kernel for every allocation — it requests pages from the kernel via mmap or brk and partitions them into the small chunks malloc returns. The implementation of that partitioning is called the memory allocator. The default on Linux is glibc's ptmalloc; alternatives that often perform better under multi-threaded load are jemalloc and mimalloc.
| Allocator | Maintained by | Strengths | Where it shines |
|---|---|---|---|
| glibc ptmalloc | GNU | Default everywhere, well-known | Single-threaded or low-thread workloads |
| jemalloc | Meta (originally FreeBSD) | Excellent fragmentation, multi-arena | Multi-threaded servers (Redis, Cassandra, MariaDB) |
| mimalloc | Microsoft Research | Very fast small-allocations, low metadata overhead | Latency-sensitive services |
| TCMalloc | Per-thread caches, integrated with pprof | C++ services with heavy small-object churn | |
| Hoard | Emery Berger | Cross-platform, scalable | Research / cross-OS |
Switching the allocator at runtime
LD_PRELOAD swaps the allocator without recompiling. Use this to A/B test against your default.
# Try jemalloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2 \
./myapp
# Try mimalloc
LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libmimalloc.so \
./myapp
Output: (none — exits 0 on success)
systemd-friendly equivalent (drop-in for the unit file):
[Service]
Environment=LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
Diagnose allocator-level fragmentation
For long-running services that grow RSS over time despite stable live data.
# jemalloc — emit a stats dump on SIGUSR2 (if MALLOC_CONF has prof_active:true)
MALLOC_CONF="prof:true,prof_active:true,prof_prefix:jeprof.out" ./myapp
kill -USR2 $(pidof myapp)
jeprof --text ./myapp jeprof.out.*.heap | head
# mimalloc — set MIMALLOC_VERBOSE=1 to print on exit
MIMALLOC_VERBOSE=1 LD_PRELOAD=.../libmimalloc.so ./myapp
Output:
Total: 1024.0 MB
512.0 50.0% 50.0% 512.0 50.0% allocate_buffer
256.0 25.0% 75.0% 256.0 25.0% parse_request
128.0 12.5% 87.5% 128.0 12.5% cache_entry_new
64.0 6.3% 93.8% 64.0 6.3% session_alloc
32.0 3.1% 96.9% 32.0 3.1% log_buffer_grow
Reading memory counters
The fields in top, ps, free, and /proc/PID/status look interchangeable but mean different things. Confusing them is the most common source of misreported memory bugs.
| Counter | Meaning |
|---|---|
VmSize / VSZ | Total virtual address space — includes mmaps that haven't been touched. Usually not a meaningful number. |
VmRSS / RSS | Resident set size — physical RAM the process actually occupies. Includes shared library pages counted once per process. |
RES (top) | Same as RSS. |
SHR (top) | Shareable pages — backing libraries, mmap'd files, shared anonymous regions. |
PSS (proportional set size) | Like RSS but shared pages are divided by the number of sharers. The fairest "how much RAM does this process really use?" number. |
USS (unique set size) | Only this process's private pages. The lower bound: free this much by killing the process. |
Swap | Pages currently in swap. |
Anon | Anonymous (non-file-backed) pages. |
File | File-backed pages. |
# PSS / USS for a process
sudo cat /proc/1234/smaps_rollup
Output:
55c4a8a00000-7ffffffff000 ---p 00000000 00:00 0 [rollup]
Rss: 412580 kB
Pss: 310120 kB
Shared_Clean: 89200 kB
Shared_Dirty: 2048 kB
Private_Clean: 218304 kB
Private_Dirty: 103028 kB
Anonymous: 103028 kB
Swap: 0 kB
PSS is 310 MB — that is what this process actually costs the system, with shared library pages fairly attributed.
# Top consumers by PSS
sudo apt install smem
smem -tk -s pss | head
Output:
PID User Command Swap USS PSS RSS
1234 nginx nginx: worker process 0 180.0M 195.4M 240.0M
4521 java java -jar app.jar 0 312.4M 324.8M 380.2M
Huge pages
Standard pages are 4 KiB; huge pages are 2 MiB or 1 GiB. They reduce TLB pressure for workloads that touch large contiguous ranges (databases, JVMs, ML training). Linux supports two flavours: static (hugetlbfs, allocated at boot) and transparent (THP, the kernel promotes 4 KiB pages opportunistically).
# Show huge-page state
cat /sys/kernel/mm/transparent_hugepage/enabled
grep -i huge /proc/meminfo
Output:
always [madvise] never
AnonHugePages: 65536 kB
HugePages_Total: 0
HugePages_Free: 0
Hugepagesize: 2048 kB
Most distros default to madvise (allocate huge pages only when the application asks via madvise(MADV_HUGEPAGE)). Switch to always only after measuring — THP can cause latency spikes for some workloads.
# Disable THP for a specific service (databases often want this)
echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
Output:
never
Common pitfalls
- "Free memory" panic —
free -hshowsfree=200Mand you assume disaster. Look atavailableinstead — the kernel will reclaim cache pages on demand. - VSZ is huge —
psshows a process at 20 GB. Almost always meaningless: it's the address space, not RAM. Check RSS (or better, PSS). - OOM killed the wrong process — by default OOM picks the largest RSS, including children. Tune
oom_score_adjfor critical services and their children, or move them into a cgroup withMemoryMax. - Swapping kills latency — if
vmstatshows sustainedsi/so, you are out of physical memory. Either add RAM, drop a workload, or lower swappiness for latency-critical processes. mmapof a huge file makes my process look bloated — only if you read VmSize. VmRSS only includes pages you've touched. PSS is the honest number.mallocreturns but the kernel later kills you — Linux overcommits.vm.overcommit_memory=1(default0= heuristic) lets allocations exceed RAM+swap; the kill happens at write time. Setvm.overcommit_memory=2for strict accounting if you can't tolerate this.- Allocator swap doesn't help — many "Redis uses too much memory" reports turn out to be fragmentation. Switching to jemalloc usually reclaims 10–20 % without code changes.
/dev/shmis RAM — files in/dev/shmconsume RAM directly (tmpfs). A runaway Chrome tab can fill it; size it explicitly in/etc/fstab.- THP and databases — transparent huge pages can hurt PostgreSQL, MongoDB, and Redis under fork-heavy workloads (large CoW pages). Disable for those services.
- Locked pages exhaust the limit —
mlock()requiresRLIMIT_MEMLOCK. On a service that mlocks large regions (databases withmlock=true), raise the systemdLimitMEMLOCK=infinity. - Container memory limit triggers OOM but no swap — Kubernetes/containerd disable swap; once you hit
memory.maxthe kernel kills you immediately. TuneMemoryHigh(soft) andMemoryMax(hard) so you throttle before dying.
Real-world recipes
Where did my memory go?
A diagnostic that pairs free with the top RSS and PSS consumers.
{
echo "=== free -h ===" ; free -h
echo "=== top 10 by RSS ==="
ps -eo pid,user,rss,pcpu,comm --sort=-rss | head -11
echo "=== top 10 by PSS (smem) ==="
sudo smem -tk -s pss 2>/dev/null | head -11
echo "=== swap users ==="
for f in /proc/[0-9]*/status; do
awk '/^Pid:/{p=$2} /^Name:/{n=$2} /^VmSwap:/&&$2>0{print $2, p, n}' "$f"
done | sort -rn | head
}
Output (excerpt):
=== free -h ===
total used free shared buff/cache available
Mem: 15Gi 5.2Gi 812Mi 128Mi 9.0Gi 9.4Gi
=== top 10 by RSS ===
PID USER RSS %CPU COMMAND
4521 alice 524288 12.3 java
1234 nginx 240000 0.4 nginx
9200 alice 118400 3.1 node
Cap a service's memory at 1 GB
systemd is the cleanest way. The kernel enforces the limit at the cgroup boundary.
sudo systemctl edit myapp.service
Output: (none — opens editor, writes drop-in on save)
[Service]
MemoryHigh=800M
MemoryMax=1G
OOMPolicy=stop
sudo systemctl restart myapp.service
systemctl status myapp.service | grep -i memory
Output:
Memory: 412.4M (high: 800.0M max: 1.0G)
Find what's swapping
When vmstat shows continuous si/so, you want to know which processes are the swap users.
for f in /proc/[0-9]*/status; do
awk '/^Pid:/{p=$2} /^Name:/{n=$2} /^VmSwap:/&&$2+0>0{printf "%10d kB %6d %s\n", $2, p, n}' "$f"
done | sort -rn | head
Output:
412000 kB 4521 java
88000 kB 9200 node
32000 kB 1234 nginx
Free the page cache without restarting
Almost never the right answer in production (you'll just refill it on the next read), but useful when benchmarking.
sync ; sudo sysctl -w vm.drop_caches=3
Output:
vm.drop_caches = 3
Reduce server swappiness for latency-sensitive workloads
Database servers and game servers usually want vm.swappiness=1 — the kernel only swaps when absolutely necessary.
echo "vm.swappiness = 1" | sudo tee /etc/sysctl.d/99-db-swappiness.conf
sudo sysctl --system
Output:
* Applying /etc/sysctl.d/99-db-swappiness.conf ...
vm.swappiness = 1
Switch to jemalloc for a Redis server
Redis ships with jemalloc baked in on most builds, but for ad-hoc apps:
sudo apt install libjemalloc2
sudo systemctl edit myapp.service
Output: (none — installs package and opens editor for the drop-in)
[Service]
Environment=LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2
sudo systemctl restart myapp.service
sudo cat /proc/$(pgrep -f myapp)/maps | grep jemalloc | head -1
Output:
7f9b1c000000-7f9b1c1f0000 r--p 00000000 fd:01 1572881 /usr/lib/x86_64-linux-gnu/libjemalloc.so.2
Lock critical pages in RAM
For a process that must never page out (real-time audio, security daemons).
sudo systemctl edit critical.service
Output: (none — opens editor, writes drop-in on save)
[Service]
LimitMEMLOCK=infinity
In code: mlockall(MCL_CURRENT | MCL_FUTURE) after startup.
Disable transparent huge pages for a database
PostgreSQL, MongoDB, and Redis usually want THP off. Add a unit that runs at boot.
cat <<'EOF' | sudo tee /etc/systemd/system/disable-thp.service
[Unit]
Description=Disable Transparent Huge Pages
DefaultDependencies=no
After=sysinit.target local-fs.target
Before=basic.target
[Service]
Type=oneshot
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'
ExecStart=/bin/sh -c 'echo never > /sys/kernel/mm/transparent_hugepage/defrag'
RemainAfterExit=yes
[Install]
WantedBy=basic.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now disable-thp.service
cat /sys/kernel/mm/transparent_hugepage/enabled
Output:
always madvise [never]
Watch the OOM killer in real time
For an investigation where you suspect imminent OOM.
sudo journalctl -k -f --grep='oom-killer|Out of memory'
Output: (none — exits 0 on success; events stream live)
Tips
When in doubt about "how much RAM is this process really using?", use PSS, not RSS.
smem -tk -s pssranks the box's processes fairly when many of them share libraries.
slabtopshows kernel slab cache use — useful whenfreesays RAM is gone but no process accounts for it. The dentry and inode caches can hold gigabytes after afind /run;vm.drop_caches=2flushes them.
[!WARN] Don't disable swap entirely on a Linux system to "make it faster" — without swap the kernel cannot evict cold anonymous pages, and that pushes the OOM killer closer to active workloads. A small swap (1–2 GB) gives the kernel headroom; the cost is paid only if you actually swap.