cheat sheet
Networking Stack
How packets actually move: the OSI and TCP/IP layer models, the BSD socket API, TCP vs UDP, the three-way handshake, MTU/MSS, NAT and port translation, basic IP routing, and the full DNS resolution flow.
Networking Stack — OSI/TCP-IP, Sockets, TCP/UDP, MTU, NAT, DNS
What it is
The "networking stack" is the layered set of protocols and OS code that turns a flat byte stream from your application into electrical or radio signals on the wire — and vice versa. Each layer hides the one below it: your application sees a socket (open connect(), send(), recv()); the OS adds TCP or UDP framing; IP adds source and destination addressing; the link layer adds MAC addresses; the physical layer puts the bits on copper, fibre, or air. Reach for this article when you need the canonical mental model — what each layer does, how a TCP handshake actually unfolds, why a path MTU change breaks one app and not another, and how getaddrinfo() ends up speaking to your DNS server.
The two layer models
The OSI seven-layer model is a teaching tool from 1984; the TCP/IP four-layer model is what the internet actually uses. Most engineers move between them fluently; understanding both lets you read RFCs and product docs without translation.
| OSI layer | TCP/IP layer | Examples |
|---|---|---|
| 7 — Application | Application | HTTP, SSH, DNS, SMTP, your web app |
| 6 — Presentation | Application | TLS, MIME, JSON, gzip |
| 5 — Session | Application | TLS resumption tickets, RPC sessions |
| 4 — Transport | Transport | TCP, UDP, QUIC, SCTP |
| 3 — Network | Internet | IPv4, IPv6, ICMP, IPsec |
| 2 — Data link | Link | Ethernet, Wi-Fi (802.11), ARP, NDP |
| 1 — Physical | Link | Cat-6 cabling, fibre, radio |
# Show the layered view for a single live connection
sudo ss -tnpie state established sport = :443
Output:
ESTAB 0 0 10.0.0.10:443 203.0.113.42:49182
cubic wscale:7,7 rto:204 rtt:3.21/1.5 cwnd:10 mss:1448
users:(("nginx",pid=1234,fd=12))
The output mixes layers: PID + fd (application), port 443/49182 (transport), 10.0.0.10/203.0.113.42 (network). Underneath, the link layer is whatever NIC 10.0.0.10 lives on, which ip route get 203.0.113.42 will tell you.
When somebody says "it's a layer 7 problem" they mean the application is misbehaving (bad request, slow handler). "Layer 4" means transport — connection rejected, TCP reset, congestion. "Layer 3" means IP routing or addressing. "Layer 1/2" means cable, NIC, switch, or VLAN. Knowing which layer to investigate first is half the troubleshoot.
The BSD socket API
Every network program on every modern OS speaks the same API, ultimately. It dates from 4.2BSD (1983) and is the syscall surface for TCP, UDP, UNIX domain sockets, raw IP, and (on Linux) Netlink. A socket is an open file descriptor with one extra property: an address family (IPv4, IPv6, UNIX) and a type (SOCK_STREAM for TCP, SOCK_DGRAM for UDP).
// Minimal TCP client in C
int s = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in addr = { .sin_family = AF_INET, .sin_port = htons(80) };
inet_pton(AF_INET, "10.0.0.1", &addr.sin_addr);
connect(s, (struct sockaddr*)&addr, sizeof(addr));
write(s, "GET / HTTP/1.0\r\n\r\n", 18);
char buf[4096]; int n = read(s, buf, sizeof(buf));
close(s);
The same pattern in Python is two lines:
import socket
with socket.create_connection(("10.0.0.1", 80)) as s:
s.send(b"GET / HTTP/1.0\r\n\r\n")
print(s.recv(4096))
Server-side syscalls
The server side of TCP adds three syscalls between socket() and the data exchange: bind(), listen(), and accept().
| Syscall | Purpose |
|---|---|
socket() | Allocate the socket FD |
bind() | Attach the socket to a local IP and port |
listen() | Mark the socket passive; create the accept queue |
accept() | Pop one completed connection from the queue, return a new FD |
connect() | Client-side: initiate the three-way handshake |
send()/recv() / read()/write() | Data transfer |
shutdown() | Half-close one direction (SHUT_RD, SHUT_WR, SHUT_RDWR) |
close() | Release the FD and tear down the connection |
setsockopt() | Tune timeouts, keepalives, buffer sizes |
# Trace a server's syscalls during one accept
sudo strace -f -e trace=network -p $(pgrep -n nginx) 2>&1 | head
Output (excerpt):
accept4(6, {sa_family=AF_INET, sin_port=htons(49182), sin_addr=...}, [16], SOCK_CLOEXEC) = 12
recvfrom(12, "GET / HTTP/1.1\r\nHost: ...", 8192, 0, NULL, NULL) = 412
sendto(12, "HTTP/1.1 200 OK\r\nDate: ...", 1024, MSG_NOSIGNAL, NULL, 0) = 1024
close(12) = 0
TCP vs UDP
TCP and UDP are both transport-layer protocols that sit on top of IP. They share addressing (source/dest IP + source/dest port) but differ in everything else.
| Property | TCP | UDP |
|---|---|---|
| Connection | Yes — explicit handshake | No — datagrams are independent |
| Reliability | Retransmits lost segments | Best effort; lost packets are lost |
| Ordering | In-order delivery | Out of order possible |
| Flow control | Receive window | None |
| Congestion control | Reno, Cubic, BBR | None (application's job) |
| Header overhead | 20 bytes (no options) | 8 bytes |
| Use cases | HTTP, SSH, SMTP, FTP, most things | DNS, NTP, VoIP, gaming, QUIC base, multicast |
# Open a TCP and a UDP socket side by side; observe headers
sudo tcpdump -nni eth0 -c 4 'port 53'
Output:
09:14:02.001 IP 10.0.0.10.41234 > 1.1.1.1.53: 12345+ A? example.com. (29)
09:14:02.011 IP 1.1.1.1.53 > 10.0.0.10.41234: 12345 1/0/0 A 93.184.216.34 (45)
09:14:02.012 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [S], seq 1, win 64240
09:14:02.022 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [S.], seq 1, ack 2, win 65535
Lines 1-2 are UDP (DNS); lines 3-4 are TCP (HTTPS, mid-handshake). Notice DNS has no Flags; TCP has [S] (SYN), [S.] (SYN+ACK).
QUIC (HTTP/3) runs over UDP but implements its own reliability, ordering, and congestion control — so it looks like UDP on the wire and like TCP to the application. The OS gives it nothing TCP didn't already provide; the value is moving congestion control into user space where it's easier to evolve.
The three-way handshake
TCP opens a connection with three packets:
- SYN — client sends a synchronise packet with an initial sequence number
x - SYN-ACK — server acks
x+1and sends its own initial sequence numbery - ACK — client acks
y+1; both sides are now established
client server
| SYN seq=x |
| -----------------------------> |
| |
| SYN-ACK seq=y ack=x+1 |
| <----------------------------- |
| |
| ACK ack=y+1 |
| -----------------------------> |
| ESTABLISHED |
The handshake takes one round-trip time (RTT) — the speed-of-light cost of "talking to a faraway server" before any application data flows. TLS adds another 1-2 RTTs on top; HTTP/2 multiplexes many requests inside one TCP connection to amortise it; QUIC merges the TCP and TLS handshakes into a single RTT.
# Observe a handshake on the wire
sudo tcpdump -nni any -c 6 'tcp and port 443 and host example.com'
curl -s -o /dev/null https://example.com/ &
Output:
09:14:02.001 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [S], seq 12345
09:14:02.022 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [S.], seq 98765, ack 12346
09:14:02.022 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [.], ack 1
09:14:02.025 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [P.], seq 1:518, length 517
09:14:02.046 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [.], ack 518
09:14:02.047 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [P.], seq 1:1448
Three handshake packets, then TLS ClientHello (517 bytes), then the server replies.
Connection states
A TCP connection traces through a state machine the kernel maintains per socket. ss -tan and netstat -an print the current state.
| State | Meaning |
|---|---|
LISTEN | Server socket waiting for incoming SYNs |
SYN-SENT | Client sent SYN, awaiting SYN-ACK |
SYN-RECV | Server received SYN, sent SYN-ACK, awaiting ACK |
ESTABLISHED | Three-way handshake complete; data can flow |
FIN-WAIT-1 | Sent FIN, awaiting ACK |
FIN-WAIT-2 | Got ACK of our FIN, awaiting peer FIN |
CLOSE-WAIT | Peer sent FIN; we must call close() |
LAST-ACK | We sent FIN after CLOSE-WAIT, awaiting ACK |
TIME-WAIT | We initiated close; waiting 2×MSL (usually 60 s) for lingering segments |
CLOSED | Connection gone |
# Histogram TCP states on this box
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn
Output:
24 ESTAB
8 TIME-WAIT
4 LISTEN
2 CLOSE-WAIT
CLOSE-WAIT is your application's bug — the peer closed, but your code hasn't called close() yet. TIME-WAIT is normal; lots of it means many short connections (consider keep-alive).
MTU and MSS
MTU (Maximum Transmission Unit) is the largest IP packet a link can carry without fragmentation. Ethernet's classic MTU is 1500 bytes; jumbo frames raise it to 9000 on supported links; PPPoE drops it to 1492; VPNs lop off another 50–100 bytes for encryption overhead.
MSS (Maximum Segment Size) is TCP's per-segment payload — MTU - 40 for IPv4 (20 IP + 20 TCP) or MTU - 60 for IPv6. The two ends advertise MSS in the SYN; the smaller wins.
# Inspect MTU per interface
ip -br link
Output:
lo UNKNOWN 00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0 UP 52:54:00:ab:cd:ef <BROADCAST,MULTICAST,UP,LOWER_UP> 1500
wg0 UNKNOWN none <POINTOPOINT,NOARP,UP,LOWER_UP> 1420
The Wireguard interface (wg0) has MTU 1420 to leave room for the encryption overhead.
# Discover the path MTU to a destination
tracepath -n 8.8.8.8 | head
Output:
1?: [LOCALHOST] pmtu 1500
1: 10.0.0.1 1.234ms
2: 100.64.0.1 12.345ms pmtu 1492
3: 192.0.2.1 14.876ms
4: 8.8.8.8 21.123ms reached
Resume: pmtu 1492 hops 4 back 4
PMTU dropped from 1500 to 1492 at hop 2 — a PPPoE link.
[!WARN] If a firewall blocks the ICMP "Fragmentation Needed" messages used for Path MTU Discovery, you get a black hole: small packets work, big ones silently disappear, your TLS handshake hangs at the certificate. Symptoms are SSH that lets you log in but stalls on
ls;curlthat hangs after the headers. Lower the MSS withiptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu.
IP addressing and routing
IPv4 addresses are 32 bits, written as four dotted-decimal octets (10.0.0.10). IPv6 is 128 bits, written as eight colon-separated hex groups (2001:db8::1). A netmask (or prefix length) partitions the address into network and host parts: 10.0.0.10/24 means the first 24 bits are the network (10.0.0.0/24).
Routing decides which interface a packet leaves on. The kernel maintains a routing table; each entry maps a destination prefix to a next hop (gateway IP and outgoing interface). The longest-match prefix wins.
ip route
ip route get 8.8.8.8
Output:
default via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.10 metric 100
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.10
192.168.10.0/24 via 10.0.0.254 dev eth0
8.8.8.8 via 10.0.0.1 dev eth0 src 10.0.0.10 uid 1000
The default route (0.0.0.0/0) catches everything not in a more specific entry; for 8.8.8.8 that's the gateway 10.0.0.1 on eth0.
See the ip cheatsheet for the full iproute2 toolkit.
Special address ranges
Knowing the ranges lets you read a routing table at a glance.
| Range | Purpose |
|---|---|
127.0.0.0/8 | Loopback |
10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16 | RFC 1918 — private IPv4 |
169.254.0.0/16 | Link-local (auto-assigned when DHCP fails) |
224.0.0.0/4 | Multicast |
100.64.0.0/10 | Carrier-grade NAT (CGNAT) |
0.0.0.0/0 | Default route ("everywhere else") |
fe80::/10 | IPv6 link-local |
fc00::/7 | IPv6 unique local |
2000::/3 | IPv6 global unicast |
ff00::/8 | IPv6 multicast |
NAT — translating between address spaces
Network Address Translation rewrites IP addresses and ports as packets cross a boundary. The home-router case is Source NAT (SNAT) or Masquerade: outbound packets get the router's WAN IP swapped in for the LAN address, and a port-mapping table remembers the mapping so replies are translated back. SNAT lets many devices share one public IP — the foundation of the modern IPv4 internet.
# Inspect the NAT table on Linux (nftables)
sudo nft list table ip nat
Output (truncated):
table ip nat {
chain POSTROUTING {
type nat hook postrouting priority srcnat; policy accept;
oifname "eth0" masquerade
}
chain PREROUTING {
type nat hook prerouting priority dstnat; policy accept;
}
}
Port forwarding (DNAT)
Inbound NAT — also known as Destination NAT or port forwarding — exposes a port on the public IP to a service on the LAN. The router rewrites the destination address before routing.
# nftables: forward TCP/80 from the WAN IP to 192.168.1.50:8080
sudo nft add rule ip nat PREROUTING tcp dport 80 dnat to 192.168.1.50:8080
Output: (none — exits 0 on success)
# iptables (legacy): same idea
sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j DNAT --to-destination 192.168.1.50:8080
Output: (none — exits 0 on success)
NAT is a routing trick, not a security feature. Many people treat "behind NAT" as a firewall — it isn't. Run a real firewall (nftables, ufw, or your router's) and don't rely on NAT to keep services hidden.
DNS resolution flow
When your code calls getaddrinfo("example.com", "443", ...), a small cascade of lookups runs before any packet hits the wire. Each step has cache and failure modes; the more steps you skip, the faster the resolution.
- Application calls
getaddrinfo()(POSIX) or equivalent (InetAddress.getByNamein Java). - libc consults
nsswitch.conf— typicallyhosts: files dns(try local files first, then DNS). /etc/hostsis checked. Ifexample.comis there, return immediately. No network traffic./etc/resolv.confis consulted to find a stub resolver. Modern systemd installs127.0.0.53(systemd-resolved).- Stub resolver checks its cache. If hit, return; otherwise forward to an upstream resolver (1.1.1.1, 8.8.8.8, your ISP, your local DNS server).
- Upstream resolver queries the root → TLD → authoritative servers, possibly using its own cache, and returns the answer.
# Trace the full flow for one name
getent hosts example.com # libc-level lookup, uses nsswitch
resolvectl query example.com # systemd-resolved stub
dig +trace example.com # recursive trace from root
Output (getent hosts):
93.184.216.34 example.com
Output (dig +trace excerpt):
. 518400 IN NS a.root-servers.net.
;; Received 239 bytes from 1.1.1.1#53(1.1.1.1) in 8 ms
com. 172800 IN NS a.gtld-servers.net.
;; Received 1180 bytes from 198.41.0.4#53(a.root-servers.net) in 24 ms
example.com. 172800 IN NS a.iana-servers.net.
;; Received 717 bytes from 192.5.6.30#53(a.gtld-servers.net) in 60 ms
example.com. 86400 IN A 93.184.216.34
;; Received 56 bytes from 199.43.135.53#53(a.iana-servers.net) in 12 ms
Common DNS record types
| Type | Stores |
|---|---|
A | IPv4 address |
AAAA | IPv6 address |
CNAME | Alias to another name |
MX | Mail exchanger |
TXT | Arbitrary text (SPF, DKIM, verification) |
NS | Authoritative nameserver |
SOA | Start of authority — zone metadata, serial |
PTR | Reverse DNS (IP → name) |
SRV | Service location (_https._tcp.example.com) |
CAA | Which CAs may issue certs |
HTTPS / SVCB | HTTPS service binding (alt-svc replacement) |
DS / DNSKEY / RRSIG | DNSSEC chain |
# Look up specific record types
dig example.com A +short
dig example.com AAAA +short
dig example.com MX +short
dig example.com TXT +short
dig +noall +answer example.com any
Output:
93.184.216.34
2606:2800:220:1:248:1893:25c8:1946
0 .
"v=spf1 -all"
"This domain is for use in examples..."
Resolver configuration
/etc/resolv.conf is the legacy text file; on systemd systems it's a symlink to run/systemd/resolve/stub-resolv.conf and the real configuration is in resolved.conf or per-link via NetworkManager.
cat /etc/resolv.conf
resolvectl status | head
Output:
# This file is managed by man:systemd-resolved(8).
nameserver 127.0.0.53
options edns0 trust-ad
search local
Global
Protocols: LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub
[!WARN]
nslookupandhostquery DNS directly and bypass/etc/hostsand other NSS sources — they're misleading for "what does my application see?". Usegetent hosts NAMEto mirror libc'sgetaddrinfopath, orresolvectl queryon systemd boxes.
Common pitfalls
- Confusing TCP and UDP ports — they're separate namespaces.
nc -lvnp 53(TCP) and a UDP listener can coexist; firewall rules must list the protocol. CLOSE-WAITaccumulation — your application is the receiver of a FIN it hasn't acted on. Find and fix the leak; it will eventually exhaust FDs.TIME-WAITis not a bug — it's the kernel ensuring lingering segments expire before the (IP, port) pair is reused. Tuningtcp_tw_reuseis rarely the right answer; fixing connection churn (use keep-alive) is.- MTU black hole — ICMP-blocked links silently drop "too big" packets. Symptoms are stalls after TLS handshake or after large query results. Lower the MSS clamp on the gateway.
- DNS caching tricks you — recently changed records may take TTL seconds to propagate through caches.
dig +traceor query an authoritative server directly to bypass intermediates. - NAT looks like a firewall but isn't — outbound is unrestricted; inbound is only blocked because no port mapping exists. Add a real firewall.
tcpdumpshows nothing because of offloading — modern NICs do TSO/LRO andtcpdumpsees giant segments.ethtool -K eth0 tso off gso off gro off lro offfor a clean trace.- DNS over TCP forgotten — replies >512 bytes (or DNSSEC) use TCP. Firewalls that only allow UDP/53 break on big answers.
localhostresolves to IPv6::1not IPv4127.0.0.1— some apps misbehave when only one is reachable. Bind explicitly to0.0.0.0and::1or use dual-stack::./etc/hostsentry overrides DNS silently — fine for development but easy to forget.getent hostsshows the truth.- Connection refused vs connection timeout — refused = the host answered with RST (no service listening or firewall sent reject); timeout = no response (host down, firewall dropped). They mean different things; the action is different.
Real-world recipes
Curl a host through a specific interface
For multi-homed boxes where you need a specific egress.
curl --interface 10.0.0.10 https://example.com
ip route get 93.184.216.34
Output:
93.184.216.34 via 10.0.0.1 dev eth0 src 10.0.0.10
Watch a TCP handshake live
A complete handshake from one terminal while a connection runs from another.
# Terminal 1
sudo tcpdump -nni any -c 6 'host example.com and tcp'
# Terminal 2
curl -s https://example.com/ > /dev/null
Output:
09:14:02.001 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [S], seq 12345
09:14:02.022 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [S.], seq 98765, ack 12346
09:14:02.022 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [.], ack 1
Pinpoint a black-hole MTU
tracepath reports per-hop PMTU; if traffic stalls but ping works, this is the test.
tracepath -n example.com | head
Output:
1: 10.0.0.10 0.123ms pmtu 1500
1: 10.0.0.1 1.234ms
2: 100.64.0.1 12.345ms pmtu 1480
3: 93.184.216.34 21.123ms reached
Resume: pmtu 1480 hops 3 back 3
Resolve a name with DNS-over-TLS
Bypass the local resolver and ask Cloudflare directly with DoT (encrypted DNS).
kdig +tls @1.1.1.1 example.com
Output:
;; Question
;; example.com. IN A
;; Answer
example.com. 86400 IN A 93.184.216.34
Audit which processes hold every TCP listening socket
For a security review or capacity check. ss -tlnp returns instantly even on busy hosts.
sudo ss -tlnp | column -t
Output:
State Recv-Q Send-Q Local-Address:Port Peer-Address:Port Process
LISTEN 0 128 0.0.0.0:22 0.0.0.0:* users:(("sshd",pid=800,fd=3))
LISTEN 0 511 0.0.0.0:80 0.0.0.0:* users:(("nginx",pid=1234,fd=6))
LISTEN 0 511 0.0.0.0:443 0.0.0.0:* users:(("nginx",pid=1234,fd=8))
LISTEN 0 128 127.0.0.1:5432 0.0.0.0:* users:(("postgres",pid=2100,fd=5))
LISTEN 0 4096 127.0.0.53:53 0.0.0.0:* users:(("systemd-resolve",pid=611,fd=13))
Set up a port forward for testing
For exposing a service running on a workstation to a colleague over the LAN.
# Quick & dirty with socat
socat TCP-LISTEN:8080,reuseaddr,fork TCP:localhost:3000
# Persistent with iptables (root)
sudo iptables -t nat -A PREROUTING -p tcp --dport 8080 -j REDIRECT --to-port 3000
Output: (none — socat blocks; iptables exits 0 on success)
Diagnose a slow DNS resolver
Time each resolver in turn.
for ns in 1.1.1.1 8.8.8.8 9.9.9.9 127.0.0.53; do
echo -n "$ns -> "
dig +short +time=2 +tries=1 @"$ns" example.com | head -1
dig @"$ns" example.com +stats 2>&1 | awk '/Query time/{print $4, $5}'
done
Output:
1.1.1.1 -> 93.184.216.34
8 msec
8.8.8.8 -> 93.184.216.34
22 msec
9.9.9.9 -> 93.184.216.34
24 msec
127.0.0.53 -> 93.184.216.34
2 msec
Capture only the first 3 seconds of a TLS handshake
For a focused capture without filling disk.
sudo timeout 3 tcpdump -nni any -w /tmp/tls.pcap 'tcp port 443'
ls -lh /tmp/tls.pcap
Output:
-rw-r--r-- 1 root root 412K May 25 09:14 /tmp/tls.pcap
Find which TCP connections are using the most retransmits
A flaky link or congested far end shows here.
sudo ss -tin state established | awk '/retrans/ {print}' | head
Output:
ESTAB 0 0 10.0.0.10:443 203.0.113.42:49182
cubic wscale:7,7 rto:204 rtt:32.1/4.8 retrans:0/12 cwnd:8 mss:1448
retrans:0/12 means 0 currently outstanding, 12 retransmits over the connection's life — high values indicate a lossy path.
Tips
mtr(ormtr-tiny) combinestracerouteandpinginto a live, refreshing display of every hop's loss and latency. It's the right answer to "is the problem in my network or theirs?" — high loss at hop N points the finger at the hop after it.
Always use
ss -tnpovernetstat -tnp— same answer, 10–100× faster, kernel-side filters, and netstat is deprecated. The only reason to keep netstat in muscle memory is for old BSDs and macOS, wheressdoesn't exist.
[!WARN] Capturing on a busy interface without a filter (
tcpdump -i eth0with no expression) can cost gigabytes per minute and drop packets. Always pass a BPF expression (port 443,host 10.0.0.1, etc.) and consider-s 96to truncate payloads when you only need headers.