cheat sheet

Networking Stack

How packets actually move: the OSI and TCP/IP layer models, the BSD socket API, TCP vs UDP, the three-way handshake, MTU/MSS, NAT and port translation, basic IP routing, and the full DNS resolution flow.

Networking Stack — OSI/TCP-IP, Sockets, TCP/UDP, MTU, NAT, DNS

What it is

The "networking stack" is the layered set of protocols and OS code that turns a flat byte stream from your application into electrical or radio signals on the wire — and vice versa. Each layer hides the one below it: your application sees a socket (open connect(), send(), recv()); the OS adds TCP or UDP framing; IP adds source and destination addressing; the link layer adds MAC addresses; the physical layer puts the bits on copper, fibre, or air. Reach for this article when you need the canonical mental model — what each layer does, how a TCP handshake actually unfolds, why a path MTU change breaks one app and not another, and how getaddrinfo() ends up speaking to your DNS server.

The two layer models

The OSI seven-layer model is a teaching tool from 1984; the TCP/IP four-layer model is what the internet actually uses. Most engineers move between them fluently; understanding both lets you read RFCs and product docs without translation.

OSI layerTCP/IP layerExamples
7 — ApplicationApplicationHTTP, SSH, DNS, SMTP, your web app
6 — PresentationApplicationTLS, MIME, JSON, gzip
5 — SessionApplicationTLS resumption tickets, RPC sessions
4 — TransportTransportTCP, UDP, QUIC, SCTP
3 — NetworkInternetIPv4, IPv6, ICMP, IPsec
2 — Data linkLinkEthernet, Wi-Fi (802.11), ARP, NDP
1 — PhysicalLinkCat-6 cabling, fibre, radio
bash
# Show the layered view for a single live connection
sudo ss -tnpie state established sport = :443

Output:

text
ESTAB 0 0 10.0.0.10:443 203.0.113.42:49182
     cubic wscale:7,7 rto:204 rtt:3.21/1.5 cwnd:10 mss:1448
     users:(("nginx",pid=1234,fd=12))

The output mixes layers: PID + fd (application), port 443/49182 (transport), 10.0.0.10/203.0.113.42 (network). Underneath, the link layer is whatever NIC 10.0.0.10 lives on, which ip route get 203.0.113.42 will tell you.

When somebody says "it's a layer 7 problem" they mean the application is misbehaving (bad request, slow handler). "Layer 4" means transport — connection rejected, TCP reset, congestion. "Layer 3" means IP routing or addressing. "Layer 1/2" means cable, NIC, switch, or VLAN. Knowing which layer to investigate first is half the troubleshoot.

The BSD socket API

Every network program on every modern OS speaks the same API, ultimately. It dates from 4.2BSD (1983) and is the syscall surface for TCP, UDP, UNIX domain sockets, raw IP, and (on Linux) Netlink. A socket is an open file descriptor with one extra property: an address family (IPv4, IPv6, UNIX) and a type (SOCK_STREAM for TCP, SOCK_DGRAM for UDP).

c
// Minimal TCP client in C
int s = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in addr = { .sin_family = AF_INET, .sin_port = htons(80) };
inet_pton(AF_INET, "10.0.0.1", &addr.sin_addr);
connect(s, (struct sockaddr*)&addr, sizeof(addr));
write(s, "GET / HTTP/1.0\r\n\r\n", 18);
char buf[4096]; int n = read(s, buf, sizeof(buf));
close(s);

The same pattern in Python is two lines:

python
import socket
with socket.create_connection(("10.0.0.1", 80)) as s:
    s.send(b"GET / HTTP/1.0\r\n\r\n")
    print(s.recv(4096))

Server-side syscalls

The server side of TCP adds three syscalls between socket() and the data exchange: bind(), listen(), and accept().

SyscallPurpose
socket()Allocate the socket FD
bind()Attach the socket to a local IP and port
listen()Mark the socket passive; create the accept queue
accept()Pop one completed connection from the queue, return a new FD
connect()Client-side: initiate the three-way handshake
send()/recv() / read()/write()Data transfer
shutdown()Half-close one direction (SHUT_RD, SHUT_WR, SHUT_RDWR)
close()Release the FD and tear down the connection
setsockopt()Tune timeouts, keepalives, buffer sizes
bash
# Trace a server's syscalls during one accept
sudo strace -f -e trace=network -p $(pgrep -n nginx) 2>&1 | head

Output (excerpt):

text
accept4(6, {sa_family=AF_INET, sin_port=htons(49182), sin_addr=...}, [16], SOCK_CLOEXEC) = 12
recvfrom(12, "GET / HTTP/1.1\r\nHost: ...", 8192, 0, NULL, NULL) = 412
sendto(12, "HTTP/1.1 200 OK\r\nDate: ...", 1024, MSG_NOSIGNAL, NULL, 0) = 1024
close(12) = 0

TCP vs UDP

TCP and UDP are both transport-layer protocols that sit on top of IP. They share addressing (source/dest IP + source/dest port) but differ in everything else.

PropertyTCPUDP
ConnectionYes — explicit handshakeNo — datagrams are independent
ReliabilityRetransmits lost segmentsBest effort; lost packets are lost
OrderingIn-order deliveryOut of order possible
Flow controlReceive windowNone
Congestion controlReno, Cubic, BBRNone (application's job)
Header overhead20 bytes (no options)8 bytes
Use casesHTTP, SSH, SMTP, FTP, most thingsDNS, NTP, VoIP, gaming, QUIC base, multicast
bash
# Open a TCP and a UDP socket side by side; observe headers
sudo tcpdump -nni eth0 -c 4 'port 53'

Output:

text
09:14:02.001 IP 10.0.0.10.41234 > 1.1.1.1.53: 12345+ A? example.com. (29)
09:14:02.011 IP 1.1.1.1.53 > 10.0.0.10.41234: 12345 1/0/0 A 93.184.216.34 (45)
09:14:02.012 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [S], seq 1, win 64240
09:14:02.022 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [S.], seq 1, ack 2, win 65535

Lines 1-2 are UDP (DNS); lines 3-4 are TCP (HTTPS, mid-handshake). Notice DNS has no Flags; TCP has [S] (SYN), [S.] (SYN+ACK).

QUIC (HTTP/3) runs over UDP but implements its own reliability, ordering, and congestion control — so it looks like UDP on the wire and like TCP to the application. The OS gives it nothing TCP didn't already provide; the value is moving congestion control into user space where it's easier to evolve.

The three-way handshake

TCP opens a connection with three packets:

  1. SYN — client sends a synchronise packet with an initial sequence number x
  2. SYN-ACK — server acks x+1 and sends its own initial sequence number y
  3. ACK — client acks y+1; both sides are now established
text
client                          server
  |   SYN  seq=x                  |
  | -----------------------------> |
  |                                |
  |   SYN-ACK  seq=y  ack=x+1     |
  | <----------------------------- |
  |                                |
  |   ACK  ack=y+1                |
  | -----------------------------> |
  |        ESTABLISHED             |

The handshake takes one round-trip time (RTT) — the speed-of-light cost of "talking to a faraway server" before any application data flows. TLS adds another 1-2 RTTs on top; HTTP/2 multiplexes many requests inside one TCP connection to amortise it; QUIC merges the TCP and TLS handshakes into a single RTT.

bash
# Observe a handshake on the wire
sudo tcpdump -nni any -c 6 'tcp and port 443 and host example.com'
curl -s -o /dev/null https://example.com/ &

Output:

text
09:14:02.001 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [S], seq 12345
09:14:02.022 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [S.], seq 98765, ack 12346
09:14:02.022 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [.], ack 1
09:14:02.025 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [P.], seq 1:518, length 517
09:14:02.046 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [.], ack 518
09:14:02.047 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [P.], seq 1:1448

Three handshake packets, then TLS ClientHello (517 bytes), then the server replies.

Connection states

A TCP connection traces through a state machine the kernel maintains per socket. ss -tan and netstat -an print the current state.

StateMeaning
LISTENServer socket waiting for incoming SYNs
SYN-SENTClient sent SYN, awaiting SYN-ACK
SYN-RECVServer received SYN, sent SYN-ACK, awaiting ACK
ESTABLISHEDThree-way handshake complete; data can flow
FIN-WAIT-1Sent FIN, awaiting ACK
FIN-WAIT-2Got ACK of our FIN, awaiting peer FIN
CLOSE-WAITPeer sent FIN; we must call close()
LAST-ACKWe sent FIN after CLOSE-WAIT, awaiting ACK
TIME-WAITWe initiated close; waiting 2×MSL (usually 60 s) for lingering segments
CLOSEDConnection gone
bash
# Histogram TCP states on this box
ss -tan | awk 'NR>1{print $1}' | sort | uniq -c | sort -rn

Output:

text
     24 ESTAB
      8 TIME-WAIT
      4 LISTEN
      2 CLOSE-WAIT

CLOSE-WAIT is your application's bug — the peer closed, but your code hasn't called close() yet. TIME-WAIT is normal; lots of it means many short connections (consider keep-alive).

MTU and MSS

MTU (Maximum Transmission Unit) is the largest IP packet a link can carry without fragmentation. Ethernet's classic MTU is 1500 bytes; jumbo frames raise it to 9000 on supported links; PPPoE drops it to 1492; VPNs lop off another 50–100 bytes for encryption overhead.

MSS (Maximum Segment Size) is TCP's per-segment payload — MTU - 40 for IPv4 (20 IP + 20 TCP) or MTU - 60 for IPv6. The two ends advertise MSS in the SYN; the smaller wins.

bash
# Inspect MTU per interface
ip -br link

Output:

text
lo     UNKNOWN  00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0   UP       52:54:00:ab:cd:ef <BROADCAST,MULTICAST,UP,LOWER_UP> 1500
wg0    UNKNOWN  none <POINTOPOINT,NOARP,UP,LOWER_UP> 1420

The Wireguard interface (wg0) has MTU 1420 to leave room for the encryption overhead.

bash
# Discover the path MTU to a destination
tracepath -n 8.8.8.8 | head

Output:

text
 1?: [LOCALHOST]                      pmtu 1500
 1:  10.0.0.1                                          1.234ms
 2:  100.64.0.1                                       12.345ms pmtu 1492
 3:  192.0.2.1                                        14.876ms
 4:  8.8.8.8                                          21.123ms reached
     Resume: pmtu 1492 hops 4 back 4

PMTU dropped from 1500 to 1492 at hop 2 — a PPPoE link.

[!WARN] If a firewall blocks the ICMP "Fragmentation Needed" messages used for Path MTU Discovery, you get a black hole: small packets work, big ones silently disappear, your TLS handshake hangs at the certificate. Symptoms are SSH that lets you log in but stalls on ls; curl that hangs after the headers. Lower the MSS with iptables -t mangle -A POSTROUTING -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --clamp-mss-to-pmtu.

IP addressing and routing

IPv4 addresses are 32 bits, written as four dotted-decimal octets (10.0.0.10). IPv6 is 128 bits, written as eight colon-separated hex groups (2001:db8::1). A netmask (or prefix length) partitions the address into network and host parts: 10.0.0.10/24 means the first 24 bits are the network (10.0.0.0/24).

Routing decides which interface a packet leaves on. The kernel maintains a routing table; each entry maps a destination prefix to a next hop (gateway IP and outgoing interface). The longest-match prefix wins.

bash
ip route
ip route get 8.8.8.8

Output:

text
default via 10.0.0.1 dev eth0 proto dhcp src 10.0.0.10 metric 100
10.0.0.0/24 dev eth0 proto kernel scope link src 10.0.0.10
192.168.10.0/24 via 10.0.0.254 dev eth0

8.8.8.8 via 10.0.0.1 dev eth0 src 10.0.0.10 uid 1000

The default route (0.0.0.0/0) catches everything not in a more specific entry; for 8.8.8.8 that's the gateway 10.0.0.1 on eth0.

See the ip cheatsheet for the full iproute2 toolkit.

Special address ranges

Knowing the ranges lets you read a routing table at a glance.

RangePurpose
127.0.0.0/8Loopback
10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16RFC 1918 — private IPv4
169.254.0.0/16Link-local (auto-assigned when DHCP fails)
224.0.0.0/4Multicast
100.64.0.0/10Carrier-grade NAT (CGNAT)
0.0.0.0/0Default route ("everywhere else")
fe80::/10IPv6 link-local
fc00::/7IPv6 unique local
2000::/3IPv6 global unicast
ff00::/8IPv6 multicast

NAT — translating between address spaces

Network Address Translation rewrites IP addresses and ports as packets cross a boundary. The home-router case is Source NAT (SNAT) or Masquerade: outbound packets get the router's WAN IP swapped in for the LAN address, and a port-mapping table remembers the mapping so replies are translated back. SNAT lets many devices share one public IP — the foundation of the modern IPv4 internet.

bash
# Inspect the NAT table on Linux (nftables)
sudo nft list table ip nat

Output (truncated):

text
table ip nat {
    chain POSTROUTING {
        type nat hook postrouting priority srcnat; policy accept;
        oifname "eth0" masquerade
    }
    chain PREROUTING {
        type nat hook prerouting priority dstnat; policy accept;
    }
}

Port forwarding (DNAT)

Inbound NAT — also known as Destination NAT or port forwarding — exposes a port on the public IP to a service on the LAN. The router rewrites the destination address before routing.

bash
# nftables: forward TCP/80 from the WAN IP to 192.168.1.50:8080
sudo nft add rule ip nat PREROUTING tcp dport 80 dnat to 192.168.1.50:8080

Output: (none — exits 0 on success)

bash
# iptables (legacy): same idea
sudo iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j DNAT --to-destination 192.168.1.50:8080

Output: (none — exits 0 on success)

NAT is a routing trick, not a security feature. Many people treat "behind NAT" as a firewall — it isn't. Run a real firewall (nftables, ufw, or your router's) and don't rely on NAT to keep services hidden.

DNS resolution flow

When your code calls getaddrinfo("example.com", "443", ...), a small cascade of lookups runs before any packet hits the wire. Each step has cache and failure modes; the more steps you skip, the faster the resolution.

  1. Application calls getaddrinfo() (POSIX) or equivalent (InetAddress.getByName in Java).
  2. libc consults nsswitch.conf — typically hosts: files dns (try local files first, then DNS).
  3. /etc/hosts is checked. If example.com is there, return immediately. No network traffic.
  4. /etc/resolv.conf is consulted to find a stub resolver. Modern systemd installs 127.0.0.53 (systemd-resolved).
  5. Stub resolver checks its cache. If hit, return; otherwise forward to an upstream resolver (1.1.1.1, 8.8.8.8, your ISP, your local DNS server).
  6. Upstream resolver queries the root → TLD → authoritative servers, possibly using its own cache, and returns the answer.
bash
# Trace the full flow for one name
getent hosts example.com               # libc-level lookup, uses nsswitch
resolvectl query example.com           # systemd-resolved stub
dig +trace example.com                 # recursive trace from root

Output (getent hosts):

text
93.184.216.34   example.com

Output (dig +trace excerpt):

text
.                       518400 IN NS  a.root-servers.net.
;; Received 239 bytes from 1.1.1.1#53(1.1.1.1) in 8 ms

com.                    172800 IN NS  a.gtld-servers.net.
;; Received 1180 bytes from 198.41.0.4#53(a.root-servers.net) in 24 ms

example.com.            172800 IN NS  a.iana-servers.net.
;; Received 717 bytes from 192.5.6.30#53(a.gtld-servers.net) in 60 ms

example.com.            86400  IN A   93.184.216.34
;; Received 56 bytes from 199.43.135.53#53(a.iana-servers.net) in 12 ms

Common DNS record types

TypeStores
AIPv4 address
AAAAIPv6 address
CNAMEAlias to another name
MXMail exchanger
TXTArbitrary text (SPF, DKIM, verification)
NSAuthoritative nameserver
SOAStart of authority — zone metadata, serial
PTRReverse DNS (IP → name)
SRVService location (_https._tcp.example.com)
CAAWhich CAs may issue certs
HTTPS / SVCBHTTPS service binding (alt-svc replacement)
DS / DNSKEY / RRSIGDNSSEC chain
bash
# Look up specific record types
dig example.com A +short
dig example.com AAAA +short
dig example.com MX +short
dig example.com TXT +short
dig +noall +answer example.com any

Output:

text
93.184.216.34

2606:2800:220:1:248:1893:25c8:1946

0 .

"v=spf1 -all"
"This domain is for use in examples..."

Resolver configuration

/etc/resolv.conf is the legacy text file; on systemd systems it's a symlink to run/systemd/resolve/stub-resolv.conf and the real configuration is in resolved.conf or per-link via NetworkManager.

bash
cat /etc/resolv.conf
resolvectl status | head

Output:

text
# This file is managed by man:systemd-resolved(8).
nameserver 127.0.0.53
options edns0 trust-ad
search local

Global
       Protocols: LLMNR=resolve -mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

[!WARN] nslookup and host query DNS directly and bypass /etc/hosts and other NSS sources — they're misleading for "what does my application see?". Use getent hosts NAME to mirror libc's getaddrinfo path, or resolvectl query on systemd boxes.

Common pitfalls

  1. Confusing TCP and UDP ports — they're separate namespaces. nc -lvnp 53 (TCP) and a UDP listener can coexist; firewall rules must list the protocol.
  2. CLOSE-WAIT accumulation — your application is the receiver of a FIN it hasn't acted on. Find and fix the leak; it will eventually exhaust FDs.
  3. TIME-WAIT is not a bug — it's the kernel ensuring lingering segments expire before the (IP, port) pair is reused. Tuning tcp_tw_reuse is rarely the right answer; fixing connection churn (use keep-alive) is.
  4. MTU black hole — ICMP-blocked links silently drop "too big" packets. Symptoms are stalls after TLS handshake or after large query results. Lower the MSS clamp on the gateway.
  5. DNS caching tricks you — recently changed records may take TTL seconds to propagate through caches. dig +trace or query an authoritative server directly to bypass intermediates.
  6. NAT looks like a firewall but isn't — outbound is unrestricted; inbound is only blocked because no port mapping exists. Add a real firewall.
  7. tcpdump shows nothing because of offloading — modern NICs do TSO/LRO and tcpdump sees giant segments. ethtool -K eth0 tso off gso off gro off lro off for a clean trace.
  8. DNS over TCP forgotten — replies >512 bytes (or DNSSEC) use TCP. Firewalls that only allow UDP/53 break on big answers.
  9. localhost resolves to IPv6 ::1 not IPv4 127.0.0.1 — some apps misbehave when only one is reachable. Bind explicitly to 0.0.0.0 and ::1 or use dual-stack ::.
  10. /etc/hosts entry overrides DNS silently — fine for development but easy to forget. getent hosts shows the truth.
  11. Connection refused vs connection timeout — refused = the host answered with RST (no service listening or firewall sent reject); timeout = no response (host down, firewall dropped). They mean different things; the action is different.

Real-world recipes

Curl a host through a specific interface

For multi-homed boxes where you need a specific egress.

bash
curl --interface 10.0.0.10 https://example.com
ip route get 93.184.216.34

Output:

text
93.184.216.34 via 10.0.0.1 dev eth0 src 10.0.0.10

Watch a TCP handshake live

A complete handshake from one terminal while a connection runs from another.

bash
# Terminal 1
sudo tcpdump -nni any -c 6 'host example.com and tcp'

# Terminal 2
curl -s https://example.com/ > /dev/null

Output:

text
09:14:02.001 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [S], seq 12345
09:14:02.022 IP 93.184.216.34.443 > 10.0.0.10.49182: Flags [S.], seq 98765, ack 12346
09:14:02.022 IP 10.0.0.10.49182 > 93.184.216.34.443: Flags [.], ack 1

Pinpoint a black-hole MTU

tracepath reports per-hop PMTU; if traffic stalls but ping works, this is the test.

bash
tracepath -n example.com | head

Output:

text
 1:  10.0.0.10                                              0.123ms pmtu 1500
 1:  10.0.0.1                                               1.234ms
 2:  100.64.0.1                                            12.345ms pmtu 1480
 3:  93.184.216.34                                         21.123ms reached
     Resume: pmtu 1480 hops 3 back 3

Resolve a name with DNS-over-TLS

Bypass the local resolver and ask Cloudflare directly with DoT (encrypted DNS).

bash
kdig +tls @1.1.1.1 example.com

Output:

text
;; Question
;; example.com.        IN  A
;; Answer
example.com.   86400   IN  A  93.184.216.34

Audit which processes hold every TCP listening socket

For a security review or capacity check. ss -tlnp returns instantly even on busy hosts.

bash
sudo ss -tlnp | column -t

Output:

text
State   Recv-Q Send-Q Local-Address:Port  Peer-Address:Port  Process
LISTEN  0      128    0.0.0.0:22          0.0.0.0:*          users:(("sshd",pid=800,fd=3))
LISTEN  0      511    0.0.0.0:80          0.0.0.0:*          users:(("nginx",pid=1234,fd=6))
LISTEN  0      511    0.0.0.0:443         0.0.0.0:*          users:(("nginx",pid=1234,fd=8))
LISTEN  0      128    127.0.0.1:5432      0.0.0.0:*          users:(("postgres",pid=2100,fd=5))
LISTEN  0      4096   127.0.0.53:53       0.0.0.0:*          users:(("systemd-resolve",pid=611,fd=13))

Set up a port forward for testing

For exposing a service running on a workstation to a colleague over the LAN.

bash
# Quick & dirty with socat
socat TCP-LISTEN:8080,reuseaddr,fork TCP:localhost:3000

# Persistent with iptables (root)
sudo iptables -t nat -A PREROUTING -p tcp --dport 8080 -j REDIRECT --to-port 3000

Output: (none — socat blocks; iptables exits 0 on success)

Diagnose a slow DNS resolver

Time each resolver in turn.

bash
for ns in 1.1.1.1 8.8.8.8 9.9.9.9 127.0.0.53; do
  echo -n "$ns -> "
  dig +short +time=2 +tries=1 @"$ns" example.com | head -1
  dig @"$ns" example.com +stats 2>&1 | awk '/Query time/{print $4, $5}'
done

Output:

text
1.1.1.1 -> 93.184.216.34
8 msec
8.8.8.8 -> 93.184.216.34
22 msec
9.9.9.9 -> 93.184.216.34
24 msec
127.0.0.53 -> 93.184.216.34
2 msec

Capture only the first 3 seconds of a TLS handshake

For a focused capture without filling disk.

bash
sudo timeout 3 tcpdump -nni any -w /tmp/tls.pcap 'tcp port 443'
ls -lh /tmp/tls.pcap

Output:

text
-rw-r--r-- 1 root root 412K May 25 09:14 /tmp/tls.pcap

Find which TCP connections are using the most retransmits

A flaky link or congested far end shows here.

bash
sudo ss -tin state established | awk '/retrans/ {print}' | head

Output:

text
ESTAB 0 0 10.0.0.10:443 203.0.113.42:49182
     cubic wscale:7,7 rto:204 rtt:32.1/4.8 retrans:0/12 cwnd:8 mss:1448

retrans:0/12 means 0 currently outstanding, 12 retransmits over the connection's life — high values indicate a lossy path.

Tips

mtr (or mtr-tiny) combines traceroute and ping into a live, refreshing display of every hop's loss and latency. It's the right answer to "is the problem in my network or theirs?" — high loss at hop N points the finger at the hop after it.

Always use ss -tnp over netstat -tnp — same answer, 10–100× faster, kernel-side filters, and netstat is deprecated. The only reason to keep netstat in muscle memory is for old BSDs and macOS, where ss doesn't exist.

[!WARN] Capturing on a busy interface without a filter (tcpdump -i eth0 with no expression) can cost gigabytes per minute and drop packets. Always pass a BPF expression (port 443, host 10.0.0.1, etc.) and consider -s 96 to truncate payloads when you only need headers.