cheat sheet
awk / gawk
Pattern-action language for structured text. Field splitting, built-in variables, arithmetic, string functions, arrays, BEGIN/END blocks, and practical data-processing recipes.
awk / gawk — Text Processing
What it is
awk is a pattern-action text processing language that has been part of Unix since 1977, originally created by Aho, Weinberger, and Kernighan at Bell Labs; gawk is the GNU implementation and the most widely installed variant today. It automatically splits each input line into fields, making it ideal for processing structured text like log files, CSV, and command output without writing a full script. Reach for awk when you need to filter, transform, or aggregate field-delimited text in a pipeline; for full programming logic or JSON, python or jq are better choices.
Syntax
An awk program is a series of pattern { action } pairs written as a single-quoted string on the command line or stored in a file passed with -f. Options like -F (field separator) and -v (variable assignment) come before the program.
awk [OPTIONS] 'PROGRAM' [FILE...]
awk [OPTIONS] -f script.awk [FILE...]
awk -v VAR=value 'PROGRAM' [FILE...]
Output: (none — exits 0 on success)
A program is a series of pattern { action } rules. Both are optional:
- No pattern → action runs on every line
- No action → default is
{ print }(prints the matching line)
Built-in variables
| Variable | Meaning |
|---|---|
$0 | Entire current record (line) |
$1 … $NF | Fields 1 through NF |
NF | Number of fields in current record |
NR | Total records read so far |
FNR | Record number within current file |
FS | Input field separator (default: whitespace) |
OFS | Output field separator (default: space) |
RS | Input record separator (default: \n) |
ORS | Output record separator (default: \n) |
FILENAME | Current input file name |
ARGC / ARGV | Argument count / array |
BEGIN and END blocks
BEGIN runs once before any input is read — the right place to initialize variables or set FS/OFS. END runs once after the last record, making it ideal for printing totals or summaries. Neither block receives input records.
BEGIN { FS=","; OFS="\t" } # run before any input
{ print $2, $1 } # run per record
END { print "Total:", NR } # run after all input
Given data.csv containing name,dept,salary rows:
Output:
Engineering Alice
Ops Bob
Finance Carol
Total: 3
Field separator
FS controls how each input record is split into fields ($1, $2, …). Set it with -F on the command line or by assigning FS in a BEGIN block; it can be a literal character, a multi-character string, or a regex. The default splits on runs of whitespace, discarding leading and trailing spaces.
awk -F: '{print $1}' /etc/passwd # colon-separated
awk -F'\t' '{print $3}' data.tsv # tab-separated
awk -F', *' '{print $2}' file # comma + optional spaces
awk 'BEGIN{FS="|"} {print $1}' pipe.txt # pipe character
awk -F'[,;]' '{print $1, $2}' file # regex separator
Output:
root
daemon
bin
sys
nobody
Patterns
A pattern is a condition that gates whether a rule's action runs on a given record. It can be a regex (/re/), a comparison expression ($3 > 100), or a range (/START/,/END/). Omitting the pattern means the action runs on every record; omitting the action defaults to { print }.
awk '/error/' log # print lines matching regex
awk '!/^#/' config # skip comment lines
awk 'NR==1' file # first line only
awk 'NR>=10 && NR<=20' file # lines 10–20
awk '$3 > 100' data # field comparison
awk '$1 ~ /^foo/' file # field matches regex
awk '/START/,/END/' file # range: START to END (inclusive)
Output:
2026-01-15 10:23:45 ERROR connection refused
2026-01-15 10:45:01 ERROR timeout waiting for response
Printf and output
printf gives column-aligned, formatted output using C-style format strings — use it instead of print when you need fixed-width fields or numeric precision. Awk can also redirect output directly to files or pipe it into shell commands without leaving the awk process.
awk '{printf "%-20s %5d\n", $1, $2}' file # formatted output
awk '{print $2 > "out.txt"}' file # redirect to file
awk '{print $1 >> "append.txt"}' file # append
awk '{print | "sort -rn"}' file # pipe to command
Output:
Alice 75000
Bob 62000
Carol 91000
String functions
| Function | Description |
|---|---|
length(s) | Length of string (or $0 if no arg) |
substr(s, i, n) | Substring from index i (1-based), length n |
index(s, t) | First position of t in s (0 = not found) |
split(s, a, sep) | Split s into array a using sep |
sub(r, s, t) | Replace first regex r match in t with s |
gsub(r, s, t) | Replace all regex r matches in t with s |
match(s, r) | Sets RSTART, RLENGTH; returns position or 0 |
sprintf(fmt, ...) | Format string (like printf, returns string) |
tolower(s) | Lowercase |
toupper(s) | Uppercase |
gensub(r, s, h, t) | gawk: replace with \1 groups, h="g" for global |
awk '{print toupper($1), length($0)}' file
awk '{gsub(/foo/, "bar"); print}' file # replace in $0
awk '{sub(/^[ \t]+/, ""); print}' file # ltrim
awk '{gsub(/[ \t]+$/, ""); print}' file # rtrim
awk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH)}' file
Output:
ALICE 28
BOB 22
CAROL 30
Numeric functions (gawk)
Gawk provides standard math functions including int(), sqrt(), sin(), cos(), atan2(), log(), exp(), and rand()/srand() for random numbers. Basic arithmetic operators (+, -, *, /, %, ^) and printf format specifiers handle most numeric output needs.
awk '{print int($1), sqrt($2), $3^2}' data
awk 'BEGIN{srand()} {print int(rand()*100)}' /dev/stdin
awk '{printf "%.2f\n", $1/$2}' nums
Output:
3 4.000 25.000000
7 9.000 144.000000
Arrays
Awk arrays are associative (hash maps): the index can be any string or number, and entries spring into existence on first use. They are unordered — iterate with for (key in array) — and a single array can accumulate counts, sums, or mappings across all input records.
# Frequency count
awk '{count[$1]++} END {for (k in count) print k, count[k]}' file
# Associative array from CSV: id → name
awk -F, 'NR>1 {map[$1]=$2} END {for (id in map) print id, map[id]}' data.csv
# Delete element
awk '{delete seen[$1]; seen[$1]=$2}' file
# Array test
awk '$1 in seen {print "dup:", $1} {seen[$1]=1}' file
Output:
Engineering 3
Ops 1
Finance 2
Multi-file processing
When multiple files are passed, NR counts records across all files while FNR resets to 1 for each new file. The classic two-file idiom FNR==NR { … ; next } processes the first file into memory, then uses that data while reading the second.
# FNR vs NR
awk 'FNR==1 {print "--- File:", FILENAME}' file1 file2
# Process only second file
awk 'FNR==NR {ids[$1]=1; next} $1 in ids' list.txt data.txt
Output:
--- File: file1
--- File: file2
Practical recipes
# Sum a column
awk '{sum+=$3} END {print sum}' data.txt
# Average
awk '{sum+=$1; n++} END {print sum/n}' numbers.txt
# Print columns in different order
awk '{print $3, $1, $2}' file.txt
# Skip header, process rest
awk 'NR>1 {print $2, $4}' report.csv
# Print unique lines (ordered, like sort|uniq)
awk '!seen[$0]++' file.txt
# Print duplicate lines only
awk 'seen[$0]++ == 1' file.txt
# Concatenate lines every N records
awk 'ORS= (NR%3 ? " " : "\n")' file # join every 3 lines
# Extract value from key=value
awk -F= '/^timeout/{print $2}' config.ini
# Column-align a colon file
awk -F: '{printf "%-15s %-10s %s\n", $1,$3,$7}' /etc/passwd
# Top N by field
awk '{print $5, $0}' access.log | sort -rn | head -10 | cut -d' ' -f2-
# Running total
awk '{running+=$1; print running, $0}' ledger.txt
# Filter by date field (YYYY-MM-DD in $2)
awk '$2 >= "2025-01-01" && $2 <= "2025-03-31"' events.log
# Accumulate by group, then report
awk -F, '{bytes[$1]+=$3} END {
for (h in bytes) printf "%s\t%.1f MB\n", h, bytes[h]/1048576
}' access.csv | sort -k2 -rn
# Transpose rows to columns
awk '{for(i=1;i<=NF;i++) col[i]=col[i] (NR>1?"\t":"") $i}
END {for(i=1;i<=NF;i++) print col[i]}' matrix.txt
Output:
228000
76000.00
75000 Alice Engineering
62000 Bob Ops
Engineering Alice
Ops Bob
10 2026-01-15 10:23:45 ERROR connection refused
10 10.00 MB
webserver01 15.2 MB
appserver02 8.7 MB
dbserver03 3.1 MB
Multiline records
Setting RS="" switches awk into paragraph mode, where blank lines delimit records and newlines within a record become field separators (when FS="\n"). For CSV with quoted newlines, gawk's FPAT variable matches field content by pattern rather than splitting on a delimiter.
# Blank-line-separated records (like paragraphs)
awk 'BEGIN{RS=""; FS="\n"} /keyword/{print $1}' file
# Multi-line CSV (quoted fields containing newlines) — use gawk
gawk 'BEGIN{FPAT="([^,]*)|(\"[^\"]+\")"} {print $2}' data.csv
Output:
First line of matching paragraph
"Engineering Department, North"
Built-in CSV (gawk 5.3+)
Gawk 5.3.0 (Nov 2023) added native CSV parsing via the --csv option, mirroring the same feature in BWK awk ("The One True Awk"). It correctly handles quoted fields, embedded commas, doubled "" quotes, and CRLF line endings — no more hand-rolled FPAT for standard RFC 4180 data. The mode forces FS="," and disables backslash-escape processing inside fields; combine it with BEGIN{OFS=","} to round-trip CSV.
# Parse a CSV with quoted commas and embedded quotes (gawk 5.3+)
gawk --csv '{print $1, $3}' data.csv
# Convert CSV to TSV
gawk --csv 'BEGIN{OFS="\t"} {$1=$1; print}' data.csv > data.tsv
# Skip header, sum a numeric column
gawk --csv 'NR>1 {sum+=$4} END {print sum}' sales.csv
Output:
Alice "Engineering, North"
Bob "Ops, West"
Carol "Finance, HQ"
Check with
gawk --version—--csvrequires gawk 5.3.0 or later. On older systems, fall back to theFPATrecipe above or use a dedicated tool likeqsv.
Unicode escapes (gawk 5.3+)
Gawk 5.3.0 also introduced the \u escape sequence for inserting Unicode code points by hex value (1–8 digits), encoded as UTF-8 in the current locale. This makes it easier to emit non-ASCII symbols, box-drawing characters, and emoji from awk programs without literal multibyte bytes in source.
# Print a checkmark and warning symbol (gawk 5.3+)
gawk 'BEGIN{print "✓ ok"; print "⚠ warn"}'
# Box-drawing borders
gawk 'BEGIN{print "┌──┐"; print "└──┘"}'
Output:
✓ ok
⚠ warn
┌──┐
└──┘
Modern alternatives
If you outgrow standard awk for performance or modern I/O, two actively used reimplementations expand the design space. frawk is a Rust-based JIT-compiled awk-alike with statically inferred types and built-in CSV/TSV support — typically several times faster than gawk on large files. goawk is a POSIX-compliant Go implementation with CSV mode (-i csv / -o csv), useful when you want a single static binary or are embedding awk in a Go program. Both accept most awk programs unchanged but differ in extensions and edge cases.
# frawk: same syntax, often faster on big files
frawk -F, '{sum+=$3} END {print sum}' huge.csv
# goawk: CSV input/output modes
goawk -i csv -o csv 'NR>1 {print $1, $4}' data.csv
Output: (none — performance/portability differences only)
One-liners reference
awk 'END{print NR}' file # count lines (wc -l)
awk '{print NF}' file # print field count per line
awk 'NF' file # remove blank lines
awk 'length>72' file # lines longer than 72 chars
awk '{$1=$1; print}' file # collapse whitespace, trim
awk '{print $NF}' file # print last field
awk '{print $(NF-1)}' file # print second-to-last field
awk 'NR%2==0' file # print even-numbered lines
awk 'NR==FNR{a[$0];next} $0 in a' f1 f2 # intersection of two files
awk 'NR==FNR{a[$0];next} !($0 in a)' f1 f2 # lines in f2 not in f1
Output:
42
3
4
5
The quick brown fox jumps over the lazy dog — this line exceeds seventy-two characters
/bin/bash
/bin/sh
line2
line4
alice
carol
bob
dave
gawk(GNU awk) extends POSIX awk withgensub(),FPATfor CSV,nextfile, co-processes (|&), and more. On most Linux systemsawkis alreadygawk; check withawk --version.
Sources
- Gawk 5.3.0 released — LWN.net — CSV and
\uUnicode escape additions. - Gawk 5.3.2 announcement (info-gnu, April 2025) — latest stable bug-fix release.
- The GNU Awk User's Guide — canonical reference for gawk extensions.
- frawk on GitHub — Rust-based JIT awk alternative with CSV support.