cheat sheet

awk / gawk

Pattern-action language for structured text. Field splitting, built-in variables, arithmetic, string functions, arrays, BEGIN/END blocks, and practical data-processing recipes.

awk / gawk — Text Processing

What it is

awk is a pattern-action text processing language that has been part of Unix since 1977, originally created by Aho, Weinberger, and Kernighan at Bell Labs; gawk is the GNU implementation and the most widely installed variant today. It automatically splits each input line into fields, making it ideal for processing structured text like log files, CSV, and command output without writing a full script. Reach for awk when you need to filter, transform, or aggregate field-delimited text in a pipeline; for full programming logic or JSON, python or jq are better choices.

Syntax

An awk program is a series of pattern { action } pairs written as a single-quoted string on the command line or stored in a file passed with -f. Options like -F (field separator) and -v (variable assignment) come before the program.

bash
awk [OPTIONS] 'PROGRAM' [FILE...]
awk [OPTIONS] -f script.awk [FILE...]
awk -v VAR=value 'PROGRAM' [FILE...]

Output: (none — exits 0 on success)

A program is a series of pattern { action } rules. Both are optional:

  • No pattern → action runs on every line
  • No action → default is { print } (prints the matching line)

Built-in variables

VariableMeaning
$0Entire current record (line)
$1$NFFields 1 through NF
NFNumber of fields in current record
NRTotal records read so far
FNRRecord number within current file
FSInput field separator (default: whitespace)
OFSOutput field separator (default: space)
RSInput record separator (default: \n)
ORSOutput record separator (default: \n)
FILENAMECurrent input file name
ARGC / ARGVArgument count / array

BEGIN and END blocks

BEGIN runs once before any input is read — the right place to initialize variables or set FS/OFS. END runs once after the last record, making it ideal for printing totals or summaries. Neither block receives input records.

awk
BEGIN { FS=","; OFS="\t" }   # run before any input
{ print $2, $1 }              # run per record
END   { print "Total:", NR }  # run after all input

Given data.csv containing name,dept,salary rows:

Output:

text
Engineering	Alice
Ops	Bob
Finance	Carol
Total: 3

Field separator

FS controls how each input record is split into fields ($1, $2, …). Set it with -F on the command line or by assigning FS in a BEGIN block; it can be a literal character, a multi-character string, or a regex. The default splits on runs of whitespace, discarding leading and trailing spaces.

bash
awk -F: '{print $1}' /etc/passwd          # colon-separated
awk -F'\t' '{print $3}' data.tsv          # tab-separated
awk -F', *' '{print $2}' file             # comma + optional spaces
awk 'BEGIN{FS="|"} {print $1}' pipe.txt   # pipe character
awk -F'[,;]' '{print $1, $2}' file        # regex separator

Output:

text
root
daemon
bin
sys
nobody

Patterns

A pattern is a condition that gates whether a rule's action runs on a given record. It can be a regex (/re/), a comparison expression ($3 > 100), or a range (/START/,/END/). Omitting the pattern means the action runs on every record; omitting the action defaults to { print }.

bash
awk '/error/'             log      # print lines matching regex
awk '!/^#/'              config    # skip comment lines
awk 'NR==1'              file      # first line only
awk 'NR>=10 && NR<=20'  file      # lines 10–20
awk '$3 > 100'           data      # field comparison
awk '$1 ~ /^foo/'        file      # field matches regex
awk '/START/,/END/'      file      # range: START to END (inclusive)

Output:

text
2026-01-15 10:23:45 ERROR connection refused
2026-01-15 10:45:01 ERROR timeout waiting for response

Printf and output

printf gives column-aligned, formatted output using C-style format strings — use it instead of print when you need fixed-width fields or numeric precision. Awk can also redirect output directly to files or pipe it into shell commands without leaving the awk process.

bash
awk '{printf "%-20s %5d\n", $1, $2}' file   # formatted output
awk '{print $2 > "out.txt"}'         file   # redirect to file
awk '{print $1 >> "append.txt"}'     file   # append
awk '{print | "sort -rn"}'           file   # pipe to command

Output:

text
Alice                75000
Bob                  62000
Carol                91000

String functions

FunctionDescription
length(s)Length of string (or $0 if no arg)
substr(s, i, n)Substring from index i (1-based), length n
index(s, t)First position of t in s (0 = not found)
split(s, a, sep)Split s into array a using sep
sub(r, s, t)Replace first regex r match in t with s
gsub(r, s, t)Replace all regex r matches in t with s
match(s, r)Sets RSTART, RLENGTH; returns position or 0
sprintf(fmt, ...)Format string (like printf, returns string)
tolower(s)Lowercase
toupper(s)Uppercase
gensub(r, s, h, t)gawk: replace with \1 groups, h="g" for global
bash
awk '{print toupper($1), length($0)}' file
awk '{gsub(/foo/, "bar"); print}'     file        # replace in $0
awk '{sub(/^[ \t]+/, ""); print}'     file        # ltrim
awk '{gsub(/[ \t]+$/, ""); print}'    file        # rtrim
awk 'match($0, /[0-9]+/) {print substr($0, RSTART, RLENGTH)}' file

Output:

text
ALICE 28
BOB 22
CAROL 30

Numeric functions (gawk)

Gawk provides standard math functions including int(), sqrt(), sin(), cos(), atan2(), log(), exp(), and rand()/srand() for random numbers. Basic arithmetic operators (+, -, *, /, %, ^) and printf format specifiers handle most numeric output needs.

bash
awk '{print int($1), sqrt($2), $3^2}' data
awk 'BEGIN{srand()} {print int(rand()*100)}' /dev/stdin
awk '{printf "%.2f\n", $1/$2}' nums

Output:

text
3 4.000 25.000000
7 9.000 144.000000

Arrays

Awk arrays are associative (hash maps): the index can be any string or number, and entries spring into existence on first use. They are unordered — iterate with for (key in array) — and a single array can accumulate counts, sums, or mappings across all input records.

bash
# Frequency count
awk '{count[$1]++} END {for (k in count) print k, count[k]}' file

# Associative array from CSV: id → name
awk -F, 'NR>1 {map[$1]=$2} END {for (id in map) print id, map[id]}' data.csv

# Delete element
awk '{delete seen[$1]; seen[$1]=$2}' file

# Array test
awk '$1 in seen {print "dup:", $1} {seen[$1]=1}' file

Output:

text
Engineering 3
Ops 1
Finance 2

Multi-file processing

When multiple files are passed, NR counts records across all files while FNR resets to 1 for each new file. The classic two-file idiom FNR==NR { … ; next } processes the first file into memory, then uses that data while reading the second.

bash
# FNR vs NR
awk 'FNR==1 {print "--- File:", FILENAME}' file1 file2

# Process only second file
awk 'FNR==NR {ids[$1]=1; next} $1 in ids' list.txt data.txt

Output:

text
--- File: file1
--- File: file2

Practical recipes

bash
# Sum a column
awk '{sum+=$3} END {print sum}' data.txt

# Average
awk '{sum+=$1; n++} END {print sum/n}' numbers.txt

# Print columns in different order
awk '{print $3, $1, $2}' file.txt

# Skip header, process rest
awk 'NR>1 {print $2, $4}' report.csv

# Print unique lines (ordered, like sort|uniq)
awk '!seen[$0]++' file.txt

# Print duplicate lines only
awk 'seen[$0]++ == 1' file.txt

# Concatenate lines every N records
awk 'ORS= (NR%3 ? " " : "\n")' file    # join every 3 lines

# Extract value from key=value
awk -F= '/^timeout/{print $2}' config.ini

# Column-align a colon file
awk -F: '{printf "%-15s %-10s %s\n", $1,$3,$7}' /etc/passwd

# Top N by field
awk '{print $5, $0}' access.log | sort -rn | head -10 | cut -d' ' -f2-

# Running total
awk '{running+=$1; print running, $0}' ledger.txt

# Filter by date field (YYYY-MM-DD in $2)
awk '$2 >= "2025-01-01" && $2 <= "2025-03-31"' events.log

# Accumulate by group, then report
awk -F, '{bytes[$1]+=$3} END {
  for (h in bytes) printf "%s\t%.1f MB\n", h, bytes[h]/1048576
}' access.csv | sort -k2 -rn

# Transpose rows to columns
awk '{for(i=1;i<=NF;i++) col[i]=col[i] (NR>1?"\t":"") $i}
     END {for(i=1;i<=NF;i++) print col[i]}' matrix.txt

Output:

text
228000
76000.00
75000 Alice Engineering
62000 Bob Ops
Engineering Alice
Ops Bob
10 2026-01-15 10:23:45 ERROR connection refused
10 10.00 MB
webserver01	15.2 MB
appserver02	8.7 MB
dbserver03	3.1 MB

Multiline records

Setting RS="" switches awk into paragraph mode, where blank lines delimit records and newlines within a record become field separators (when FS="\n"). For CSV with quoted newlines, gawk's FPAT variable matches field content by pattern rather than splitting on a delimiter.

bash
# Blank-line-separated records (like paragraphs)
awk 'BEGIN{RS=""; FS="\n"} /keyword/{print $1}' file

# Multi-line CSV (quoted fields containing newlines) — use gawk
gawk 'BEGIN{FPAT="([^,]*)|(\"[^\"]+\")"} {print $2}' data.csv

Output:

text
First line of matching paragraph
"Engineering Department, North"

Built-in CSV (gawk 5.3+)

Gawk 5.3.0 (Nov 2023) added native CSV parsing via the --csv option, mirroring the same feature in BWK awk ("The One True Awk"). It correctly handles quoted fields, embedded commas, doubled "" quotes, and CRLF line endings — no more hand-rolled FPAT for standard RFC 4180 data. The mode forces FS="," and disables backslash-escape processing inside fields; combine it with BEGIN{OFS=","} to round-trip CSV.

bash
# Parse a CSV with quoted commas and embedded quotes (gawk 5.3+)
gawk --csv '{print $1, $3}' data.csv

# Convert CSV to TSV
gawk --csv 'BEGIN{OFS="\t"} {$1=$1; print}' data.csv > data.tsv

# Skip header, sum a numeric column
gawk --csv 'NR>1 {sum+=$4} END {print sum}' sales.csv

Output:

text
Alice "Engineering, North"
Bob "Ops, West"
Carol "Finance, HQ"

Check with gawk --version--csv requires gawk 5.3.0 or later. On older systems, fall back to the FPAT recipe above or use a dedicated tool like qsv.

Unicode escapes (gawk 5.3+)

Gawk 5.3.0 also introduced the \u escape sequence for inserting Unicode code points by hex value (1–8 digits), encoded as UTF-8 in the current locale. This makes it easier to emit non-ASCII symbols, box-drawing characters, and emoji from awk programs without literal multibyte bytes in source.

bash
# Print a checkmark and warning symbol (gawk 5.3+)
gawk 'BEGIN{print "✓ ok"; print "⚠ warn"}'

# Box-drawing borders
gawk 'BEGIN{print "┌──┐"; print "└──┘"}'

Output:

text
✓ ok
⚠ warn
┌──┐
└──┘

Modern alternatives

If you outgrow standard awk for performance or modern I/O, two actively used reimplementations expand the design space. frawk is a Rust-based JIT-compiled awk-alike with statically inferred types and built-in CSV/TSV support — typically several times faster than gawk on large files. goawk is a POSIX-compliant Go implementation with CSV mode (-i csv / -o csv), useful when you want a single static binary or are embedding awk in a Go program. Both accept most awk programs unchanged but differ in extensions and edge cases.

bash
# frawk: same syntax, often faster on big files
frawk -F, '{sum+=$3} END {print sum}' huge.csv

# goawk: CSV input/output modes
goawk -i csv -o csv 'NR>1 {print $1, $4}' data.csv

Output: (none — performance/portability differences only)

One-liners reference

bash
awk 'END{print NR}' file              # count lines (wc -l)
awk '{print NF}'   file              # print field count per line
awk 'NF'           file              # remove blank lines
awk 'length>72'    file              # lines longer than 72 chars
awk '{$1=$1; print}' file            # collapse whitespace, trim
awk '{print $NF}'  file              # print last field
awk '{print $(NF-1)}' file           # print second-to-last field
awk 'NR%2==0'      file              # print even-numbered lines
awk 'NR==FNR{a[$0];next} $0 in a'  f1 f2  # intersection of two files
awk 'NR==FNR{a[$0];next} !($0 in a)' f1 f2 # lines in f2 not in f1

Output:

text
42
3
4
5
The quick brown fox jumps over the lazy dog — this line exceeds seventy-two characters
/bin/bash
/bin/sh
line2
line4
alice
carol
bob
dave

gawk (GNU awk) extends POSIX awk with gensub(), FPAT for CSV, nextfile, co-processes (|&), and more. On most Linux systems awk is already gawk; check with awk --version.

Sources