software

Definition

Awk

Awk is a pattern-scanning and processing language for text files.

Records

A record is one input unit read at a time. By default, each record is one line of input, separated by the newline character.

Fields

Fields are parts of a record. A record is split into fields using the field separator, which is usually whitespace.

  • $0 refers to the whole current record.
  • $1, $2, ... refer to the first, second, and later fields in that same record.

The $c syntax is called a field reference. Here, c is the field number.

$0

{ print $0 }

This prints the whole current record.

$0 and $1

{ print $0 }
{ print $1 }

For the record hello world, $0 is hello world, $1 is hello, $2 is world, and $3 is empty.

Arrays

Awk uses associative arrays. An associative array stores values under keys, which can be arbitrary strings. You access a value with square brackets, such as m[1].

In match($0, /.../, m), awk stores captured text in the array m. m[0] is the whole match, and m[1] is the first captured part of the match.

m[1]

if (match($0, /hello ([^ ]+)/, m)) {
  print m[1]
}

For the record hello world, this prints world.

Variables

User-defined variables

Awk does not require explicit variable declarations. A variable is created when it is assigned for the first time. User-defined variables are not scoped to one record; they persist across records until they are reassigned.

An uninitialised variable evaluates to the empty string in string context and to 0 in numeric context.

zon_url = m[1]

zon_url = m[1]

This assigns the first captured match to zon_url, so that it can be used later in the program.

NR

NR is the total number of records read so far across all input files.
It starts at 1 for the first record and increases by 1 each time awk reads a new record.

FNR

FNR is the number of the current record within the current input file.
It resets to 1 whenever awk starts a new file.

FNR == NR is true only while awk is reading the first input file.

Statements

Awk has statements that are written without parentheses. This includes keywords such as print and delete, which are not function calls.

Important

Statements like print and delete do not need parentheses, whereas ordinary function calls such as match(...) do.

print

print writes output. It usually prints the current record $0 when no argument is given.

delete

delete removes an entry from an array.

Functions

Awk has built-in functions and user-defined functions.

match

match(s, r, a) searches the string s for the POSIX extended regular expression r. If the search succeeds, it returns the 1-based position of the match in s, and it sets RSTART and RLENGTH accordingly. If a third argument a is given, awk stores the whole match in a[0] and the captured subexpressions in a[1], a[2], and so on.

Important

match() is often used inside a condition, because a successful match returns a non-zero value.

match

if (match($0, /hello ([^ ]+)/, m)) {
  print m[0]
  print m[1]
}

For the record hello world, m[0] is hello world and m[1] is world.

User-defined functions

A user-defined function is declared with the function keyword. It can take parameters and return a value with return.

function

function double(x) {
  return x * 2
}

This defines a function that returns twice its argument.

Awk also allows local variables to be listed after the parameters; they are separated by extra spaces in the function header.

local variables

function flush(x,    i) {
  i = x
  print i
}

Here, x is a parameter and i is a local variable of flush.

Operators

Pattern Matching Operator

The pattern matching operator ~ tests whether the left-hand side matches a POSIX extended regular expression on the right-hand side.
The operator !~ is the negated form.

~

$0 ~ /^[[:space:]]*\{[[:space:]]*$/

This is true when the current record contains only optional whitespace and a literal {.

!~

$0 !~ /foo/

This is true when the current record does not match foo.

Conditionals

if

if runs one block when a condition is true. If the condition is false, awk skips that block. You can also add else and else if branches.

A condition can be any expression whose value is treated as true or false.
For example, match($0, /re/, m) is used as a condition because it returns a non-zero value when the regular expression matches the current record.

if

if ($1 == "yes") {
  print $0
}

This prints the current record only when the first field is yes.

else

else runs when the if condition is false.

else

if ($1 == "yes") {
  print "yes"
} else {
  print "no"
}

This chooses one of two actions.

else if

else if adds another condition after an if or else if.

else if

if ($1 == "yes") {
  print "yes"
} else if ($1 == "maybe") {
  print "maybe"
}

This checks a second condition if the first one is false.

Ternary Operator

The ternary operator chooses between two values with a condition.

Ternary operator

print ($1 == "yes" ? "yes" : "no")

This prints yes when the condition is true, and no otherwise.

Pattern-Action Rules

An awk program is a list of pattern-action rules. Each rule has an optional pattern and an action block in braces. If the pattern is true, awk runs the action block.

A rule can look like this:

pattern { action }

The blocks you noticed are action blocks. They can be written one after another, and awk checks them for each record.

FNR == NR { ... }

FNR == NR {
  print $0
}

This rule runs its action block only for records from the first input file.

next

next stops processing the current record and moves awk to the next one.
Without next, awk keeps checking later rules for the same record.

Important

next is useful when one rule handles a record completely and later rules should not see it.

next

if ($1 == "skip") {
  next
}
print $0

This skips records whose first field is skip.

next in a rule

FNR == NR {
  print $0
  next
}

This prints records from the first input file and then stops awk from applying later rules to those same records.

BEGIN

BEGIN is a special pattern that runs before awk reads the first record. It is commonly used to initialise variables or print a header.

BEGIN

BEGIN {
  print "start"
}

This runs once, before any input is processed.

END

END is a special pattern that runs after awk has read the last record. It is commonly used to print summaries or to flush buffered output.

END

END {
  if (in_item) {
    flush()
  }
}

This runs once, after all input has been processed.

Examples

FNR == NR { ... }

This pattern runs only while awk reads the first input file.
FNR counts records within the current file, but NR counts records across all files. When awk starts the second file, FNR resets to 1 but NR keeps increasing. So FNR == NR is false for later files.

For example:

FNR == NR { print $0 }

This prints every record from the first file and skips the later files because the condition is no longer true.