awk Overview

awk is a tool for processing rectangular data, in other words, data that looks like a spreadsheet, with a regular structure on every row. Each awk script is essentially a one-liner, with an (optional) BEGIN section, a main section, and an (optional) END section. Often the entire awk script is only the main section.

So an awk script can look like this:

`awk 'BEGIN{ put-the-preliminary-stuff-here } { put-the-main-stuff-here } END{ put-the-ending-stuff-here }'

but, more commonly, there is no BEGIN or END section, so an awk program looks like:

`awk '{ put-the-main-stuff-here }'

Any code in the BEGIN section will run only one time, before awk starts processing the data set. Any code in the END section will run only one time, after awk finishes processing the data set. For these reasons, the important part of awk programs in the main portion. The main section of an awk script is executed on each row of the data. This main section runs one time for every line of the data set. If, for instance, the data set has 1 million lines, then the main portion of the awk script will run 1 million times, namely, 1 time per row. The main section usually consists of a transformation of the data or a retrieval of a portion of the data. This is a key thing to keep in mind with awk. When figuring out what to do with awk, I encourage students to do the following: Go row by row through your data, and think about what you want to do with each row of the data. Do you want to extract something? Keep track of something? Search for something? Usually you want to do the same thing on each row of your data set. With this mentality in mind, it becomes easier to write your first awk scripts.

The syntax for awk takes a little time to get used to. Once familiar with awk, it is possible make quick transformations of data using awk with just 1 line of work. Also, awk is really handy as a part of a larger bash pipeline of tools for transforming or extracting data.

Field separators

In awk, by default, the field separator is whitespace. In other words, any pieces of data that are separated by whitespace will be considered as separate fields. If two parts of a row of data have (for instance) spaces or tabs between them, they will be treated as separate pieces of the data on that row.

It is possible to specify other field separators, using the -F option.

For instance, in a csv file, the field separator should be a comma. So if you run awk on a csv file, you usually want to specify that the field separator is a comma, like this:

awk -F, myfile.csv '{print $5}'

In this example, we search through each row of data, looking for the 5th field. Each time we come to a comma in the row, we are in a new field of the data on that row. We are asking awk to print the data in the 5th field of each row.

As another example, in a tsv file (where tsv stands for tab-separated values), the data fields are separated by tabs. We can, for instance, print the 2nd field from each row of data.

awk -F"\t" myfile.tsv '{print $2}'

In the election data contained in the directory:

/anvil/projects/data/tdm/election

the data fields are separated by the pipe symbol. For instance, we can use

awk -F"\|" /anvil/projects/data/tdm/election/itcont2000.txt '{print $10}'

to print the 10th field of data from each row. This will print the state where each donor lives.

Please note that, when working with large data sets like the election data, awk is often used to print 1 result per row of data. In the election data, there are millions of rows. So it is usually helpful to use the awk command as part of a pipeline. For instance, we might just want to see how this command works on the first 10 rows of the data set, like this:

awk -F"\|" /anvil/projects/data/tdm/election/itcont2000.txt '{print $10}' | head

or we could write:

cat /anvil/projects/data/tdm/election/itcont2000.txt | awk -F"\|" '{print $10}'

FIXTHIS we could give an example showing how to do the previous example using cut and uniq and sort.

FIXTHIS might want to give an example with the flight data, which has comma-separated values

Examples

FIXTHIS see Section 20.10 of the Unix Power Tools book for example awk programs.

FIXTHIS put examples of NF and NR and some of the other variables listed on page 382 of Unix Power Tools.

Associative Arrays

FIXTHIS need to put some examples here, for instance, keep track of the number of election donations from each state; also show how this could be done using cut and uniq and sort.

FIXTHIS We could do the same with taxi cab rides, classified according to the number of passengers in the ride; we can also show how this could be done using cut and uniq and sort.

Commands in awk

FIXTHIS we could list some of the awk commands given on page 383 of the Unix Power Tools, including examples of some of these. For instance, some of the states in the election donations data are not capitalized. We could fix this by using the toupper command.

Using sed and awk together

The sed and awk tools are often used together, and there is even a book devoted to these two tools.

Information from the GNU webpage about awk

awk Books

  • The AWK Programming Language by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger (Addison-Wesley, 1988), available at Amazon

  • Effective awk Programming, 4th Edition, by Arnold Robbins (O’Reilly, 2015), available at O’Reilly or Amazon Note: This is essentially the very same text as the gawk book below, which is available for free.

  • Gawk: Effective AWK Programming, by Arnold D. Robbins, freely available online as the manual for gawk, available in several formats. In particular, the book can be legally downloaded for free in pdf format.

  • sed & awk, 2nd Edition, by Dale Dougherty and Arnold Robbins (O’Reilly 1997), availble at O’Reilly or Amazon