Skip to content

Latest commit

 

History

History
142 lines (89 loc) · 4.6 KB

0.0firststeps.md

File metadata and controls

142 lines (89 loc) · 4.6 KB

Introduction to awk

We will be learning how to use awk in this book totally via examples. We typically run awk programs on data files or by piping it data from some other unix utility like cat/grep/sed etc.

For this book, we will be using various methods, for now, we'll be using a file called sales.csv.

Create a folder for all the data files/programs that you are going to use in this book. In that folder, make a file named sales.csv.

The content of the file should be as below:

month,sales
Jan,100
Feb,200
March,100
April,50
May,900
June,1000
July,10
August,20
September,456
October,134
November,934
December,545

Program #1

code ⏰  awk '{print}' sales.csv

Output: Prints the entire file

This is one of the shortest awk programs one can write. What it does is basically mimics the cat command by printing the entire file to the commandline.

How did it work?

Syntax

awk <command> <file1> <file2> ... <filen>
  • <command>: should be enclosed in a '. With code enclosed in one or more curly brackets {}.

  • <file1 ... n>: One or more files on which the awk program will be executed upon. At least one file is mandatory, otherwise awk will take input from stdin.

Execution of an awk program

When any awk command is executed, it executes the code (enclosed in the ') over each line of the input file(s).

Working with fields

We saw how to print the entire line to the commandline. In this section, we'll see how to selectively print fields. In awk, a file consists of lines and every line contains fields which are separated using delimiters.

Let's say that we have this file:

month,sales
Jan,100
  • Record month,sales is the first record. Jan,100 is the second record. i.e. records are separated with a newline

  • Field month and sales are called fields. Whatever separates the two fields is the delimiter

Program #2

code ⏰  awk '{print $0}' sales.csv

The output of the above command is exactly same as Program #1.

Program #3

code ⏰  awk -F',' '{print $1, $2}' sales.csv
month sales
Jan 100
Feb 200
March 100
April 50
May 900
June 1000
July 10
August 20
September 456
October 134
November 934
December 545
  • $0: the entire line that's being processed. (Since awk processes an entire file on line by line basis from top to bottom)

  • $1: the first field of the line being processed

  • $2: the second field of the line being processed

  • $NF: the last field of the line being processed

  • -F',': The delimiter in this case is a comma. You can have any delimiter which your data file contains, just enclose it in quotes.

Note:

-F ',' is wrong because there is a space between the two fields. We need to have it as -F','

The need of a delimiter?

code ⏰  awk '{print $1}' sales.csv

Run the above code and see the output. After analysing it, read the next paragraph.

The default delimiter in awk is a space, thus, if we are processing a comma separated file and we don't give a delimiter, then awk will consider the entire line as the first field.

In Program #3, when we printed $1, $2 you would've noticed that the month and the sales got printed with a space between them. This is because we have a delimiter for the output fields as well.

We set the output field separator/delimiter by adding it like this before the awk code block (which is enclosed in single quotes) -v OFS=","

Now run the below code and play with the OFS value.

code ⏰  awk -F, -v OFS="," '{print $1, $2}' sales.csv

code ⏰  awk -F, -v OFS=":" '{print $1, $NF, $2}' sales.csv

code ⏰  awk -F"," -v OFS=";" '{print $1, $2}' sales.csv

The printing order does not matter, we can print any field in any location for any number of times that we want.

code ⏰  awk -F"," -v OFS=";" '{print $2, $1, $2}' sales.csv

You can run an awk script on multiple files as well, just give the names after the command.

Multiple code blocks

code ⏰  awk -F"," -v OFS=";" '{print $2, $1} {print $0}' sales.csv

The statements enclosed in {} form a block of code. These blocks are executed one at a time from left to right. When we run the above code, we'll first get the fields printed in reverse order and then the entire file as is, on the command line.

You can have as many blocks as you want.

The above blocks of code are redundant, we can have both the print statements in one block.

code ⏰  awk -F"," -v OFS=";" '{print $2, $1; print $0}' sales.csv

Within a block, a semi-colon can be used to separate the lines of code.

Links

-Next section