Notebook

Applications of Regular Expressions¶

Author: Bruno Grande

Date: August 29th, 2017

Introduction¶

In this lesson, we will cover a few applications of regular expressions (or regex) that I use all the time. Regex are available in most programming languages, but to keep this lesson accessible to as many people as possible, we will focus on applications at the Bash shell. Specifically, we will cover how you can use grep, sed and awk to get a lot done without firing up a script, especially with the power of regex at your side.

The motivation¶

Regular expressions are an extremely powerful tool for pattern matching. You might not realize it, but a lot of what we do is pattern matching, especially if you deal with text at all. The ability to describe a flexible pattern that the computer can then quickly look for in some arbitrary text opens up a world of possibilities. This lesson will focus on some of these possibilities. Notably, we will cover:

Subsetting text using grep
Search-and-replace text using sed
Filter and/or process tabular data using awk

These three tools alone justify learning Bash to make your life easier. In combination with regex, they are life-savers!

The dataset¶

We will be using Jenny Bryan's cleaned-up version of the gapminder dataset. It contains 1704 rows and 6 columns. The dataset consists of the population, life expectancy and GDP per capita for 142 countries every 5 years between 1952 and 2007. You can easily download the data using curl as follows.

In [1]:

curl -sL bit.ly/gapm-data > gapminder.tsv

In [2]:

head gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan	Asia	1992	41.674	16317921	649.3413952

Subsetting text using grep¶

At its simplest, grep can be used to filter lines based on a pattern. We can start with a plain, non-regex pattern. Here, we subset the file to lines that contains the word Canada.

In [3]:

grep Canada gapminder.tsv

Canada	Americas	1952	68.75	14785584	11367.16112
Canada	Americas	1957	69.96	17010154	12489.95006
Canada	Americas	1962	71.3	18985849	13462.48555
Canada	Americas	1967	72.13	20819767	16076.58803
Canada	Americas	1972	72.88	22284500	18970.57086
Canada	Americas	1977	74.21	23796400	22090.88306
Canada	Americas	1982	75.76	25201900	22898.79214
Canada	Americas	1987	76.86	26549700	26626.51503
Canada	Americas	1992	77.95	28523502	26342.88426
Canada	Americas	1997	78.61	30305843	28954.92589
Canada	Americas	2002	79.77	31902268	33328.96507
Canada	Americas	2007	80.653	33390141	36319.23501

You'll notice that the header lines was removed, because it doesn't contain Canada. If we want to ensure that this file remains valid, we need to keep the header. There are multiple ways to do this.

First, you can grep the header and the Canada lines separately.

In [4]:

grep country gapminder.tsv
grep Canada gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Canada	Americas	1952	68.75	14785584	11367.16112
Canada	Americas	1957	69.96	17010154	12489.95006
Canada	Americas	1962	71.3	18985849	13462.48555
Canada	Americas	1967	72.13	20819767	16076.58803
Canada	Americas	1972	72.88	22284500	18970.57086
Canada	Americas	1977	74.21	23796400	22090.88306
Canada	Americas	1982	75.76	25201900	22898.79214
Canada	Americas	1987	76.86	26549700	26626.51503
Canada	Americas	1992	77.95	28523502	26342.88426
Canada	Americas	1997	78.61	30305843	28954.92589
Canada	Americas	2002	79.77	31902268	33328.96507
Canada	Americas	2007	80.653	33390141	36319.23501

However, you will notice that we are repeating ourselves (the grep command and the gapminder.tsv file name. Ideally, we want to follow the DRY (don't repeat yourself) principle.

N.B. An astute reader will notice that I can extract the header using head -1. Indeed, this would work here, but I am familiar with file formats (e.g. VCF variant call format) where the header is neither the first line, nor a predictable number of lines into the file. In these cases, grep is more general.

The second approach involves the use of regex. In fact, grep stands for "globally search a regular expression and print". However, because there have been multiple versions of regex over the years and we are used to the more modern versions, we will need to use a variant of grep that enables extended regex. You can either use grep -E or egrep. I will be using the latter.

Here, we can start using regex by using the | operator, which matches what's on the left or on the right. Whenever you use regular expressions, it is safer to quote the pattern using single quotes.

In [5]:

egrep 'country|Canada' gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Canada	Americas	1952	68.75	14785584	11367.16112
Canada	Americas	1957	69.96	17010154	12489.95006
Canada	Americas	1962	71.3	18985849	13462.48555
Canada	Americas	1967	72.13	20819767	16076.58803
Canada	Americas	1972	72.88	22284500	18970.57086
Canada	Americas	1977	74.21	23796400	22090.88306
Canada	Americas	1982	75.76	25201900	22898.79214
Canada	Americas	1987	76.86	26549700	26626.51503
Canada	Americas	1992	77.95	28523502	26342.88426
Canada	Americas	1997	78.61	30305843	28954.92589
Canada	Americas	2002	79.77	31902268	33328.96507
Canada	Americas	2007	80.653	33390141	36319.23501

If we wanted to include the US in our results, it's as simple as adding another | operator in our pattern.

In [6]:

egrep 'country|Canada|United States' gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Canada	Americas	1952	68.75	14785584	11367.16112
Canada	Americas	1957	69.96	17010154	12489.95006
Canada	Americas	1962	71.3	18985849	13462.48555
Canada	Americas	1967	72.13	20819767	16076.58803
Canada	Americas	1972	72.88	22284500	18970.57086
Canada	Americas	1977	74.21	23796400	22090.88306
Canada	Americas	1982	75.76	25201900	22898.79214
Canada	Americas	1987	76.86	26549700	26626.51503
Canada	Americas	1992	77.95	28523502	26342.88426
Canada	Americas	1997	78.61	30305843	28954.92589
Canada	Americas	2002	79.77	31902268	33328.96507
Canada	Americas	2007	80.653	33390141	36319.23501
United States	Americas	1952	68.44	157553000	13990.48208
United States	Americas	1957	69.49	171984000	14847.12712
United States	Americas	1962	70.21	186538000	16173.14586
United States	Americas	1967	70.76	198712000	19530.36557
United States	Americas	1972	71.34	209896000	21806.03594
United States	Americas	1977	73.38	220239000	24072.63213
United States	Americas	1982	74.65	232187835	25009.55914
United States	Americas	1987	75.02	242803533	29884.35041
United States	Americas	1992	76.09	256894189	32003.93224
United States	Americas	1997	76.81	272911760	35767.43303
United States	Americas	2002	77.31	287675526	39097.09955
United States	Americas	2007	78.242	301139947	42951.65309

grep can be as flexible as you need it to be. While it may be contrived, let's say we are interested in the data from 1977 for countries whose names start with S (and we want to keep the header). As always, there are multiple was of approaching this problem.

First, we can use UNIX pipes to perform subsequent filters, one for the countries starting with S and another for the rows corresponding to 1977.

In [7]:

egrep '^(country|S)' gapminder.tsv | egrep '1977'

Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1967	67.946	1977600	4977.41854
Singapore	Asia	1977	70.795	2325300	11210.08948
Singapore	Asia	2002	78.77	4197776	36023.1054
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582

Some of you might have noticed that Singapore comes up three times, while every country is only supposed to show up once at most. Upon closer inspection, you can see why this is happening: the 1977 pattern is appearing in the line within the population number, which is not something we want.

The immediate solution to this is to prevent matches of the year within other numbers. In regex, you can specify that word boundaries must be present before and after the number using \b.

In [8]:

egrep '^(country|S)' gapminder.tsv | egrep '\b1977\b'

Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582

Indeed, this solves our problem, but we lost the header again. To get it back, we need to include the country pattern in both commands, which is slightly repetitive.

N.B. Our current solution to filtering for observations made in 1977 is imperfect, because we are filtering on the presence of 1977 anywhere in the line. Technically, if country had a population or GDP per capita of 1977 at some point, this would be included in the output. Later, we will see how we can use awk to apply regex on specific columns.

In [9]:

egrep '^(country|S)' gapminder.tsv | egrep '(country|\b1977\b)'

country	continent	year	lifeExp	pop	gdpPercap
Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582

Second, we can combine our patterns into one regex. Admittedly, there is no compelling advantage in doing so other than preventing needless commands wherever possible. For this, we need to acknowledge that the year will always be after the country name by some number of characters. We can specify "some numbers of characters" in regex using .*.

In [10]:

egrep '^(country|S).*\b1977\b' gapminder.tsv

Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582

Challenge Question 1¶

Why is the header missing in output of the above command?

Let's create a file with a list of countries of interest for the purposes of this demo.

In [11]:

echo -e 'Canada\nItaly\nAustralia\nUnited States\nEngland\nFrance' > countries.txt
cat countries.txt

Canada
Italy
Australia
United States
England
France

Given this list, we can easily filter the gapminder dataset for observations made for these countries.

In [12]:

egrep -f countries.txt gapminder.tsv | head

Australia	Oceania	1952	69.12	8691212	10039.59564
Australia	Oceania	1957	70.33	9712569	10949.64959
Australia	Oceania	1962	70.93	10794968	12217.22686
Australia	Oceania	1967	71.1	11872264	14526.12465
Australia	Oceania	1972	71.93	13177000	16788.62948
Australia	Oceania	1977	73.49	14074100	18334.19751
Australia	Oceania	1982	74.74	15184200	19477.00928
Australia	Oceania	1987	76.32	16257249	21888.88903
Australia	Oceania	1992	77.56	17481977	23424.76683
Australia	Oceania	1997	78.83	18565243	26997.93657

So far, we've seen how we can use grep to subset the lines in a file according to a certain pattern. Another useful feature of grep is its quiet mode, which can be used in conjunction with Bash conditional expressions.

First, let's review Bash if statements.

In [13]:

if [[ 1 > 2 ]]; then
    echo 'true'
else
    echo 'false'
fi

false

Here, [[ 1 > 2 ]] is actually a command that evaluates the expression inside the square brackets. This portion of the if statement in Bash can be any command. A command evaluates as true if its exit code is zero (i.e. the command was successful). Otherwise, it's considered as false.

To show this, I will run the commands true and false, which respectively return exit codes 0 and 1.

N.B. The $? is a useful variable that contains the exit code of the most recently run command.

In [14]:

if true; then
    echo 'Exit code: ' $?
    echo 'Considered true'
else
    echo 'Exit code: ' $?
    echo 'Considered false'
fi

Exit code:  0
Considered true

In [15]:

if false; then
    echo 'Exit code: ' $?
    echo 'Considered true'
else
    echo 'Exit code: ' $?
    echo 'Considered false'
fi

Exit code:  1
Considered false

Now, let's say you're interested in running a command only if a file contains some pattern. You can use grep in quiet mode inside a if statement, as follows.

In [16]:

if egrep -q 'Canada' countries.txt; then
    echo 'Canada is in the countries.txt file :D'
else
    echo 'Canada is not in the countries.txt file :('
fi

Canada is in the countries.txt file :D

In [17]:

if egrep -q 'Switzerland' countries.txt; then
    echo 'Switzerland is in the countries.txt file :D'
else
    echo 'Switzerland is not in the countries.txt file :('
fi

Switzerland is not in the countries.txt file :(

Here, we are just echoing some text, but you can do whatever you want once you know a file matches a pattern. A nice thing about quiet mode is that grep stops searching as soon as it encounters the first instance of the pattern.

If you wanted to count how many instances of a pattern there are in a file, you can certainly pipe the output of grep to wc -l. You can be slightly more efficient by avoiding the extra command and using the -c option in grep.

In [18]:

grep 'Canada' gapminder.tsv | wc -l

In [19]:

grep -c 'Canada' gapminder.tsv

Lastly, for all of the above grep commands, you can invert the search using the -v option. In other words, if you want all lines except for those containing "Canada" or "United States", you can simply do the following:

In [20]:

cat countries.txt

Canada
Italy
Australia
United States
England
France

In [21]:

egrep -v 'Canada|United States' countries.txt

Italy
Australia
England
France

Challenge Question 2¶

Write an if statement in Bash that checks if there are any countries that start with the letter "Z" outside of Africa, and echoes the response accordingly.

Search-and-replace text using sed¶

So far, we've seen grep's amazing ability to subset lines in a file according to a pattern, which can be as complex as you can conjure. Now, we're going to introduce sed, which is probably best known for its ability to perform search-and-replace really easily at the command line.

Let's remind ourselves of what's in our gapminder.tsv file.

In [22]:

head gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan	Asia	1992	41.674	16317921	649.3413952

To start off with a simple example to examine the structure of a sed command, we are going to replace every instance of "United States" with "USA". Here, we will count instances of each term before and after we apply sed to confirm the change.

In general, we need to ensure that modern regular expressions are enabled in sed. Unfortunately, this option varies based on your platform. Typically, it's -E on Macs and -r on Linux (and probably Windows, although I'm not sure).

In [23]:

sed -E 's/United States/USA/' gapminder.tsv > gapminder.usa.tsv

In [24]:

echo 'Before sed'
grep -c 'United States' gapminder.tsv
grep -c 'USA' gapminder.tsv

echo 'After sed'
grep -c 'United States' gapminder.usa.tsv
grep -c 'USA' gapminder.usa.tsv

Before sed
12
0
After sed
0
12

As you can see, the search-and-replace worked. The general form of a sed search-and-replace is as follows:

sed -E 's/what_you_want_to_replace/what_you_want_to_replace_with/' input_file.txt > output_file.txt

Just in case you're still skeptical, we'll apply the same change on our small countries.txt file.

In [25]:

sed -E 's/United States/USA/' countries.txt

Canada
Italy
Australia
USA
England
France

The initial s is necessary to indicate the search-and-replace command within sed. There are other commands that we won't see today, such as insert (i) and delete (d). The slashes are used to delimit the what_you_want_to_replace from the what_you_want_to_replace_with. It can actually be any character you want, as long as you're consistent.

For example, you can use colons (:) instead.

In [26]:

sed -E 's:United States:USA:' countries.txt

Canada
Italy
Australia
USA
England
France

The character you use is not that important. One thing to consider is that if the character you choose appear in the regex, you will need to escape it with backslashes. That's why I generally stick with slashes as my character in sed commands unless I'm dealing with file paths as my input text (which commonly include slashes), in which case I will switch to colons or vertical bars.

Let's move on to a slightly more complex change. We are going to replace every period (.) with a comma (,), as it we want to send our data to a collaborator in France, where they use commas instead of periods in decimal numbers.

There is an important thing we need to handle: there might be multiple instances of a point. By default, sed will only replace the first instance of a pattern per line. If we want to replace every instance, we'll need to enable the global mode by adding a g at the end of the sed command.

N.B. Recall that the period in regex has special meaning and matches any character. If we want to match an actual period, we need to escape it using a backslash.

In [27]:

sed -E 's/\./,/g' gapminder.tsv > gapminder.comma.tsv
head gapminder.comma.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28,801	8425333	779,4453145
Afghanistan	Asia	1957	30,332	9240934	820,8530296
Afghanistan	Asia	1962	31,997	10267083	853,10071
Afghanistan	Asia	1967	34,02	11537966	836,1971382
Afghanistan	Asia	1972	36,088	13079460	739,9811058
Afghanistan	Asia	1977	38,438	14880372	786,11336
Afghanistan	Asia	1982	39,854	12881816	978,0114388
Afghanistan	Asia	1987	40,822	13867957	852,3959448
Afghanistan	Asia	1992	41,674	16317921	649,3413952

Challenge Question 3¶

Write a sed command that replaces are continent names with "Pangaea".

You can easily chain multiple search-and-replace commands by using the -e option.

In [28]:

sed -E -e 's/United States/USA/' -e 's/\./,/g' gapminder.tsv > gapminder.usa_and_comma.tsv
egrep 'country|USA' gapminder.usa_and_comma.tsv | head

country	continent	year	lifeExp	pop	gdpPercap
USA	Americas	1952	68,44	157553000	13990,48208
USA	Americas	1957	69,49	171984000	14847,12712
USA	Americas	1962	70,21	186538000	16173,14586
USA	Americas	1967	70,76	198712000	19530,36557
USA	Americas	1972	71,34	209896000	21806,03594
USA	Americas	1977	73,38	220239000	24072,63213
USA	Americas	1982	74,65	232187835	25009,55914
USA	Americas	1987	75,02	242803533	29884,35041
USA	Americas	1992	76,09	256894189	32003,93224

Perhaps one of the most powerful features of sed and regex when doing search-and-replace is backreferences. They allow you to search for something and replace it with something that includes what was originally matched. I think the best way to explain this is to demonstrate backreferences in action. Our contrived example is to match the country name at the beginning of each line and duplicating it.

In [29]:

sed -E 's/^([^\t]+)/\1_\1/' gapminder.tsv > gapminder.double_country.tsv
head gapminder.double_country.tsv

country_country	continent	year	lifeExp	pop	gdpPercap
Afghanistan_Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan_Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan_Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan_Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan_Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan_Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan_Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan_Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan_Afghanistan	Asia	1992	41.674	16317921	649.3413952

Challenge Question 4¶

Use backreferences to get rid of all decimal digits. Don't worry about rounding up or down; just take the floor of the number.

Filter and/or process tabular data using awk¶

The last tool we will cover today is awk. This tool combines the features of grep and sed and makes them more useful in the context of tabular data, such as our gapminder.tsv file consisting of six tab-delimited columns.

In [30]:

head gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan	Asia	1992	41.674	16317921	649.3413952

FS and OFS
Print subset of columns
Conditionally print lines
sub and gensub

The first thing you need to configure with awk is the field separator (FS), which is what separates the columns in each line. Typically, we use comma- or tab-delimited files. In this case, gapminder.tsv uses tabs. We also configure the output field separator (OFS) to be the same character. Notice that we use single quotes again to avoid unintended issues down the line.

In [3]:

awk 'BEGIN {FS=OFS="\t"}' gapminder.tsv

The BEGIN {} contains awk commands that are run once at the beginning. Here, we only need to set the input and output field separator once. Because there are no commands that follow BEGIN {}, awk doesn't do anything. If we want to print lines, we can use print $0, where $0 refers to all columns.

In [4]:

awk 'BEGIN {FS=OFS="\t"} {print $0}' gapminder.tsv | head

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28.801	8425333	779.4453145
Afghanistan	Asia	1957	30.332	9240934	820.8530296
Afghanistan	Asia	1962	31.997	10267083	853.10071
Afghanistan	Asia	1967	34.02	11537966	836.1971382
Afghanistan	Asia	1972	36.088	13079460	739.9811058
Afghanistan	Asia	1977	38.438	14880372	786.11336
Afghanistan	Asia	1982	39.854	12881816	978.0114388
Afghanistan	Asia	1987	40.822	13867957	852.3959448
Afghanistan	Asia	1992	41.674	16317921	649.3413952

Admittedly, this isn't very useful. You can refer to the first, second, third, etc. columns using $1, $2, $3, etc. So, if we want to print the country name, the year and the population, we can use awk as follows.

In [6]:

awk 'BEGIN {FS=OFS="\t"} {print $1, $3, $5}' gapminder.tsv | head

country	year	pop
Afghanistan	1952	8425333
Afghanistan	1957	9240934
Afghanistan	1962	10267083
Afghanistan	1967	11537966
Afghanistan	1972	13079460
Afghanistan	1977	14880372
Afghanistan	1982	12881816
Afghanistan	1987	13867957
Afghanistan	1992	16317921

Again, this isn't very useful, because can achieve the same effect using cut in Bash using much less typing.

In [7]:

cut -f1,3,5 gapminder.tsv | head

country	year	pop
Afghanistan	1952	8425333
Afghanistan	1957	9240934
Afghanistan	1962	10267083
Afghanistan	1967	11537966
Afghanistan	1972	13079460
Afghanistan	1977	14880372
Afghanistan	1982	12881816
Afghanistan	1987	13867957
Afghanistan	1992	16317921

Things start getting interesting once you perform filtering on specific columns or manipulating text in specific columns. For instance, let's revisit our earlier task of filtering on rows that pertain to 1977. This can be accurately done by simply checking if column 3 is equal to 1977. In this case, we don't have to worry about the digits "1977" appearing in other columns such as the population.

In [8]:

awk 'BEGIN {FS=OFS="\t"} $3 == 1977 {print $0}' gapminder.tsv | head

Afghanistan	Asia	1977	38.438	14880372	786.11336
Albania	Europe	1977	68.93	2509048	3533.00391
Algeria	Africa	1977	58.014	17152804	4910.416756
Angola	Africa	1977	39.483	6162675	3008.647355
Argentina	Americas	1977	68.481	26983828	10079.02674
Australia	Oceania	1977	73.49	14074100	18334.19751
Austria	Europe	1977	72.17	7568430	19749.4223
Bahrain	Asia	1977	65.593	297410	19340.10196
Bangladesh	Asia	1977	46.923	80428306	659.8772322
Belgium	Europe	1977	72.8	9821800	19117.97448

Note that the {print $0} is actually optional when we specify a condition for filtering lines.

In [9]:

awk 'BEGIN {FS=OFS="\t"} $3 == 1977' gapminder.tsv | head

Afghanistan	Asia	1977	38.438	14880372	786.11336
Albania	Europe	1977	68.93	2509048	3533.00391
Algeria	Africa	1977	58.014	17152804	4910.416756
Angola	Africa	1977	39.483	6162675	3008.647355
Argentina	Americas	1977	68.481	26983828	10079.02674
Australia	Oceania	1977	73.49	14074100	18334.19751
Austria	Europe	1977	72.17	7568430	19749.4223
Bahrain	Asia	1977	65.593	297410	19340.10196
Bangladesh	Asia	1977	46.923	80428306	659.8772322
Belgium	Europe	1977	72.8	9821800	19117.97448

We can also combine multiple conditions using &&. Here, we will reproduce our earlier command in awk, where we will filter for 1977 data for countries whose names starts with "S".

In [11]:

awk 'BEGIN {FS=OFS="\t"} $3 == 1977 && $1 ~ /^S/' gapminder.tsv | head

Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439

We now face a similar issue as before, where the header is missing. We can address this in multiple ways. We will use our approach from earlier, by matching country in the first column.

In [12]:

awk 'BEGIN {FS=OFS="\t"} $3 == 1977 && $1 ~ /^S/ || $1 == "country"' gapminder.tsv | head

country	continent	year	lifeExp	pop	gdpPercap
Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513

In general, the structure of awk commands (within the single quotes) is as follows:

awk 'BEGIN {FS=OFS="\t"} CONDITION {ACTION} CONDITION {ACTION} {ACTION}' input.tsv > output.tsv

You can think of an awk command as a series of conditions and actions that will only run if the preceding condition is true. In fact, BEGIN is a condition that is only true at the beginning of the file. Hence, the {FS=OFS="\t"} only gets run once at the outset. Any action that isn't preceded by a condition (like the last {ACTION} in the example command above) will run for every line.

Challenge Question 5¶

Tackle Challenge Question 3, but this time using awk. You should be able to simplify your approach.

Hint: You no longer need to know the continents in the file anymore.

Solutions to Challenge Questions¶

Challenge Question 1¶

The header is missing from the output because the .*\b1977\b in the pattern is restricting that all lines (i.e. those starting with country or S) have a 1977 in it. The solution is to move the .*\b1977\b inside the parentheses such that it only applies to lines starting with S.

In [31]:

egrep '^(country|S.*\b1977\b)' gapminder.tsv

country	continent	year	lifeExp	pop	gdpPercap
Sao Tome and Principe	Africa	1977	58.55	86796	1737.561657
Saudi Arabia	Asia	1977	58.69	8128505	34167.7626
Senegal	Africa	1977	48.879	5260855	1561.769116
Serbia	Europe	1977	70.3	8686367	12980.66956
Sierra Leone	Africa	1977	36.788	3140897	1348.285159
Singapore	Asia	1977	70.795	2325300	11210.08948
Slovak Republic	Europe	1977	70.45	4827803	10922.66404
Slovenia	Europe	1977	70.97	1746919	15277.03017
Somalia	Africa	1977	41.974	4353666	1450.992513
South Africa	Africa	1977	55.527	27129932	8028.651439
Spain	Europe	1977	74.39	36439000	13236.92117
Sri Lanka	Asia	1977	65.949	14116836	1348.775651
Sudan	Africa	1977	47.8	17104986	2202.988423
Swaziland	Africa	1977	52.537	551425	3781.410618
Sweden	Europe	1977	75.44	8251648	18855.72521
Switzerland	Europe	1977	75.39	6316424	26982.29052
Syria	Asia	1977	61.195	7932503	3195.484582

Challenge Question 2¶

In [32]:

if grep -v 'Africa' gapminder.tsv | grep -q '^Z'; then
    echo "There is a country that starts with Z outside of Africa"
else
    echo "There is no country that starts with Z outside of Africa"
fi

There is no country that starts with Z outside of Africa

Challenge Question 3¶

In [33]:

sed -E 's/Africa|America|Asia|Europe|Oceania/Pangaea/' gapminder.tsv > gapminder.pangaea.tsv
head gapminder.pangaea.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Pangaea	1952	28.801	8425333	779.4453145
Afghanistan	Pangaea	1957	30.332	9240934	820.8530296
Afghanistan	Pangaea	1962	31.997	10267083	853.10071
Afghanistan	Pangaea	1967	34.02	11537966	836.1971382
Afghanistan	Pangaea	1972	36.088	13079460	739.9811058
Afghanistan	Pangaea	1977	38.438	14880372	786.11336
Afghanistan	Pangaea	1982	39.854	12881816	978.0114388
Afghanistan	Pangaea	1987	40.822	13867957	852.3959448
Afghanistan	Pangaea	1992	41.674	16317921	649.3413952

Challenge Question 4¶

In [34]:

sed -E 's/([0-9]+)\.[0-9]+/\1/g' gapminder.tsv > gapminder.no_decimal.tsv
head gapminder.no_decimal.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Asia	1952	28	8425333	779
Afghanistan	Asia	1957	30	9240934	820
Afghanistan	Asia	1962	31	10267083	853
Afghanistan	Asia	1967	34	11537966	836
Afghanistan	Asia	1972	36	13079460	739
Afghanistan	Asia	1977	38	14880372	786
Afghanistan	Asia	1982	39	12881816	978
Afghanistan	Asia	1987	40	13867957	852
Afghanistan	Asia	1992	41	16317921	649

Challenge Question 5¶

In [2]:

awk 'BEGIN {FS=OFS="\t"} $1 != "country" {$2 = "Pangaea"} {print $0}' gapminder.tsv > gapminder.pangaea.2.tsv
head gapminder.pangaea.2.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Pangaea	1952	28.801	8425333	779.4453145
Afghanistan	Pangaea	1957	30.332	9240934	820.8530296
Afghanistan	Pangaea	1962	31.997	10267083	853.10071
Afghanistan	Pangaea	1967	34.02	11537966	836.1971382
Afghanistan	Pangaea	1972	36.088	13079460	739.9811058
Afghanistan	Pangaea	1977	38.438	14880372	786.11336
Afghanistan	Pangaea	1982	39.854	12881816	978.0114388
Afghanistan	Pangaea	1987	40.822	13867957	852.3959448
Afghanistan	Pangaea	1992	41.674	16317921	649.3413952

The above solution works fine. You can make it a bit simpler (assuming your header is on the first line). NR in awk refers to the line number. Here, we are changing the second column for every line with a line number greater than 1 (i.e. any non-header line).

In [3]:

awk 'BEGIN {FS=OFS="\t"} NR > 1 {$2 = "Pangaea"} {print $0}' gapminder.tsv > gapminder.pangaea.3.tsv
head gapminder.pangaea.3.tsv

country	continent	year	lifeExp	pop	gdpPercap
Afghanistan	Pangaea	1952	28.801	8425333	779.4453145
Afghanistan	Pangaea	1957	30.332	9240934	820.8530296
Afghanistan	Pangaea	1962	31.997	10267083	853.10071
Afghanistan	Pangaea	1967	34.02	11537966	836.1971382
Afghanistan	Pangaea	1972	36.088	13079460	739.9811058
Afghanistan	Pangaea	1977	38.438	14880372	786.11336
Afghanistan	Pangaea	1982	39.854	12881816	978.0114388
Afghanistan	Pangaea	1987	40.822	13867957	852.3959448
Afghanistan	Pangaea	1992	41.674	16317921	649.3413952

In [ ]: