In this lesson, we will cover a few applications of regular expressions (or regex) that I use all the time. Regex are available in most programming languages, but to keep this lesson accessible to as many people as possible, we will focus on applications at the Bash shell. Specifically, we will cover how you can use grep
, sed
and awk
to get a lot done without firing up a script, especially with the power of regex at your side.
Regular expressions are an extremely powerful tool for pattern matching. You might not realize it, but a lot of what we do is pattern matching, especially if you deal with text at all. The ability to describe a flexible pattern that the computer can then quickly look for in some arbitrary text opens up a world of possibilities. This lesson will focus on some of these possibilities. Notably, we will cover:
grep
sed
awk
These three tools alone justify learning Bash to make your life easier. In combination with regex, they are life-savers!
We will be using Jenny Bryan's cleaned-up version of the gapminder dataset. It contains 1704 rows and 6 columns. The dataset consists of the population, life expectancy and GDP per capita for 142 countries every 5 years between 1952 and 2007. You can easily download the data using curl
as follows.
curl -sL bit.ly/gapm-data > gapminder.tsv
head gapminder.tsv
country continent year lifeExp pop gdpPercap Afghanistan Asia 1952 28.801 8425333 779.4453145 Afghanistan Asia 1957 30.332 9240934 820.8530296 Afghanistan Asia 1962 31.997 10267083 853.10071 Afghanistan Asia 1967 34.02 11537966 836.1971382 Afghanistan Asia 1972 36.088 13079460 739.9811058 Afghanistan Asia 1977 38.438 14880372 786.11336 Afghanistan Asia 1982 39.854 12881816 978.0114388 Afghanistan Asia 1987 40.822 13867957 852.3959448 Afghanistan Asia 1992 41.674 16317921 649.3413952
At its simplest, grep can be used to filter lines based on a pattern. We can start with a plain, non-regex pattern. Here, we subset the file to lines that contains the word Canada
.
grep Canada gapminder.tsv
Canada Americas 1952 68.75 14785584 11367.16112 Canada Americas 1957 69.96 17010154 12489.95006 Canada Americas 1962 71.3 18985849 13462.48555 Canada Americas 1967 72.13 20819767 16076.58803 Canada Americas 1972 72.88 22284500 18970.57086 Canada Americas 1977 74.21 23796400 22090.88306 Canada Americas 1982 75.76 25201900 22898.79214 Canada Americas 1987 76.86 26549700 26626.51503 Canada Americas 1992 77.95 28523502 26342.88426 Canada Americas 1997 78.61 30305843 28954.92589 Canada Americas 2002 79.77 31902268 33328.96507 Canada Americas 2007 80.653 33390141 36319.23501
You'll notice that the header lines was removed, because it doesn't contain Canada
. If we want to ensure that this file remains valid, we need to keep the header. There are multiple ways to do this.
First, you can grep the header and the Canada
lines separately.
grep country gapminder.tsv
grep Canada gapminder.tsv
country continent year lifeExp pop gdpPercap Canada Americas 1952 68.75 14785584 11367.16112 Canada Americas 1957 69.96 17010154 12489.95006 Canada Americas 1962 71.3 18985849 13462.48555 Canada Americas 1967 72.13 20819767 16076.58803 Canada Americas 1972 72.88 22284500 18970.57086 Canada Americas 1977 74.21 23796400 22090.88306 Canada Americas 1982 75.76 25201900 22898.79214 Canada Americas 1987 76.86 26549700 26626.51503 Canada Americas 1992 77.95 28523502 26342.88426 Canada Americas 1997 78.61 30305843 28954.92589 Canada Americas 2002 79.77 31902268 33328.96507 Canada Americas 2007 80.653 33390141 36319.23501
However, you will notice that we are repeating ourselves (the grep
command and the gapminder.tsv
file name. Ideally, we want to follow the DRY (don't repeat yourself) principle.
N.B. An astute reader will notice that I can extract the header using head -1
. Indeed, this would work here, but I am familiar with file formats (e.g. VCF variant call format) where the header is neither the first line, nor a predictable number of lines into the file. In these cases, grep
is more general.
The second approach involves the use of regex. In fact, grep
stands for "globally search a regular expression and print". However, because there have been multiple versions of regex over the years and we are used to the more modern versions, we will need to use a variant of grep that enables extended regex. You can either use grep -E
or egrep
. I will be using the latter.
Here, we can start using regex by using the |
operator, which matches what's on the left or on the right. Whenever you use regular expressions, it is safer to quote the pattern using single quotes.
egrep 'country|Canada' gapminder.tsv
country continent year lifeExp pop gdpPercap Canada Americas 1952 68.75 14785584 11367.16112 Canada Americas 1957 69.96 17010154 12489.95006 Canada Americas 1962 71.3 18985849 13462.48555 Canada Americas 1967 72.13 20819767 16076.58803 Canada Americas 1972 72.88 22284500 18970.57086 Canada Americas 1977 74.21 23796400 22090.88306 Canada Americas 1982 75.76 25201900 22898.79214 Canada Americas 1987 76.86 26549700 26626.51503 Canada Americas 1992 77.95 28523502 26342.88426 Canada Americas 1997 78.61 30305843 28954.92589 Canada Americas 2002 79.77 31902268 33328.96507 Canada Americas 2007 80.653 33390141 36319.23501
If we wanted to include the US in our results, it's as simple as adding another |
operator in our pattern.
egrep 'country|Canada|United States' gapminder.tsv
country continent year lifeExp pop gdpPercap Canada Americas 1952 68.75 14785584 11367.16112 Canada Americas 1957 69.96 17010154 12489.95006 Canada Americas 1962 71.3 18985849 13462.48555 Canada Americas 1967 72.13 20819767 16076.58803 Canada Americas 1972 72.88 22284500 18970.57086 Canada Americas 1977 74.21 23796400 22090.88306 Canada Americas 1982 75.76 25201900 22898.79214 Canada Americas 1987 76.86 26549700 26626.51503 Canada Americas 1992 77.95 28523502 26342.88426 Canada Americas 1997 78.61 30305843 28954.92589 Canada Americas 2002 79.77 31902268 33328.96507 Canada Americas 2007 80.653 33390141 36319.23501 United States Americas 1952 68.44 157553000 13990.48208 United States Americas 1957 69.49 171984000 14847.12712 United States Americas 1962 70.21 186538000 16173.14586 United States Americas 1967 70.76 198712000 19530.36557 United States Americas 1972 71.34 209896000 21806.03594 United States Americas 1977 73.38 220239000 24072.63213 United States Americas 1982 74.65 232187835 25009.55914 United States Americas 1987 75.02 242803533 29884.35041 United States Americas 1992 76.09 256894189 32003.93224 United States Americas 1997 76.81 272911760 35767.43303 United States Americas 2002 77.31 287675526 39097.09955 United States Americas 2007 78.242 301139947 42951.65309
grep
can be as flexible as you need it to be. While it may be contrived, let's say we are interested in the data from 1977 for countries whose names start with S
(and we want to keep the header). As always, there are multiple was of approaching this problem.
First, we can use UNIX pipes to perform subsequent filters, one for the countries starting with S
and another for the rows corresponding to 1977.
egrep '^(country|S)' gapminder.tsv | egrep '1977'
Sao Tome and Principe Africa 1977 58.55 86796 1737.561657 Saudi Arabia Asia 1977 58.69 8128505 34167.7626 Senegal Africa 1977 48.879 5260855 1561.769116 Serbia Europe 1977 70.3 8686367 12980.66956 Sierra Leone Africa 1977 36.788 3140897 1348.285159 Singapore Asia 1967 67.946 1977600 4977.41854 Singapore Asia 1977 70.795 2325300 11210.08948 Singapore Asia 2002 78.77 4197776 36023.1054 Slovak Republic Europe 1977 70.45 4827803 10922.66404 Slovenia Europe 1977 70.97 1746919 15277.03017 Somalia Africa 1977 41.974 4353666 1450.992513 South Africa Africa 1977 55.527 27129932 8028.651439 Spain Europe 1977 74.39 36439000 13236.92117 Sri Lanka Asia 1977 65.949 14116836 1348.775651 Sudan Africa 1977 47.8 17104986 2202.988423 Swaziland Africa 1977 52.537 551425 3781.410618 Sweden Europe 1977 75.44 8251648 18855.72521 Switzerland Europe 1977 75.39 6316424 26982.29052 Syria Asia 1977 61.195 7932503 3195.484582
Some of you might have noticed that Singapore comes up three times, while every country is only supposed to show up once at most. Upon closer inspection, you can see why this is happening: the 1977
pattern is appearing in the line within the population number, which is not something we want.
The immediate solution to this is to prevent matches of the year within other numbers. In regex, you can specify that word boundaries must be present before and after the number using \b
.
egrep '^(country|S)' gapminder.tsv | egrep '\b1977\b'
Sao Tome and Principe Africa 1977 58.55 86796 1737.561657 Saudi Arabia Asia 1977 58.69 8128505 34167.7626 Senegal Africa 1977 48.879 5260855 1561.769116 Serbia Europe 1977 70.3 8686367 12980.66956 Sierra Leone Africa 1977 36.788 3140897 1348.285159 Singapore Asia 1977 70.795 2325300 11210.08948 Slovak Republic Europe 1977 70.45 4827803 10922.66404 Slovenia Europe 1977 70.97 1746919 15277.03017 Somalia Africa 1977 41.974 4353666 1450.992513 South Africa Africa 1977 55.527 27129932 8028.651439 Spain Europe 1977 74.39 36439000 13236.92117 Sri Lanka Asia 1977 65.949 14116836 1348.775651 Sudan Africa 1977 47.8 17104986 2202.988423 Swaziland Africa 1977 52.537 551425 3781.410618 Sweden Europe 1977 75.44 8251648 18855.72521 Switzerland Europe 1977 75.39 6316424 26982.29052 Syria Asia 1977 61.195 7932503 3195.484582
Indeed, this solves our problem, but we lost the header again. To get it back, we need to include the country
pattern in both commands, which is slightly repetitive.
N.B. Our current solution to filtering for observations made in 1977 is imperfect, because we are filtering on the presence of 1977 anywhere in the line. Technically, if country had a population or GDP per capita of 1977 at some point, this would be included in the output. Later, we will see how we can use awk to apply regex on specific columns.
egrep '^(country|S)' gapminder.tsv | egrep '(country|\b1977\b)'
country continent year lifeExp pop gdpPercap Sao Tome and Principe Africa 1977 58.55 86796 1737.561657 Saudi Arabia Asia 1977 58.69 8128505 34167.7626 Senegal Africa 1977 48.879 5260855 1561.769116 Serbia Europe 1977 70.3 8686367 12980.66956 Sierra Leone Africa 1977 36.788 3140897 1348.285159 Singapore Asia 1977 70.795 2325300 11210.08948 Slovak Republic Europe 1977 70.45 4827803 10922.66404 Slovenia Europe 1977 70.97 1746919 15277.03017 Somalia Africa 1977 41.974 4353666 1450.992513 South Africa Africa 1977 55.527 27129932 8028.651439 Spain Europe 1977 74.39 36439000 13236.92117 Sri Lanka Asia 1977 65.949 14116836 1348.775651 Sudan Africa 1977 47.8 17104986 2202.988423 Swaziland Africa 1977 52.537 551425 3781.410618 Sweden Europe 1977 75.44 8251648 18855.72521 Switzerland Europe 1977 75.39 6316424 26982.29052 Syria Asia 1977 61.195 7932503 3195.484582
Second, we can combine our patterns into one regex. Admittedly, there is no compelling advantage in doing so other than preventing needless commands wherever possible. For this, we need to acknowledge that the year will always be after the country name by some number of characters. We can specify "some numbers of characters" in regex using .*
.
egrep '^(country|S).*\b1977\b' gapminder.tsv
Sao Tome and Principe Africa 1977 58.55 86796 1737.561657 Saudi Arabia Asia 1977 58.69 8128505 34167.7626 Senegal Africa 1977 48.879 5260855 1561.769116 Serbia Europe 1977 70.3 8686367 12980.66956 Sierra Leone Africa 1977 36.788 3140897 1348.285159 Singapore Asia 1977 70.795 2325300 11210.08948 Slovak Republic Europe 1977 70.45 4827803 10922.66404 Slovenia Europe 1977 70.97 1746919 15277.03017 Somalia Africa 1977 41.974 4353666 1450.992513 South Africa Africa 1977 55.527 27129932 8028.651439 Spain Europe 1977 74.39 36439000 13236.92117 Sri Lanka Asia 1977 65.949 14116836 1348.775651 Sudan Africa 1977 47.8 17104986 2202.988423 Swaziland Africa 1977 52.537 551425 3781.410618 Sweden Europe 1977 75.44 8251648 18855.72521 Switzerland Europe 1977 75.39 6316424 26982.29052 Syria Asia 1977 61.195 7932503 3195.484582
Let's create a file with a list of countries of interest for the purposes of this demo.
echo -e 'Canada\nItaly\nAustralia\nUnited States\nEngland\nFrance' > countries.txt
cat countries.txt
Canada Italy Australia United States England France
Given this list, we can easily filter the gapminder dataset for observations made for these countries.
egrep -f countries.txt gapminder.tsv | head
Australia Oceania 1952 69.12 8691212 10039.59564 Australia Oceania 1957 70.33 9712569 10949.64959 Australia Oceania 1962 70.93 10794968 12217.22686 Australia Oceania 1967 71.1 11872264 14526.12465 Australia Oceania 1972 71.93 13177000 16788.62948 Australia Oceania 1977 73.49 14074100 18334.19751 Australia Oceania 1982 74.74 15184200 19477.00928 Australia Oceania 1987 76.32 16257249 21888.88903 Australia Oceania 1992 77.56 17481977 23424.76683 Australia Oceania 1997 78.83 18565243 26997.93657
So far, we've seen how we can use grep
to subset the lines in a file according to a certain pattern. Another useful feature of grep
is its quiet mode, which can be used in conjunction with Bash conditional expressions.
First, let's review Bash if statements.
if [[ 1 > 2 ]]; then
echo 'true'
else
echo 'false'
fi
false
Here, [[ 1 > 2 ]]
is actually a command that evaluates the expression inside the square brackets. This portion of the if statement in Bash can be any command. A command evaluates as true if its exit code is zero (i.e. the command was successful). Otherwise, it's considered as false.
To show this, I will run the commands true
and false
, which respectively return exit codes 0 and 1.
N.B. The $?
is a useful variable that contains the exit code of the most recently run command.
if true; then
echo 'Exit code: ' $?
echo 'Considered true'
else
echo 'Exit code: ' $?
echo 'Considered false'
fi
Exit code: 0 Considered true
if false; then
echo 'Exit code: ' $?
echo 'Considered true'
else
echo 'Exit code: ' $?
echo 'Considered false'
fi
Exit code: 1 Considered false
Now, let's say you're interested in running a command only if a file contains some pattern. You can use grep in quiet mode inside a if statement, as follows.
if egrep -q 'Canada' countries.txt; then
echo 'Canada is in the countries.txt file :D'
else
echo 'Canada is not in the countries.txt file :('
fi
Canada is in the countries.txt file :D
if egrep -q 'Switzerland' countries.txt; then
echo 'Switzerland is in the countries.txt file :D'
else
echo 'Switzerland is not in the countries.txt file :('
fi
Switzerland is not in the countries.txt file :(
Here, we are just echoing some text, but you can do whatever you want once you know a file matches a pattern. A nice thing about quiet mode is that grep stops searching as soon as it encounters the first instance of the pattern.
If you wanted to count how many instances of a pattern there are in a file, you can certainly pipe the output of grep
to wc -l
. You can be slightly more efficient by avoiding the extra command and using the -c
option in grep
.
grep 'Canada' gapminder.tsv | wc -l
12
grep -c 'Canada' gapminder.tsv
12
Lastly, for all of the above grep
commands, you can invert the search using the -v
option. In other words, if you want all lines except for those containing "Canada" or "United States", you can simply do the following:
cat countries.txt
Canada Italy Australia United States England France
egrep -v 'Canada|United States' countries.txt
Italy Australia England France
Write an if statement in Bash that checks if there are any countries that start with the letter "Z" outside of Africa, and echoes the response accordingly.
So far, we've seen grep
's amazing ability to subset lines in a file according to a pattern, which can be as complex as you can conjure. Now, we're going to introduce sed
, which is probably best known for its ability to perform search-and-replace really easily at the command line.
Let's remind ourselves of what's in our gapminder.tsv
file.
head gapminder.tsv
country continent year lifeExp pop gdpPercap Afghanistan Asia 1952 28.801 8425333 779.4453145 Afghanistan Asia 1957 30.332 9240934 820.8530296 Afghanistan Asia 1962 31.997 10267083 853.10071 Afghanistan Asia 1967 34.02 11537966 836.1971382 Afghanistan Asia 1972 36.088 13079460 739.9811058 Afghanistan Asia 1977 38.438 14880372 786.11336 Afghanistan Asia 1982 39.854 12881816 978.0114388 Afghanistan Asia 1987 40.822 13867957 852.3959448 Afghanistan Asia 1992 41.674 16317921 649.3413952
To start off with a simple example to examine the structure of a sed
command, we are going to replace every instance of "United States" with "USA". Here, we will count instances of each term before and after we apply sed
to confirm the change.
In general, we need to ensure that modern regular expressions are enabled in sed
. Unfortunately, this option varies based on your platform. Typically, it's -E
on Macs and -r
on Linux (and probably Windows, although I'm not sure).
sed -E 's/United States/USA/' gapminder.tsv > gapminder.usa.tsv
echo 'Before sed'
grep -c 'United States' gapminder.tsv
grep -c 'USA' gapminder.tsv
echo 'After sed'
grep -c 'United States' gapminder.usa.tsv
grep -c 'USA' gapminder.usa.tsv
Before sed 12 0 After sed 0 12
As you can see, the search-and-replace worked. The general form of a sed
search-and-replace is as follows:
sed -E 's/what_you_want_to_replace/what_you_want_to_replace_with/' input_file.txt > output_file.txt
Just in case you're still skeptical, we'll apply the same change on our small countries.txt
file.
sed -E 's/United States/USA/' countries.txt
Canada Italy Australia USA England France
The initial s
is necessary to indicate the search-and-replace command within sed
. There are other commands that we won't see today, such as insert (i
) and delete (d
). The slashes are used to delimit the what_you_want_to_replace
from the what_you_want_to_replace_with
. It can actually be any character you want, as long as you're consistent.
For example, you can use colons (:
) instead.
sed -E 's:United States:USA:' countries.txt
Canada Italy Australia USA England France
The character you use is not that important. One thing to consider is that if the character you choose appear in the regex, you will need to escape it with backslashes. That's why I generally stick with slashes as my character in sed
commands unless I'm dealing with file paths as my input text (which commonly include slashes), in which case I will switch to colons or vertical bars.
Let's move on to a slightly more complex change. We are going to replace every period (.
) with a comma (,
), as it we want to send our data to a collaborator in France, where they use commas instead of periods in decimal numbers.
There is an important thing we need to handle: there might be multiple instances of a point. By default, sed
will only replace the first instance of a pattern per line. If we want to replace every instance, we'll need to enable the global mode by adding a g
at the end of the sed
command.
N.B. Recall that the period in regex has special meaning and matches any character. If we want to match an actual period, we need to escape it using a backslash.
sed -E 's/\./,/g' gapminder.tsv > gapminder.comma.tsv
head gapminder.comma.tsv
country continent year lifeExp pop gdpPercap Afghanistan Asia 1952 28,801 8425333 779,4453145 Afghanistan Asia 1957 30,332 9240934 820,8530296 Afghanistan Asia 1962 31,997 10267083 853,10071 Afghanistan Asia 1967 34,02 11537966 836,1971382 Afghanistan Asia 1972 36,088 13079460 739,9811058 Afghanistan Asia 1977 38,438 14880372 786,11336 Afghanistan Asia 1982 39,854 12881816 978,0114388 Afghanistan Asia 1987 40,822 13867957 852,3959448 Afghanistan Asia 1992 41,674 16317921 649,3413952
You can easily chain multiple search-and-replace commands by using the -e
option.
sed -E -e 's/United States/USA/' -e 's/\./,/g' gapminder.tsv > gapminder.usa_and_comma.tsv
egrep 'country|USA' gapminder.usa_and_comma.tsv | head
country continent year lifeExp pop gdpPercap USA Americas 1952 68,44 157553000 13990,48208 USA Americas 1957 69,49 171984000 14847,12712 USA Americas 1962 70,21 186538000 16173,14586 USA Americas 1967 70,76 198712000 19530,36557 USA Americas 1972 71,34 209896000 21806,03594 USA Americas 1977 73,38 220239000 24072,63213 USA Americas 1982 74,65 232187835 25009,55914 USA Americas 1987 75,02 242803533 29884,35041 USA Americas 1992 76,09 256894189 32003,93224
Perhaps one of the most powerful features of sed and regex when doing search-and-replace is backreferences. They allow you to search for something and replace it with something that includes what was originally matched. I think the best way to explain this is to demonstrate backreferences in action. Our contrived example is to match the country name at the beginning of each line and duplicating it.
sed -E 's/^([^\t]+)/\1_\1/' gapminder.tsv > gapminder.double_country.tsv
head gapminder.double_country.tsv
country_country continent year lifeExp pop gdpPercap Afghanistan_Afghanistan Asia 1952 28.801 8425333 779.4453145 Afghanistan_Afghanistan Asia 1957 30.332 9240934 820.8530296 Afghanistan_Afghanistan Asia 1962 31.997 10267083 853.10071 Afghanistan_Afghanistan Asia 1967 34.02 11537966 836.1971382 Afghanistan_Afghanistan Asia 1972 36.088 13079460 739.9811058 Afghanistan_Afghanistan Asia 1977 38.438 14880372 786.11336 Afghanistan_Afghanistan Asia 1982 39.854 12881816 978.0114388 Afghanistan_Afghanistan Asia 1987 40.822 13867957 852.3959448 Afghanistan_Afghanistan Asia 1992 41.674 16317921 649.3413952
Use backreferences to get rid of all decimal digits. Don't worry about rounding up or down; just take the floor of the number.
The last tool we will cover today is awk
. This tool combines the features of grep
and sed
and makes them more useful in the context of tabular data, such as our gapminder.tsv
file consisting of six tab-delimited columns.
head gapminder.tsv
country continent year lifeExp pop gdpPercap Afghanistan Asia 1952 28.801 8425333 779.4453145 Afghanistan Asia 1957 30.332 9240934 820.8530296 Afghanistan Asia 1962 31.997 10267083 853.10071 Afghanistan Asia 1967 34.02 11537966 836.1971382 Afghanistan Asia 1972 36.088 13079460 739.9811058 Afghanistan Asia 1977 38.438 14880372 786.11336 Afghanistan Asia 1982 39.854 12881816 978.0114388 Afghanistan Asia 1987 40.822 13867957 852.3959448 Afghanistan Asia 1992 41.674 16317921 649.3413952
The first thing you need to configure with awk
is the field separator (FS
), which is what separates the columns in each line. Typically, we use comma- or tab-delimited files. In this case, gapminder.tsv
uses tabs. We also configure the output field separator (OFS
) to be the same character. Notice that we use single quotes again to avoid unintended issues down the line.
awk 'BEGIN {FS=OFS="\t"}' gapminder.tsv
The BEGIN {}
contains awk commands that are run once at the beginning. Here, we only need to set the input and output field separator once. Because there are no commands that follow BEGIN {}
, awk
doesn't do anything. If we want to print lines, we can use print $0
, where $0
refers to all columns.
awk 'BEGIN {FS=OFS="\t"} {print $0}' gapminder.tsv | head
country continent year lifeExp pop gdpPercap Afghanistan Asia 1952 28.801 8425333 779.4453145 Afghanistan Asia 1957 30.332 9240934 820.8530296 Afghanistan Asia 1962 31.997 10267083 853.10071 Afghanistan Asia 1967 34.02 11537966 836.1971382 Afghanistan Asia 1972 36.088 13079460 739.9811058 Afghanistan Asia 1977 38.438 14880372 786.11336 Afghanistan Asia 1982 39.854 12881816 978.0114388 Afghanistan Asia 1987 40.822 13867957 852.3959448 Afghanistan Asia 1992 41.674 16317921 649.3413952
Admittedly, this isn't very useful. You can refer to the first, second, third, etc. columns using $1
, $2
, $3
, etc. So, if we want to print the country name, the year and the population, we can use awk
as follows.
awk 'BEGIN {FS=OFS="\t"} {print $1, $3, $5}' gapminder.tsv | head
country year pop Afghanistan 1952 8425333 Afghanistan 1957 9240934 Afghanistan 1962 10267083 Afghanistan 1967 11537966 Afghanistan 1972 13079460 Afghanistan 1977 14880372 Afghanistan 1982 12881816 Afghanistan 1987 13867957 Afghanistan 1992 16317921
Again, this isn't very useful, because can achieve the same effect using cut
in Bash using much less typing.
cut -f1,3,5 gapminder.tsv | head
country year pop Afghanistan 1952 8425333 Afghanistan 1957 9240934 Afghanistan 1962 10267083 Afghanistan 1967 11537966 Afghanistan 1972 13079460 Afghanistan 1977 14880372 Afghanistan 1982 12881816 Afghanistan 1987 13867957 Afghanistan 1992 16317921
Things start getting interesting once you perform filtering on specific columns or manipulating text in specific columns. For instance, let's revisit our earlier task of filtering on rows that pertain to 1977. This can be accurately done by simply checking if column 3 is equal to 1977. In this case, we don't have to worry about the digits "1977" appearing in other columns such as the population.
awk 'BEGIN {FS=OFS="\t"} $3 == 1977 {print $0}' gapminder.tsv | head
Afghanistan Asia 1977 38.438 14880372 786.11336 Albania Europe 1977 68.93 2509048 3533.00391 Algeria Africa 1977 58.014 17152804 4910.416756 Angola Africa 1977 39.483 6162675 3008.647355 Argentina Americas 1977 68.481 26983828 10079.02674 Australia Oceania 1977 73.49 14074100 18334.19751 Austria Europe 1977 72.17 7568430 19749.4223 Bahrain Asia 1977 65.593 297410 19340.10196 Bangladesh Asia 1977 46.923 80428306 659.8772322 Belgium Europe 1977 72.8 9821800 19117.97448
Note that the {print $0}
is actually optional when we specify a condition for filtering lines.
awk 'BEGIN {FS=OFS="\t"} $3 == 1977' gapminder.tsv | head
Afghanistan Asia 1977 38.438 14880372 786.11336 Albania Europe 1977 68.93 2509048 3533.00391 Algeria Africa 1977 58.014 17152804 4910.416756 Angola Africa 1977 39.483 6162675 3008.647355 Argentina Americas 1977 68.481 26983828 10079.02674 Australia Oceania 1977 73.49 14074100 18334.19751 Austria Europe 1977 72.17 7568430 19749.4223 Bahrain Asia 1977 65.593 297410 19340.10196 Bangladesh Asia 1977 46.923 80428306 659.8772322 Belgium Europe 1977 72.8 9821800 19117.97448
We can also combine multiple conditions using &&
. Here, we will reproduce our earlier command in awk
, where we will filter for 1977 data for countries whose names starts with "S".
awk 'BEGIN {FS=OFS="\t"} $3 == 1977 && $1 ~ /^S/' gapminder.tsv | head
Sao Tome and Principe Africa 1977 58.55 86796 1737.561657 Saudi Arabia Asia 1977 58.69 8128505 34167.7626 Senegal Africa 1977 48.879 5260855 1561.769116 Serbia Europe 1977 70.3 8686367 12980.66956 Sierra Leone Africa 1977 36.788 3140897 1348.285159 Singapore Asia 1977 70.795 2325300 11210.08948 Slovak Republic Europe 1977 70.45 4827803 10922.66404 Slovenia Europe 1977 70.97 1746919 15277.03017 Somalia Africa 1977 41.974 4353666 1450.992513 South Africa Africa 1977 55.527 27129932 8028.651439
We now face a similar issue as before, where the header is missing. We can address this in multiple ways. We will use our approach from earlier, by matching country in the first column.
awk 'BEGIN {FS=OFS="\t"} $3 == 1977 && $1 ~ /^S/ || $1 == "country"' gapminder.tsv | head
country continent year lifeExp pop gdpPercap Sao Tome and Principe Africa 1977 58.55 86796 1737.561657 Saudi Arabia Asia 1977 58.69 8128505 34167.7626 Senegal Africa 1977 48.879 5260855 1561.769116 Serbia Europe 1977 70.3 8686367 12980.66956 Sierra Leone Africa 1977 36.788 3140897 1348.285159 Singapore Asia 1977 70.795 2325300 11210.08948 Slovak Republic Europe 1977 70.45 4827803 10922.66404 Slovenia Europe 1977 70.97 1746919 15277.03017 Somalia Africa 1977 41.974 4353666 1450.992513
In general, the structure of awk
commands (within the single quotes) is as follows:
awk 'BEGIN {FS=OFS="\t"} CONDITION {ACTION} CONDITION {ACTION} {ACTION}' input.tsv > output.tsv
You can think of an awk
command as a series of conditions and actions that will only run if the preceding condition is true. In fact, BEGIN
is a condition that is only true at the beginning of the file. Hence, the {FS=OFS="\t"}
only gets run once at the outset. Any action that isn't preceded by a condition (like the last {ACTION}
in the example command above) will run for every line.
Tackle Challenge Question 3, but this time using awk
. You should be able to simplify your approach.
Hint: You no longer need to know the continents in the file anymore.
The header is missing from the output because the .*\b1977\b
in the pattern is restricting that all lines (i.e. those starting with country
or S
) have a 1977
in it. The solution is to move the .*\b1977\b
inside the parentheses such that it only applies to lines starting with S
.
egrep '^(country|S.*\b1977\b)' gapminder.tsv
country continent year lifeExp pop gdpPercap Sao Tome and Principe Africa 1977 58.55 86796 1737.561657 Saudi Arabia Asia 1977 58.69 8128505 34167.7626 Senegal Africa 1977 48.879 5260855 1561.769116 Serbia Europe 1977 70.3 8686367 12980.66956 Sierra Leone Africa 1977 36.788 3140897 1348.285159 Singapore Asia 1977 70.795 2325300 11210.08948 Slovak Republic Europe 1977 70.45 4827803 10922.66404 Slovenia Europe 1977 70.97 1746919 15277.03017 Somalia Africa 1977 41.974 4353666 1450.992513 South Africa Africa 1977 55.527 27129932 8028.651439 Spain Europe 1977 74.39 36439000 13236.92117 Sri Lanka Asia 1977 65.949 14116836 1348.775651 Sudan Africa 1977 47.8 17104986 2202.988423 Swaziland Africa 1977 52.537 551425 3781.410618 Sweden Europe 1977 75.44 8251648 18855.72521 Switzerland Europe 1977 75.39 6316424 26982.29052 Syria Asia 1977 61.195 7932503 3195.484582
if grep -v 'Africa' gapminder.tsv | grep -q '^Z'; then
echo "There is a country that starts with Z outside of Africa"
else
echo "There is no country that starts with Z outside of Africa"
fi
There is no country that starts with Z outside of Africa
sed -E 's/Africa|America|Asia|Europe|Oceania/Pangaea/' gapminder.tsv > gapminder.pangaea.tsv
head gapminder.pangaea.tsv
country continent year lifeExp pop gdpPercap Afghanistan Pangaea 1952 28.801 8425333 779.4453145 Afghanistan Pangaea 1957 30.332 9240934 820.8530296 Afghanistan Pangaea 1962 31.997 10267083 853.10071 Afghanistan Pangaea 1967 34.02 11537966 836.1971382 Afghanistan Pangaea 1972 36.088 13079460 739.9811058 Afghanistan Pangaea 1977 38.438 14880372 786.11336 Afghanistan Pangaea 1982 39.854 12881816 978.0114388 Afghanistan Pangaea 1987 40.822 13867957 852.3959448 Afghanistan Pangaea 1992 41.674 16317921 649.3413952
sed -E 's/([0-9]+)\.[0-9]+/\1/g' gapminder.tsv > gapminder.no_decimal.tsv
head gapminder.no_decimal.tsv
country continent year lifeExp pop gdpPercap Afghanistan Asia 1952 28 8425333 779 Afghanistan Asia 1957 30 9240934 820 Afghanistan Asia 1962 31 10267083 853 Afghanistan Asia 1967 34 11537966 836 Afghanistan Asia 1972 36 13079460 739 Afghanistan Asia 1977 38 14880372 786 Afghanistan Asia 1982 39 12881816 978 Afghanistan Asia 1987 40 13867957 852 Afghanistan Asia 1992 41 16317921 649
awk 'BEGIN {FS=OFS="\t"} $1 != "country" {$2 = "Pangaea"} {print $0}' gapminder.tsv > gapminder.pangaea.2.tsv
head gapminder.pangaea.2.tsv
country continent year lifeExp pop gdpPercap Afghanistan Pangaea 1952 28.801 8425333 779.4453145 Afghanistan Pangaea 1957 30.332 9240934 820.8530296 Afghanistan Pangaea 1962 31.997 10267083 853.10071 Afghanistan Pangaea 1967 34.02 11537966 836.1971382 Afghanistan Pangaea 1972 36.088 13079460 739.9811058 Afghanistan Pangaea 1977 38.438 14880372 786.11336 Afghanistan Pangaea 1982 39.854 12881816 978.0114388 Afghanistan Pangaea 1987 40.822 13867957 852.3959448 Afghanistan Pangaea 1992 41.674 16317921 649.3413952
The above solution works fine. You can make it a bit simpler (assuming your header is on the first line). NR
in awk
refers to the line number. Here, we are changing the second column for every line with a line number greater than 1 (i.e. any non-header line).
awk 'BEGIN {FS=OFS="\t"} NR > 1 {$2 = "Pangaea"} {print $0}' gapminder.tsv > gapminder.pangaea.3.tsv
head gapminder.pangaea.3.tsv
country continent year lifeExp pop gdpPercap Afghanistan Pangaea 1952 28.801 8425333 779.4453145 Afghanistan Pangaea 1957 30.332 9240934 820.8530296 Afghanistan Pangaea 1962 31.997 10267083 853.10071 Afghanistan Pangaea 1967 34.02 11537966 836.1971382 Afghanistan Pangaea 1972 36.088 13079460 739.9811058 Afghanistan Pangaea 1977 38.438 14880372 786.11336 Afghanistan Pangaea 1982 39.854 12881816 978.0114388 Afghanistan Pangaea 1987 40.822 13867957 852.3959448 Afghanistan Pangaea 1992 41.674 16317921 649.3413952