SERHAT ÇEVİKEL
In the first part of this session, we will use the shell to explore and wrangle our data
This facility, known as "Binder" enables us to create "stateless" and "zero dependency" teaching environments to which you can access anywhere from the web. All it needs is to click a link! And each click starts a new instance of a Docker image at the same state (programs, configurations, options, data, files, material, etc.)
As Binder faq states:
How can mybinder.org be free to use? The Binder project has a grant from the Moore Foundation to sustain the cloud resources running mybinder.org. In the future we hope to see more public BinderHub services running that can form a collection of community resources for interactive cloud computing.
Gordon and Betty Moore Foundation web site says:
Gordon and Betty Moore established the foundation to create positive outcomes for future generations. In pursuit of that vision, we foster path-breaking scientific discovery, environmental conservation, patient care improvements and preservation of the special character of the San Francisco Bay Area.
SO SPECIAL THANKS GO TO GORDON AND BETTY MOORE FOUNDATION FOR PROVIDING EARLY FUNDING IN 2017 FOR SUCH A FACILITY
While it may seem like mybinder.org is a single website, it is in fact a federation of teams that deploy public BinderHubs to serve the community. This page lists the BinderHubs that currently help power mybinder.org.
You can learn more about the funders of the Binderhub Federation from: The BinderHub Federation
Here is a list of the current members of the BinderHub federation:
GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.
When you hit:
parallel --citation
The notification goes as:
Academic tradition requires you to cite works you base your article on.
When using programs that use GNU Parallel to process data for publication please cite:
@article{Tange2011a,
title = {GNU Parallel - The Command-Line Power Tool}, author = {O. Tange}, address = {Frederiksberg, Denmark}, journal = {;login: The USENIX Magazine}, month = {Feb}, number = {1}, volume = {36}, url = {http://www.gnu.org/s/parallel%7D, year = {2011}, pages = {42-47}, doi = {http://dx.doi.org/10.5281/zenodo.16303%7D }
(Feel free to use \nocite{Tange2011a})
This helps funding further development; AND IT WON'T COST YOU A CENT.
If you pay 10000 EUR you should feel free to use GNU Parallel without citing.
If you send a copy of your published article to tange@gnu.org, it will be
mentioned in the release notes of next version of GNU Parallel.
So don't forget to cite Tange2011a when you use GNU Parallel in an academic study
And you may also want to review below links to get an undertanding of markdown syntax, to create a better layout and view of your homeworks and/or your own Jupyter notebooks:
We have letters for drives/volumes and each drive has its own root as such: C:, D:\ ..
We have a single filesystem root "/" and all drives/volumes are "mounted" under this root
Directories are separated by backslashes "\"
Directories are separated by slashes "/". Backslash is an escape character to toggle between literal and special usages of some characters
Directory and filenames are case insensitive: "path" and "PATH" are the same
Directory and filenames are case sensitive: "path" and "PATH" are different and can co-exist
In Windows tradition, directory and filenames contain many spaces since it does not matter in GUI usage
In Linux tradition, whitespaces are avoided as much as possible because they may cause trouble for shell commands
GUI's are the main interface for interaction with the OS
A true Linux/Unix "hacker" prefers the terminal/CLI/shell (the black screen) for many good reasons!
So:
In Stanley Kubrick's sci-fi classic 2001: A Space Odyssey, an AI interface to the main computer of the spaceship looked and talked like that:
Hopefully the shell we use, BASH, do not refuse our requests arbitrarily, as HAL does to Dave
In an SoS notebook, in order to invoke Bash (send a command to be interpreted with Bash), type the command to a cell with "Bash"at the right and hit Enter:
# Hello the commands you type and enter here are interpreted by Bash
# But statements that start with a hash sign "#" are comments not interpreted at all
# So that's a comment that tells about comments!
We can also invoke Bash through a terminal.
In *ubuntu distros, Ctrl+Alt+T is the default shortcut to invoke a terminal window. If you are on an *ubuntu system, try it yourself, and to close it type "exit"
Now let's open a terminal from inside Jupyter/binder by clicking on the plus sign - the new launcher - to the left of the toolbar and click on the black terminal icon with the $ sign:
Now please copy and paste the below command inside that terminal:
screen -S 1
A very powerful utility called "screen" is automatically opened inside the terminal
From now on, we can send commands to the terminal from this Jupyter notebook using the utility called "GNU Screen".
We will come to screen later, however, what it does are:
Note that, it may take some time for the screen inside terminal to be operational. We will go on and return back to screen terminal in order to send and execute some of our codes to terminal ...
And this code is getting the appropriate parameter in order to send remote commands to terminal from here:
(content not important for the time being)
#scr=$(screen -ls 2> /dev/null | grep -Po "^.+(?=\.jupyter)" | head -1 | tr -d "\t")
scr=$(screen -ls | grep -P "Attached" | head -1 | grep -Po "^\t+.+?(?=\s|\t)" | tr -d "\t")
echo $scr
We do not usually memorize all usage details of all commands. We can access this info whenever we want:
To get a one-liner explanation:
whatis ls
To get most verbose info:
man ls
And best of both worlds, info on most common usages:
tldr ls
tldr tldr
Note that, tldr is not a standard utility in *nix system and you have to install it as an npm package separately. It is a community project and it may not cover many lesser known utilities/programs
And for screen:
whatis screen
tldr screen
whatis ls
tldr ls
We see that, most common flags are:
Let's list the contents of the root directory, with details and time sorted:
ls -lt /
EXERCISE 1:
Now try yourself: List all the contents of the "usr" directory right under root, including hidden files, and reverse size sorted, one file per line and without details:
SOLUTION 1:
pass1=
encrypt="U2FsdGVkX1+dE49K4/94vw8AdJTGsjV5Bza7+BMfh70="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
whatis echo
tldr echo
">" redirect operator "redirects" the standard input (stdin) from a command into a file (and saves) ">>" appends, ">" overwrites
echo "Hello World" > ~/helloworld
echo "Hello World a second time" > ~/helloworld # this will overwrite the file
echo "Hello World a third time" >> ~/helloworld # this will append the file
ls ~/helloworld
We see the file is there, but how can see the contents?
Cat "concatenates" or combines and prints the contents of the files provided as arguments
whatis cat
tldr cat
Let's view the main file for configuration, environment and options of the Bash shell for the current user:
cat ~/.bashrc
EXERCISE 2:
View the contents of the helloworld file at the home directory created recently
SOLUTION 2:
pass1=
encrypt="U2FsdGVkX1/+/ZVgeVyUxQ9lO6wDAKaPxNKIkrz8LigmUJPhOkJxvHhiw0/Zi7xw"
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Most of the time, we may be in a need to get the row/word/character/byte count of the output of a previous command
whatis wc
tldr wc
See how many files/directory there are instantly under the root directory:
ls / | wc -l
We are ok with "ls /" and "wc -l" parts, but what about that strange "|" sign?
Unix pipe "|" is a structure in Unix-like OS's, that redirects the standard output (stdout) of a former command into the standard input (stdin) of a latter command
ls / | cat -n
The first process writes the contents of the root /, and the output is fed into the input of "cat" with "n" flag for line numbers
So the command takes the raw input and adds line numbers. See how many files
ls / | wc -l
First process does the same thing, second counts the lines in the input
Pipe is very powerful and simplifies complex data and workflows!
You'l create killer one-liners that does a lot of things using multiple pipes!!!
In our workflow we may refer to same paths and filenames many times, and some values and parameters may be selected once and referred to in many places
It is better that we define them once in an environment variable and refer by that variable name:
First, some built-ins:
echo $HOME # echoes the value of HOME env variable
echo ~ # tilde is a shorthand for HOME
echo $PATH # echoes the value of PATH env variable
$PATH is where the OS looks for the executable files in return for a command
For example when we just type "ls", we do not specify the exact location of the "ls" executable. The exact pathof the executable file that ls command invokes is:
which ls
To list the details of that file:
ls -l `which ls`
The backticks execute the command inside and provide the output as an argument. The same is done with:
ls -l $(which ls)
$(...) notation executes the command inside and saves it as a temporary variable to be used as arguments
Apart from /bin/ls, for all ls executables in whole $PATH locations:
which -a ls
datadir=~/data
Now let's check the value (with "$"):
echo $datadir
ls $datadir
Now let's define the path to the directory for "imdb" database, just under datadir:
imdbdir=$datadir/imdb
echo $imdbdir
ls $imdbdir
*IMPORTANT!: FOR THE SUBSEQUENT CODE SNIPPETS TO WORK WELL \$datadir AND \$imdbdir ENVIRONMENT VARIABLES SHOULD BE DECLARED AS ABOVE. IF FOR ANY REASON THE BASH KERNEL IS RESTARTED, THESE VARIABLES MUST BE DECLARED AGAIN BY RUNNING THIS CODE CELL:
datadir=~/data
imdbdir=$datadir/imdb
EXERCISE 3:
Define the "imdbtsv" variable for the path to the tsv directory under $imdbdir, by using $imdbdir, check the value and contents
SOLUTION 3:
pass1=
encrypt="U2FsdGVkX1+eMbt2jVy8BK8OBgzJXwsKuew6aNDkBvUziV/0dZm7RhCnl2YJCX1c YZ0alY4WMRt7ptJUPuimyde+oTl/mPvXJCkF5xXYePE="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Let's view the contents of the tsv directory:
ls -lh $imdbdir/tsv
tsv files are "tab separated values" but here they are gzipped for portability concerns. We will inflate the files but first let's get info on the contents of this database:
Contents:
The database is separated into 6 tsv files:
title.basics.tsv: 365 MB, 4,534,355 lines, 9 fields. Contains the following information for titles:
title.crew.tsv: 133 MB, 4,534,355 lines, 3 fields. Contains the director and writer information for all the titles in IMDb. Fields include:
title.episode.tsv: 72 MB, 2,986,926 lines, 4 fields. Contains the tv episode information. Fields include:
title.principals.tsv: 281 MB, 4,008,569 lines, 2 fields. Contains the principal cast for titles:
title.ratings.tsv: 13 MB, 767,042 lines, 3 fields. Contains the IMDb rating and votes information for titles
name.basics.tsv: 474 MB, 8,155,448 lines, 6 fields. Contains the following information for names:
Note that title.basics.tsv is 365MB while title.basics.tsv.gz is 79MB. That's why large files are gzipped. Now we will gunzip them
We can view the contents of gzipped files:
whatis cat
tldr cat
whatis zcat
tldr zcat
"cat" command lets us print the contents of a text formatted file to terminal, zcat does that for gzipped files. However the files are big and doing so will only show the last part in a terminal, and here in jupyter it will enter into a long running process that will overuse the memory
We will invoke a separete terminal and view the files with a pager called less:
whatis less
tldr less
whatis zless
tldr zless
scr=$(screen -ls | grep -P "Attached" | head -1 | grep -Po "^\t+.+?(?=\s|\t)" | tr -d "\t")
screen -S $scr -X stuff "zless -N $imdbdir/tsv/name.basics.tsv.gz\n"
N flag shows line numbers
Now let's play a game in pairs:
My ones are 794034 and 200. Who are they? Hoo-Hah!
"less is more"! You'll love "less" in time. Hit "q" to exit ...
First we will create a separate directory for gunzipped tsv files, just under $imdbdir. However if the command is rerun, it should not raise an error:
whatis mkdir
tldr mkdir
p flag to mkdir creates directories recursively and does nothing if they exist:
ls $imdbdir
mkdir -p $imdbdir/tsv2 && echo "$imdbdir/tsv2 created" ## second command executed only if first one is successful
ls $imdbdir
Now let's gunzip tsv.gz files, keeping the original gz files and not overwriting any gunzipped files if the command is run more than once:
yes n | gunzip -k $imdbdir/tsv/*.gz
whatis yes
tldr yes
The first part repeatedly output an "n" for "NO" as long as the second part says "should I overwrite the existing file".
Now copy those gunzipped files into the new directory so that we don't mistakenly modify the original gz files.
whatis cp
tldr cp
cp -n $imdbdir/tsv/*.tsv $imdbdir/tsv2/
Note that "n" flag is "no-clobber" which stands for "do not overwrite existing files"
Now, I am fed up with this tsv2 directory and get rid of it completely:
whatis rm
tldr rm
We write a short shell function that checks whether the first argument ($1) exists as a condition and returns true or false. "-e" means "exists"
exists() if [ -e $1 ]; then echo "true"; else echo "false"; fi
exists $imdbdir/tsv2
rm -r $imdbdir/tsv2 && echo "$imdbdir/tsv2 deleted"
exists $imdbdir/tsv2
Now rerun the above commands to recreate the tsv2 directory and copy tsv files Check the contents of the directory:
mkdir -p $imdbdir/tsv2 && echo "$imdbdir/tsv2 created" ## second command executed only if first one is successful
cp -n $imdbdir/tsv/*.tsv $imdbdir/tsv2/
ls $imdbdir/tsv2
EXERCISE 4:
SOLUTION 4:
pass1=
encrypt="U2FsdGVkX1/vzhgzmriq2SUeRoCJn1hpgV0lP8qU9HIotS6WVfVmnnxMf5w5LgS2 uhJDLm+T+kuLmwDbYRF0CJc1kckMA468XJ4QVNyXzr8TWW+oIMylnPX15OXi3FFz d891bcFNirD8Q0bYiaCYuOrUi20JScwax1Ua9xPZtNOiIwoQHQbmUJ6EsaHlPkAM XdQJg7eZe9+9+YJCwcSswXda2fCs4/NIP2OVEXgHrhNZIgQXYAZ+K6R82EvGvGf2 vFia5CFXlunH20QWvJUkw2Ro7I7lMHa/U2c7UdQzkdyy28AJqFQhWV5J49Am2dOW pEWorE7btQ1gQ8PznqBFq9WfoQOviioxuIRD+XqLqSXO+PMbqmtW5TZfbDvYtDLL 3c71Fsb+AicJ9Fw5WxI+ig=="
solution=$(echo "$encrypt" | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Let's see the initial or last lines of a file:
whatis head
tldr head
cat $imdbdir/tsv2/name.basics.tsv | head -10
But the columns are not aligned
whatis column
tldr column
cat $imdbdir/tsv2/name.basics.tsv | head -10 | column -t
Now I want to get a feel of how tall the tsv files are in terms of row count. For just one file:
cat $imdbdir/tsv2/name.basics.tsv | wc -l
This is hard to read as human beings
whatis numfmt
man numfmt
cat $imdbdir/tsv2/name.basics.tsv | wc -l | numfmt --to=si
But single file does not suffice, let's do that for all tsv files
ls $imdbdir/tsv2/*
ls $imdbdir/tsv2/* | \
while read l;
do
cat "$l" | wc -l | numfmt --to=si;
done
That's better. We loop through the full path of the contents of tsv2 directory, assign each to variable \$l and run the above pipe for each $l.
Better, we can use a for loop for this iteration:
for l in $imdbdir/tsv2/*; \
do
cat "$l" | wc -l | numfmt --to=si;
done
But shouldn't it be better if we also report respective filenames?
whatis printf
tldr printf
for l in $imdbdir/tsv2/*; \
do
printf "%s\t" $l;
cat "$l" | wc -l | numfmt --to=si;
done
We can use the more powerful "find" command instead of ls
whatis find
tldr find
find $imdbdir/tsv2 -mindepth 1 | \
while read l;
do
printf "%s\t" $l;
cat $l | wc -l | numfmt --to=si;
done
But we can replace the whole loop with sth more terse (or concise) and aligned columns:
whatis xargs
tldr xargs
find $imdbdir/tsv2 -mindepth 1 | \
xargs -i sh -c 'printf "%s\t" {}; cat {} | wc -l | numfmt --to=si' | \
column -t
Or do the loop inside the "-exec" argument of find command:
find $imdbdir/tsv2 -mindepth 1 \
-exec sh -c 'printf "%s\t" {}; cat {} | wc -l | numfmt --to=si' \; | \
column -t
Or instead of serial processing, process multiple lines at the same time utilizing multiple CPU cores:
whatis parallel
tldr parallel
And to support the development of GNU Parallel project, please read:
parallel --citation
find $imdbdir/tsv2 -mindepth 1 | \
parallel -k -j0 'printf "%s\t" {}; cat {} | wc -l | numfmt --to=si' | \
column -t
j flag provides number of parallel threads, j0 means maximum threads available, k keeps the original order of the inputs
Contrary to sequential single-threaded processing, in parallel processing, without k flag, the order of the output may be different based on completion times of each thread
You see there are many ways to do the same thing in Linux
EXERCISE 5:
Find the WORD counts (not line) of all gzipped files under tsv directory, using any of the methods you like above. Check the "name" argument to file, the glob for the files is "*.gz"
IMHO, this is a "big data management" course, and in order to process larger datasets easily utilizing the max power of your processor with full throttle, "parallel" should be your way!
Note that in order to use text processing commands (cat, less etc.) on zipped files, we add the prefix "z" to those commands
SOLUTION 5:
pass1=
encrypt="U2FsdGVkX18fvmKge1qRd1u+3HhdU41PB1axe167aRS527dX9nIx2ckFe3wPmE0E vQ93TpCe1jTphNkfZ+qyzB1oiJvxdVUclalfFlQbzhy42p1pN46/DMHvLGBho6RI oXelzh1SLtqk2O9YU4v3mwFcf5LtTZiSyq237Lni/Y5/3ZB0nePZz6TCt3i6Lavz Id7SgTK7XNsdEjtjsESvig=="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
What if we want to get the column counts of each file?
Combining all we have learned above, we can read the first line and count the words!
find $imdbdir/tsv2 -mindepth 1 | \
parallel -k -j0 'printf "%s\t" {}; head -1 {} | wc -w' | \
column -t
EXERCISE 6:
gzip algorithm can implement compressions up to two orders of magnitude.
In some cases it is not convenient to gunzip the files since they may fill up the drive.
So we may have to do the same thing on the gzipped files, off you go!
Note: There is no "z" version for head, but "head" can work on stdin
SOLUTION 6:
pass1=
encrypt="U2FsdGVkX19GPSAW6SDyYhMan0hv+ZxMF544aTgc3uqtBfkllke344tehTZK1lBo I9x/LzmrmFnoY+SWtvstpln4pkqKMycuFxVuV1oCwEAn0FKHbUOqqbKthawGVWJb yJYo7jztKKIO/5/VROag1QHWQ07ihGUeNRMkGrr9CJQge2+pQyR2W4DOl4oKrwA9 /fuZ2du4+1tmXWM/Q48RIA=="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Statistics are good to have but it would be better to review the structure of the files, for example the headers and first data rows:
find $imdbdir/tsv2 -mindepth 1 | \
parallel -k -j0 'printf "%s\n" {}; head -2 {} | column -t; echo'
EXERCISE 7:
Did you realize that sth was different here as compared to previous ones? What is that?
SOLUTION 7:
pass1=
encrypt="U2FsdGVkX1+2dKu4xqou48SPsqsiI6GE3dC+Otn8k6ZOHn5yUo21hVBLeZJmm6VA 2yWJ7AD4JwQ6Onk1oF6mT69Lno0m1k3tGRqUfB2A/YVf4H1CxvnC/1hGnMxBOBHM ZV72KSZKzb2X3xUY9MX0Yj1Pr+gYmEzoOB1NntJ7SfI="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
If we want to get the last n lines, we use tail:
whatis tail
tldr tail
EXERCISE 8:
We want to print the first and last lines of each file.
Note that, in order the align both lines, they must be fed into the pipe before "column" command together. Can you figure out a way for this?
SOLUTION 8:
pass1=
encrypt="U2FsdGVkX18QvsiUoHDi8Xk/h3bH6K2gaaPXrVnxqxQ1eXDPTss1htcSDx5uOVW/ QNXlXUvh94Jk/yMPS0P+4LIKMhsnMPTDTAb7YdYF+DZ4jP285gQqBdezPyEg8qCc tNuDVOvwjjchyYm0pSoeDJHGAAgxrvapOQj3R2hi2L8EMCYAm5EDhMgfGgOnU0J5"
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Suppose we are concerned with only one column of a file. We have information on how the columns are separated and the position of the column.
We can do that easily with "AWK". It is more than a command, a fully fledged programming language and the Swiss army knife of text processing in Unix! It is mostly used for short scripts and especially very very powerful one-liners.
I hope you watched the old AT&T video on Unix pipes, told by a bearded gentleman. Brian Kernighan is one of the most colossal figures in history of computing and the "K" in "AWK" stands for his surname
whatis awk
tldr awk
But first we have to know the position of a column with a certain heading, such as endYear:
head -1 $imdbdir/tsv2/title.basics.tsv
We can easily count that it is the 7th line. But suppose we want to do that programmatically.
We need a few more tools
whatis grep
tldr grep
grep is a very important tool to make pattern search in files. With the -n flag, it can return the line number of match
However, the input is not separated into lines :(
First we have to convert tabs into newlines, with any of the two tools below:
whatis tr
tldr tr
head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n"
whatis sed
tldr sed
While tr is a more specific tool for transforming characters in a file or stdin, sed is more powerful and versatile
It can substitute pattern matches with new strings, extract certain lines, append, delete or insert lines
head -1 $imdbdir/tsv2/title.basics.tsv | sed 's/\t/\n/g'
Above sed command reads:
"grep" stands for, Global/Regular Expressions/Print
awk is more versatile and handle more complex task but harder to write, sed is only a subset of awk but easier to write and tr is a subset of sed but even easier to use.
IMHO, if an easier tool exists for a specific task, I go for it, so I prefer "tr" here:
Now, we should search for endYear and get its line count:
head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | grep -n endYear
That's good it is the 7th column. But programmatically we should only get that 7 and not the rest.
We can feed into grep again as such:(don't mind the pattern here)
head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | grep -n endYear | grep -Po "^\d+(?=:)"
It reads:
Regex is a very powerful domain specific language to match patterns in text and as a big data manager you WILL surely need it in many tasks
We will return back to regex and make recommendations in later sessions
Maybe we can get a more concise way to extract the number? We can revert back to sed for this:
head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | sed -n '/endYear/='
In grep you cannot return the line number and suppress the match at the same time. But in sed you can only return the line number of match
Now in order to use this column position inside awk, we can save it into a variable called CN for "column number"
CN=$(head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | sed -n '/endYear/=')
echo $CN
In more complex code, inserting echo (or printf) statements is important for subsequent debugging
So we can save output of any command into a variable with this syntax
Getting a column with an index is trivial for awk:
awk -F "\t" '{ print $7}' $imdbdir/tsv2/title.basics.tsv | head
"-F" sets the field separator to tab character.
But awk has its own variables and we have to pass this CN variable into awk with the -v flag.
We can use the same variable name or a different one.
awk -v CN=$CN -F "\t" '{ print $CN }' $imdbdir/tsv2/title.basics.tsv | head
Note that the $CN inside awk expression is different from the one we defined before. Variables of awk are distinct from that of the shell awk is invoked from.
Combined steps are:
CN=$(head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | sed -n '/endYear/=')
awk -v CN=$CN -F "\t" '{ print $CN }' $imdbdir/tsv2/title.basics.tsv | head
Or in a single statement:
awk -v CN=$(head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | sed -n '/endYear/=') \
-F "\t" '{ print $CN }' $imdbdir/tsv2/title.basics.tsv | head
Writing strictly one-liner codes is not the best skill a coder should have.
Writing a code that is easily understandable by the coder h(er|im)self and collaborators and easily debuggable is a more important skill
So I go for the two-liner version
But wait .... I refer to the filename twice! Suppose I had to do that many more times, referring to the same path again and again. Wouldn't it be better to save it into a variable and reuse?
filename=$imdbdir/tsv2/title.basics.tsv
CN=$(head -1 $filename | tr "\t" "\n" | sed -n '/endYear/=')
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | head
But for a more reusable code, we can also parametrize the pattern and the last command to get the head or tail
filename=$imdbdir/tsv2/title.basics.tsv
pattern=endYear
comnd=head
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | $comnd
And we can wrap all inside a shell function with parameters to be passed.
Suppose first parameter is the filename, second is the pattern and the last is the command (head or tail)
Parameters passed to the function can be referred by their positions as $1, $2 ...
# $imdbdir/tsv2/title.basics.tsv
# endYear
# head
headortail()
{
filename=$1
pattern=$2
comnd=$3
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | $comnd
}
Now we can reuse this code for many similar purposes. Let's start with our initial intent:
headortail $imdbdir/tsv2/title.basics.tsv endYear head
And get the last values of primaryName column from the name.basics.tsv file:
headortail $imdbdir/tsv2/name.basics.tsv primaryName tail
Note that we can also do the pattern match with awk, and more powerful than sed or grep:
For example let's return the whole line where 7th field matches "endYear", prefixed with line number
filename=$imdbdir/tsv2/title.basics.tsv
awk -F "\t" '$7 == "endYear" { print NR": "$0 }' $filename
EXERCISE 9:
Write a function named linepattern that takes a file, a column pattern, a row pattern and returns the line number of match and the whole line
Note that you can pass multiple variables to awk by invking "-v" multiple times for each variable
And string variables with whitespaces must be wrapped inside quotes
SOLUTION 9:
pass1=
encrypt="U2FsdGVkX18yoFtMKddz+N0MjU6twanensOn3zp1K/Bkv7lPtruYc23Ma6LoOjGk dlozrWW+rG3dgDq6TOYcgrpO1d4pIZomvyE01bWSDh1WXgXx0s+wTt8EPAIj/rg7 B5Nqcy4U/24rXJq59n6IkAOFrFFKCcDK3Z/D62FxI5al0tH5cyQVkWv+7NJSf+T7 Ctl3wN3Cjt4kHZXkL2YX7PfznQwOJb34BvWUd5Vl5XxTKmrbuS801+41XQdWUVJ5 kEXlxR3nmL3WPvLtfT/0Ux1VLjzJxzFtaUmvyDGigdoHIp/qXS+jv+I4pmXG8GWs"
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Let's call it for name.basics.tsv, primaryName and "Al Pacino"
linepattern $imdbdir/tsv2/name.basics.tsv primaryName "Al Pacino" | column -t
Now out of millions of rows, let's explore the unique values:
whatis sort
tldr sort
whatis uniq
tldr uniq
It is a coincidence that UNIQ is just next to our building!
filename=$imdbdir/tsv2/title.basics.tsv
pattern=endYear
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | sort | uniq
In order for "uniq" to work, stdin must be well sorted. But wait, sort command can also do uniq with "-u" flag:
filename=$imdbdir/tsv2/title.basics.tsv
pattern=endYear
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | sort -u
But for single column values, that tall of an output does not look good. We can get it arranged and printed in columns:
whatis pr
man pr
filename=$imdbdir/tsv2/title.basics.tsv
pattern=endYear
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | \
sort -u | pr -8 -t
EXERCISE 10:
Get the unique values of startYear column of title.basics.tsv but report only year values (just numbers)
You can reuse components from previous grep commands for regex
SOLUTION 10:
pass1=
encrypt="U2FsdGVkX19ZYi1B9oGWAmWy5lxS4X1HuF6mo6QMpjWeoHS34ZDvXidi0RTD1WpE wLZD8MkYgZmlD+t4jFS9iTtrrd5zXO74cehaQUVvquHcfFyI2F8MZOQpxvHdwEFj QP96UYpY9c+a/E/SuDbPEldCMrDS8Q/haRoaUKn4JWI51bbVhG3o0vwNjgVmBGKW xLYOEK/FHiVpKcuxb2u0THLrKquzIFddJkkOYXqLpjYQ8yxiL27D6eMLlcpc/+aQ QQObeW1jV5D0p7ewvG85Px6w0U9R3SFAA27RUnq+wmY="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
The dataset stretches far back!
Now let's say the counts of unique values are also of importance:
Sort in reverse order in terms of counts (only years)
counts=$(filename=$imdbdir/tsv2/title.basics.tsv
pattern=startYear
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | \
grep -Po "\d+" | sort | uniq -c)
echo "$counts" | sort -nr | column -c 100
And let's sort in increasing year order:
echo "$counts" | sort -n -k 2 | column -c 100
And how about a simple plot wihout leaving the terminal/shell?
echo "$counts" | sort -n -k 2 | \
gnuplot -e "set terminal dumb; plot '-' using 2:1 w points pt '*'"
And let's just focus on the period between 1980 and 2017:
echo "$counts" | sort -n -k 2 | awk -F " " '$2 <= 2017 && $2 >= 1980' | \
gnuplot -e "set terminal dumb; plot '-' using 2:1 w points pt '*'"
Suppose we use the awk code for filtering on multiple criteria:
filename=$imdbdir/tsv2/title.basics.tsv
head -1 $filename
Let's exclude first two columns, keep the header, filter for matching Godfather in primary Title and startYear after 1972 and before 1990:
CN1=$(head -1 $filename | tr "\t" "\n" | sed -n "/primaryTitle/=")
CN2=$(head -1 $filename | tr "\t" "\n" | sed -n "/startYear/=")
{ head -1 $filename | awk -F "\t" 'BEGIN {OFS = "\t"} { $1=$2="";print $0 }'; \
awk -v CN1=$CN1 -v CN2=$CN2 -F "\t" \
'BEGIN {OFS = "\t"} $CN1 ~ /Godfather/ && $CN2 > 1972 && $CN2 < 1990 { $1=$2="";print $0 }' $filename; } | \
column -t -s $'\t'
OFS is the output field separator, in order to format better for column command
We delete first two columns by setting them to empty string
And we combine both the header and filtered rows with { line1; line2; } | further_commands, to format together with column
We combine multiple filters with &&
EXERCISE 11:
Delete first 2 and isAdult columns, keep the header Filter for movies that have a runtime less than 10 minutes and not \N, genres include thriller and start year is after 1990 and not \N Note that you have to escape the escape character in "\N" as such: "\N"
Run head -1 for once and save in a variable
SOLUTION 11:
pass1=
encrypt="U2FsdGVkX1+S79PKqpQJiVEf5D6Y/snYFkTIjRPI+v/ZbkuGO+wB6rmkGH5AHHUg 7aEK+wO93DX02gFwYjkltP3OTIxg7wxJFjAfRYGMUriQK/DbcPMqlWRyAZWp2pXo vrI0smeerjV5d4wyMQDe33lYvbfFq6sC1cZvbmbQeOxv4ET8O+BapcDVt4ULXYC+ B75Uc1g0PrqM9fewr0IqQX+xCaqAAn4j8ly5nxhxVGgneTrErUeO6QP9oEx1FCvz vFXhCdm1uyqKtBOO8GB/OUQ5Mz6PvlNM87RNNuNhmrhnsjwqw97eC8uLQKssASor mi/5t9NebnQgHAf0IShq0L/KawPfnm/KmYXjb2DqSSu+9OM1ZyxpDAgjveD/TNbR C0Re7OgVfi6kg22LiZpRiHks4+clqnkjF6cCA54dFu5Qf49orXJNtB7HOakorRoD 0puOPrWRy754JLxLzgKnnE7B6+OJ99uFWQGLXuT5RKfdtBc6wWjXZkSNPVoK8vTV cihK+YKQum03N/kkZ6cxtTRnT+ZCG5zyZBOFxOS+w0qt9NeYd1wbG9thr/CSvkzJ dgycZl31MeOXzqUatZsQg9x13Aq6NVBusM8u1l2fw6bFCDlmSHdHyEp+UsLhfaTK GvDyAy8VI2AZQDIWTDoxbU6DdLdrWgsUT7TkImjFjn7WCFpRqS6OAAyYBEYfuko9 CH0VWfwJYiCs3O2HXd3p/V/aqPPs1z9ZitI+IvzodAjLRC2+YST5t3QKxVBKsf1y wqTuDpOOHMXmIYjChKD/6vfalCYxD7uza+DKVoPCUX0="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
Even very strict filters can yield results!
For more complex queries involving joins among tables is beyond the intent of awk language. And referring to even more column names complicates the code.
Now we need to switch to SQL
But first we have to process the data to better fit PostgreSQL database server
Let's remember the fields of files again:
find $imdbdir/tsv2 -mindepth 1 | \
parallel -k -j0 'printf "%s\n" {}; head -2 {} | column -t; echo'
See that the field that starts with ""tt" appears in all but one of the files. This field is the unique code for each title
And the field that starts with "nm" appears in title.principals, title.crew and name.basics and is the unique code for each person in movies (actor, director, or any profession)
We can join the tables based on these common fields to create more complex results
But we have a problem! The principalCast field in title.principals file combine all principal people of a move into a separate field with commas!
We would expect a separate row for each unique combination of a title and a person
Each tool has its own strengths. shell was powerful until now, but we need something specifically designed to transform data, R!
data.table is an expressive, efficient and very powerful package designed for handling very large datasets and with a syntax with components that corresponds to SQL clauses:
When the value at the dropdown box on the right shows "R", it means, now we are sending our commands to the R interpreter, not Bash
First load the library:
library(data.table)
Set the main directory path
prefix <- "~/data/imdb"
prefix
And the file path
in_path <- sprintf("%s/tsv2/title.principals.tsv", prefix)
in_path
Fast read the file into R using data.table:
principals <- data.table::fread(in_path)
And split and melt from the second column. Note that since the resulting object is +25m lines, we here limit the processed lines in order to save time on binder:
principals_molten <- principals[1:1e5, {
principalCast_list = strsplit(principalCast, ",");
.(
tconst = rep(tconst, sapply(principalCast_list, length)),
principalCast = unlist(principalCast_list)
)
}
]
What does it do?
principalCast_list = strsplit(principalCast, ",");
creates an intermadiate object that is an R list, each item of which is a character vector created from each original row by splitting second field from commas
tconst = rep(tconst, sapply(principalCast_list, length))
for each lish item, replicates the tconst field with the count of principalCast - one for each person
principalCast = unlist(principalCast_list)
Now that we have a vector of tconst - title code - that is replicated enough for the number of principal cast in a title, we can combine all list items into a long vector
Let's see in action using only the first two lines:
In the first (second) row there are three (two) principal cast concatenated with commas:
principals12 <- principals[1:2]
principals12
Now we split each row of second column from commas:
principalCast_list = strsplit(principals12[,principalCast], ",")
principalCast_list
To view the list better:
listviewer::jsonedit(principalCast_list, mode = "form")
We want to end up with 5 rows: 3 + 2
lengths <- sapply(principalCast_list, length)
lengths
In the first title, there are 3 principal cast and in the second title 2
tconst <- rep(principals12[,tconst], lengths)
tconst
Now the title codes are replicated with relavant counts
principalCast <- unlist(principalCast_list)
principalCast
And the list is simplified into a single vector. tconst and principalCast are both of length 5, thus we can combine them together
combined <- list(tconst = tconst, principalCast = principalCast)
listviewer::jsonedit(combined, mode = "form")
And a list of equal length vectors can be converted to a data.frame or better a data.table:
principals12_molten <- as.data.table(combined)
principals12_molten
Compare initial and final objects:
principals12
principals12_molten
lapply(list(principals[1:1e5], principals_molten), dim)
If we had run the code in the whole dataset, we would have started with a data.table of 4 million rows and have ended up with a data.table of 25 million rows in a matter of few seconds in powerful enough computer!
The single data.table statement combines all these steps and repeats for all rows. The initial and final dimensions are:
Now create a filename for output:
out_path <- sprintf("%s/tsv2/title.principals_melt_short.tsv", prefix)
out_path
And fast write the data.table into a tsv file:
fwrite(principals_molten, file = out_path, sep = "\t") # fast write new DT as tsv
Now back to bash, let's view the new data file:
head -20 $imdbdir/tsv2/title.principals_melt_short.tsv | column -t
Bonus: Below parallel code does the same thing somwhow far less efficiently, so don't try it (while loop is there for you not to execute the code). This is here for demonstration purposes only. You may skip it.
R code is far less concise and efficient
while false;
do
head -1 $imdbdir/tsv2/title.principals.tsv; tail -n+2 $imdbdir/tsv2/title.principals.tsv | \
parallel -j0 "line={}; \
tc=\${line%%$'\t'*}; \
nm=\${line##*$'\t'}; \
nml=\${nm//,/$'\n'}; \
ln=\$(echo -n \"\$nml\" | grep -c ^); \
paste <(yes \$tc | head -n \$ln) <(echo -n \"\$nml\") --delimiters \"\t\"" \
> $imdbdir/tsv2/title.principals_melt_shell.tsv
done
Inside the command to be executed by parallel:
{}
stands for inputline={};
assigns the input into $line variabletc=\${line%%$'\t'*}
deletes the part including and after the tab character, so gets only the tconst portion. It is a more advanced topic in shell scripting called "parameter substitution"nm=\${line##*$'\t'}
deletes the part including and before the tab character so only leaves the nm part. Again parameter substitutionnml=\${nm//,/$'\n'};
replaces the commas between nm's by newlines, so practically splits them into separate linesln=\$(echo -n \"\$nml\" | grep -c ^)
counts the number of lines of nm's, so number of times tconst should be repeatedyes \$tc | head -n \$ln
repeats the tconst the exact number of times equal to the number of nm'spaste <(yes \$tc | head -n \$ln) <(echo -n \"\$nml\") --delimiters \"\t\"
combines two columns of tc and nm into a tab separated text. This line utilizes a more advanced topic in shell scripting called "process substitution"For subsequent codes, it is better that we clean the memory allocated to R for better performance:
rm(list =ls())
EXERCISE 12:
In R, create a file title.genres_short.tsv under tsv2 in which there are two columns from title.basics.tsv, tconst and genres: genres split from commas and tconst replicated accordingly. Take only the first 1e5 (100k) lines. Compare the dimension of input and output objects/files
SOLUTION 12:
pass <- readline(prompt = "Please enter the password for the solution: ")
encrypt <- "U2FsdGVkX1+bbnopeIKcv3xn7U0pmH8WCpExJzguv8sSJBF+o0ipnhkCqyXVydjb 6RSgGNjWN/dEVp9Fq2Ht/B8v2rWKSEyOwtlfBQ0l4ecTRBGYWmK0AQRINUaTIXZS 2MCRKFA0fxjp9bENNr1hcldd68IcWYSLLnzVto7TFrBD4rmSSGXyedb4+9VBQafX FaY1C0EHuGsA0YW7/F5QXrv164or/Shm6+Zi1aaEpnl0HHviHdC1n5hXeiIyD6g7 iEHCsdVm68V1L1SI3ALHbf9nOnV56+Hgp59I5slx0N/DMmLtUumQ9pk8A0Lw8mAk 1ojJp2QzG5r89iZ58fVtxj5VDjr/pnQSKHBzfH0GRF+JHIKS90OAmcKzHh/Two6q Ckv0e101B5cargbkk5EKRwYmNqUzTBIzIqqCrqhzbGJS/z9maAHSmSTgqyFYxLnR QgMvuCfGHNiq0O+WR5F7piDVdmpHzG59yD4A/2lZMSe01jppyBXjW/+E5PocIie/ L9E3wxl/ObiRGc6dwCQ1W4DiaEqZJxUpraqANWLJ8Yc7dsPEOB4GBuZnOYP9WGYR sNdJ91nqhJb5eFmbzcOQRBJzEt4onc0dqBdZC3qZTTJzJ7WLShIwr35tOkcrtDxn zbHuFR7j4x4cUVHZcA3UVlG5QrKegLh64jA0UGjUML+Aq5hzjaCcJ+GQeOjkMjHk iLEgMvz+YdFuo9txUm9J8PlR+iOOLz5yI8xNU5YVLZmYvzXFSnGYwQKoAINmlY/7 5T0kAOYVi8HHtgNedmp0Ivs4i5hYPHwbKRDJ5Eh7ews="
solution <- system(sprintf("echo %s | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:%s 2> /dev/null", encrypt, pass), intern = T, ignore.stderr = T)
cat(solution, sep = "\n")
eval(parse(text = solution))
Let's view the output file from Bash:
head -20 $imdbdir/tsv2/title.genres_short.tsv | column -t
Now, in order to import tsv's into postgresql, we need to trim the header rows. In csv's, this step is not necessary
We will use tail for this purpose, let's see..
cat $imdbdir/tsv2/title.basics.tsv | wc -l | numfmt --to=si
title.basics file has 4.6 million rows! More than any gui text editor or spreadsheet can handle
File starts like that:
head $imdbdir/tsv2/title.basics.tsv
And after tail:
tail -n+2 $imdbdir/tsv2/title.basics.tsv | head
We can redirect it to a new file
mkdir -p $imdbdir/tsv3
tail -n+2 $imdbdir/tsv2/title.basics.tsv > $imdbdir/tsv3/title.basics2.tsv
Now do it for all files:
if clause checks for existing files, if exists, skips
basename extracts the filename
${basenm/.tsv/2.tsv} adds a 2 before .tsv
find $imdbdir/tsv2 -mindepth 1 | \
parallel -k -j0 "basenm=\$(basename {});
if [ ! -e ${imdbdir}/tsv3/\${basenm/.tsv/2.tsv} ];
then
tail -n+2 {} > \
${imdbdir}/tsv3/\${basenm/.tsv/2.tsv};
fi
"
Now it is time to create a database on PostgreSQL server and import the data into!
There are some other utilities that you might find useful while working on *nix systems (Linux, MacOS, FreeBSD, even Android)
To get details on ongoing processes on the operating system:
whatis ps
tldr ps
For example, list "python" processes:
ps aux | grep python
Get information on the hostname of the computer you are logged into:
whatis hostname
tldr hostname
hostname
"ls" does not report the recursive size of directories, for this you may use du (disk usage)
whatis du
tldr du
du -sh $imdbdir/*
To get information on mounted file systems, use df:
whatis df
tldr df
To list mounted file systems, with file types and human readable sizes:
df -hT
To mount/unmount drives on the root filesystem:
whatis mount
tldr mount
whatis umount
tldr umount
To check the integrity of a filesystem:
whatis fsck
tldr fsck
To get superuser privileges on current user (if allowed to) or execute a command as another user:
whatis sudo
tldr sudo
And to get a dynamic overview of the system resources, utilization and processes: top (by default) or better htop (to be installed separately)
whatis htop
tldr htop
Let's open htop on our terminal (press q to quit):
scr=$(screen -ls | grep -P "Attached" | head -1 | grep -Po "^\t+.+?(?=\s|\t)" | tr -d "\t")
screen -S $scr -X stuff 'htop\n'
For a very versatile, efficient, lightweight and extensible text and code editor, vim is the tools of power users:
whatis vim
tldr vim
scr=$(screen -ls | grep -P "Attached" | head -1 | grep -Po "^\t+.+?(?=\s|\t)" | tr -d "\t")
screen -S $scr -X stuff 'vim\n'
There are also many networking tools in *unix that make it easier and efficient to script and automatize networking tasks and monitor networking activities such as:
ip, ifconfig, route, netstat, nmap, ping, bwm-ng, nethogs, iftop
To get and parse data from the network useful tools are:
curl, wget, lynx, w3m, elinks. You can also use chrome headless from the command line
In order to induce you to "dirty your hands" with the shell tools, here is a little assignment that will count as additional 5 points if done right:
Great "sed" tool has a simple tool to extract any arbitrary lines by line number from a text stream as such:
For example, extract the 10th to 20th lines from input:
seq 100 | sed -n "10,20p"
sed can both read from stdin (input from the pipe) and a file as the last parameter
tldr sed
While seq creates a sequence of numbers:
whatis seq
tldr seq
Your task is to write a wrapper function called windowsearch take takes four parameters:
SOLUTION BONUS 1:
pass1=
encrypt="U2FsdGVkX1/Y9ExlbpCUNWw72NzPn6D9LH0B3yzGkj3GMYpzObI/hn2Tq+EE+Iy3 7jJWVnuYZf1v4/e4tvoV/UaFzwtrXYmJOdUj86dm93vC5yTkzzL13USrujOKSwGD mFM+ZWikVnFmGWHBi+NaCvbrQ/0Bl+96dmp1Z7zPv6JvApA5DWLBNarG8il+Oc+3 fRdUkV0BscSY9GoWWloMTj/QvCOdfG7xfIhCe4JSCkI="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done
To check your function, invoking it with the following parameters:
windowsearch 10 20 2 $imdbdir/tsv2/title.principals.tsv
should yield:
tt0000009 nm0085156,nm0063086,nm1309758,nm0183823
tt0000014 nm0166380,nm0525910,nm0244989
tt0000010 nm0525910
tt0000016 nm0525910
tt0000012 nm0525910,nm0525908
tt0000015 nm0721526
tt0000018 nm0804434,nm3692071
tt0000019 nm0932055
tt0000013 nm1715062,nm0525910,nm0525908
tt0000017 nm3691272,nm0804434,nm1587194,nm3692829
tt0000011 nm3692297,nm0804434
Following the instructions in the instructor message on how to submit your solution