Data exploration at the shell¶

SERHAT ÇEVİKEL

In the first part of this session, we will use the shell to explore and wrangle our data

Some notifications¶

The Moore Foundation¶

This facility, known as "Binder" enables us to create "stateless" and "zero dependency" teaching environments to which you can access anywhere from the web. All it needs is to click a link! And each click starts a new instance of a Docker image at the same state (programs, configurations, options, data, files, material, etc.)

As Binder faq states:

How can mybinder.org be free to use? The Binder project has a grant from the Moore Foundation to sustain the cloud resources running mybinder.org. In the future we hope to see more public BinderHub services running that can form a collection of community resources for interactive cloud computing.

Gordon and Betty Moore Foundation web site says:

Gordon and Betty Moore established the foundation to create positive outcomes for future generations. In pursuit of that vision, we foster path-breaking scientific discovery, environmental conservation, patient care improvements and preservation of the special character of the San Francisco Bay Area.

SO SPECIAL THANKS GO TO GORDON AND BETTY MOORE FOUNDATION FOR PROVIDING EARLY FUNDING IN 2017 FOR SUCH A FACILITY

BinderHub Federation¶

While it may seem like mybinder.org is a single website, it is in fact a federation of teams that deploy public BinderHubs to serve the community. This page lists the BinderHubs that currently help power mybinder.org.

You can learn more about the funders of the Binderhub Federation from: The BinderHub Federation

Members of the BinderHub Federation¶

Here is a list of the current members of the BinderHub federation:

gke.mybinder.org

Run by The Binder Team

Funded by Google Cloud Platform

ovh.mybinder.org

Run by The OVH Team

Funded by OVH

gesis.mybinder.org

Run by The GESIS Notebooks Team

Funded by GESIS

turing.mybinder.org

The Turing Way team led by Sarah Gibson

Run by The Turing Way team led by Sarah Gibson

Funded by The Alan Turing Institute

GNU Parallel¶

GNU parallel is a shell tool for executing jobs in parallel using one or more computers. A job can be a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables. A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel.

When you hit:

parallel --citation

The notification goes as:

Academic tradition requires you to cite works you base your article on.

When using programs that use GNU Parallel to process data for publication please cite:

@article{Tange2011a,

title = {GNU Parallel - The Command-Line Power Tool}, author = {O. Tange}, address = {Frederiksberg, Denmark}, journal = {;login: The USENIX Magazine}, month = {Feb}, number = {1}, volume = {36}, url = {http://www.gnu.org/s/parallel%7D, year = {2011}, pages = {42-47}, doi = {http://dx.doi.org/10.5281/zenodo.16303%7D }

(Feel free to use \nocite{Tange2011a})

This helps funding further development; AND IT WON'T COST YOU A CENT.

If you pay 10000 EUR you should feel free to use GNU Parallel without citing.

If you send a copy of your published article to tange@gnu.org, it will be

mentioned in the release notes of next version of GNU Parallel.

So don't forget to cite Tange2011a when you use GNU Parallel in an academic study

First steps¶

Jupyter and markdown basics¶

First of all, please watch this short video on how to use the modes and shortcuts in Jupyter notebooks:

And you may also want to review below links to get an undertanding of markdown syntax, to create a better layout and view of your homeworks and/or your own Jupyter notebooks:

Markdown Cheatsheet

Markdown Quick Reference

Journey to Linux: Breaking Windows habits¶

We have letters for drives/volumes and each drive has its own root as such: C:, D:\ ..

We have a single filesystem root "/" and all drives/volumes are "mounted" under this root

Directories are separated by backslashes "\"

Directories are separated by slashes "/". Backslash is an escape character to toggle between literal and special usages of some characters

Directory and filenames are case insensitive: "path" and "PATH" are the same

Directory and filenames are case sensitive: "path" and "PATH" are different and can co-exist

In Windows tradition, directory and filenames contain many spaces since it does not matter in GUI usage

In Linux tradition, whitespaces are avoided as much as possible because they may cause trouble for shell commands

GUI's are the main interface for interaction with the OS

A true Linux/Unix "hacker" prefers the terminal/CLI/shell (the black screen) for many good reasons!

So:

The Shell¶

The shell is the command interpreter of Unix/Linux systems
The terminal is the windows program that a shell works on in desktop environments
When we work in a virtual terminal, no need for a separate terminal program, everything IS inside the terminal. Try it yourself if you are on a *nix PC: Ctrl+Alt+F2 and login! And back here with Ctrl+Alt+F1
Together we call this mode of interaction with the OS "Command Line Interface" or CLI

In Stanley Kubrick's sci-fi classic 2001: A Space Odyssey, an AI interface to the main computer of the spaceship looked and talked like that:

Hopefully the shell we use, BASH, do not refuse our requests arbitrarily, as HAL does to Dave

In an SoS notebook, in order to invoke Bash (send a command to be interpreted with Bash), type the command to a cell with "Bash"at the right and hit Enter:

In [ ]:

# Hello the commands you type and enter here are interpreted by Bash
# But statements that start with a hash sign "#" are comments not interpreted at all
# So that's a comment that tells about comments!

We can also invoke Bash through a terminal.

In *ubuntu distros, Ctrl+Alt+T is the default shortcut to invoke a terminal window. If you are on an *ubuntu system, try it yourself, and to close it type "exit"

Now let's open a terminal from inside Jupyter/binder by clicking on the plus sign - the new launcher - to the left of the toolbar and click on the black terminal icon with the $ sign:

Now please copy and paste the below command inside that terminal:

screen -S 1

A very powerful utility called "screen" is automatically opened inside the terminal

From now on, we can send commands to the terminal from this Jupyter notebook using the utility called "GNU Screen".

We will come to screen later, however, what it does are:

To multiplex many shell instances inside a session
To attach and detach to an ongoing process, from local and from remote hosts (think that you were disconnected from a remote computer, you want to reconnect and resume from where you were last. screen does that)
To send commands to other shell instances easily

Note that, it may take some time for the screen inside terminal to be operational. We will go on and return back to screen terminal in order to send and execute some of our codes to terminal ...

And this code is getting the appropriate parameter in order to send remote commands to terminal from here:

(content not important for the time being)

In [ ]:

#scr=$(screen -ls 2> /dev/null | grep -Po "^.+(?=\.jupyter)" | head -1 | tr -d "\t")
scr=$(screen -ls | grep -P "Attached" | head -1 | grep -Po "^\t+.+?(?=\s|\t)" | tr -d "\t")
echo $scr

Get info on commands¶

We do not usually memorize all usage details of all commands. We can access this info whenever we want:

To get a one-liner explanation:

In [ ]:

whatis ls

To get most verbose info:

In [ ]:

man ls

And best of both worlds, info on most common usages:

In [ ]:

tldr ls

In [ ]:

tldr tldr

Note that, tldr is not a standard utility in *nix system and you have to install it as an npm package separately. It is a community project and it may not cover many lesser known utilities/programs

And for screen:

In [ ]:

whatis screen
tldr screen

ls command¶

In [ ]:

whatis ls
tldr ls

We see that, most common flags are:

a: all incl. hidden
l: long and detail
h: human readable units
S: size sorted
t: time sorted
r: reverse sorted
1: one file per line (default in l)

Let's list the contents of the root directory, with details and time sorted:

In [ ]:

ls -lt /

EXERCISE 1:

Now try yourself: List all the contents of the "usr" directory right under root, including hidden files, and reverse size sorted, one file per line and without details:

In [ ]:

SOLUTION 1:

In [ ]:

pass1=
encrypt="U2FsdGVkX1+dE49K4/94vw8AdJTGsjV5Bza7+BMfh70="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

echo command¶

In [ ]:

whatis echo
tldr echo

">" redirect¶

">" redirect operator "redirects" the standard input (stdin) from a command into a file (and saves) ">>" appends, ">" overwrites

In [ ]:

echo "Hello World" > ~/helloworld
echo "Hello World a second time" > ~/helloworld # this will overwrite the file
echo "Hello World a third time" >> ~/helloworld # this will append the file

In [ ]:

ls ~/helloworld

We see the file is there, but how can see the contents?

cat command¶

Cat "concatenates" or combines and prints the contents of the files provided as arguments

In [ ]:

whatis cat
tldr cat

Let's view the main file for configuration, environment and options of the Bash shell for the current user:

In [ ]:

cat ~/.bashrc

EXERCISE 2:

View the contents of the helloworld file at the home directory created recently

In [ ]:

SOLUTION 2:

In [ ]:

pass1=
encrypt="U2FsdGVkX1/+/ZVgeVyUxQ9lO6wDAKaPxNKIkrz8LigmUJPhOkJxvHhiw0/Zi7xw"
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

wc command¶

Most of the time, we may be in a need to get the row/word/character/byte count of the output of a previous command

In [ ]:

whatis wc
tldr wc

See how many files/directory there are instantly under the root directory:

In [ ]:

ls / | wc -l

We are ok with "ls /" and "wc -l" parts, but what about that strange "|" sign?

Unix pipe¶

Unix pipe "|" is a structure in Unix-like OS's, that redirects the standard output (stdout) of a former command into the standard input (stdin) of a latter command

In [ ]:

ls / | cat -n

The first process writes the contents of the root /, and the output is fed into the input of "cat" with "n" flag for line numbers

So the command takes the raw input and adds line numbers. See how many files

In [ ]:

ls / | wc -l

First process does the same thing, second counts the lines in the input

Here Prof. Brian Kernighan - one of the pioneers of the Unix systems and C programming language explains how the Unix pipe can be utilized so that simple tools are combined to solve more complex problems:

Pipe is very powerful and simplifies complex data and workflows!

You'l create killer one-liners that does a lot of things using multiple pipes!!!

Environment variables¶

In our workflow we may refer to same paths and filenames many times, and some values and parameters may be selected once and referred to in many places

It is better that we define them once in an environment variable and refer by that variable name:

First, some built-ins:

In [ ]:

echo $HOME # echoes the value of HOME env variable
echo ~ # tilde is a shorthand for HOME
echo $PATH # echoes the value of PATH env variable

$PATH is where the OS looks for the executable files in return for a command

For example when we just type "ls", we do not specify the exact location of the "ls" executable. The exact pathof the executable file that ls command invokes is:

In [ ]:

which ls

To list the details of that file:

In [ ]:

ls -l `which ls`

The backticks execute the command inside and provide the output as an argument. The same is done with:

In [ ]:

ls -l $(which ls)

$(...) notation executes the command inside and saves it as a temporary variable to be used as arguments

Apart from /bin/ls, for all ls executables in whole $PATH locations:

In [ ]:

which -a ls

Now let's define the path to data directory, just under the $HOME.
Note that we do not use the "$" when the variable is on the LHS to an assignment:

In [ ]:

datadir=~/data

Now let's check the value (with "$"):

In [ ]:

echo $datadir
ls $datadir

Now let's define the path to the directory for "imdb" database, just under datadir:

In [ ]:

imdbdir=$datadir/imdb
echo $imdbdir
ls $imdbdir

*IMPORTANT!: FOR THE SUBSEQUENT CODE SNIPPETS TO WORK WELL \$datadir AND \$imdbdir ENVIRONMENT VARIABLES SHOULD BE DECLARED AS ABOVE. IF FOR ANY REASON THE BASH KERNEL IS RESTARTED, THESE VARIABLES MUST BE DECLARED AGAIN BY RUNNING THIS CODE CELL:

In [ ]:

datadir=~/data
imdbdir=$datadir/imdb

EXERCISE 3:

Define the "imdbtsv" variable for the path to the tsv directory under $imdbdir, by using $imdbdir, check the value and contents

In [ ]:

SOLUTION 3:

In [ ]:

pass1=
encrypt="U2FsdGVkX1+eMbt2jVy8BK8OBgzJXwsKuew6aNDkBvUziV/0dZm7RhCnl2YJCX1c YZ0alY4WMRt7ptJUPuimyde+oTl/mPvXJCkF5xXYePE="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

imdb dataset¶

Let's view the contents of the tsv directory:

In [ ]:

ls -lh $imdbdir/tsv

tsv files are "tab separated values" but here they are gzipped for portability concerns. We will inflate the files but first let's get info on the contents of this database:

Contents:

The database is separated into 6 tsv files:

title.basics.tsv: 365 MB, 4,534,355 lines, 9 fields. Contains the following information for titles:
- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title.
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year.
- endYear (YYYY) – TV Sereis end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title
title.crew.tsv: 133 MB, 4,534,355 lines, 3 fields. Contains the director and writer information for all the titles in IMDb. Fields include:
- tconst (string)
- directors (array of nconsts) - director(s) of the given title
- writers (array of nconsts) – writer(s) of the given title
title.episode.tsv: 72 MB, 2,986,926 lines, 4 fields. Contains the tv episode information. Fields include:
- tconst (string) - alphanumeric identifier of episode
- parentTconst (string) - alphanumeric identifier of the parent TV Series
- seasonNumber (integer) – season number the episode belongs to
- episodeNumber (integer) – episode number of the tconst in the TV series.
title.principals.tsv: 281 MB, 4,008,569 lines, 2 fields. Contains the principal cast for titles:
- tconst (string)
- principalCast (array of nconsts) – title’s top-billed cast
- Note: This file will be pre-processed so that the second field is “molten” into separate cast for each line (SÇ)
title.ratings.tsv: 13 MB, 767,042 lines, 3 fields. Contains the IMDb rating and votes information for titles
- tconst (string)
- averageRating – weighted average of all the individual user ratings
- numVotes - number of votes the title has received
name.basics.tsv: 474 MB, 8,155,448 lines, 6 fields. Contains the following information for names:
- nconst (string) - alphanumeric unique identifier of the name/person
- primaryName (string)– name by which the person is most often credited
- birthYear – in YYYY format
- deathYear – in YYYY format if applicable, else ‘\N’
- primaryProfession (array of strings)– the top-3 professions of the person
- knownForTitles (array of tconsts) – titles the person is known for

Note that title.basics.tsv is 365MB while title.basics.tsv.gz is 79MB. That's why large files are gzipped. Now we will gunzip them

view gzipped files¶

We can view the contents of gzipped files:

In [ ]:

whatis cat
tldr cat

In [ ]:

whatis zcat
tldr zcat

"cat" command lets us print the contents of a text formatted file to terminal, zcat does that for gzipped files. However the files are big and doing so will only show the last part in a terminal, and here in jupyter it will enter into a long running process that will overuse the memory

We will invoke a separete terminal and view the files with a pager called less:

In [ ]:

whatis less
tldr less

In [ ]:

whatis zless
tldr zless

In [ ]:

scr=$(screen -ls | grep -P "Attached" | head -1 | grep -Po "^\t+.+?(?=\s|\t)" | tr -d "\t")
screen -S $scr -X stuff "zless -N $imdbdir/tsv/name.basics.tsv.gz\n"

N flag shows line numbers

Get help on less usage by typing "h".
Scroll back and forth with pgup/down or j/k
search with "/pattern" and N/n gets to next/previous match.
xxg goes to xxth line.

Now let's play a game in pairs:

Search for your favorite act(ress|or) and tell the line number to your partner. (S)he will find that act(ress|or) and tell you back!

My ones are 794034 and 200. Who are they? Hoo-Hah!

"less is more"! You'll love "less" in time. Hit "q" to exit ...

gunzip tsv files¶

First we will create a separate directory for gunzipped tsv files, just under $imdbdir. However if the command is rerun, it should not raise an error:

In [ ]:

whatis mkdir
tldr mkdir

p flag to mkdir creates directories recursively and does nothing if they exist:

In [ ]:

ls $imdbdir
mkdir -p $imdbdir/tsv2 && echo "$imdbdir/tsv2 created" ## second command executed only if first one is successful
ls $imdbdir

Now let's gunzip tsv.gz files, keeping the original gz files and not overwriting any gunzipped files if the command is run more than once:

In [ ]:

yes n | gunzip -k $imdbdir/tsv/*.gz

In [ ]:

whatis yes
tldr yes

The first part repeatedly output an "n" for "NO" as long as the second part says "should I overwrite the existing file".

Now copy those gunzipped files into the new directory so that we don't mistakenly modify the original gz files.

In [ ]:

whatis cp
tldr cp

In [ ]:

cp -n $imdbdir/tsv/*.tsv $imdbdir/tsv2/

Note that "n" flag is "no-clobber" which stands for "do not overwrite existing files"

Now, I am fed up with this tsv2 directory and get rid of it completely:

In [ ]:

whatis rm
tldr rm

We write a short shell function that checks whether the first argument ($1) exists as a condition and returns true or false. "-e" means "exists"

In [ ]:

exists()  if [ -e $1 ]; then echo "true"; else echo "false"; fi
exists $imdbdir/tsv2
rm -r $imdbdir/tsv2 && echo "$imdbdir/tsv2 deleted"
exists $imdbdir/tsv2

Now rerun the above commands to recreate the tsv2 directory and copy tsv files Check the contents of the directory:

In [ ]:

mkdir -p $imdbdir/tsv2 && echo "$imdbdir/tsv2 created" ## second command executed only if first one is successful
cp -n $imdbdir/tsv/*.tsv $imdbdir/tsv2/

In [ ]:

ls $imdbdir/tsv2

EXERCISE 4:

Write a function exists2 that echoes "exists" or "notexists" based on the exists condition
Create a directory tsv3, but it should not complain if run twice. If executed it should return a message that it did so (even if it does not redo)
Check whether tsv3 exists
Copy the tsv files under tsv directory here, but it should not overwrite the files
List the contents of tsv3 directory
Remove the tsv3 directory and return a message that it did so
and check whether tsv3 exists anymore using the above function

In [ ]:

SOLUTION 4:

In [ ]:

pass1=
encrypt="U2FsdGVkX1/vzhgzmriq2SUeRoCJn1hpgV0lP8qU9HIotS6WVfVmnnxMf5w5LgS2 uhJDLm+T+kuLmwDbYRF0CJc1kckMA468XJ4QVNyXzr8TWW+oIMylnPX15OXi3FFz d891bcFNirD8Q0bYiaCYuOrUi20JScwax1Ua9xPZtNOiIwoQHQbmUJ6EsaHlPkAM XdQJg7eZe9+9+YJCwcSswXda2fCs4/NIP2OVEXgHrhNZIgQXYAZ+K6R82EvGvGf2 vFia5CFXlunH20QWvJUkw2Ro7I7lMHa/U2c7UdQzkdyy28AJqFQhWV5J49Am2dOW pEWorE7btQ1gQ8PznqBFq9WfoQOviioxuIRD+XqLqSXO+PMbqmtW5TZfbDvYtDLL 3c71Fsb+AicJ9Fw5WxI+ig=="
solution=$(echo "$encrypt" | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

Get first or last n lines¶

Let's see the initial or last lines of a file:

In [ ]:

whatis head
tldr head

In [ ]:

cat $imdbdir/tsv2/name.basics.tsv | head -10

But the columns are not aligned

In [ ]:

whatis column
tldr column

In [ ]:

cat $imdbdir/tsv2/name.basics.tsv | head -10 | column -t

Row counts and many ways of looping in Unix/Linux¶

Now I want to get a feel of how tall the tsv files are in terms of row count. For just one file:

In [ ]:

cat $imdbdir/tsv2/name.basics.tsv | wc -l

This is hard to read as human beings

In [ ]:

whatis numfmt
man numfmt

In [ ]:

cat $imdbdir/tsv2/name.basics.tsv | wc -l | numfmt --to=si

But single file does not suffice, let's do that for all tsv files

In [ ]:

ls $imdbdir/tsv2/*

In [ ]:

ls $imdbdir/tsv2/* | \
while read l;
do
    cat "$l" | wc -l | numfmt --to=si;
done

That's better. We loop through the full path of the contents of tsv2 directory, assign each to variable \$l and run the above pipe for each $l.

Better, we can use a for loop for this iteration:

In [ ]:

for l in $imdbdir/tsv2/*; \
do
    cat "$l" | wc -l | numfmt --to=si;
done

But shouldn't it be better if we also report respective filenames?

In [ ]:

whatis printf
tldr printf

In [ ]:

for l in $imdbdir/tsv2/*; \
do
    printf "%s\t" $l;
    cat "$l" | wc -l | numfmt --to=si;
done

We can use the more powerful "find" command instead of ls

In [ ]:

whatis find
tldr find

In [ ]:

find $imdbdir/tsv2 -mindepth 1 | \
while read l;
do
    printf "%s\t" $l;
    cat $l | wc -l | numfmt --to=si;
done

But we can replace the whole loop with sth more terse (or concise) and aligned columns:

In [ ]:

whatis xargs
tldr xargs

In [ ]:

find $imdbdir/tsv2 -mindepth 1 | \
    xargs -i sh -c 'printf "%s\t" {}; cat {} | wc -l | numfmt --to=si' | \
    column -t

Or do the loop inside the "-exec" argument of find command:

In [ ]:

find $imdbdir/tsv2 -mindepth 1 \
    -exec sh -c 'printf "%s\t" {}; cat {} | wc -l | numfmt --to=si' \; | \
    column -t

Or instead of serial processing, process multiple lines at the same time utilizing multiple CPU cores:

In [ ]:

whatis parallel
tldr parallel

And to support the development of GNU Parallel project, please read:

In [ ]:

parallel --citation

In [ ]:

find $imdbdir/tsv2 -mindepth 1 | \
    parallel -k -j0 'printf "%s\t" {}; cat {} | wc -l | numfmt --to=si' | \
    column -t

j flag provides number of parallel threads, j0 means maximum threads available, k keeps the original order of the inputs

Contrary to sequential single-threaded processing, in parallel processing, without k flag, the order of the output may be different based on completion times of each thread

You see there are many ways to do the same thing in Linux

EXERCISE 5:

Find the WORD counts (not line) of all gzipped files under tsv directory, using any of the methods you like above. Check the "name" argument to file, the glob for the files is "*.gz"

IMHO, this is a "big data management" course, and in order to process larger datasets easily utilizing the max power of your processor with full throttle, "parallel" should be your way!

Note that in order to use text processing commands (cat, less etc.) on zipped files, we add the prefix "z" to those commands

In [ ]:

SOLUTION 5:

In [ ]:

pass1=
encrypt="U2FsdGVkX18fvmKge1qRd1u+3HhdU41PB1axe167aRS527dX9nIx2ckFe3wPmE0E vQ93TpCe1jTphNkfZ+qyzB1oiJvxdVUclalfFlQbzhy42p1pN46/DMHvLGBho6RI oXelzh1SLtqk2O9YU4v3mwFcf5LtTZiSyq237Lni/Y5/3ZB0nePZz6TCt3i6Lavz Id7SgTK7XNsdEjtjsESvig=="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

Column counts¶

What if we want to get the column counts of each file?

Combining all we have learned above, we can read the first line and count the words!

In [ ]:

find $imdbdir/tsv2 -mindepth 1 | \
    parallel -k -j0 'printf "%s\t" {}; head -1 {} | wc -w' | \
    column -t

EXERCISE 6:

gzip algorithm can implement compressions up to two orders of magnitude.

In some cases it is not convenient to gunzip the files since they may fill up the drive.

So we may have to do the same thing on the gzipped files, off you go!

Note: There is no "z" version for head, but "head" can work on stdin

In [ ]:

SOLUTION 6:

In [ ]:

pass1=
encrypt="U2FsdGVkX19GPSAW6SDyYhMan0hv+ZxMF544aTgc3uqtBfkllke344tehTZK1lBo I9x/LzmrmFnoY+SWtvstpln4pkqKMycuFxVuV1oCwEAn0FKHbUOqqbKthawGVWJb yJYo7jztKKIO/5/VROag1QHWQ07ihGUeNRMkGrr9CJQge2+pQyR2W4DOl4oKrwA9 /fuZ2du4+1tmXWM/Q48RIA=="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

Head and tail of files¶

Statistics are good to have but it would be better to review the structure of the files, for example the headers and first data rows:

In [ ]:

find $imdbdir/tsv2 -mindepth 1 | \
    parallel -k -j0 'printf "%s\n" {}; head -2 {} | column -t; echo'

EXERCISE 7:

Did you realize that sth was different here as compared to previous ones? What is that?

SOLUTION 7:

In [ ]:

pass1=
encrypt="U2FsdGVkX1+2dKu4xqou48SPsqsiI6GE3dC+Otn8k6ZOHn5yUo21hVBLeZJmm6VA 2yWJ7AD4JwQ6Onk1oF6mT69Lno0m1k3tGRqUfB2A/YVf4H1CxvnC/1hGnMxBOBHM ZV72KSZKzb2X3xUY9MX0Yj1Pr+gYmEzoOB1NntJ7SfI="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo

If we want to get the last n lines, we use tail:

In [ ]:

whatis tail
tldr tail

EXERCISE 8:

We want to print the first and last lines of each file.

Note that, in order the align both lines, they must be fed into the pipe before "column" command together. Can you figure out a way for this?

In [ ]:

SOLUTION 8:

In [ ]:

pass1=
encrypt="U2FsdGVkX18QvsiUoHDi8Xk/h3bH6K2gaaPXrVnxqxQ1eXDPTss1htcSDx5uOVW/ QNXlXUvh94Jk/yMPS0P+4LIKMhsnMPTDTAb7YdYF+DZ4jP285gQqBdezPyEg8qCc tNuDVOvwjjchyYm0pSoeDJHGAAgxrvapOQj3R2hi2L8EMCYAm5EDhMgfGgOnU0J5"
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

Selected columns and pattern match¶

Suppose we are concerned with only one column of a file. We have information on how the columns are separated and the position of the column.

We can do that easily with "AWK". It is more than a command, a fully fledged programming language and the Swiss army knife of text processing in Unix! It is mostly used for short scripts and especially very very powerful one-liners.

I hope you watched the old AT&T video on Unix pipes, told by a bearded gentleman. Brian Kernighan is one of the most colossal figures in history of computing and the "K" in "AWK" stands for his surname

In [ ]:

whatis awk
tldr awk

But first we have to know the position of a column with a certain heading, such as endYear:

In [ ]:

head -1 $imdbdir/tsv2/title.basics.tsv

We can easily count that it is the 7th line. But suppose we want to do that programmatically.

We need a few more tools

In [ ]:

whatis grep
tldr grep

grep is a very important tool to make pattern search in files. With the -n flag, it can return the line number of match

However, the input is not separated into lines :(

First we have to convert tabs into newlines, with any of the two tools below:

In [ ]:

whatis tr
tldr tr

In [ ]:

head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n"

In [ ]:

whatis sed
tldr sed

While tr is a more specific tool for transforming characters in a file or stdin, sed is more powerful and versatile

It can substitute pattern matches with new strings, extract certain lines, append, delete or insert lines

In [ ]:

head -1 $imdbdir/tsv2/title.basics.tsv | sed 's/\t/\n/g'

Above sed command reads:

Substitute (s)
tab character (\t)
with a newline character (\n)
globally everywhere in the input (g)

"grep" stands for, Global/Regular Expressions/Print

awk is more versatile and handle more complex task but harder to write, sed is only a subset of awk but easier to write and tr is a subset of sed but even easier to use.

IMHO, if an easier tool exists for a specific task, I go for it, so I prefer "tr" here:

Now, we should search for endYear and get its line count:

In [ ]:

head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | grep -n endYear

That's good it is the 7th column. But programmatically we should only get that 7 and not the rest.

We can feed into grep again as such:(don't mind the pattern here)

In [ ]:

head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | grep -n endYear | grep -Po "^\d+(?=:)"

It reads:

Start from the beginning (^)
Get the digits (\d)
Of 1 or more (+)
Until (?=
The colon(:)

Regex is a very powerful domain specific language to match patterns in text and as a big data manager you WILL surely need it in many tasks

We will return back to regex and make recommendations in later sessions

Maybe we can get a more concise way to extract the number? We can revert back to sed for this:

In [ ]:

head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | sed -n '/endYear/='

In grep you cannot return the line number and suppress the match at the same time. But in sed you can only return the line number of match

Now in order to use this column position inside awk, we can save it into a variable called CN for "column number"

In [ ]:

CN=$(head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | sed -n '/endYear/=')
echo $CN

In more complex code, inserting echo (or printf) statements is important for subsequent debugging

So we can save output of any command into a variable with this syntax

Getting a column with an index is trivial for awk:

In [ ]:

awk -F "\t" '{ print $7}' $imdbdir/tsv2/title.basics.tsv | head

"-F" sets the field separator to tab character.

But awk has its own variables and we have to pass this CN variable into awk with the -v flag.

We can use the same variable name or a different one.

In [ ]:

awk -v CN=$CN -F "\t" '{ print $CN }' $imdbdir/tsv2/title.basics.tsv | head

Note that the $CN inside awk expression is different from the one we defined before. Variables of awk are distinct from that of the shell awk is invoked from.

Combined steps are:

In [ ]:

CN=$(head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | sed -n '/endYear/=')
awk -v CN=$CN -F "\t" '{ print $CN }' $imdbdir/tsv2/title.basics.tsv | head

Or in a single statement:

In [ ]:

awk -v CN=$(head -1 $imdbdir/tsv2/title.basics.tsv | tr "\t" "\n" | sed -n '/endYear/=') \
-F "\t" '{ print $CN }' $imdbdir/tsv2/title.basics.tsv | head

Writing strictly one-liner codes is not the best skill a coder should have.

Writing a code that is easily understandable by the coder h(er|im)self and collaborators and easily debuggable is a more important skill

So I go for the two-liner version

But wait .... I refer to the filename twice! Suppose I had to do that many more times, referring to the same path again and again. Wouldn't it be better to save it into a variable and reuse?

In [ ]:

filename=$imdbdir/tsv2/title.basics.tsv
CN=$(head -1 $filename | tr "\t" "\n" | sed -n '/endYear/=')
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | head

But for a more reusable code, we can also parametrize the pattern and the last command to get the head or tail

In [ ]:

filename=$imdbdir/tsv2/title.basics.tsv
pattern=endYear
comnd=head
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | $comnd

And we can wrap all inside a shell function with parameters to be passed.

Suppose first parameter is the filename, second is the pattern and the last is the command (head or tail)

Parameters passed to the function can be referred by their positions as $1, $2 ...

In [ ]:

# $imdbdir/tsv2/title.basics.tsv
# endYear
# head

headortail()
{
    filename=$1
    pattern=$2
    comnd=$3
    CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
    awk -v CN=$CN -F "\t" '{ print $CN }' $filename | $comnd
}

Now we can reuse this code for many similar purposes. Let's start with our initial intent:

In [ ]:

headortail $imdbdir/tsv2/title.basics.tsv endYear head

And get the last values of primaryName column from the name.basics.tsv file:

In [ ]:

headortail $imdbdir/tsv2/name.basics.tsv primaryName tail

Note that we can also do the pattern match with awk, and more powerful than sed or grep:

For example let's return the whole line where 7th field matches "endYear", prefixed with line number

In [ ]:

filename=$imdbdir/tsv2/title.basics.tsv
awk -F "\t" '$7 == "endYear" { print NR": "$0 }' $filename

EXERCISE 9:

Write a function named linepattern that takes a file, a column pattern, a row pattern and returns the line number of match and the whole line

Note that you can pass multiple variables to awk by invking "-v" multiple times for each variable

And string variables with whitespaces must be wrapped inside quotes

In [ ]:

SOLUTION 9:

In [ ]:

pass1=
encrypt="U2FsdGVkX18yoFtMKddz+N0MjU6twanensOn3zp1K/Bkv7lPtruYc23Ma6LoOjGk dlozrWW+rG3dgDq6TOYcgrpO1d4pIZomvyE01bWSDh1WXgXx0s+wTt8EPAIj/rg7 B5Nqcy4U/24rXJq59n6IkAOFrFFKCcDK3Z/D62FxI5al0tH5cyQVkWv+7NJSf+T7 Ctl3wN3Cjt4kHZXkL2YX7PfznQwOJb34BvWUd5Vl5XxTKmrbuS801+41XQdWUVJ5 kEXlxR3nmL3WPvLtfT/0Ux1VLjzJxzFtaUmvyDGigdoHIp/qXS+jv+I4pmXG8GWs"
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

Let's call it for name.basics.tsv, primaryName and "Al Pacino"

In [ ]:

linepattern $imdbdir/tsv2/name.basics.tsv primaryName "Al Pacino" | column -t

Unique values of a field¶

Now out of millions of rows, let's explore the unique values:

In [ ]:

whatis sort
tldr sort

In [ ]:

whatis uniq
tldr uniq

It is a coincidence that UNIQ is just next to our building!

In [ ]:

filename=$imdbdir/tsv2/title.basics.tsv
pattern=endYear
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | sort | uniq

In order for "uniq" to work, stdin must be well sorted. But wait, sort command can also do uniq with "-u" flag:

In [ ]:

filename=$imdbdir/tsv2/title.basics.tsv
pattern=endYear
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | sort -u

But for single column values, that tall of an output does not look good. We can get it arranged and printed in columns:

In [ ]:

whatis pr
man pr

In [ ]:

filename=$imdbdir/tsv2/title.basics.tsv
pattern=endYear
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | \
sort -u | pr -8 -t

EXERCISE 10:

Get the unique values of startYear column of title.basics.tsv but report only year values (just numbers)

You can reuse components from previous grep commands for regex

In [ ]:

SOLUTION 10:

In [ ]:

pass1=
encrypt="U2FsdGVkX19ZYi1B9oGWAmWy5lxS4X1HuF6mo6QMpjWeoHS34ZDvXidi0RTD1WpE wLZD8MkYgZmlD+t4jFS9iTtrrd5zXO74cehaQUVvquHcfFyI2F8MZOQpxvHdwEFj QP96UYpY9c+a/E/SuDbPEldCMrDS8Q/haRoaUKn4JWI51bbVhG3o0vwNjgVmBGKW xLYOEK/FHiVpKcuxb2u0THLrKquzIFddJkkOYXqLpjYQ8yxiL27D6eMLlcpc/+aQ QQObeW1jV5D0p7ewvG85Px6w0U9R3SFAA27RUnq+wmY="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

The dataset stretches far back!

Now let's say the counts of unique values are also of importance:

Sort in reverse order in terms of counts (only years)

In [ ]:

counts=$(filename=$imdbdir/tsv2/title.basics.tsv
pattern=startYear
CN=$(head -1 $filename | tr "\t" "\n" | sed -n "/$pattern/=")
awk -v CN=$CN -F "\t" '{ print $CN }' $filename | \
grep -Po "\d+" | sort | uniq -c)

echo "$counts" | sort -nr | column -c 100

And let's sort in increasing year order:

In [ ]:

echo "$counts" | sort -n -k 2 | column -c 100

And how about a simple plot wihout leaving the terminal/shell?

In [ ]:

echo "$counts" | sort -n -k 2 | \
gnuplot -e "set terminal dumb; plot '-' using 2:1 w points pt '*'"

And let's just focus on the period between 1980 and 2017:

In [ ]:

echo "$counts" | sort -n -k 2 | awk -F " " '$2 <= 2017 && $2 >= 1980' | \
gnuplot -e "set terminal dumb; plot '-' using 2:1 w points pt '*'"

Multiple filters¶

Suppose we use the awk code for filtering on multiple criteria:

In [ ]:

filename=$imdbdir/tsv2/title.basics.tsv
head -1 $filename

Let's exclude first two columns, keep the header, filter for matching Godfather in primary Title and startYear after 1972 and before 1990:

In [ ]:

CN1=$(head -1 $filename | tr "\t" "\n" | sed -n "/primaryTitle/=")
CN2=$(head -1 $filename | tr "\t" "\n" | sed -n "/startYear/=")
{ head -1 $filename | awk -F "\t" 'BEGIN {OFS = "\t"} { $1=$2="";print $0 }'; \
awk -v CN1=$CN1 -v CN2=$CN2 -F "\t" \
'BEGIN {OFS = "\t"} $CN1 ~ /Godfather/ && $CN2 > 1972 && $CN2 < 1990 { $1=$2="";print $0 }' $filename; } | \
column -t -s $'\t'

OFS is the output field separator, in order to format better for column command

We delete first two columns by setting them to empty string

And we combine both the header and filtered rows with { line1; line2; } | further_commands, to format together with column

We combine multiple filters with &&

EXERCISE 11:

Delete first 2 and isAdult columns, keep the header Filter for movies that have a runtime less than 10 minutes and not \N, genres include thriller and start year is after 1990 and not \N Note that you have to escape the escape character in "\N" as such: "\N"

Run head -1 for once and save in a variable

In [ ]:

SOLUTION 11:

In [ ]:

pass1=
encrypt="U2FsdGVkX1+S79PKqpQJiVEf5D6Y/snYFkTIjRPI+v/ZbkuGO+wB6rmkGH5AHHUg 7aEK+wO93DX02gFwYjkltP3OTIxg7wxJFjAfRYGMUriQK/DbcPMqlWRyAZWp2pXo vrI0smeerjV5d4wyMQDe33lYvbfFq6sC1cZvbmbQeOxv4ET8O+BapcDVt4ULXYC+ B75Uc1g0PrqM9fewr0IqQX+xCaqAAn4j8ly5nxhxVGgneTrErUeO6QP9oEx1FCvz vFXhCdm1uyqKtBOO8GB/OUQ5Mz6PvlNM87RNNuNhmrhnsjwqw97eC8uLQKssASor mi/5t9NebnQgHAf0IShq0L/KawPfnm/KmYXjb2DqSSu+9OM1ZyxpDAgjveD/TNbR C0Re7OgVfi6kg22LiZpRiHks4+clqnkjF6cCA54dFu5Qf49orXJNtB7HOakorRoD 0puOPrWRy754JLxLzgKnnE7B6+OJ99uFWQGLXuT5RKfdtBc6wWjXZkSNPVoK8vTV cihK+YKQum03N/kkZ6cxtTRnT+ZCG5zyZBOFxOS+w0qt9NeYd1wbG9thr/CSvkzJ dgycZl31MeOXzqUatZsQg9x13Aq6NVBusM8u1l2fw6bFCDlmSHdHyEp+UsLhfaTK GvDyAy8VI2AZQDIWTDoxbU6DdLdrWgsUT7TkImjFjn7WCFpRqS6OAAyYBEYfuko9 CH0VWfwJYiCs3O2HXd3p/V/aqPPs1z9ZitI+IvzodAjLRC2+YST5t3QKxVBKsf1y wqTuDpOOHMXmIYjChKD/6vfalCYxD7uza+DKVoPCUX0="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

Even very strict filters can yield results!

For more complex queries involving joins among tables is beyond the intent of awk language. And referring to even more column names complicates the code.

Now we need to switch to SQL

But first we have to process the data to better fit PostgreSQL database server

Data wrangling with shell and R¶

Split fields and melt¶

Let's remember the fields of files again:

In [ ]:

find $imdbdir/tsv2 -mindepth 1 | \
    parallel -k -j0 'printf "%s\n" {}; head -2 {} | column -t; echo'

See that the field that starts with ""tt" appears in all but one of the files. This field is the unique code for each title

And the field that starts with "nm" appears in title.principals, title.crew and name.basics and is the unique code for each person in movies (actor, director, or any profession)

We can join the tables based on these common fields to create more complex results

But we have a problem! The principalCast field in title.principals file combine all principal people of a move into a separate field with commas!

We would expect a separate row for each unique combination of a title and a person

Each tool has its own strengths. shell was powerful until now, but we need something specifically designed to transform data, R!

data.table is an expressive, efficient and very powerful package designed for handling very large datasets and with a syntax with components that corresponds to SQL clauses:

When the value at the dropdown box on the right shows "R", it means, now we are sending our commands to the R interpreter, not Bash

First load the library:

In [ ]:

library(data.table)

Set the main directory path

In [ ]:

prefix <- "~/data/imdb"
prefix

And the file path

In [ ]:

in_path <- sprintf("%s/tsv2/title.principals.tsv", prefix)
in_path

Fast read the file into R using data.table:

In [ ]:

principals <- data.table::fread(in_path)

And split and melt from the second column. Note that since the resulting object is +25m lines, we here limit the processed lines in order to save time on binder:

In [ ]:

 principals_molten <- principals[1:1e5, {
    principalCast_list = strsplit(principalCast, ",");
    .(
        tconst = rep(tconst, sapply(principalCast_list, length)),
        principalCast = unlist(principalCast_list)
    )
 }
]

What does it do?

principalCast_list = strsplit(principalCast, ",");

creates an intermadiate object that is an R list, each item of which is a character vector created from each original row by splitting second field from commas

tconst = rep(tconst, sapply(principalCast_list, length))

for each lish item, replicates the tconst field with the count of principalCast - one for each person

principalCast = unlist(principalCast_list)

Now that we have a vector of tconst - title code - that is replicated enough for the number of principal cast in a title, we can combine all list items into a long vector

Let's see in action using only the first two lines:

In the first (second) row there are three (two) principal cast concatenated with commas:

In [ ]:

principals12 <- principals[1:2]
principals12

Now we split each row of second column from commas:

In [ ]:

principalCast_list = strsplit(principals12[,principalCast], ",")
principalCast_list

To view the list better:

In [ ]:

listviewer::jsonedit(principalCast_list, mode = "form")

We want to end up with 5 rows: 3 + 2

In [ ]:

lengths <- sapply(principalCast_list, length)
lengths

In the first title, there are 3 principal cast and in the second title 2

In [ ]:

tconst <- rep(principals12[,tconst], lengths)
tconst

Now the title codes are replicated with relavant counts

In [ ]:

principalCast <- unlist(principalCast_list)
principalCast

And the list is simplified into a single vector. tconst and principalCast are both of length 5, thus we can combine them together

In [ ]:

combined <- list(tconst = tconst, principalCast = principalCast)
listviewer::jsonedit(combined, mode = "form")

And a list of equal length vectors can be converted to a data.frame or better a data.table:

In [ ]:

principals12_molten <- as.data.table(combined)
principals12_molten

Compare initial and final objects:

In [ ]:

principals12
principals12_molten

In [ ]:

lapply(list(principals[1:1e5], principals_molten), dim)

If we had run the code in the whole dataset, we would have started with a data.table of 4 million rows and have ended up with a data.table of 25 million rows in a matter of few seconds in powerful enough computer!

The single data.table statement combines all these steps and repeats for all rows. The initial and final dimensions are:

Now create a filename for output:

In [ ]:

out_path <- sprintf("%s/tsv2/title.principals_melt_short.tsv", prefix)
out_path

And fast write the data.table into a tsv file:

In [ ]:

fwrite(principals_molten, file = out_path, sep = "\t") # fast write new DT as tsv

Now back to bash, let's view the new data file:

In [ ]:

head -20 $imdbdir/tsv2/title.principals_melt_short.tsv | column -t

Bonus: Below parallel code does the same thing somwhow far less efficiently, so don't try it (while loop is there for you not to execute the code). This is here for demonstration purposes only. You may skip it.

R code is far less concise and efficient

In [ ]:

while false;
do
head -1 $imdbdir/tsv2/title.principals.tsv; tail -n+2 $imdbdir/tsv2/title.principals.tsv | \
parallel -j0 "line={}; \
    tc=\${line%%$'\t'*}; \
    nm=\${line##*$'\t'}; \
    nml=\${nm//,/$'\n'}; \
    ln=\$(echo -n \"\$nml\" | grep -c ^); \
    paste <(yes \$tc | head -n \$ln) <(echo -n \"\$nml\") --delimiters \"\t\"" \
    > $imdbdir/tsv2/title.principals_melt_shell.tsv
done

Inside the command to be executed by parallel:

{} stands for input
line={}; assigns the input into $line variable
tc=\${line%%$'\t'*} deletes the part including and after the tab character, so gets only the tconst portion. It is a more advanced topic in shell scripting called "parameter substitution"
nm=\${line##*$'\t'} deletes the part including and before the tab character so only leaves the nm part. Again parameter substitution
nml=\${nm//,/$'\n'}; replaces the commas between nm's by newlines, so practically splits them into separate lines
ln=\$(echo -n \"\$nml\" | grep -c ^) counts the number of lines of nm's, so number of times tconst should be repeated
yes \$tc | head -n \$ln repeats the tconst the exact number of times equal to the number of nm's
paste <(yes \$tc | head -n \$ln) <(echo -n \"\$nml\") --delimiters \"\t\" combines two columns of tc and nm into a tab separated text. This line utilizes a more advanced topic in shell scripting called "process substitution"

For subsequent codes, it is better that we clean the memory allocated to R for better performance:

In [ ]:

rm(list =ls())

EXERCISE 12:

In R, create a file title.genres_short.tsv under tsv2 in which there are two columns from title.basics.tsv, tconst and genres: genres split from commas and tconst replicated accordingly. Take only the first 1e5 (100k) lines. Compare the dimension of input and output objects/files

In [ ]:

SOLUTION 12:

In [ ]:

pass <- readline(prompt = "Please enter the password for the solution: ")
encrypt <- "U2FsdGVkX1+bbnopeIKcv3xn7U0pmH8WCpExJzguv8sSJBF+o0ipnhkCqyXVydjb 6RSgGNjWN/dEVp9Fq2Ht/B8v2rWKSEyOwtlfBQ0l4ecTRBGYWmK0AQRINUaTIXZS 2MCRKFA0fxjp9bENNr1hcldd68IcWYSLLnzVto7TFrBD4rmSSGXyedb4+9VBQafX FaY1C0EHuGsA0YW7/F5QXrv164or/Shm6+Zi1aaEpnl0HHviHdC1n5hXeiIyD6g7 iEHCsdVm68V1L1SI3ALHbf9nOnV56+Hgp59I5slx0N/DMmLtUumQ9pk8A0Lw8mAk 1ojJp2QzG5r89iZ58fVtxj5VDjr/pnQSKHBzfH0GRF+JHIKS90OAmcKzHh/Two6q Ckv0e101B5cargbkk5EKRwYmNqUzTBIzIqqCrqhzbGJS/z9maAHSmSTgqyFYxLnR QgMvuCfGHNiq0O+WR5F7piDVdmpHzG59yD4A/2lZMSe01jppyBXjW/+E5PocIie/ L9E3wxl/ObiRGc6dwCQ1W4DiaEqZJxUpraqANWLJ8Yc7dsPEOB4GBuZnOYP9WGYR sNdJ91nqhJb5eFmbzcOQRBJzEt4onc0dqBdZC3qZTTJzJ7WLShIwr35tOkcrtDxn zbHuFR7j4x4cUVHZcA3UVlG5QrKegLh64jA0UGjUML+Aq5hzjaCcJ+GQeOjkMjHk iLEgMvz+YdFuo9txUm9J8PlR+iOOLz5yI8xNU5YVLZmYvzXFSnGYwQKoAINmlY/7 5T0kAOYVi8HHtgNedmp0Ivs4i5hYPHwbKRDJ5Eh7ews="
solution <- system(sprintf("echo %s | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:%s 2> /dev/null", encrypt, pass), intern = T, ignore.stderr = T)
cat(solution, sep = "\n")
eval(parse(text = solution))

Let's view the output file from Bash:

In [ ]:

head -20 $imdbdir/tsv2/title.genres_short.tsv | column -t

trim headers¶

Now, in order to import tsv's into postgresql, we need to trim the header rows. In csv's, this step is not necessary

We will use tail for this purpose, let's see..

In [ ]:

cat $imdbdir/tsv2/title.basics.tsv | wc -l | numfmt --to=si

title.basics file has 4.6 million rows! More than any gui text editor or spreadsheet can handle

File starts like that:

In [ ]:

head $imdbdir/tsv2/title.basics.tsv

And after tail:

In [ ]:

tail -n+2 $imdbdir/tsv2/title.basics.tsv | head

We can redirect it to a new file

In [ ]:

mkdir -p $imdbdir/tsv3

In [ ]:

tail -n+2 $imdbdir/tsv2/title.basics.tsv > $imdbdir/tsv3/title.basics2.tsv

Now do it for all files:

if clause checks for existing files, if exists, skips

basename extracts the filename

${basenm/.tsv/2.tsv} adds a 2 before .tsv

In [ ]:

find $imdbdir/tsv2 -mindepth 1 | \
    parallel -k -j0 "basenm=\$(basename {});
        if [ ! -e ${imdbdir}/tsv3/\${basenm/.tsv/2.tsv} ];
        then
            tail -n+2 {} > \
            ${imdbdir}/tsv3/\${basenm/.tsv/2.tsv};
        fi
        "

Now it is time to create a database on PostgreSQL server and import the data into!

Some other useful *nix utilities¶

There are some other utilities that you might find useful while working on *nix systems (Linux, MacOS, FreeBSD, even Android)

To get details on ongoing processes on the operating system:

In [ ]:

whatis ps
tldr ps

For example, list "python" processes:

In [ ]:

ps aux | grep python

Get information on the hostname of the computer you are logged into:

In [ ]:

whatis hostname
tldr hostname

In [ ]:

hostname

"ls" does not report the recursive size of directories, for this you may use du (disk usage)

In [ ]:

whatis du
tldr du

In [ ]:

du -sh $imdbdir/*

To get information on mounted file systems, use df:

In [ ]:

whatis df
tldr df

To list mounted file systems, with file types and human readable sizes:

In [ ]:

df -hT

To mount/unmount drives on the root filesystem:

In [ ]:

whatis mount
tldr mount

In [ ]:

whatis umount
tldr umount

To check the integrity of a filesystem:

In [ ]:

whatis fsck
tldr fsck

To get superuser privileges on current user (if allowed to) or execute a command as another user:

In [ ]:

whatis sudo
tldr sudo

And to get a dynamic overview of the system resources, utilization and processes: top (by default) or better htop (to be installed separately)

In [ ]:

whatis htop
tldr htop

Let's open htop on our terminal (press q to quit):

In [ ]:

scr=$(screen -ls | grep -P "Attached" | head -1 | grep -Po "^\t+.+?(?=\s|\t)" | tr -d "\t")
screen -S $scr -X stuff 'htop\n'

For a very versatile, efficient, lightweight and extensible text and code editor, vim is the tools of power users:

In [ ]:

whatis vim
tldr vim

Enter edit mode with "i",
Exit edit mode and enter into command mode with "Esc",
To save ":w",
To quit after save ":q",
To quit without save ":q!"

In [ ]:

scr=$(screen -ls | grep -P "Attached" | head -1 | grep -Po "^\t+.+?(?=\s|\t)" | tr -d "\t")
screen -S $scr -X stuff 'vim\n'

There are also many networking tools in *unix that make it easier and efficient to script and automatize networking tasks and monitor networking activities such as:

ip, ifconfig, route, netstat, nmap, ping, bwm-ng, nethogs, iftop

To get and parse data from the network useful tools are:

curl, wget, lynx, w3m, elinks. You can also use chrome headless from the command line

Bonus homework 01¶

In order to induce you to "dirty your hands" with the shell tools, here is a little assignment that will count as additional 5 points if done right:

Great "sed" tool has a simple tool to extract any arbitrary lines by line number from a text stream as such:

For example, extract the 10th to 20th lines from input:

In [ ]:

seq 100 | sed -n "10,20p"

sed can both read from stdin (input from the pipe) and a file as the last parameter

In [ ]:

tldr sed

While seq creates a sequence of numbers:

In [ ]:

whatis seq
tldr seq

Your task is to write a wrapper function called windowsearch take takes four parameters:

First parameter is the starting line number for the window
Second parameter is the ending line number for the window
Third parameter is the position of sort field
Fourth parameter is the filename
The function takes the three parameters and using sed, returns those lines in between the two line numbers from the file by sorting according to the nth field as specified by the third parameter
So sed output should pipe into sort
Note that positional parameters are referred to by their position as \$1, \$2
You may or may not assign those positional parameters into named variables inside the function. In a simple context it is not too important. In more complex scripts, we may lose the track of what \$1, \$2 and so on stands for

In [ ]:

SOLUTION BONUS 1:

In [ ]:

pass1=
encrypt="U2FsdGVkX1/Y9ExlbpCUNWw72NzPn6D9LH0B3yzGkj3GMYpzObI/hn2Tq+EE+Iy3 7jJWVnuYZf1v4/e4tvoV/UaFzwtrXYmJOdUj86dm93vC5yTkzzL13USrujOKSwGD mFM+ZWikVnFmGWHBi+NaCvbrQ/0Bl+96dmp1Z7zPv6JvApA5DWLBNarG8il+Oc+3 fRdUkV0BscSY9GoWWloMTj/QvCOdfG7xfIhCe4JSCkI="
solution=$(echo $encrypt | openssl enc -md sha256 -aes-128-cbc -a -d -salt -pass pass:$pass1 2> /dev/null)
echo "$solution"; echo
for l in "$solution"; do eval "${l}"; done

To check your function, invoking it with the following parameters:

In [ ]:

windowsearch 10 20 2 $imdbdir/tsv2/title.principals.tsv

should yield:

tt0000009	nm0085156,nm0063086,nm1309758,nm0183823
tt0000014	nm0166380,nm0525910,nm0244989
tt0000010	nm0525910
tt0000016	nm0525910
tt0000012	nm0525910,nm0525908
tt0000015	nm0721526
tt0000018	nm0804434,nm3692071
tt0000019	nm0932055
tt0000013	nm1715062,nm0525910,nm0525908
tt0000017	nm3691272,nm0804434,nm1587194,nm3692829
tt0000011	nm3692297,nm0804434

Following the instructions in the instructor message on how to submit your solution