Getting started with Bash Notebooks¶

Notebook can be loaded for different underlying kernels: bash, python and R. Notebooks are useful to document interactive data analysis. They combine code cells with markdown cells. A markdown cell can contain text, math or headings.

You can create new bash notebooks using the "New" Dropdown list in the Jupyter File Browser and then selecting "Bash". Notebooks open if you click on them.

In Jupyter notebooks, you work with Cells. You can create new cells, or insert them above or below existing cells using the menu items in the Insert menu. Use the dropdown list in the command bar in Jupyter to change the type of the cell. The two main types we're going to use are Markdown and Code. Markdown cells are useful for documenting stuff, Code cells for running code. Markdown cells can be edited by double-clicking into them. Layout them by runnign Shift-Enter.

Code cells are used to enter and execute code. Let's look at some examples.

We can first check which directory we are in, using the pwd (=Present Working Directory) command:

In [1]:

pwd

/Users/schiffels/dev/popgen_course

OK, so we're in the dev/popgen_course subfolder within my home folder /home/stephan. We can list the contents of that folder:

In [2]:

ls

03_Rmd_smartpca.Rmd
03_bashnb_smartpca.ipynb
04_Rmd_plotting_pca.Rmd
04_pynb_plotting_pca.ipynb
05_Rmd_fstatistics.Rmd
05_pynb_fstatistics.ipynb
0_Welcome.ipynb
1A_short_primer_on_jupyter.ipynb
1B_getting_started_with_bash_notebooks.ipynb
1C_getting_started_with_python_notebooks.ipynb
1D_getting_started_with_R_notebooks.ipynb
README.md
adm_f3_param.txt
adm_f3_popfile.txt
f3_outgroup_stats_Han.txt
f3_outgroup_stats_MA1.txt
f4_param.txt
f4_popfile.txt
img
outgroup_f3_param_Han.txt
outgroup_f3_param_MA1.txt
outgroup_f3_popfile_Han.txt
outgroup_f3_popfile_MA1.txt
pca.AllEurasia.eval
pca.AllEurasia.evec
pca.AllEurasia.params.txt
pca.WestEurasia.eval
pca.WestEurasia.evec
pca.WestEurasia.params.txt
population_frequencies.txt
supp
test
testDir

We can now create a new directory:

In [3]:

mkdir testDir

mkdir: testDir: File exists

and change into that directory:

In [4]:

cd testDir

and confirm that we are now in the new dir:

In [5]:

pwd

/Users/schiffels/dev/popgen_course/testDir

OK, let's go back and delete the subfolder again:

In [6]:

cd ..
rm -r testDir

Here is a simple example of how to use echo:

In [7]:

echo "Hello, how are you?"

Hello, how are you?

OK, so let's try some more useful things with grep, which can be used to filter large text files by searching for patterns, in this case just the occurrence of the word "French":

In [9]:

grep French example_data/example.ind

           HGDP00511 M     French
           HGDP00512 M     French
           HGDP00513 F     French
           HGDP00514 F     French
           HGDP00515 M     French
           HGDP00516 F     French
           HGDP00517 F     French
           HGDP00518 M     French
           HGDP00519 M     French
           HGDP00522 M     French
           HGDP00523 F     French
           HGDP00524 F     French
           HGDP00525 M     French
           HGDP00526 F     French
           HGDP00527 F     French
           HGDP00528 M     French
           HGDP00529 F     French
           HGDP00531 F     French
           HGDP00533 M     French
           HGDP00534 F     French
           HGDP00535 F     French
           HGDP00536 F     French
           HGDP00537 F     French
           HGDP00538 M     French
           HGDP00539 F     French
     SouthFrench3326 M     French
     SouthFrench3947 M     French
     SouthFrench1323 M     French
     SouthFrench3951 M     French
     SouthFrench3068 M     French
     SouthFrench1112 M     French
     SouthFrench4018 M     French

Alright, so that lists all French individuals in that list. Now let's count them, by simply passing the flag -c:

In [14]:

grep -c French example_data/example.ind

32?2004l

*Note:* We so far have seen the pwd, mkdir, cd, rm, ls and grep commands. If you want to find out more about those, just google them, they are among the most popular and widely used commands/programs in Unix.

In Python3 notebooks you can plot things: Create a new python3 notebook, and run this boilerplate code in the first cell:

%matplotlib inline
import matplotlib.pyplot as plt

Then plot something, opening a second cell:

*Exercise:* Create a simple plot using plt.plot([1, 2, 3], [5, 2, 6])

Bash Pipes¶

OK. So this first Notebook operates on Bash, which is more or less the lingua franca of Linux operating systems. Everything you do on command lines uses bash. One of the most useful techniques in bash scripting or bash commands are Unix pipes. To illustrate them, consider the following.

Let's look at the structure of our ind file:

In [15]:

head example_data/example.ind

             Yuk_009 M    Yukagir
             Yuk_025 F    Yukagir
             Yuk_022 F    Yukagir
             Yuk_020 F    Yukagir
               MC_40 M    Chukchi
             Yuk_024 F    Yukagir
             Nesk_25 F Eskimo_Naukan
             Yuk_023 F    Yukagir
               MC_16 M    Chukchi
               MC_15 F    Chukchi

*Note:* The head command just lists the top 10 rows of a file.

Let's filter out the population column:

In [16]:

head example_data/example.ind | awk '{print $3}'

Yukagirl
Yukagir
Yukagir
Yukagir
Chukchi
Yukagir
Eskimo_Naukan
Yukagir
Chukchi
Chukchi

*Note:* The awk program is one of the most powerful programs for text-file processing in the Unix-world. It is actually a full-fledged programming language itself. Here we only use it in one of its simplest form. The program {print $3} simply says "For every line of the input file, print out the third field".

*Note:* The pipe symbol | tells Unix to redirect the output of the program to its left into the program to its right as standard input.

Let's sort the output (notice we now use cat instead of head, but use head in the end:

In [17]:

cat example_data/example.ind | awk '{print $3}' | sort | head

Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Abkhasian
Adygei

OK, so there are some error messages in the end because head ungracefully discards the rest of the data, but that's OK.

Now let's use uniq to get rid of population name duplicates:

In [18]:

cat example_data/example.ind | awk '{print $3}' | sort | uniq | head

Abkhasian
Adygei
Albanian
Aleut
Aleut_Tlingit
Altaian
Ami
Armenian
Atayal
Balkar

And now let's count:

In [19]:

cat example_data/example.ind | awk '{print $3}' | sort | uniq | wc -l

OK, so there are 120 populations in the dataset. And how many individuals?

In [20]:

wc -l example_data/example.ind

    1371 example_data/example.ind

So 1371 individuals on 120 populations, so a bit more than 10 per population on average. Good to know!

*Note:* we learned some new Unix commands: awk, cat, head, sort, uniq and wc.

As a final step, let's modify our pipeline to output not just the unique populations, but also the number of individuals per populations. Fortunately this is extremely easy, since the flag -c to the uniq command already does the job:

In [21]:

cat example_data/example.ind | awk '{print $3}' | sort | uniq -c | head

   9 Abkhasian
  16 Adygei
   6 Albanian
   7 Aleut
   4 Aleut_Tlingit
   7 Altaian
  10 Ami
  10 Armenian
   9 Atayal
  10 Balkar

Nice. Let's put that list into a file that we can then import for plotting later.

In [21]:

cat /data/popgen_course/genotypes_small.ind | awk '{print $3}' | sort | uniq -c > population_frequencies.txt

OK, we have created a new file called population_frequencies.txt in our current directory. We have used the bash redirection sumbol > for writing outputs from a command or pipeline into a file. The file should now contain the population number data. We can check this by running:

In [22]:

head population_frequencies.txt

      9 Abkhasian
     16 Adygei
      6 Albanian
      7 Aleut
      4 Aleut_Tlingit
      7 Altaian
     10 Ami
     10 Armenian
      9 Atayal
     10 Balkar

OK, it seems to have worked. If you want to look at the file in a more interactive way, go back to your Jupyter File Browser and click on the file, which you should now see within your working directory. The file should open in a text editor that you can use to scroll around.

OK, now that we have a file to plot, let's try it out using a new python3 notebook. See the next notebook, called 02_pynb_getting_started in this series.