Notebook can be loaded for different underlying kernels: bash, python and R. Notebooks are useful to document interactive data analysis. They combine code cells with markdown cells. A markdown cell can contain text, math or headings.
You can create new bash notebooks using the "New" Dropdown list in the Jupyter File Browser and then selecting "Bash". Notebooks open if you click on them.
In Jupyter notebooks, you work with Cells. You can create new cells, or insert them above or below existing cells using the menu items in the Insert
menu. Use the dropdown list in the command bar in Jupyter to change the type of the cell. The two main types we're going to use are Markdown
and Code
. Markdown cells are useful for documenting stuff, Code cells for running code. Markdown cells can be edited by double-clicking into them. Layout them by runnign Shift-Enter.
Code cells are used to enter and execute code. Let's look at some examples.
We can first check which directory we are in, using the pwd
(=Present Working Directory) command:
pwd
/Users/schiffels/dev/popgen_course
OK, so we're in the dev/popgen_course
subfolder within my home folder /home/stephan
. We can list the contents of that folder:
ls
03_Rmd_smartpca.Rmd 03_bashnb_smartpca.ipynb 04_Rmd_plotting_pca.Rmd 04_pynb_plotting_pca.ipynb 05_Rmd_fstatistics.Rmd 05_pynb_fstatistics.ipynb 0_Welcome.ipynb 1A_short_primer_on_jupyter.ipynb 1B_getting_started_with_bash_notebooks.ipynb 1C_getting_started_with_python_notebooks.ipynb 1D_getting_started_with_R_notebooks.ipynb README.md adm_f3_param.txt adm_f3_popfile.txt f3_outgroup_stats_Han.txt f3_outgroup_stats_MA1.txt f4_param.txt f4_popfile.txt img outgroup_f3_param_Han.txt outgroup_f3_param_MA1.txt outgroup_f3_popfile_Han.txt outgroup_f3_popfile_MA1.txt pca.AllEurasia.eval pca.AllEurasia.evec pca.AllEurasia.params.txt pca.WestEurasia.eval pca.WestEurasia.evec pca.WestEurasia.params.txt population_frequencies.txt supp test testDir
We can now create a new directory:
mkdir testDir
mkdir: testDir: File exists
and change into that directory:
cd testDir
and confirm that we are now in the new dir:
pwd
/Users/schiffels/dev/popgen_course/testDir
OK, let's go back and delete the subfolder again:
cd ..
rm -r testDir
Here is a simple example of how to use echo
:
echo "Hello, how are you?"
Hello, how are you?
OK, so let's try some more useful things with grep
, which can be used to filter large text files by searching for patterns, in this case just the occurrence of the word "French":
grep French example_data/example.ind
HGDP00511 M French HGDP00512 M French HGDP00513 F French HGDP00514 F French HGDP00515 M French HGDP00516 F French HGDP00517 F French HGDP00518 M French HGDP00519 M French HGDP00522 M French HGDP00523 F French HGDP00524 F French HGDP00525 M French HGDP00526 F French HGDP00527 F French HGDP00528 M French HGDP00529 F French HGDP00531 F French HGDP00533 M French HGDP00534 F French HGDP00535 F French HGDP00536 F French HGDP00537 F French HGDP00538 M French HGDP00539 F French SouthFrench3326 M French SouthFrench3947 M French SouthFrench1323 M French SouthFrench3951 M French SouthFrench3068 M French SouthFrench1112 M French SouthFrench4018 M French
Alright, so that lists all French individuals in that list. Now let's count them, by simply passing the flag -c
:
grep -c French example_data/example.ind
32?2004l
*Note:* We so far have seen the pwd
, mkdir
, cd
, rm
, ls
and grep
commands. If you want to find out more about those, just google them, they are among the most popular and widely used commands/programs in Unix.
In Python3 notebooks you can plot things: Create a new python3 notebook, and run this boilerplate code in the first cell:
%matplotlib inline
import matplotlib.pyplot as plt
Then plot something, opening a second cell:
*Exercise:* Create a simple plot using plt.plot([1, 2, 3], [5, 2, 6])
OK. So this first Notebook operates on Bash, which is more or less the lingua franca of Linux operating systems. Everything you do on command lines uses bash. One of the most useful techniques in bash scripting or bash commands are Unix pipes. To illustrate them, consider the following.
Let's look at the structure of our ind
file:
head example_data/example.ind
Yuk_009 M Yukagir Yuk_025 F Yukagir Yuk_022 F Yukagir Yuk_020 F Yukagir MC_40 M Chukchi Yuk_024 F Yukagir Nesk_25 F Eskimo_Naukan Yuk_023 F Yukagir MC_16 M Chukchi MC_15 F Chukchi
*Note:* The head
command just lists the top 10 rows of a file.
Let's filter out the population column:
head example_data/example.ind | awk '{print $3}'
Yukagirl Yukagir Yukagir Yukagir Chukchi Yukagir Eskimo_Naukan Yukagir Chukchi Chukchi
*Note:* The awk
program is one of the most powerful programs for text-file processing in the Unix-world. It is actually a full-fledged programming language itself. Here we only use it in one of its simplest form. The program {print $3}
simply says "For every line of the input file, print out the third field".
*Note:* The pipe symbol |
tells Unix to redirect the output of the program to its left into the program to its right as standard input.
Let's sort the output (notice we now use cat
instead of head
, but use head
in the end:
cat example_data/example.ind | awk '{print $3}' | sort | head
Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Adygei
OK, so there are some error messages in the end because head
ungracefully discards the rest of the data, but that's OK.
Now let's use uniq
to get rid of population name duplicates:
cat example_data/example.ind | awk '{print $3}' | sort | uniq | head
Abkhasian Adygei Albanian Aleut Aleut_Tlingit Altaian Ami Armenian Atayal Balkar
And now let's count:
cat example_data/example.ind | awk '{print $3}' | sort | uniq | wc -l
120
OK, so there are 120 populations in the dataset. And how many individuals?
wc -l example_data/example.ind
1371 example_data/example.ind
So 1371 individuals on 120 populations, so a bit more than 10 per population on average. Good to know!
*Note:* we learned some new Unix commands: awk
, cat
, head
, sort
, uniq
and wc
.
As a final step, let's modify our pipeline to output not just the unique populations, but also the number of individuals per populations. Fortunately this is extremely easy, since the flag -c
to the uniq
command already does the job:
cat example_data/example.ind | awk '{print $3}' | sort | uniq -c | head
9 Abkhasian 16 Adygei 6 Albanian 7 Aleut 4 Aleut_Tlingit 7 Altaian 10 Ami 10 Armenian 9 Atayal 10 Balkar
Nice. Let's put that list into a file that we can then import for plotting later.
cat /data/popgen_course/genotypes_small.ind | awk '{print $3}' | sort | uniq -c > population_frequencies.txt
OK, we have created a new file called population_frequencies.txt
in our current directory. We have used the bash redirection sumbol >
for writing outputs from a command or pipeline into a file. The file should now contain the population number data. We can check this by running:
head population_frequencies.txt
9 Abkhasian 16 Adygei 6 Albanian 7 Aleut 4 Aleut_Tlingit 7 Altaian 10 Ami 10 Armenian 9 Atayal 10 Balkar
OK, it seems to have worked. If you want to look at the file in a more interactive way, go back to your Jupyter File Browser and click on the file, which you should now see within your working directory. The file should open in a text editor that you can use to scroll around.
OK, now that we have a file to plot, let's try it out using a new python3 notebook. See the next notebook, called 02_pynb_getting_started
in this series.