We first check which directory we are in, using the pwd
(=Present Working Directory) command:
pwd
/home/stephan
OK, se that's indeed our home folder. We can list the contents of that folder:
ls
bashnb_test.ipynb rnb_test.ipynb test.txt pynb_test.ipynb setting_up_bash.ipynb
We can now create a new directory:
mkdir testDir
and change into that directory:
cd testDir
and confirm that we are now in the new dir:
pwd
/home/stephan/testDir
and playing with echo
:
echo "Hello, how are you?"
Hello, how are you?
OK, so let's try some more useful things with grep
:
grep French /data/pca/genotypes_small.ind
HGDP00511 M French HGDP00512 M French HGDP00513 F French HGDP00514 F French HGDP00515 M French HGDP00516 F French HGDP00517 F French HGDP00518 M French HGDP00519 M French HGDP00522 M French HGDP00523 F French HGDP00524 F French HGDP00525 M French HGDP00526 F French HGDP00527 F French HGDP00528 M French HGDP00529 F French HGDP00531 F French HGDP00533 M French HGDP00534 F French HGDP00535 F French HGDP00536 F French HGDP00537 F French HGDP00538 M French HGDP00539 F French SouthFrench3326 M French SouthFrench3947 M French SouthFrench1323 M French SouthFrench3951 M French SouthFrench3068 M French SouthFrench1112 M French SouthFrench4018 M French
Alright, so that lists all French individuals. Now let's count them:
grep -c French /data/pca/genotypes_small.ind
32
Let's look at the structure of our ind
file:
head /data/pca/genotypes_small.ind
Yuk_009 M Yukagir Yuk_025 F Yukagir Yuk_022 F Yukagir Yuk_020 F Yukagir MC_40 M Chukchi Yuk_024 F Yukagir Yuk_023 F Yukagir MC_16 M Chukchi MC_15 F Chukchi MC_18 M Chukchi
Let's filter out the population column:
head /data/pca/genotypes_small.ind | awk '{print $3}'
Yukagir Yukagir Yukagir Yukagir Chukchi Yukagir Yukagir Chukchi Chukchi Chukchi
Let's sort it (notice we now use cat
instead of head
, but use head
in the end:
cat /data/pca/genotypes_small.ind | awk '{print $3}' | sort | head
Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Abkhasian Adygei sort: fflush fehlgeschlagen: Standardausgabe: Datenübergabe unterbrochen (broken pipe) sort: Schreibfehler
OK, so there are some error messages in the end because head
ungracefully discards the rest of the data, but that's OK.
Now let's use uniq
to get rid of population name duplicates:
cat /data/pca/genotypes_small.ind | awk '{print $3}' | sort | uniq | head
Abkhasian Adygei Albanian Aleut Aleut_Tlingit Altaian Ami Armenian Atayal Balkar
And now let's count:
cat /data/pca/genotypes_small.ind | awk '{print $3}' | sort | uniq | wc -l
116
OK, so there are 116 populations in the dataset. And how many individuals?
wc -l /data/pca/genotypes_small.ind
1340 /data/pca/genotypes_small.ind
So 1340 individuals on 116 populations, so a bit more than 10 per population on average. Good to know!