SDIR=/data;
cd $SDIR/pca;
ls
AllEurasia.poplist.txt genotypes_small.ind WestEurasia.poplist.txt genotypes_small.geno genotypes_small.snp
Exploring the files. Here are the first 20 individuals:
head -20 genotypes_small.ind
Yuk_009 M Yukagir Yuk_025 F Yukagir Yuk_022 F Yukagir Yuk_020 F Yukagir MC_40 M Chukchi Yuk_024 F Yukagir Yuk_023 F Yukagir MC_16 M Chukchi MC_15 F Chukchi MC_18 M Chukchi Yuk_004 M Yukagir MC_08 F Chukchi Nov_005 M Nganasan MC_25 F Chukchi Yuk_019 F Yukagir Yuk_011 M Yukagir Sesk_47 M Chukchi1 MC_17 M Chukchi Yuk_021 M Yukagir MC_06 F Chukchi
And here the first 20 SNP rows:
head -20 genotypes_small.snp
1_752566 1 0.020130 752566 G A 1_842013 1 0.022518 842013 T G 1_891021 1 0.024116 891021 G A 1_903426 1 0.024457 903426 C T 1_949654 1 0.025727 949654 A G 1_1018704 1 0.026288 1018704 A G 1_1045331 1 0.026665 1045331 G A 1_1048955 1 0.026674 1048955 A G 1_1061166 1 0.026711 1061166 T C 1_1108637 1 0.028311 1108637 G A 1_1120431 1 0.028916 1120431 G A 1_1156131 1 0.029335 1156131 T C 1_1157547 1 0.029356 1157547 T C 1_1158277 1 0.029367 1158277 G A 1_1161780 1 0.029391 1161780 C T 1_1170587 1 0.029450 1170587 C T 1_1205155 1 0.029735 1205155 A C 1_1211292 1 0.029785 1211292 C T 1_1235792 1 0.030045 1235792 C T 1_1254255 1 0.030111 1254255 G A
And here are the first 20 genotypes of the first 100 individuals:
head -20 genotypes_small.geno | cut -c1-100
0101101211102210102021200100010200000011001000200001010110001100001111101001110200110100000111100010 2012121012210011122100111202201222121102222121121012202221211212202201101201220222122021220222220221 1100112001110021001001111000011200000111100001110001110100002100110111120000102200110100010010000000 0000112210222121221121100202221222122112112211202122222221022222111221102200112222122210220111121111 0000000000000000000000000000100000000000000000100010000000000000000000000000000100000000100001000000 1012100221102201101121110120110000010012002010200100010011100100011011101110120200010120101112120111 2222222222222222222222222222222222222222222222222222222222222222222222222222222222121221222222221222 2211222002212022102001212222212212222210122212121222112222221112122111222222122021221122222222211122 2211222002212022102001212202012212212210122212121122112221221112121111222122112021211112222111211111 2222222222102222202222222222222222222222222211222212122222122122222222222222222122221222222222212222 2212222212122222222222222222221222222222222220221122222222122221212222221222222202222222222222221222 1101100001000001001000000222010021200001202110101111110122100021211110001221120002110001212222122222 1221121221222211221222222121221222212222222222222222222211121221212122221202101222212222222222222222 1221121221222211221222222121221222212222222222222222222211121221212122221202101222212222222222222222 1221121221222211221222222121221222212222222222222222222211121221212122221202101222212222222222222222 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 2222222222222222222222222222222222222222222222222222222222222222222222222222222222121212222222222222 1011111102100111001100200122221022211211222021212200120222112121221120012221222102020112222122222222 2222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222 2122121212102221202222222222222221122212222192222211122222222112222222222122222122221221222222222222
Counting how many individuals and SNPs there are:
wc -l genotypes_small.ind
wc -l genotypes_small.snp
1340 genotypes_small.ind 593124 genotypes_small.snp
And now we check that the first row of the *.geno
file indeed contains the same number of columns:
head -1 genotypes_small.geno | wc -c
1341
which is one more, including the newline character at the end of the line. Now counting the number of rows in the *.geno
-file (this takes a few seconds, as the file is several hundred MB large):
wc -l genotypes_small.geno
593124 genotypes_small.geno
Great, the number of rows and columns agrees with the numbers indicated in the *.ind
and *.snp
file!
Now we're counting how many different populations there are. Let's first see the first 10 populations in the sorted list, alongside the number of individuals in each group:
awk '{print $3}' genotypes_small.ind | sort | uniq -c | head -20
9 Abkhasian 16 Adygei 6 Albanian 7 Aleut 4 Aleut_Tlingit 7 Altaian 10 Ami 10 Armenian 9 Atayal 10 Balkar 29 Basque 25 BedouinA 19 BedouinB 10 Belarusian 6 BolshoyOleniOstrov 9 Borneo 10 Bulgarian 8 Cambodian 2 Canary_Islander 2 ChalmnyVarre
If you look into the file further down, you will notice that there are a number of populations with only one sample. Let's filter those out and count only populations with at least two individuals and count them:
awk '{print $3}' genotypes_small.ind | sort | uniq -c | awk '$1>1' | wc -l
113
OK, so there are 113 populations with more than one individual in this dataset.