Principal components analysis (PCA) is one of the most useful techniques to visualise genetic diversity in a dataset. The methodology is not restricted to genetic data, but in general allows breaking down high-dimensional datasets to two or more dimensions for visualisation in a two-dimensional space.
This lesson is also our first contact with the genotype data used in this and most of the following lessons. The dataset that we will work with contains 1,340 individuals, each represented by 593,124 single nucleotide polymorphisms (SNPs). Those SNPs have exactly two different alleles, and each individual has one of four possible values at each genotype: homozygous reference, heterozygous, homozygous alternative, or missing. Those four values are encoded 2, 1, 0 and 9 respectively.
The data is laid out as a matrix, with columns indicating individuals, and rows indicating SNPs. The data itself comes in the so-called "EIGENSTRAT" format, which is defined in the Eigensoft package used by many tools used in this workshop. In this format, a genotype dataset consists of three files, usually with the following file endings:
*.snp
: The file containing the SNP positions. It consists of six columns: SNP-name, chromosome, genetic positions, physical position, reference allele, alternative allele.*.ind
: The file containing the names of the individuals. It consists of three columns: Individual Name, Sex (encoded as M(ale), F(emale), or U(nknown)), and population name.*.geno
: The file containing the genotype matrix, with individuals laid out from left to right, and SNP positions laid out from top to bottom.In the following, we will explore the files using bash in this notebook.
The data that we want to analyse is stored at /data/popgen_course
. Let's list the contents of that directory:
ls /data/popgen_course
AllEurasia.poplist.txt genotypes_small.ind WestEurasia.poplist.txt genotypes_small.geno genotypes_small.snp
Let's explore those files a bit. Here are the first 20 individuals:
head -20 /data/popgen_course/genotypes_small.ind
Yuk_009 M Yukagir Yuk_025 F Yukagir Yuk_022 F Yukagir Yuk_020 F Yukagir MC_40 M Chukchi Yuk_024 F Yukagir Yuk_023 F Yukagir MC_16 M Chukchi MC_15 F Chukchi MC_18 M Chukchi Yuk_004 M Yukagir MC_08 F Chukchi Nov_005 M Nganasan MC_25 F Chukchi Yuk_019 F Yukagir Yuk_011 M Yukagir Sesk_47 M Chukchi1 MC_17 M Chukchi Yuk_021 M Yukagir MC_06 F Chukchi
And here the first 20 SNP rows:
head -20 /data/popgen_course/genotypes_small.snp
1_752566 1 0.020130 752566 G A 1_842013 1 0.022518 842013 T G 1_891021 1 0.024116 891021 G A 1_903426 1 0.024457 903426 C T 1_949654 1 0.025727 949654 A G 1_1018704 1 0.026288 1018704 A G 1_1045331 1 0.026665 1045331 G A 1_1048955 1 0.026674 1048955 A G 1_1061166 1 0.026711 1061166 T C 1_1108637 1 0.028311 1108637 G A 1_1120431 1 0.028916 1120431 G A 1_1156131 1 0.029335 1156131 T C 1_1157547 1 0.029356 1157547 T C 1_1158277 1 0.029367 1158277 G A 1_1161780 1 0.029391 1161780 C T 1_1170587 1 0.029450 1170587 C T 1_1205155 1 0.029735 1205155 A C 1_1211292 1 0.029785 1211292 C T 1_1235792 1 0.030045 1235792 C T 1_1254255 1 0.030111 1254255 G A
And here are the first 20 genotypes of the first 50 individuals:
head -20 /data/popgen_course/genotypes_small.geno | cut -c1-50
01011012111022101020212001000102000000110010002000 20121210122100111221001112022012221211022221211210 11001120011100210010011110000112000001111000011100 00001122102221212211211002022212221221121122112021 00000000000000000000000000001000000000000000001000 10121002211022011011211101201100000100120020102001 22222222222222222222222222222222222222222222222222 22112220022120221020012122222122122222101222121212 22112220022120221020012122020122122122101222121211 22222222221022222022222222222222222222222222112222 22122222121222222222222222222212222222222222202211 11011000010000010010000002220100212000012021101011 12211212212222112212222221212212222122222222222222 12211212212222112212222221212212222122222222222222 12211212212222112212222221212212222122222222222222 22222222222222222222222222222222222222222222222222 22222222222222222222222222222222222222222222222222 10111111021001110011002001222210222112112220212122 22222222222222222222222222222222222222222222222222 21221212121022212022222222222222211222122221922222
Counting how many individuals and SNPs there are:
wc -l /data/popgen_course/genotypes_small.ind
wc -l /data/popgen_course/genotypes_small.snp
1340 /data/popgen_course/genotypes_small.ind 593124 /data/popgen_course/genotypes_small.snp
And now we check that the first row of the *.geno
file indeed contains the same number of columns:
head -1 /data/popgen_course/genotypes_small.geno | wc -c
1341
which is one more, including the newline character at the end of the line. Now counting the number of rows in the *.geno
-file (this takes a few seconds, as the file is several hundred MB large):
wc -l /data/popgen_course/genotypes_small.geno
593124 /data/popgen_course/genotypes_small.geno
Great, the number of rows and columns agrees with the numbers indicated in the *.ind
and *.snp
file!
Now we're counting how many different populations there are. Let's first see the first 10 populations in the sorted list, alongside the number of individuals in each group:
awk '{print $3}' /data/popgen_course/genotypes_small.ind | sort | uniq -c | head -20
9 Abkhasian 16 Adygei 6 Albanian 7 Aleut 4 Aleut_Tlingit 7 Altaian 10 Ami 10 Armenian 9 Atayal 10 Balkar 29 Basque 25 BedouinA 19 BedouinB 10 Belarusian 6 BolshoyOleniOstrov 9 Borneo 10 Bulgarian 8 Cambodian 2 Canary_Islander 2 ChalmnyVarre
To understand how PCA works, consider a single individual and its representation by its 593,124 markers. Formally, each individual is a point in a 593,124-dimensional space, where each dimension can take only the three possible genotypes indicated above, or have missing data. To visualise this high-dimensional dataset, we would like to project it down to two dimensions. But as there are many ways to project the shadow of a three-dimensional object on a two dimensional plane, there are many (and even more) ways to project a 593,124-dimensional cloud of points to two dimensions. What PCA does is figuring out the "best" way to do this project in order to visualise the major components of variance in the data.
For actually running the analysis, we use a software called smartPCA
from the Eigensoft package. As many other tools from this and related packages, smartPCA
reads in a parameter file which specifies its input and output files and options. In our case, we want the parameter file to have the following content:
genotypename: /data/popgen_course/genotypes_small.geno
snpname: /data/popgen_course/genotypes_small.snp
indivname: /data/popgen_course/genotypes_small.ind
evecoutname: pca.WestEurasia.evec
evaloutname: pca.WestEurasia.eval
poplistname: /data/popgen_course/WestEurasia.poplist.txt
lsqproject: YES
numoutevec: 4
numthreads: 1
Here, the first three parameters specify the input genotype files. The next two rows specify two output file names, typically with ending *.evec
and *.eval
. The parameter line beginning with poplistname
contains a file with a list of populations used for calculating the principal components (see below). The option lsqproject
is important for applications including ancient DNA with lots of missing data, which I will not elaborate on. For the purpose of this workshop, you should use lsqproject: YES
. The next option numoutevec
specifies the number of principal components that we compute, the last option numthreads
the number of CPUs to use for this run. We use just one since we're working together on the same computer, so cannot afford everyone running on lots of CPUs.
The parameter named poplistname
is a very crucial one. It specifies the populations whose individuals are used to calculate the principal components. Why not just all of them you ask? For two reasons: First, there are simply too many of them and we don't want to use all of them, since the computation would take too long. More importantly, however, we generally try to avoid using ancient samples to compute principal components, to avoid specific ancient-DNA related artefacts affecting the computation. Finally, the list of populations to use for PCA should be informed by your question. If you're investigating African population structure, in makes no sense to put Asian or European individuals in your population list, since then the main axes of genetic differentiation would not be inside of Africa, but between Africans and Non-Africans.
So what happens to individuals that are not in populations listed in the population list? Well, fortunately, they are not just ignored, but "projected". This means that after the principal components have been computed, all individuals (not just the one in the list) are projected onto these principal components. That way, we can visualise ancient populations in the context of modern genetic variation. While that may sound a bit problematic at first (Some variation in ancient populations is not represented well by modern populations), but it turns out to be nevertheless one of the most useful tools for this purpose. The advantage of avoiding ancient-DNA artefacts and batch effects to affect the visualisation outweighs the disadvantage of missing some private genetic variation components in the ancient populations themselves. Of course, that argument breaks down once the analysed populations become too ancient and detached from modern genetic variation. But for our purposes it will work just fine.
For this workshop, I prepared two population lists::
/data/popgen_course/WestEurasia.poplist.txt
/data/popgen_course/AllEurasia.poplist.txt
As you can tell from the names of the files, they specify two sets of modern populations representing West Eurasia or all of Europe and Asia, respectively.
I recommend to look through both of the population lists and google some population names that you don't recognise to get a feeling for the ethnic groups represented here.
smartPCA
¶Now go ahead and open a new text file using your Jupyter Browser, you can name it anything you like. For the sake of a concrete name, let's call it pca.WestEurasia.params.txt
. Text files in Jupyter are opene in a text editor, so you can then simply copy-paste the above lines into the new file.
Let's see whether it worked, by printing out the contents of that file into your notebook:
cat pca.WestEurasia.params.txt
genotypename: /data/popgen_course/genotypes_small.geno snpname: /data/popgen_course/genotypes_small.snp indivname: /data/popgen_course/genotypes_small.ind evecoutname: pca.WestEurasia.evec evaloutname: pca.WestEurasia.eval poplistname: /data/popgen_course/WestEurasia.poplist.txt lsqproject: YES numoutevec: 4 numthreads: 1
Great, so that's our parameter file for running smartPCA
.
*Note:* that we specified two output files in our parameter file, here called pca.WestEurasia.evec
and pca.WestEurasia.eval
. You can actually put any names you want in there. But beware of relative vs. absolute paths. File names starting with /
are considered "absolute", that is, taken to go from the root of the file system. In contrast, filenames not starting with /
are considered "relative" to the current working directory. If you forgot which directory you're in, run pwd
.
*Note:* The option poplistname
is a crucial one. Here you need to specify which populations are used to compute the eigenvectors of the principal components analysis. In our case, I have prepared two population list files: /data/popgen_course/WestEurasia.poplist.txt
and /data/popgen_course/AllEurasia.poplist.txt
. Pick one of the two to carry on.
Good, now we can run smartPCA
. To do that, it's more convenient to use the terminal than a Notebook. So open a terminal and run
smartpca -p pca.WestEurasia.params.txt
This will typically run for about 30 minutes and output lots of logging output to the screen.
In a similar manner we can prepare a parameter file for the AllEurasia population list. This is how it should look:
cat pca.AllEurasia.params.txt
genotypename: /data/popgen_course/genotypes_small.geno snpname: /data/popgen_course/genotypes_small.snp indivname: /data/popgen_course/genotypes_small.ind evecoutname: pca.AllEurasia.evec evaloutname: pca.AllEurasia.eval poplistname: /data/popgen_course/AllEurasia.poplist.txt lsqproject: YES numoutevec: 4 numthreads: 1
And similar to the command above, we can run pca on the AllEurasia population list via:
smartpca -p pca.AllEurasia.params.txt
which will run slightly longer than the first one because there are more populations