Principal Component Plots¶

For this chapter, you will need the PCA results that we ran in the last chapter. I have actually included the output files of my runs into this repository, so you can just use them if something didn't work in the previous chapter.

For making plots in python, the most popular libary around is matplotlib. We will also make use of pandas. You can load them via:

In [16]:

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt

In [5]:

cd ~/popgen_course

/home/stephan/popgen_course

In [6]:

pwd

Out[6]:

'/home/stephan/popgen_course'

In [7]:

ls

01_bashnb_getting_started.ipynb  pca.AllEurasia.params.txt
02_pynb_getting_started.ipynb    pca.WestEurasia.eval
03_bashnb_smartpca.ipynb         pca.WestEurasia.evec
04_pynb_plotting_pca.ipynb       pca.WestEurasia.params.txt
pca.AllEurasia.eval              population_frequencies.txt
pca.AllEurasia.evec              README.md

Let's have a look at the main results file from smartpca:

In [8]:

!head pca.WestEurasia.evec

           #eigvals:     6.289     3.095     2.693     2.010 
             Yuk_009     0.0123      0.1252      0.1147      0.0567          Yukagir
             Yuk_025     0.0120      0.1258      0.1168      0.0576          Yukagir
             Yuk_022     0.0136      0.1303      0.1186      0.0564          Yukagir
             Yuk_020     0.0170      0.1278      0.1176      0.0584          Yukagir
               MC_40     0.0183      0.1226      0.1123      0.0537          Chukchi
             Yuk_024     0.0144      0.1271      0.1124      0.0584          Yukagir
             Yuk_023     0.0124      0.1348      0.1238      0.0642          Yukagir
               MC_16     0.0144      0.1266      0.1169      0.0541          Chukchi
               MC_15     0.0146      0.1250      0.1119      0.0559          Chukchi

The first row contains the eigenvalues for the first 4 principal components (PCs), and all further rows contain the PC coordinates for each individual. The first column contains the name of each individual, the last row the population. To load this dataset with python, we use the pandas_ package, which facilitates working with data in python. To load data using pandas, we will use the read_csv() function. This function lets you define column headers, which we have to define first:

In [13]:

column_names = ["Name", "PC1", "PC2", "PC3", "PC4", "Group"]

In [14]:

column_names

Out[14]:

['Name', 'PC1', 'PC2', 'PC3', 'PC4', 'Group']

We can then load the eigenVec file from the pca run:

In [17]:

pcaDat = pd.read_csv("pca.WestEurasia.evec",
                     delim_whitespace=True, skiprows=1, names=column_names)

In [25]:

pcaDat2 = pd.read_csv("pca.AllEurasia.evec",
                     delim_whitespace=True, skiprows=1, names=column_names)

Looking at the data, we find that it is a matrix, with each individual on one row, and the columns denoting the first 4 principal components. The last column contains the population for each individual:

In [18]:

pcaDat

Out[18]:

	Name	PC1	PC2	PC3	PC4	Group
0	Yuk_009	0.0123	0.1252	0.1147	0.0567	Yukagir
1	Yuk_025	0.0120	0.1258	0.1168	0.0576	Yukagir
2	Yuk_022	0.0136	0.1303	0.1186	0.0564	Yukagir
3	Yuk_020	0.0170	0.1278	0.1176	0.0584	Yukagir
4	MC_40	0.0183	0.1226	0.1123	0.0537	Chukchi
5	Yuk_024	0.0144	0.1271	0.1124	0.0584	Yukagir
6	Yuk_023	0.0124	0.1348	0.1238	0.0642	Yukagir
7	MC_16	0.0144	0.1266	0.1169	0.0541	Chukchi
8	MC_15	0.0146	0.1250	0.1119	0.0559	Chukchi
9	MC_18	0.0175	0.1238	0.1167	0.0523	Chukchi
10	Yuk_004	0.0110	0.1273	0.1117	0.0573	Yukagir
11	MC_08	0.0187	0.1253	0.1185	0.0564	Chukchi
12	Nov_005	0.0152	0.1349	0.1285	0.0618	Nganasan
13	MC_25	0.0182	0.1258	0.1196	0.0532	Chukchi
14	Yuk_019	0.0161	0.1327	0.1229	0.0617	Yukagir
15	Yuk_011	0.0152	0.1217	0.1148	0.0569	Yukagir
16	Sesk_47	0.0167	0.1241	0.1177	0.0549	Chukchi1
17	MC_17	0.0180	0.1268	0.1147	0.0544	Chukchi
18	Yuk_021	0.0141	0.1329	0.1210	0.0653	Yukagir
19	MC_06	0.0159	0.1264	0.1135	0.0557	Chukchi
20	MC_38	0.0178	0.1240	0.1143	0.0534	Chukchi
21	MC_14	0.0165	0.1238	0.1114	0.0524	Chukchi
22	Ul5	0.0070	0.1306	0.1144	0.0540	Ulchi
23	Ul31	0.0056	0.1289	0.1182	0.0550	Ulchi
24	Ul65	0.0051	0.1331	0.1117	0.0599	Ulchi
25	Tuba12	0.0172	0.0906	0.0790	0.0362	Tubalar
26	Tuba20	0.0129	0.0894	0.0767	0.0308	Tubalar
27	Nel19	0.0273	0.0605	0.0608	0.0333	Yukagir
28	Nlk16	0.0217	0.0744	0.0753	0.0360	Even
29	Kor66	0.0148	0.1259	0.1157	0.0531	Koryak
...	...	...	...	...	...	...
1259	I0429	0.0413	0.0447	0.0440	0.0098	Yamnaya_Samara
1260	I0438	0.0384	0.0497	0.0399	0.0020	Yamnaya_Samara
1261	I0585	0.0770	-0.0424	0.0372	0.0355	WHG
1262	I0797	-0.0101	-0.0452	-0.0342	-0.0124	LBK_EN
1263	I0795	-0.0057	-0.0495	-0.0429	0.0098	LBK_EN
1264	I0022	-0.0133	-0.0433	-0.0356	-0.0089	LBK_EN
1265	I0026	-0.0142	-0.0438	-0.0430	-0.0027	LBK_EN
1266	I1507	0.0866	-0.0455	0.0393	0.0311	WHG
1267	I0025	-0.0103	-0.0449	-0.0404	-0.0023	LBK_EN
1268	I0443	0.0350	0.0401	0.0412	0.0028	Yamnaya_Samara
1269	I0054	-0.0054	-0.0413	-0.0410	-0.0124	LBK_EN
1270	I0046	-0.0066	-0.0446	-0.0386	-0.0092	LBK_EN
1271	I0048	-0.0128	-0.0367	-0.0388	-0.0129	LBK_EN
1272	I0056	-0.0067	-0.0472	-0.0388	-0.0054	LBK_EN
1273	I0057	-0.0113	-0.0442	-0.0357	-0.0008	LBK_EN
1274	I0100	-0.0063	-0.0455	-0.0410	-0.0051	LBK_EN
1275	I0659	-0.0084	-0.0437	-0.0431	-0.0099	LBK_EN
1276	I0821	-0.0071	-0.0428	-0.0380	-0.0103	LBK_EN
1277	I1550	-0.0107	-0.0386	-0.0402	-0.0039	LBK_EN
1278	BOO001	0.0399	0.0760	0.0915	0.0453	BolshoyOleniOstrov
1279	BOO002	0.0445	0.0735	0.0925	0.0379	BolshoyOleniOstrov
1280	BOO003	0.0466	0.0765	0.0862	0.0415	BolshoyOleniOstrov
1281	BOO004	0.0411	0.0723	0.0938	0.0419	BolshoyOleniOstrov
1282	BOO005	0.0461	0.0731	0.0909	0.0401	BolshoyOleniOstrov
1283	BOO006	0.0394	0.0917	0.1002	0.0438	BolshoyOleniOstrov
1284	CHV001	0.0441	0.0331	0.0587	0.0325	ChalmnyVarre
1285	CHV002	0.0442	0.0351	0.0610	0.0373	ChalmnyVarre
1286	JK1968	0.0398	0.0385	0.0661	0.0299	Levanluhta
1287	JK1970	0.0408	0.0466	0.0600	0.0363	Levanluhta
1288	JK2065	0.0392	-0.0065	0.0195	0.0043	JK2065

1289 rows × 6 columns

We can quickly plot the first two PCs for all individuals:

In [20]:

plt.scatter(x=pcaDat["PC1"], y=pcaDat["PC2"])

Out[20]:

<matplotlib.collections.PathCollection at 0x7fe1f662def0>

In [21]:

plt.figure(figsize=(10, 10))
plt.scatter(x=pcaDat["PC1"], y=pcaDat["PC2"])
plt.xlabel("PC1");
plt.ylabel("PC2");

which is not very helpful, because we can't see where each population falls. We can highlight a few populations to get a bit more of a feeling:

In [23]:

plt.scatter(x=pcaDat["PC1"], y=pcaDat["PC2"])
for pop in ["Finnish", "Sardinian", "Armenian", "BedouinB"]: #French, Finnish, Han, Ami, Nganasan
    d = pcaDat[pcaDat["Group"] == pop]
    plt.scatter(x=d["PC1"], y=d["PC2"], label=pop)

In [30]:

plt.figure(figsize=(10, 10))
plt.scatter(x=-pcaDat["PC1"], y=pcaDat["PC2"], label="")
for pop in ["Finnish", "Sardinian", "Armenian", "BedouinB"]:
    d = pcaDat[pcaDat["Group"] == pop]
    plt.scatter(x=-d["PC1"], y=d["PC2"], label=pop)
plt.legend()
plt.xlabel("PC1");
plt.ylabel("PC2");

In [33]:

plt.figure(figsize=(10, 10))
plt.scatter(x=-pcaDat2["PC1"], y=pcaDat2["PC2"], label="")
for pop in ["Finnish", "Sardinian", "Han", "Ami", "Nganasan"]:
    d = pcaDat2[pcaDat2["Group"] == pop]
    plt.scatter(x=-d["PC1"], y=d["PC2"], label=pop)
plt.legend()
plt.xlabel("PC1");
plt.ylabel("PC2");

Showing all populations¶

OK, but how do we systematically show all the populations? There are too many of those to separate them all by different colors, or by different symbols, so we need to combine colours and symbols and use all the combinations of them to show all the populations. To do that, we first need to load the population list that we want to focus on for now, which are the same lists as used above for running the PCA. In case of the West Eurasian PCA, you can load the file using:

In [34]:

pd.read_csv("/data/popgen_course/WestEurasia.poplist.txt",
                         names=["Population"]).sort_values(by="Population")

Out[34]:

	Population
1	Abkhasian
2	Adygei
3	Albanian
4	Armenian
5	Assyrian
6	Balkar
7	Basque
8	BedouinA
9	BedouinB
10	Belarusian
11	Bulgarian
12	Canary_Islander
13	Chechen
0	Chuvash
14	Croatian
15	Cypriot
16	Czech
17	Druze
18	English
19	Estonian
20	Finnish
21	French
22	Georgian
23	German
24	Greek
25	Hungarian
26	Icelandic
27	Iranian
28	Irish
29	Irish_Ulster
...	...
38	Jew_Tunisian
39	Jew_Turkish
40	Jew_Yemenite
41	Jordanian
42	Kumyk
44	Lebanese
43	Lebanese_Christian
45	Lebanese_Muslim
46	Lezgin
47	Lithuanian
48	Maltese
49	Mordovian
50	North_Ossetian
51	Norwegian
52	Orcadian
53	Palestinian
54	Polish
55	Romanian
56	Russian
57	Sardinian
58	Saudi
59	Scottish
60	Shetlandic
61	Sicilian
62	Sorb
64	Spanish
63	Spanish_North
65	Syrian
66	Turkish
67	Ukrainian

68 rows × 1 columns

Next, we need to associate a color number and a symbol number with each population. To keep things simple, I would recommend to simply cycle through all combinations automatically. This code snippet looks a bit magic, but it does the job:

In [36]:

popListDat = pd.read_csv("/data/popgen_course/WestEurasia.poplist.txt",
                         names=["Population"]).sort_values(by="Population")
nPops = len(popListDat)
nCols = 8
nSymbols = int(nPops / nCols)
colorIndices = [int(i / nSymbols) for i in range(nPops)]
symbolIndices = [i % nSymbols for i in range(nPops)]
popListDat = popListDat.assign(colorIndex=colorIndices, symbolIndex=symbolIndices)

In [41]:

popListDat2 = pd.read_csv("/data/popgen_course/AllEurasia.poplist.txt",
                         names=["Population"]).sort_values(by="Population")
nPops = len(popListDat2)
nCols = 8
nSymbols = int(nPops / nCols)
colorIndices = [int(i / nSymbols) for i in range(nPops)]
symbolIndices = [i % nSymbols for i in range(nPops)]
popListDat2 = popListDat2.assign(colorIndex=colorIndices, symbolIndex=symbolIndices)

How do we know it worked? Let's look at popListDat:

In [39]:

popListDat

Out[39]:

	Population	colorIndex	symbolIndex
1	Abkhasian	0	0
2	Adygei	0	1
3	Albanian	0	2
4	Armenian	0	3
5	Assyrian	0	4
6	Balkar	0	5
7	Basque	0	6
8	BedouinA	0	7
9	BedouinB	1	0
10	Belarusian	1	1
11	Bulgarian	1	2
12	Canary_Islander	1	3
13	Chechen	1	4
0	Chuvash	1	5
14	Croatian	1	6
15	Cypriot	1	7
16	Czech	2	0
17	Druze	2	1
18	English	2	2
19	Estonian	2	3
20	Finnish	2	4
21	French	2	5
22	Georgian	2	6
23	German	2	7
24	Greek	3	0
25	Hungarian	3	1
26	Icelandic	3	2
27	Iranian	3	3
28	Irish	3	4
29	Irish_Ulster	3	5
...	...	...	...
38	Jew_Tunisian	4	6
39	Jew_Turkish	4	7
40	Jew_Yemenite	5	0
41	Jordanian	5	1
42	Kumyk	5	2
44	Lebanese	5	3
43	Lebanese_Christian	5	4
45	Lebanese_Muslim	5	5
46	Lezgin	5	6
47	Lithuanian	5	7
48	Maltese	6	0
49	Mordovian	6	1
50	North_Ossetian	6	2
51	Norwegian	6	3
52	Orcadian	6	4
53	Palestinian	6	5
54	Polish	6	6
55	Romanian	6	7
56	Russian	7	0
57	Sardinian	7	1
58	Saudi	7	2
59	Scottish	7	3
60	Shetlandic	7	4
61	Sicilian	7	5
62	Sorb	7	6
64	Spanish	7	7
63	Spanish_North	8	0
65	Syrian	8	1
66	Turkish	8	2
67	Ukrainian	8	3

68 rows × 3 columns

OK nice, we now have each population name associated with a unique combination of color-number and symbol-number. We can now plot all points with colors and symbols:

In [43]:

plt.figure(figsize=(10,10))
symbolVec = ["8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d", "8", "s", "p"]
colorVec = [u'#1f77b4', u'#ff7f0e', u'#2ca02c', u'#d62728', u'#9467bd',
            u'#8c564b', u'#e377c2', u'#7f7f7f', u'#bcbd22', u'#17becf']
for i, row in popListDat.iterrows():
    d = pcaDat[pcaDat.Group == row["Population"]]
    plt.scatter(x=-d["PC1"], y=d["PC2"], c=colorVec[row["colorIndex"]],
                marker=symbolVec[row["symbolIndex"]], label=row["Population"])
plt.xlabel("PC1");
plt.ylabel("PC2");
plt.legend(loc=(1.1, 0), ncol=3)

Out[43]:

<matplotlib.legend.Legend at 0x7fe1f3eb4470>

In [44]:

plt.figure(figsize=(10,10))
symbolVec = ["8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d", "8", "s", "p"]
colorVec = [u'#1f77b4', u'#ff7f0e', u'#2ca02c', u'#d62728', u'#9467bd',
            u'#8c564b', u'#e377c2', u'#7f7f7f', u'#bcbd22', u'#17becf']
for i, row in popListDat2.iterrows():
    d = pcaDat2[pcaDat2.Group == row["Population"]]
    plt.scatter(x=-d["PC1"], y=d["PC2"], c=colorVec[row["colorIndex"]],
                marker=symbolVec[row["symbolIndex"]], label=row["Population"])
plt.xlabel("PC1");
plt.ylabel("PC2");
plt.legend(loc=(1.1, 0), ncol=3)

Out[44]:

<matplotlib.legend.Legend at 0x7fe1f3e5da90>

Adding ancient populations¶

Of course, until now we haven't yet included any of the actual ancient test individuals that we want to analyse, but with plot command above you can very easily add them, by simply adding a few manual plot command before the legend, but outside of the foor loop.

We add the following ancient populations to this plot:

Levanluhta (two individuals from Finland from the first millenium AD)
BolshoyOleniOstrov (a group of 3500 year old individuals from Northern Russia).
WHG (short for Western Hunter-Gatherers, about 8000 years ago)
LBK_EN (short for Linearbandkeramik Early Neolithic, from about 6,000 years ago)
Yamnaya_Samara, a late Neolithic population from the Russian Steppe, about 4,800 years ago.

The first two populations are from a publication on ancient Fennoscandian genomes (Lamnidis et al. 2018), and are instructive to understand what PCA can be used for. The latter three populations are from two famous publications (Lazaridis et al. 2014 and Haak et al. 2015). It can be shown that modern European genetic diversity is formed by a mix of three ancestries represented by these ancient groups. To highlight these ancient populations, we plot them in black and using different symbols. While we're at it, we should also add the population called "Saami.DG":

In [11]:

plt.figure(figsize=(10,10))
symbolVec = ["8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d", "v", "<", ">", "^"]
colorVec = [u'#1f77b4', u'#ff7f0e', u'#2ca02c', u'#d62728', u'#9467bd',
            u'#8c564b', u'#e377c2', u'#7f7f7f', u'#bcbd22', u'#17becf']
for i, row in popListDat.iterrows():
    d = pcaDat[pcaDat.Population == row["Population"]]
    plt.scatter(x=-d["PC1"], y=d["PC2"], c=colorVec[row["colorIndex"]],
                marker=symbolVec[row["symbolIndex"]], label=row["Population"])

for i, pop in enumerate(["Levanluhta", "Saami.DG", "BolshoyOleniOstrov", "Yamnaya_Samara", "LBK_EN", "WHG"]):
    d = pcaDat[pcaDat.Population == pop]
    plt.scatter(x=-d["PC1"], y=d["PC2"], c="black", marker=symbolVec[i], label=pop)
plt.xlabel("PC1");
plt.ylabel("PC2");
plt.legend(loc=(1.1, 0), ncol=3)

Out[11]:

<matplotlib.legend.Legend at 0x7f5f946e6748>

OK, so what are we looking at? This is quite a rich plot, of course, and we won't discuss all the details here. I just want to highlight two things. First, you can see that most present-day Europeans are scattered in a relatively tight space in the center of a triangle span up by the WHG on the lower left, LBK_EN on the lower right (seen from European points) and by Yamnaya_Samara (top). Indeed, a widely-accepted model for present-day Europeans assumes these three ancient source populations for all Europeans (Lazaridis et al. 2014 and Haak et al. 2015).

The second thing that is noteworthy here is that present-day people from Northeastern Europe, such as Finns, Saami and other Uralic speaking populations are "dragged" towards the ancient samples form Bolshoy Oleni Ostrov. Indeed, a recent model published by us assumes that "Siberian" genetic ancestry entered Europe around 4000 years ago as a kind of fourth genetic component on top of the three other components discusseda bove, and is nowadays found in most Uralic speakers in Europe, including Finns, Saami and Estonians.

East-Eurasian PCA¶

We can make a similar plot using the all-Eurasian PCA that we have run:

In [44]:

popListDat = pd.read_csv("/data/popgen_course/AllEurasia.poplist.txt",
                         names=["Population"]).sort_values(by="Population")
nPops = len(popListDat)
nCols = 9
nSymbols = int(nPops / nCols)
colorIndices = [int(i / nSymbols) for i in range(nPops)]
symbolIndices = [i % nSymbols for i in range(nPops)]
popListDat = popListDat.assign(colorIndex=colorIndices, symbolIndex=symbolIndices)
popListDat

Out[44]:

	Population	colorIndex	symbolIndex
0	Abkhasian	0	0
1	Adygei	0	1
2	Albanian	0	2
3	Aleut	0	3
4	Aleut_Tlingit	0	4
5	Altaian	0	5
6	Ami	0	6
7	Armenian	0	7
8	Assyrian	0	8
9	Atayal	0	9
10	Avar	0	10
11	Azeri	0	11
12	Balkar	0	12
13	Basque	1	0
14	BedouinA	1	1
15	BedouinB	1	2
16	Belarusian	1	3
17	Borneo	1	4
18	Bulgarian	1	5
19	Buryat	1	6
20	Cambodian	1	7
21	Chechen	1	8
22	Chukchi	1	9
23	Chukchi1	1	10
24	Chuvash	1	11
25	Croatian	1	12
26	Cypriot	2	0
27	Czech	2	1
28	Dai	2	2
29	Daur	2	3
...	...	...	...
89	Saami.DG	6	11
90	Saami_WGA	6	12
91	Sardinian	7	0
92	Saudi	7	1
93	Scottish	7	2
94	Selkup	7	3
95	Semende	7	4
96	She	7	5
97	Sherpa.DG	7	6
98	Sicilian	7	7
99	Spanish	7	8
100	Spanish_North	7	9
101	Syrian	7	10
102	Tajik	7	11
103	Thai	7	12
104	Tibetan.DG	8	0
105	Tu	8	1
106	Tubalar	8	2
107	Tujia	8	3
108	Turkish	8	4
109	Turkmen	8	5
110	Tuvinian	8	6
111	Ukrainian	8	7
112	Ulchi	8	8
113	Uygur	8	9
114	Uzbek	8	10
115	Xibo	8	11
116	Yakut	8	12
117	Yi	9	0
118	Yukagir	9	1

119 rows × 3 columns

In [45]:

pcaDat = pd.read_csv("pca.AllEurasia.evec",
                     delim_whitespace=True, skiprows=1, names=names)

In [46]:

plt.figure(figsize=(10,10))
symbolVec = ["8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d", "v", "<", ">", "^"]
colorVec = [u'#1f77b4', u'#ff7f0e', u'#2ca02c', u'#d62728', u'#9467bd',
            u'#8c564b', u'#e377c2', u'#7f7f7f', u'#bcbd22', u'#17becf']
for i, row in popListDat.iterrows():
    d = pcaDat[pcaDat.Population == row["Population"]]
    plt.scatter(x=-d["PC1"], y=d["PC2"], c=colorVec[row["colorIndex"]],
                marker=symbolVec[row["symbolIndex"]], label=row["Population"])

for i, pop in enumerate(["Levanluhta", "Saami.DG", "BolshoyOleniOstrov", "Yamnaya_Samara", "LBK_EN", "WHG"]):
    d = pcaDat[pcaDat.Population == pop]
    plt.scatter(x=-d["PC1"], y=d["PC2"], c="black", marker=symbolVec[i], label=pop)

plt.xlabel("PC1");
plt.ylabel("PC2");
plt.legend(loc=(1.1, 0), ncol=3)

Out[46]:

<matplotlib.legend.Legend at 0x7f7086207a58>

This PCA looks quite different. Here, we have all Western-Eurasian groups squished together on the left side of the plot, and on the right we have East-Asian populations. The plot roughly reflects Geography, with Northern East-Asian people such as the Nganasan on the top-right, and Southern East-Asian people like the Taiwanese Ami on the lower right. Here we can now see that the ancient samples from Russia and Finnland, as well as present-day Uralic populations are actually distributed between East and West, contrary to most other Europeans. This confirms that these group in Europe have quite a distinctive East-Asian genetic ancestry, and we found that it is best represented by the Nganasan (Lamnidis et al. 2018).