For this chapter, you will need the PCA results that we ran in the last chapter. I have actually included the output files of my runs into this repository, so you can just use them if something didn't work in the previous chapter.
For making plots in python, the most popular libary around is matplotlib. We will also make use of pandas. You can load them via:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
cd ~/popgen_course
/home/stephan/popgen_course
pwd
'/home/stephan/popgen_course'
ls
01_bashnb_getting_started.ipynb pca.AllEurasia.params.txt 02_pynb_getting_started.ipynb pca.WestEurasia.eval 03_bashnb_smartpca.ipynb pca.WestEurasia.evec 04_pynb_plotting_pca.ipynb pca.WestEurasia.params.txt pca.AllEurasia.eval population_frequencies.txt pca.AllEurasia.evec README.md
Let's have a look at the main results file from smartpca
:
!head pca.WestEurasia.evec
#eigvals: 6.289 3.095 2.693 2.010 Yuk_009 0.0123 0.1252 0.1147 0.0567 Yukagir Yuk_025 0.0120 0.1258 0.1168 0.0576 Yukagir Yuk_022 0.0136 0.1303 0.1186 0.0564 Yukagir Yuk_020 0.0170 0.1278 0.1176 0.0584 Yukagir MC_40 0.0183 0.1226 0.1123 0.0537 Chukchi Yuk_024 0.0144 0.1271 0.1124 0.0584 Yukagir Yuk_023 0.0124 0.1348 0.1238 0.0642 Yukagir MC_16 0.0144 0.1266 0.1169 0.0541 Chukchi MC_15 0.0146 0.1250 0.1119 0.0559 Chukchi
The first row contains the eigenvalues for the first 4 principal components (PCs), and all further rows contain the PC coordinates for each individual. The first column contains the name of each individual, the last row the population. To load this dataset with python, we use the pandas_ package, which facilitates working with data in python. To load data using pandas, we will use the read_csv()
function. This function lets you define column headers, which we have to define first:
column_names = ["Name", "PC1", "PC2", "PC3", "PC4", "Group"]
column_names
['Name', 'PC1', 'PC2', 'PC3', 'PC4', 'Group']
We can then load the eigenVec file from the pca run:
pcaDat = pd.read_csv("pca.WestEurasia.evec",
delim_whitespace=True, skiprows=1, names=column_names)
pcaDat2 = pd.read_csv("pca.AllEurasia.evec",
delim_whitespace=True, skiprows=1, names=column_names)
Looking at the data, we find that it is a matrix, with each individual on one row, and the columns denoting the first 4 principal components. The last column contains the population for each individual:
pcaDat
Name | PC1 | PC2 | PC3 | PC4 | Group | |
---|---|---|---|---|---|---|
0 | Yuk_009 | 0.0123 | 0.1252 | 0.1147 | 0.0567 | Yukagir |
1 | Yuk_025 | 0.0120 | 0.1258 | 0.1168 | 0.0576 | Yukagir |
2 | Yuk_022 | 0.0136 | 0.1303 | 0.1186 | 0.0564 | Yukagir |
3 | Yuk_020 | 0.0170 | 0.1278 | 0.1176 | 0.0584 | Yukagir |
4 | MC_40 | 0.0183 | 0.1226 | 0.1123 | 0.0537 | Chukchi |
5 | Yuk_024 | 0.0144 | 0.1271 | 0.1124 | 0.0584 | Yukagir |
6 | Yuk_023 | 0.0124 | 0.1348 | 0.1238 | 0.0642 | Yukagir |
7 | MC_16 | 0.0144 | 0.1266 | 0.1169 | 0.0541 | Chukchi |
8 | MC_15 | 0.0146 | 0.1250 | 0.1119 | 0.0559 | Chukchi |
9 | MC_18 | 0.0175 | 0.1238 | 0.1167 | 0.0523 | Chukchi |
10 | Yuk_004 | 0.0110 | 0.1273 | 0.1117 | 0.0573 | Yukagir |
11 | MC_08 | 0.0187 | 0.1253 | 0.1185 | 0.0564 | Chukchi |
12 | Nov_005 | 0.0152 | 0.1349 | 0.1285 | 0.0618 | Nganasan |
13 | MC_25 | 0.0182 | 0.1258 | 0.1196 | 0.0532 | Chukchi |
14 | Yuk_019 | 0.0161 | 0.1327 | 0.1229 | 0.0617 | Yukagir |
15 | Yuk_011 | 0.0152 | 0.1217 | 0.1148 | 0.0569 | Yukagir |
16 | Sesk_47 | 0.0167 | 0.1241 | 0.1177 | 0.0549 | Chukchi1 |
17 | MC_17 | 0.0180 | 0.1268 | 0.1147 | 0.0544 | Chukchi |
18 | Yuk_021 | 0.0141 | 0.1329 | 0.1210 | 0.0653 | Yukagir |
19 | MC_06 | 0.0159 | 0.1264 | 0.1135 | 0.0557 | Chukchi |
20 | MC_38 | 0.0178 | 0.1240 | 0.1143 | 0.0534 | Chukchi |
21 | MC_14 | 0.0165 | 0.1238 | 0.1114 | 0.0524 | Chukchi |
22 | Ul5 | 0.0070 | 0.1306 | 0.1144 | 0.0540 | Ulchi |
23 | Ul31 | 0.0056 | 0.1289 | 0.1182 | 0.0550 | Ulchi |
24 | Ul65 | 0.0051 | 0.1331 | 0.1117 | 0.0599 | Ulchi |
25 | Tuba12 | 0.0172 | 0.0906 | 0.0790 | 0.0362 | Tubalar |
26 | Tuba20 | 0.0129 | 0.0894 | 0.0767 | 0.0308 | Tubalar |
27 | Nel19 | 0.0273 | 0.0605 | 0.0608 | 0.0333 | Yukagir |
28 | Nlk16 | 0.0217 | 0.0744 | 0.0753 | 0.0360 | Even |
29 | Kor66 | 0.0148 | 0.1259 | 0.1157 | 0.0531 | Koryak |
... | ... | ... | ... | ... | ... | ... |
1259 | I0429 | 0.0413 | 0.0447 | 0.0440 | 0.0098 | Yamnaya_Samara |
1260 | I0438 | 0.0384 | 0.0497 | 0.0399 | 0.0020 | Yamnaya_Samara |
1261 | I0585 | 0.0770 | -0.0424 | 0.0372 | 0.0355 | WHG |
1262 | I0797 | -0.0101 | -0.0452 | -0.0342 | -0.0124 | LBK_EN |
1263 | I0795 | -0.0057 | -0.0495 | -0.0429 | 0.0098 | LBK_EN |
1264 | I0022 | -0.0133 | -0.0433 | -0.0356 | -0.0089 | LBK_EN |
1265 | I0026 | -0.0142 | -0.0438 | -0.0430 | -0.0027 | LBK_EN |
1266 | I1507 | 0.0866 | -0.0455 | 0.0393 | 0.0311 | WHG |
1267 | I0025 | -0.0103 | -0.0449 | -0.0404 | -0.0023 | LBK_EN |
1268 | I0443 | 0.0350 | 0.0401 | 0.0412 | 0.0028 | Yamnaya_Samara |
1269 | I0054 | -0.0054 | -0.0413 | -0.0410 | -0.0124 | LBK_EN |
1270 | I0046 | -0.0066 | -0.0446 | -0.0386 | -0.0092 | LBK_EN |
1271 | I0048 | -0.0128 | -0.0367 | -0.0388 | -0.0129 | LBK_EN |
1272 | I0056 | -0.0067 | -0.0472 | -0.0388 | -0.0054 | LBK_EN |
1273 | I0057 | -0.0113 | -0.0442 | -0.0357 | -0.0008 | LBK_EN |
1274 | I0100 | -0.0063 | -0.0455 | -0.0410 | -0.0051 | LBK_EN |
1275 | I0659 | -0.0084 | -0.0437 | -0.0431 | -0.0099 | LBK_EN |
1276 | I0821 | -0.0071 | -0.0428 | -0.0380 | -0.0103 | LBK_EN |
1277 | I1550 | -0.0107 | -0.0386 | -0.0402 | -0.0039 | LBK_EN |
1278 | BOO001 | 0.0399 | 0.0760 | 0.0915 | 0.0453 | BolshoyOleniOstrov |
1279 | BOO002 | 0.0445 | 0.0735 | 0.0925 | 0.0379 | BolshoyOleniOstrov |
1280 | BOO003 | 0.0466 | 0.0765 | 0.0862 | 0.0415 | BolshoyOleniOstrov |
1281 | BOO004 | 0.0411 | 0.0723 | 0.0938 | 0.0419 | BolshoyOleniOstrov |
1282 | BOO005 | 0.0461 | 0.0731 | 0.0909 | 0.0401 | BolshoyOleniOstrov |
1283 | BOO006 | 0.0394 | 0.0917 | 0.1002 | 0.0438 | BolshoyOleniOstrov |
1284 | CHV001 | 0.0441 | 0.0331 | 0.0587 | 0.0325 | ChalmnyVarre |
1285 | CHV002 | 0.0442 | 0.0351 | 0.0610 | 0.0373 | ChalmnyVarre |
1286 | JK1968 | 0.0398 | 0.0385 | 0.0661 | 0.0299 | Levanluhta |
1287 | JK1970 | 0.0408 | 0.0466 | 0.0600 | 0.0363 | Levanluhta |
1288 | JK2065 | 0.0392 | -0.0065 | 0.0195 | 0.0043 | JK2065 |
1289 rows × 6 columns
We can quickly plot the first two PCs for all individuals:
plt.scatter(x=pcaDat["PC1"], y=pcaDat["PC2"])
<matplotlib.collections.PathCollection at 0x7fe1f662def0>
plt.figure(figsize=(10, 10))
plt.scatter(x=pcaDat["PC1"], y=pcaDat["PC2"])
plt.xlabel("PC1");
plt.ylabel("PC2");
which is not very helpful, because we can't see where each population falls. We can highlight a few populations to get a bit more of a feeling:
plt.scatter(x=pcaDat["PC1"], y=pcaDat["PC2"])
for pop in ["Finnish", "Sardinian", "Armenian", "BedouinB"]: #French, Finnish, Han, Ami, Nganasan
d = pcaDat[pcaDat["Group"] == pop]
plt.scatter(x=d["PC1"], y=d["PC2"], label=pop)
plt.figure(figsize=(10, 10))
plt.scatter(x=-pcaDat["PC1"], y=pcaDat["PC2"], label="")
for pop in ["Finnish", "Sardinian", "Armenian", "BedouinB"]:
d = pcaDat[pcaDat["Group"] == pop]
plt.scatter(x=-d["PC1"], y=d["PC2"], label=pop)
plt.legend()
plt.xlabel("PC1");
plt.ylabel("PC2");
plt.figure(figsize=(10, 10))
plt.scatter(x=-pcaDat2["PC1"], y=pcaDat2["PC2"], label="")
for pop in ["Finnish", "Sardinian", "Han", "Ami", "Nganasan"]:
d = pcaDat2[pcaDat2["Group"] == pop]
plt.scatter(x=-d["PC1"], y=d["PC2"], label=pop)
plt.legend()
plt.xlabel("PC1");
plt.ylabel("PC2");
OK, but how do we systematically show all the populations? There are too many of those to separate them all by different colors, or by different symbols, so we need to combine colours and symbols and use all the combinations of them to show all the populations. To do that, we first need to load the population list that we want to focus on for now, which are the same lists as used above for running the PCA. In case of the West Eurasian PCA, you can load the file using:
pd.read_csv("/data/popgen_course/WestEurasia.poplist.txt",
names=["Population"]).sort_values(by="Population")
Population | |
---|---|
1 | Abkhasian |
2 | Adygei |
3 | Albanian |
4 | Armenian |
5 | Assyrian |
6 | Balkar |
7 | Basque |
8 | BedouinA |
9 | BedouinB |
10 | Belarusian |
11 | Bulgarian |
12 | Canary_Islander |
13 | Chechen |
0 | Chuvash |
14 | Croatian |
15 | Cypriot |
16 | Czech |
17 | Druze |
18 | English |
19 | Estonian |
20 | Finnish |
21 | French |
22 | Georgian |
23 | German |
24 | Greek |
25 | Hungarian |
26 | Icelandic |
27 | Iranian |
28 | Irish |
29 | Irish_Ulster |
... | ... |
38 | Jew_Tunisian |
39 | Jew_Turkish |
40 | Jew_Yemenite |
41 | Jordanian |
42 | Kumyk |
44 | Lebanese |
43 | Lebanese_Christian |
45 | Lebanese_Muslim |
46 | Lezgin |
47 | Lithuanian |
48 | Maltese |
49 | Mordovian |
50 | North_Ossetian |
51 | Norwegian |
52 | Orcadian |
53 | Palestinian |
54 | Polish |
55 | Romanian |
56 | Russian |
57 | Sardinian |
58 | Saudi |
59 | Scottish |
60 | Shetlandic |
61 | Sicilian |
62 | Sorb |
64 | Spanish |
63 | Spanish_North |
65 | Syrian |
66 | Turkish |
67 | Ukrainian |
68 rows × 1 columns
Next, we need to associate a color number and a symbol number with each population. To keep things simple, I would recommend to simply cycle through all combinations automatically. This code snippet looks a bit magic, but it does the job:
popListDat = pd.read_csv("/data/popgen_course/WestEurasia.poplist.txt",
names=["Population"]).sort_values(by="Population")
nPops = len(popListDat)
nCols = 8
nSymbols = int(nPops / nCols)
colorIndices = [int(i / nSymbols) for i in range(nPops)]
symbolIndices = [i % nSymbols for i in range(nPops)]
popListDat = popListDat.assign(colorIndex=colorIndices, symbolIndex=symbolIndices)
popListDat2 = pd.read_csv("/data/popgen_course/AllEurasia.poplist.txt",
names=["Population"]).sort_values(by="Population")
nPops = len(popListDat2)
nCols = 8
nSymbols = int(nPops / nCols)
colorIndices = [int(i / nSymbols) for i in range(nPops)]
symbolIndices = [i % nSymbols for i in range(nPops)]
popListDat2 = popListDat2.assign(colorIndex=colorIndices, symbolIndex=symbolIndices)
How do we know it worked? Let's look at popListDat
:
popListDat
Population | colorIndex | symbolIndex | |
---|---|---|---|
1 | Abkhasian | 0 | 0 |
2 | Adygei | 0 | 1 |
3 | Albanian | 0 | 2 |
4 | Armenian | 0 | 3 |
5 | Assyrian | 0 | 4 |
6 | Balkar | 0 | 5 |
7 | Basque | 0 | 6 |
8 | BedouinA | 0 | 7 |
9 | BedouinB | 1 | 0 |
10 | Belarusian | 1 | 1 |
11 | Bulgarian | 1 | 2 |
12 | Canary_Islander | 1 | 3 |
13 | Chechen | 1 | 4 |
0 | Chuvash | 1 | 5 |
14 | Croatian | 1 | 6 |
15 | Cypriot | 1 | 7 |
16 | Czech | 2 | 0 |
17 | Druze | 2 | 1 |
18 | English | 2 | 2 |
19 | Estonian | 2 | 3 |
20 | Finnish | 2 | 4 |
21 | French | 2 | 5 |
22 | Georgian | 2 | 6 |
23 | German | 2 | 7 |
24 | Greek | 3 | 0 |
25 | Hungarian | 3 | 1 |
26 | Icelandic | 3 | 2 |
27 | Iranian | 3 | 3 |
28 | Irish | 3 | 4 |
29 | Irish_Ulster | 3 | 5 |
... | ... | ... | ... |
38 | Jew_Tunisian | 4 | 6 |
39 | Jew_Turkish | 4 | 7 |
40 | Jew_Yemenite | 5 | 0 |
41 | Jordanian | 5 | 1 |
42 | Kumyk | 5 | 2 |
44 | Lebanese | 5 | 3 |
43 | Lebanese_Christian | 5 | 4 |
45 | Lebanese_Muslim | 5 | 5 |
46 | Lezgin | 5 | 6 |
47 | Lithuanian | 5 | 7 |
48 | Maltese | 6 | 0 |
49 | Mordovian | 6 | 1 |
50 | North_Ossetian | 6 | 2 |
51 | Norwegian | 6 | 3 |
52 | Orcadian | 6 | 4 |
53 | Palestinian | 6 | 5 |
54 | Polish | 6 | 6 |
55 | Romanian | 6 | 7 |
56 | Russian | 7 | 0 |
57 | Sardinian | 7 | 1 |
58 | Saudi | 7 | 2 |
59 | Scottish | 7 | 3 |
60 | Shetlandic | 7 | 4 |
61 | Sicilian | 7 | 5 |
62 | Sorb | 7 | 6 |
64 | Spanish | 7 | 7 |
63 | Spanish_North | 8 | 0 |
65 | Syrian | 8 | 1 |
66 | Turkish | 8 | 2 |
67 | Ukrainian | 8 | 3 |
68 rows × 3 columns
OK nice, we now have each population name associated with a unique combination of color-number and symbol-number. We can now plot all points with colors and symbols:
plt.figure(figsize=(10,10))
symbolVec = ["8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d", "8", "s", "p"]
colorVec = [u'#1f77b4', u'#ff7f0e', u'#2ca02c', u'#d62728', u'#9467bd',
u'#8c564b', u'#e377c2', u'#7f7f7f', u'#bcbd22', u'#17becf']
for i, row in popListDat.iterrows():
d = pcaDat[pcaDat.Group == row["Population"]]
plt.scatter(x=-d["PC1"], y=d["PC2"], c=colorVec[row["colorIndex"]],
marker=symbolVec[row["symbolIndex"]], label=row["Population"])
plt.xlabel("PC1");
plt.ylabel("PC2");
plt.legend(loc=(1.1, 0), ncol=3)
<matplotlib.legend.Legend at 0x7fe1f3eb4470>
plt.figure(figsize=(10,10))
symbolVec = ["8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d", "8", "s", "p"]
colorVec = [u'#1f77b4', u'#ff7f0e', u'#2ca02c', u'#d62728', u'#9467bd',
u'#8c564b', u'#e377c2', u'#7f7f7f', u'#bcbd22', u'#17becf']
for i, row in popListDat2.iterrows():
d = pcaDat2[pcaDat2.Group == row["Population"]]
plt.scatter(x=-d["PC1"], y=d["PC2"], c=colorVec[row["colorIndex"]],
marker=symbolVec[row["symbolIndex"]], label=row["Population"])
plt.xlabel("PC1");
plt.ylabel("PC2");
plt.legend(loc=(1.1, 0), ncol=3)
<matplotlib.legend.Legend at 0x7fe1f3e5da90>
Of course, until now we haven't yet included any of the actual ancient test individuals that we want to analyse, but with plot command above you can very easily add them, by simply adding a few manual plot command before the legend, but outside of the foor loop.
We add the following ancient populations to this plot:
The first two populations are from a publication on ancient Fennoscandian genomes (Lamnidis et al. 2018), and are instructive to understand what PCA can be used for. The latter three populations are from two famous publications (Lazaridis et al. 2014 and Haak et al. 2015). It can be shown that modern European genetic diversity is formed by a mix of three ancestries represented by these ancient groups. To highlight these ancient populations, we plot them in black and using different symbols. While we're at it, we should also add the population called "Saami.DG":
plt.figure(figsize=(10,10))
symbolVec = ["8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d", "v", "<", ">", "^"]
colorVec = [u'#1f77b4', u'#ff7f0e', u'#2ca02c', u'#d62728', u'#9467bd',
u'#8c564b', u'#e377c2', u'#7f7f7f', u'#bcbd22', u'#17becf']
for i, row in popListDat.iterrows():
d = pcaDat[pcaDat.Population == row["Population"]]
plt.scatter(x=-d["PC1"], y=d["PC2"], c=colorVec[row["colorIndex"]],
marker=symbolVec[row["symbolIndex"]], label=row["Population"])
for i, pop in enumerate(["Levanluhta", "Saami.DG", "BolshoyOleniOstrov", "Yamnaya_Samara", "LBK_EN", "WHG"]):
d = pcaDat[pcaDat.Population == pop]
plt.scatter(x=-d["PC1"], y=d["PC2"], c="black", marker=symbolVec[i], label=pop)
plt.xlabel("PC1");
plt.ylabel("PC2");
plt.legend(loc=(1.1, 0), ncol=3)
<matplotlib.legend.Legend at 0x7f5f946e6748>
OK, so what are we looking at? This is quite a rich plot, of course, and we won't discuss all the details here. I just want to highlight two things. First, you can see that most present-day Europeans are scattered in a relatively tight space in the center of a triangle span up by the WHG on the lower left, LBK_EN on the lower right (seen from European points) and by Yamnaya_Samara (top). Indeed, a widely-accepted model for present-day Europeans assumes these three ancient source populations for all Europeans (Lazaridis et al. 2014 and Haak et al. 2015).
The second thing that is noteworthy here is that present-day people from Northeastern Europe, such as Finns, Saami and other Uralic speaking populations are "dragged" towards the ancient samples form Bolshoy Oleni Ostrov. Indeed, a recent model published by us assumes that "Siberian" genetic ancestry entered Europe around 4000 years ago as a kind of fourth genetic component on top of the three other components discusseda bove, and is nowadays found in most Uralic speakers in Europe, including Finns, Saami and Estonians.
We can make a similar plot using the all-Eurasian PCA that we have run:
popListDat = pd.read_csv("/data/popgen_course/AllEurasia.poplist.txt",
names=["Population"]).sort_values(by="Population")
nPops = len(popListDat)
nCols = 9
nSymbols = int(nPops / nCols)
colorIndices = [int(i / nSymbols) for i in range(nPops)]
symbolIndices = [i % nSymbols for i in range(nPops)]
popListDat = popListDat.assign(colorIndex=colorIndices, symbolIndex=symbolIndices)
popListDat
Population | colorIndex | symbolIndex | |
---|---|---|---|
0 | Abkhasian | 0 | 0 |
1 | Adygei | 0 | 1 |
2 | Albanian | 0 | 2 |
3 | Aleut | 0 | 3 |
4 | Aleut_Tlingit | 0 | 4 |
5 | Altaian | 0 | 5 |
6 | Ami | 0 | 6 |
7 | Armenian | 0 | 7 |
8 | Assyrian | 0 | 8 |
9 | Atayal | 0 | 9 |
10 | Avar | 0 | 10 |
11 | Azeri | 0 | 11 |
12 | Balkar | 0 | 12 |
13 | Basque | 1 | 0 |
14 | BedouinA | 1 | 1 |
15 | BedouinB | 1 | 2 |
16 | Belarusian | 1 | 3 |
17 | Borneo | 1 | 4 |
18 | Bulgarian | 1 | 5 |
19 | Buryat | 1 | 6 |
20 | Cambodian | 1 | 7 |
21 | Chechen | 1 | 8 |
22 | Chukchi | 1 | 9 |
23 | Chukchi1 | 1 | 10 |
24 | Chuvash | 1 | 11 |
25 | Croatian | 1 | 12 |
26 | Cypriot | 2 | 0 |
27 | Czech | 2 | 1 |
28 | Dai | 2 | 2 |
29 | Daur | 2 | 3 |
... | ... | ... | ... |
89 | Saami.DG | 6 | 11 |
90 | Saami_WGA | 6 | 12 |
91 | Sardinian | 7 | 0 |
92 | Saudi | 7 | 1 |
93 | Scottish | 7 | 2 |
94 | Selkup | 7 | 3 |
95 | Semende | 7 | 4 |
96 | She | 7 | 5 |
97 | Sherpa.DG | 7 | 6 |
98 | Sicilian | 7 | 7 |
99 | Spanish | 7 | 8 |
100 | Spanish_North | 7 | 9 |
101 | Syrian | 7 | 10 |
102 | Tajik | 7 | 11 |
103 | Thai | 7 | 12 |
104 | Tibetan.DG | 8 | 0 |
105 | Tu | 8 | 1 |
106 | Tubalar | 8 | 2 |
107 | Tujia | 8 | 3 |
108 | Turkish | 8 | 4 |
109 | Turkmen | 8 | 5 |
110 | Tuvinian | 8 | 6 |
111 | Ukrainian | 8 | 7 |
112 | Ulchi | 8 | 8 |
113 | Uygur | 8 | 9 |
114 | Uzbek | 8 | 10 |
115 | Xibo | 8 | 11 |
116 | Yakut | 8 | 12 |
117 | Yi | 9 | 0 |
118 | Yukagir | 9 | 1 |
119 rows × 3 columns
pcaDat = pd.read_csv("pca.AllEurasia.evec",
delim_whitespace=True, skiprows=1, names=names)
plt.figure(figsize=(10,10))
symbolVec = ["8", "s", "p", "P", "*", "h", "H", "+", "x", "X", "D", "d", "v", "<", ">", "^"]
colorVec = [u'#1f77b4', u'#ff7f0e', u'#2ca02c', u'#d62728', u'#9467bd',
u'#8c564b', u'#e377c2', u'#7f7f7f', u'#bcbd22', u'#17becf']
for i, row in popListDat.iterrows():
d = pcaDat[pcaDat.Population == row["Population"]]
plt.scatter(x=-d["PC1"], y=d["PC2"], c=colorVec[row["colorIndex"]],
marker=symbolVec[row["symbolIndex"]], label=row["Population"])
for i, pop in enumerate(["Levanluhta", "Saami.DG", "BolshoyOleniOstrov", "Yamnaya_Samara", "LBK_EN", "WHG"]):
d = pcaDat[pcaDat.Population == pop]
plt.scatter(x=-d["PC1"], y=d["PC2"], c="black", marker=symbolVec[i], label=pop)
plt.xlabel("PC1");
plt.ylabel("PC2");
plt.legend(loc=(1.1, 0), ncol=3)
<matplotlib.legend.Legend at 0x7f7086207a58>
This PCA looks quite different. Here, we have all Western-Eurasian groups squished together on the left side of the plot, and on the right we have East-Asian populations. The plot roughly reflects Geography, with Northern East-Asian people such as the Nganasan on the top-right, and Southern East-Asian people like the Taiwanese Ami on the lower right. Here we can now see that the ancient samples from Russia and Finnland, as well as present-day Uralic populations are actually distributed between East and West, contrary to most other Europeans. This confirms that these group in Europe have quite a distinctive East-Asian genetic ancestry, and we found that it is best represented by the Nganasan (Lamnidis et al. 2018).