First, we need some biolerplate code, to load the plotting library and to set up your Jupyter notebook for interactive plotting:
%matplotlib inline
import matplotlib.pyplot as plt
We also need a second library, called pandas, which helps with working with data.
import pandas as pd
Finding documentation for python functions is easy:
?pd.read_csv
Now we read in the population frequency data:
dat = pd.read_csv("population_frequencies.txt", delim_whitespace=True, names=["nr", "pop"])
and verify that it worked:
dat
nr | pop | |
---|---|---|
0 | 9 | Abkhasian |
1 | 16 | Adygei |
2 | 6 | Albanian |
3 | 7 | Aleut |
4 | 4 | Aleut_Tlingit |
5 | 7 | Altaian |
6 | 10 | Ami |
7 | 10 | Armenian |
8 | 9 | Atayal |
9 | 10 | Balkar |
10 | 29 | Basque |
11 | 25 | BedouinA |
12 | 19 | BedouinB |
13 | 10 | Belarusian |
14 | 6 | BolshoyOleniOstrov |
15 | 9 | Borneo |
16 | 10 | Bulgarian |
17 | 8 | Cambodian |
18 | 2 | Canary_Islander |
19 | 2 | ChalmnyVarre |
20 | 9 | Chechen |
21 | 20 | Chukchi |
22 | 3 | Chukchi1 |
23 | 10 | Chuvash |
24 | 10 | Croatian |
25 | 8 | Cypriot |
26 | 10 | Czech |
27 | 10 | Dai |
28 | 9 | Daur |
29 | 4 | Dolgan |
... | ... | ... |
86 | 27 | Sardinian |
87 | 8 | Saudi |
88 | 4 | Scottish |
89 | 10 | Selkup |
90 | 10 | Semende |
91 | 10 | She |
92 | 2 | Sherpa.DG |
93 | 11 | Sicilian |
94 | 53 | Spanish |
95 | 5 | Spanish_North |
96 | 8 | Syrian |
97 | 8 | Tajik |
98 | 10 | Thai |
99 | 2 | Tibetan.DG |
100 | 10 | Tu |
101 | 22 | Tubalar |
102 | 10 | Tujia |
103 | 50 | Turkish |
104 | 7 | Turkmen |
105 | 10 | Tuvinian |
106 | 9 | Ukrainian |
107 | 25 | Ulchi |
108 | 10 | Uygur |
109 | 10 | Uzbek |
110 | 3 | WHG |
111 | 7 | Xibo |
112 | 20 | Yakut |
113 | 9 | Yamnaya_Samara |
114 | 10 | Yi |
115 | 19 | Yukagir |
116 rows × 2 columns
OK, so let's proceed with simple plotting:
plt.plot(dat["nr"])
[<matplotlib.lines.Line2D at 0x7f8e0c1af198>]
Not bad, but we'd like to sort the values. For that we use the sort_values
function:
?dat.sort_values
dat_sorted = dat.sort_values(by="nr")
dat_sorted
nr | pop | |
---|---|---|
44 | 1 | Italian_South |
56 | 1 | JK2065 |
85 | 1 | Saami_WGA |
67 | 2 | Levanluhta |
84 | 2 | Saami.DG |
19 | 2 | ChalmnyVarre |
18 | 2 | Canary_Islander |
92 | 2 | Sherpa.DG |
99 | 2 | Tibetan.DG |
22 | 3 | Chukchi1 |
110 | 3 | WHG |
88 | 4 | Scottish |
29 | 4 | Dolgan |
4 | 4 | Aleut_Tlingit |
95 | 5 | Spanish_North |
50 | 6 | Jew_Iraqi |
60 | 6 | Korean |
45 | 6 | Itelmen |
14 | 6 | BolshoyOleniOstrov |
2 | 6 | Albanian |
73 | 6 | Mongola |
52 | 6 | Jew_Moroccan |
47 | 7 | Jew_Ashkenazi |
104 | 7 | Turkmen |
48 | 7 | Jew_Georgian |
53 | 7 | Jew_Tunisian |
3 | 7 | Aleut |
5 | 7 | Altaian |
111 | 7 | Xibo |
42 | 8 | Iranian |
... | ... | ... |
31 | 10 | English |
105 | 10 | Tuvinian |
13 | 10 | Belarusian |
78 | 11 | Norwegian |
76 | 11 | Nganasan |
93 | 11 | Sicilian |
41 | 12 | Icelandic |
79 | 13 | Orcadian |
65 | 14 | LBK_EN |
1 | 16 | Adygei |
115 | 19 | Yukagir |
12 | 19 | BedouinB |
21 | 20 | Chukchi |
112 | 20 | Yakut |
43 | 20 | Italian_North |
40 | 20 | Hungarian |
37 | 20 | Greek |
101 | 22 | Tubalar |
83 | 22 | Russian |
107 | 25 | Ulchi |
11 | 25 | BedouinA |
86 | 27 | Sardinian |
10 | 29 | Basque |
46 | 29 | Japanese |
35 | 32 | French |
82 | 38 | Palestinian |
30 | 39 | Druze |
38 | 43 | Han |
103 | 50 | Turkish |
94 | 53 | Spanish |
116 rows × 2 columns
x = range(len(dat_sorted))
y = dat_sorted["nr"]
plt.plot(x, y)
[<matplotlib.lines.Line2D at 0x7f8e0bff15c0>]
Now we just need to add tick labels and change the size of the plot:
dat_sorted = dat.sort_values(by="nr")
y = dat_sorted["nr"]
x = range(len(y))
xticks = dat_sorted["pop"]
plt.figure(figsize=(20,8))
plt.plot(x, y)
plt.xticks(x, xticks, rotation="vertical");
OK, this was a very short introduction to python and plotting. Clearly there is a lot more to learn, but hopefully this may serve as a teaser for learning more about it. The matplotlib- and pandas-libraries are both well documented, check out the linked websites to find out more.