Learning Objectives
Lesson outline
Do you remember how to:
1 Read in the data from the csv file from yesterday?
import pandas as pd
world_data = pd.read_csv('https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv')
# If saved locally yesterday:
# surveys = pd.read_csv("world_data.csv")
2 How to select only the columns 'country' and 'year' from the data frame?
world_data[['country', 'year']].head() #head just to limit output
country | year | |
---|---|---|
0 | Afghanistan | 1800 |
1 | Afghanistan | 1801 |
2 | Afghanistan | 1802 |
3 | Afghanistan | 1803 |
4 | Afghanistan | 1804 |
3 How to select a few rows together with the columns above?
world_data.loc[[1, 13, 24], ['country', 'year']]
country | year | |
---|---|---|
1 | Afghanistan | 1801 |
13 | Afghanistan | 1813 |
24 | Afghanistan | 1824 |
4 How to select only data from year 1995?
world_data.loc[world_data['year'] == 1995]
country | year | population | region | sub_region | income_group | life_expectancy | income | children_per_woman | child_mortality | pop_density | co2_per_capita | years_in_school_men | years_in_school_women | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
195 | Afghanistan | 1995 | 17100000 | Asia | Southern Asia | Low | 51.1 | 881 | 7.61 | 150.0 | 26.20 | 0.0727 | 2.56 | 0.49 |
414 | Albania | 1995 | 3110000 | Europe | Southern Europe | Upper middle | 74.1 | 4130 | 2.59 | 32.9 | 113.00 | 0.6720 | 9.31 | 9.07 |
633 | Algeria | 1995 | 28900000 | Africa | Northern Africa | Upper middle | 72.3 | 9300 | 3.45 | 43.3 | 12.10 | 3.3000 | 5.67 | 4.84 |
852 | Angola | 1995 | 14300000 | Africa | Sub-Saharan Africa | Lower middle | 52.0 | 2970 | 6.92 | 223.0 | 11.40 | 0.7690 | 4.89 | 3.05 |
1071 | Antigua and Barbuda | 1995 | 73600 | Americas | Latin America and the Caribbean | High | 74.4 | 16500 | 2.21 | 19.5 | 167.00 | 3.7400 | 10.50 | 11.40 |
1290 | Argentina | 1995 | 35000000 | Americas | Latin America and the Caribbean | High | 73.1 | 13900 | 2.76 | 24.3 | 12.80 | 3.6600 | 9.53 | 10.00 |
1509 | Armenia | 1995 | 3220000 | Asia | Western Asia | Upper middle | 69.3 | 2170 | 1.80 | 38.7 | 113.00 | 1.0600 | 10.10 | 10.20 |
1728 | Australia | 1995 | 18100000 | Oceania | Australia and New Zealand | High | 78.3 | 30400 | 1.82 | 7.0 | 2.35 | 15.6000 | 11.80 | 11.80 |
1947 | Austria | 1995 | 7990000 | Europe | Western Europe | High | 76.7 | 33700 | 1.42 | 6.8 | 97.00 | 7.4800 | 10.90 | 10.50 |
2166 | Azerbaijan | 1995 | 7780000 | Asia | Western Asia | Upper middle | 64.9 | 3320 | 2.58 | 94.1 | 94.10 | 4.2900 | 10.70 | 10.30 |
2385 | Bahamas | 1995 | 280000 | Americas | Latin America and the Caribbean | High | 70.6 | 22100 | 2.51 | 18.7 | 28.00 | 6.0100 | 10.00 | 10.40 |
2604 | Bahrain | 1995 | 564000 | Asia | Western Asia | High | 70.3 | 43500 | 3.10 | 18.1 | 742.00 | 26.3000 | 7.41 | 7.54 |
2823 | Bangladesh | 1995 | 119000000 | Asia | Southern Asia | Lower middle | 61.7 | 1440 | 3.73 | 114.0 | 912.00 | 0.1920 | 4.34 | 2.75 |
3042 | Barbados | 1995 | 265000 | Americas | Latin America and the Caribbean | High | 73.7 | 12400 | 1.73 | 14.7 | 616.00 | 3.1300 | 7.53 | 7.82 |
3261 | Belarus | 1995 | 10100000 | Europe | Eastern Europe | Upper middle | 68.3 | 5450 | 1.47 | 15.7 | 50.00 | 5.9900 | 11.10 | 11.60 |
3480 | Belgium | 1995 | 10200000 | Europe | Western Europe | High | 76.9 | 32700 | 1.61 | 7.6 | 336.00 | 11.0000 | 11.40 | 11.60 |
3699 | Belize | 1995 | 207000 | Americas | Latin America and the Caribbean | Upper middle | 70.7 | 6210 | 4.11 | 29.5 | 9.07 | 1.8200 | 7.17 | 6.63 |
3918 | Benin | 1995 | 5910000 | Africa | Sub-Saharan Africa | Low | 56.5 | 1520 | 6.36 | 158.0 | 52.40 | 0.2250 | 3.62 | 1.48 |
4137 | Bhutan | 1995 | 515000 | Asia | Southern Asia | Lower middle | 62.9 | 2900 | 4.60 | 101.0 | 13.50 | 0.4840 | 4.41 | 1.78 |
4356 | Bolivia | 1995 | 7570000 | Americas | Latin America and the Caribbean | Lower middle | 64.3 | 4110 | 4.58 | 101.0 | 6.98 | 1.3000 | 8.11 | 6.64 |
4575 | Bosnia and Herzegovina | 1995 | 3840000 | Europe | Southern Europe | Upper middle | 68.9 | 1830 | 1.71 | 14.2 | 75.40 | 0.8920 | 8.46 | 7.82 |
4794 | Botswana | 1995 | 1570000 | Africa | Sub-Saharan Africa | Upper middle | 56.4 | 8900 | 3.95 | 71.6 | 2.77 | 1.9400 | 5.11 | 5.59 |
5013 | Brazil | 1995 | 162000000 | Americas | Latin America and the Caribbean | Upper middle | 69.7 | 11100 | 2.50 | 49.1 | 19.40 | 1.5900 | 5.99 | 6.46 |
5232 | Bulgaria | 1995 | 8380000 | Europe | Eastern Europe | Upper middle | 71.0 | 8450 | 1.34 | 19.2 | 77.20 | 6.9200 | 10.70 | 11.20 |
5451 | Burkina Faso | 1995 | 10100000 | Africa | Sub-Saharan Africa | Low | 50.7 | 869 | 6.84 | 195.0 | 36.90 | 0.0621 | 1.86 | 0.91 |
5670 | Burundi | 1995 | 5960000 | Africa | Sub-Saharan Africa | Low | 47.0 | 870 | 7.29 | 169.0 | 232.00 | 0.0400 | 3.45 | 2.31 |
5889 | Cambodia | 1995 | 10700000 | Asia | South-eastern Asia | Lower middle | 58.2 | 1100 | 4.69 | 120.0 | 60.40 | 0.1460 | 4.97 | 3.36 |
6108 | Cameroon | 1995 | 13500000 | Africa | Sub-Saharan Africa | Lower middle | 56.5 | 2260 | 5.98 | 166.0 | 28.50 | 0.3140 | 6.45 | 4.36 |
6327 | Canada | 1995 | 29300000 | Americas | Northern America | High | 78.0 | 32200 | 1.64 | 6.9 | 3.23 | 15.9000 | 13.60 | 13.60 |
6546 | Central African Republic | 1995 | 3350000 | Africa | Sub-Saharan Africa | Low | 46.2 | 858 | 5.62 | 175.0 | 5.38 | 0.0700 | 4.92 | 2.43 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
32607 | Sri Lanka | 1995 | 18200000 | Asia | Southern Asia | Lower middle | 72.2 | 4510 | 2.29 | 20.3 | 291.00 | 0.3240 | 8.57 | 8.44 |
32826 | Sudan | 1995 | 24100000 | Africa | Northern Africa | Lower middle | 60.4 | 1960 | 5.83 | 120.0 | 13.70 | 0.1780 | 5.70 | 3.37 |
33045 | Suriname | 1995 | 444000 | Americas | Latin America and the Caribbean | Upper middle | 70.3 | 9620 | 3.01 | 39.7 | 2.84 | 4.6500 | 7.25 | 6.89 |
33264 | Swaziland | 1995 | 961000 | Africa | Sub-Saharan Africa | Lower middle | 59.4 | 5610 | 4.80 | 83.7 | 55.90 | 0.4730 | 6.94 | 6.80 |
33483 | Sweden | 1995 | 8840000 | Europe | Northern Europe | High | 78.8 | 31100 | 1.73 | 4.8 | 21.50 | 6.2400 | 12.20 | 12.40 |
33702 | Switzerland | 1995 | 7020000 | Europe | Western Europe | High | 78.5 | 45900 | 1.52 | 6.4 | 178.00 | 5.5900 | 12.20 | 11.40 |
33921 | Syria | 1995 | 14300000 | Asia | Western Asia | Low | 72.1 | 4890 | 4.51 | 29.6 | 78.10 | 2.9000 | 7.18 | 5.05 |
34140 | Tajikistan | 1995 | 5760000 | Asia | Central Asia | Low | 63.9 | 1270 | 4.59 | 119.0 | 41.20 | 0.4250 | 10.50 | 9.52 |
34359 | Tanzania | 1995 | 30000000 | Africa | Sub-Saharan Africa | Low | 52.9 | 1370 | 5.88 | 164.0 | 33.80 | 0.1190 | 5.77 | 4.57 |
34578 | Thailand | 1995 | 59500000 | Asia | South-eastern Asia | Upper middle | 70.9 | 9380 | 1.87 | 29.2 | 116.00 | 2.7100 | 7.30 | 6.95 |
34797 | Timor-Leste | 1995 | 871000 | Asia | South-eastern Asia | Lower middle | 60.9 | 1560 | 6.38 | 139.0 | 58.60 | NaN | 5.49 | 4.09 |
35016 | Togo | 1995 | 4270000 | Africa | Sub-Saharan Africa | Low | 57.2 | 1200 | 5.76 | 134.0 | 78.60 | 0.2230 | 5.03 | 2.32 |
35235 | Tonga | 1995 | 96100 | Oceania | Polynesia | Upper middle | 68.4 | 4260 | 4.45 | 18.8 | 133.00 | 0.9920 | 9.61 | 9.49 |
35454 | Trinidad and Tobago | 1995 | 1260000 | Americas | Latin America and the Caribbean | High | 69.3 | 12900 | 1.96 | 28.0 | 245.00 | 13.6000 | 9.81 | 10.10 |
35673 | Tunisia | 1995 | 9110000 | Africa | Northern Africa | Lower middle | 73.0 | 6130 | 2.61 | 44.7 | 58.70 | 1.7300 | 8.20 | 5.28 |
35892 | Turkey | 1995 | 58500000 | Asia | Western Asia | Upper middle | 70.9 | 12300 | 2.76 | 54.8 | 76.00 | 2.9400 | 7.62 | 5.59 |
36111 | Turkmenistan | 1995 | 4210000 | Asia | Central Asia | Upper middle | 63.1 | 4600 | 3.51 | 87.5 | 8.95 | 8.0800 | 11.20 | 10.90 |
36330 | Uganda | 1995 | 20600000 | Africa | Sub-Saharan Africa | Low | 47.0 | 931 | 7.02 | 171.0 | 103.00 | 0.0457 | 5.72 | 3.69 |
36549 | Ukraine | 1995 | 50900000 | Europe | Eastern Europe | Lower middle | 66.6 | 5060 | 1.41 | 20.3 | 87.90 | 8.7600 | 11.20 | 11.50 |
36768 | United Arab Emirates | 1995 | 2450000 | Asia | Western Asia | High | 73.5 | 102000 | 3.42 | 13.1 | 29.30 | 28.8000 | 9.00 | 9.20 |
36987 | United Kingdom | 1995 | 58000000 | Europe | Northern Europe | High | 76.6 | 28600 | 1.76 | 7.2 | 240.00 | 9.2800 | 12.10 | 12.00 |
37206 | United States | 1995 | 266000000 | Americas | Northern America | High | 75.9 | 39500 | 1.98 | 9.5 | 29.00 | 19.3000 | 13.40 | 13.40 |
37425 | Uruguay | 1995 | 3220000 | Americas | Latin America and the Caribbean | High | 73.5 | 11500 | 2.40 | 20.8 | 18.40 | 1.4200 | 8.70 | 9.31 |
37644 | Uzbekistan | 1995 | 22900000 | Asia | Central Asia | Lower middle | 66.2 | 2240 | 3.53 | 70.5 | 53.70 | 4.5200 | 10.50 | 10.20 |
37863 | Vanuatu | 1995 | 168000 | Oceania | Melanesia | Lower middle | 62.3 | 2610 | 4.73 | 30.4 | 13.80 | 0.3920 | 6.70 | 5.86 |
38082 | Venezuela | 1995 | 22200000 | Americas | Latin America and the Caribbean | Upper middle | 73.0 | 15300 | 3.08 | 26.2 | 25.20 | 6.0100 | 7.87 | 8.22 |
38301 | Vietnam | 1995 | 75200000 | Asia | South-eastern Asia | Lower middle | 69.5 | 2040 | 2.71 | 39.0 | 243.00 | 0.3870 | 7.23 | 6.63 |
38520 | Yemen | 1995 | 15300000 | Asia | Western Asia | Low | 60.5 | 3530 | 7.53 | 112.0 | 29.00 | 0.6830 | 4.71 | 0.95 |
38739 | Zambia | 1995 | 9140000 | Africa | Sub-Saharan Africa | Lower middle | 46.5 | 2030 | 6.19 | 177.0 | 12.30 | 0.2380 | 6.73 | 5.13 |
38958 | Zimbabwe | 1995 | 11300000 | Africa | Sub-Saharan Africa | Low | 53.7 | 2480 | 4.43 | 90.1 | 29.30 | 1.3400 | 8.41 | 6.92 |
178 rows × 14 columns
5 Select only the rows where the region is Asia or Africa.
world_data.loc[world_data['region'].isin(['Asia', 'Africa'])]
country | year | population | region | sub_region | income_group | life_expectancy | income | children_per_woman | child_mortality | pop_density | co2_per_capita | years_in_school_men | years_in_school_women | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Afghanistan | 1800 | 3280000 | Asia | Southern Asia | Low | 28.2 | 603 | 7.00 | 469.0 | NaN | NaN | NaN | NaN |
1 | Afghanistan | 1801 | 3280000 | Asia | Southern Asia | Low | 28.2 | 603 | 7.00 | 469.0 | NaN | NaN | NaN | NaN |
2 | Afghanistan | 1802 | 3280000 | Asia | Southern Asia | Low | 28.2 | 603 | 7.00 | 469.0 | NaN | NaN | NaN | NaN |
3 | Afghanistan | 1803 | 3280000 | Asia | Southern Asia | Low | 28.2 | 603 | 7.00 | 469.0 | NaN | NaN | NaN | NaN |
4 | Afghanistan | 1804 | 3280000 | Asia | Southern Asia | Low | 28.2 | 603 | 7.00 | 469.0 | NaN | NaN | NaN | NaN |
5 | Afghanistan | 1805 | 3280000 | Asia | Southern Asia | Low | 28.2 | 603 | 7.00 | 469.0 | NaN | NaN | NaN | NaN |
6 | Afghanistan | 1806 | 3280000 | Asia | Southern Asia | Low | 28.1 | 603 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
7 | Afghanistan | 1807 | 3280000 | Asia | Southern Asia | Low | 28.1 | 603 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
8 | Afghanistan | 1808 | 3280000 | Asia | Southern Asia | Low | 28.1 | 603 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
9 | Afghanistan | 1809 | 3280000 | Asia | Southern Asia | Low | 28.1 | 603 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
10 | Afghanistan | 1810 | 3280000 | Asia | Southern Asia | Low | 28.1 | 604 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
11 | Afghanistan | 1811 | 3280000 | Asia | Southern Asia | Low | 28.1 | 604 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
12 | Afghanistan | 1812 | 3280000 | Asia | Southern Asia | Low | 28.1 | 604 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
13 | Afghanistan | 1813 | 3280000 | Asia | Southern Asia | Low | 28.1 | 604 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
14 | Afghanistan | 1814 | 3290000 | Asia | Southern Asia | Low | 28.1 | 604 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
15 | Afghanistan | 1815 | 3290000 | Asia | Southern Asia | Low | 28.1 | 604 | 7.00 | 470.0 | NaN | NaN | NaN | NaN |
16 | Afghanistan | 1816 | 3300000 | Asia | Southern Asia | Low | 28.1 | 604 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
17 | Afghanistan | 1817 | 3300000 | Asia | Southern Asia | Low | 28.0 | 604 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
18 | Afghanistan | 1818 | 3310000 | Asia | Southern Asia | Low | 28.0 | 604 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
19 | Afghanistan | 1819 | 3320000 | Asia | Southern Asia | Low | 28.0 | 604 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
20 | Afghanistan | 1820 | 3320000 | Asia | Southern Asia | Low | 28.0 | 604 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
21 | Afghanistan | 1821 | 3330000 | Asia | Southern Asia | Low | 28.0 | 607 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
22 | Afghanistan | 1822 | 3340000 | Asia | Southern Asia | Low | 28.0 | 609 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
23 | Afghanistan | 1823 | 3350000 | Asia | Southern Asia | Low | 28.0 | 611 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
24 | Afghanistan | 1824 | 3360000 | Asia | Southern Asia | Low | 28.0 | 613 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
25 | Afghanistan | 1825 | 3380000 | Asia | Southern Asia | Low | 27.9 | 615 | 7.00 | 471.0 | NaN | NaN | NaN | NaN |
26 | Afghanistan | 1826 | 3390000 | Asia | Southern Asia | Low | 27.9 | 617 | 7.00 | 473.0 | NaN | NaN | NaN | NaN |
27 | Afghanistan | 1827 | 3400000 | Asia | Southern Asia | Low | 27.9 | 619 | 7.00 | 473.0 | NaN | NaN | NaN | NaN |
28 | Afghanistan | 1828 | 3420000 | Asia | Southern Asia | Low | 27.9 | 621 | 7.00 | 473.0 | NaN | NaN | NaN | NaN |
29 | Afghanistan | 1829 | 3430000 | Asia | Southern Asia | Low | 27.9 | 623 | 7.00 | 473.0 | NaN | NaN | NaN | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
38952 | Zimbabwe | 1989 | 9900000 | Africa | Sub-Saharan Africa | Low | 62.7 | 2490 | 5.37 | 73.9 | 25.6 | 1.630 | 7.61 | 6.01 |
38953 | Zimbabwe | 1990 | 10200000 | Africa | Sub-Saharan Africa | Low | 61.7 | 2590 | 5.18 | 75.2 | 26.3 | 1.540 | 7.74 | 6.16 |
38954 | Zimbabwe | 1991 | 10400000 | Africa | Sub-Saharan Africa | Low | 61.0 | 2670 | 5.00 | 77.4 | 27.0 | 1.530 | 7.88 | 6.31 |
38955 | Zimbabwe | 1992 | 10700000 | Africa | Sub-Saharan Africa | Low | 59.4 | 2370 | 4.84 | 80.2 | 27.6 | 1.590 | 8.01 | 6.46 |
38956 | Zimbabwe | 1993 | 10900000 | Africa | Sub-Saharan Africa | Low | 57.6 | 2350 | 4.69 | 83.4 | 28.2 | 1.500 | 8.14 | 6.61 |
38957 | Zimbabwe | 1994 | 11100000 | Africa | Sub-Saharan Africa | Low | 55.8 | 2520 | 4.56 | 86.8 | 28.7 | 1.600 | 8.28 | 6.76 |
38958 | Zimbabwe | 1995 | 11300000 | Africa | Sub-Saharan Africa | Low | 53.7 | 2480 | 4.43 | 90.1 | 29.3 | 1.340 | 8.41 | 6.92 |
38959 | Zimbabwe | 1996 | 11500000 | Africa | Sub-Saharan Africa | Low | 52.2 | 2690 | 4.33 | 92.8 | 29.8 | 1.300 | 8.54 | 7.07 |
38960 | Zimbabwe | 1997 | 11700000 | Africa | Sub-Saharan Africa | Low | 50.8 | 2710 | 4.24 | 94.7 | 30.3 | 1.230 | 8.67 | 7.23 |
38961 | Zimbabwe | 1998 | 11900000 | Africa | Sub-Saharan Africa | Low | 49.1 | 2750 | 4.16 | 95.9 | 30.7 | 1.200 | 8.80 | 7.39 |
38962 | Zimbabwe | 1999 | 12100000 | Africa | Sub-Saharan Africa | Low | 47.8 | 2690 | 4.10 | 96.4 | 31.2 | 1.310 | 8.93 | 7.55 |
38963 | Zimbabwe | 2000 | 12200000 | Africa | Sub-Saharan Africa | Low | 46.7 | 2570 | 4.06 | 96.8 | 31.6 | 1.140 | 9.07 | 7.71 |
38964 | Zimbabwe | 2001 | 12400000 | Africa | Sub-Saharan Africa | Low | 46.2 | 2580 | 4.02 | 97.1 | 32.0 | 1.020 | 9.20 | 7.87 |
38965 | Zimbabwe | 2002 | 12500000 | Africa | Sub-Saharan Africa | Low | 45.6 | 2320 | 4.00 | 97.7 | 32.3 | 0.957 | 9.33 | 8.03 |
38966 | Zimbabwe | 2003 | 12600000 | Africa | Sub-Saharan Africa | Low | 45.3 | 1910 | 3.99 | 98.2 | 32.7 | 0.843 | 9.47 | 8.20 |
38967 | Zimbabwe | 2004 | 12800000 | Africa | Sub-Saharan Africa | Low | 45.1 | 1780 | 3.98 | 99.0 | 33.0 | 0.742 | 9.60 | 8.36 |
38968 | Zimbabwe | 2005 | 12900000 | Africa | Sub-Saharan Africa | Low | 45.3 | 1650 | 3.99 | 99.7 | 33.4 | 0.832 | 9.73 | 8.53 |
38969 | Zimbabwe | 2006 | 13100000 | Africa | Sub-Saharan Africa | Low | 45.7 | 1580 | 3.99 | 100.0 | 33.9 | 0.796 | 9.87 | 8.69 |
38970 | Zimbabwe | 2007 | 13300000 | Africa | Sub-Saharan Africa | Low | 46.4 | 1490 | 4.00 | 100.0 | 34.5 | 0.742 | 10.00 | 8.86 |
38971 | Zimbabwe | 2008 | 13600000 | Africa | Sub-Saharan Africa | Low | 46.7 | 1210 | 4.01 | 98.0 | 35.0 | 0.573 | 10.10 | 9.03 |
38972 | Zimbabwe | 2009 | 13800000 | Africa | Sub-Saharan Africa | Low | 47.5 | 1290 | 4.02 | 94.9 | 35.7 | 0.406 | 10.30 | 9.19 |
38973 | Zimbabwe | 2010 | 14100000 | Africa | Sub-Saharan Africa | Low | 49.6 | 1460 | 4.03 | 89.9 | 36.4 | 0.552 | 10.40 | 9.36 |
38974 | Zimbabwe | 2011 | 14400000 | Africa | Sub-Saharan Africa | Low | 51.9 | 1660 | 4.02 | 83.8 | 37.2 | 0.665 | 10.50 | 9.53 |
38975 | Zimbabwe | 2012 | 14700000 | Africa | Sub-Saharan Africa | Low | 54.1 | 1850 | 4.00 | 76.0 | 38.0 | 0.530 | 10.70 | 9.70 |
38976 | Zimbabwe | 2013 | 15100000 | Africa | Sub-Saharan Africa | Low | 55.6 | 1900 | 3.96 | 70.0 | 38.9 | 0.776 | 10.80 | 9.86 |
38977 | Zimbabwe | 2014 | 15400000 | Africa | Sub-Saharan Africa | Low | 57.0 | 1910 | 3.90 | 64.3 | 39.8 | 0.780 | 10.90 | 10.00 |
38978 | Zimbabwe | 2015 | 15800000 | Africa | Sub-Saharan Africa | Low | 58.3 | 1890 | 3.84 | 59.9 | 40.8 | NaN | 11.10 | 10.20 |
38979 | Zimbabwe | 2016 | 16200000 | Africa | Sub-Saharan Africa | Low | 59.3 | 1860 | 3.76 | 56.4 | 41.7 | NaN | NaN | NaN |
38980 | Zimbabwe | 2017 | 16500000 | Africa | Sub-Saharan Africa | Low | 59.8 | 1910 | 3.68 | 56.8 | 42.7 | NaN | NaN | NaN |
38981 | Zimbabwe | 2018 | 16900000 | Africa | Sub-Saharan Africa | Low | 60.2 | 1950 | 3.61 | 55.5 | 43.7 | NaN | NaN | NaN |
21681 rows × 14 columns
6 Calculate the total population in each region
world_data.groupby('region')['population'].sum()
region Africa 59192998600 Americas 63837885500 Asia 330133218800 Europe 98766930400 Oceania 2422277600 Name: population, dtype: int64
(7) Get the number of countries in each region for the year 2018.
world_data.loc[world_data['year'] == 2018].groupby('region').size()
region Africa 52 Americas 31 Asia 47 Europe 39 Oceania 9 dtype: int64
Although it's essential to quantitatively assess any conclusion drawn from the data, the human visual system is still one of the most advanced apparatus to detect patterns in data and it allows for quick exploration of complex relationships. Visualizations are also a highly efficient way of communicating insights drawn from the data. Therefore, it is important to know how to graphically represent the underlying data in a way that is suitable for humans to understand.
There are many plotting packages in Python, making it possible to create diverse visualizations such as interactive web graphics, 3D animations, statistical visualizations, and map-based plots. When starting out, it can be helpful to find an example of how a plot looks that you want to create and then copy and modify that code. Examples of plots can be found in many excellent online Python plotting galleries, such as this, this, and this.
Our focus will be on two of the most useful packages for researchers: matplotlib
, which is a robust, detail-oriented, low level plotting interface, and seaborn
, which provides high level functions on top of matplotlib
and allows the plotting calls to be expressed more in terms what is being explored in the underlying data rather than what graphical elements to add to the plot. The high-level figures created by seaborn
can be configured via the matplotlib
parameters, so learning these packages in tandem is useful.
By default, plots are displayed in a separate window rather than within the notebook. To change this option and always display plots in the notebook, run the following line. Note that in newer versions of the notebook this might not be needed, but it's good the be explicit.
%matplotlib inline
# Note that this will only need to be done the first time you create a plot in a notebook
# all subsequent plots will show up as expected.
To facilitate the understanding of plotting concepts, the initial examples here will not include data frames, but instead have simple lists holding just a few data points.
To create a line plot, the plot()
function from matplotlib
can be used.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [1, 2, 4, 3]
plt.plot(x ,y)
[<matplotlib.lines.Line2D at 0x7f4dc6a43dd8>]
Using plot()
like this is not very explicit since a few things happens "under the hood" e.g. a figure is automatically created and it is assumed that the plot should go into the currently active region of this figure. This gives little control over exactly where to place the plots within a figure and how to make modifications the plot after creating it, e.g. adding a title or labeling the axis.
To facilitate modifications to the plot, it is recommended to use the object oriented plotting interface in matplotlib
, where an empty figure and at least one axes object is explicitly created before a plot is added to it. This figure and its axes object are assigned to variable names which are used for plotting. In matplotlib
, an axes object refers to what you would often call a subplot colloquially and it is named "axes" because it consists of an x-axis and a y-axis by default.
fig, ax = plt.subplots()
Calling subplots()
returns two objects, the figure and its axes object. Plots can be added to the axes object of the figure using the name we assigned to the returned axes object (ax
by convention).
fig, ax = plt.subplots()
ax.plot(x, y)
[<matplotlib.lines.Line2D at 0x7f4dc69e1668>]
To create a scatter plot, use scatter()
instead of plot()
.
fig, ax = plt.subplots()
ax.scatter(x, y)
<matplotlib.collections.PathCollection at 0x7f4dc6912f98>
Plots can also be combined together in the same axes. The line style and marker color can be changed to facilitate viewing the elements in th combined plot.
fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')
[<matplotlib.lines.Line2D at 0x7f4dc68f9d30>]
And plot elements can be resized.
fig, ax = plt.subplots()
ax.scatter(x, y, color='red', s=100)
ax.plot(x, y, linestyle='dashed', linewidth=3)
[<matplotlib.lines.Line2D at 0x7f4dc685afd0>]
It is common to modify the plot after creating it, e.g. adding a title or label the axis.
fig, ax = plt.subplots()
ax.scatter(x, y, color='red')
ax.plot(x, y, linestyle='dashed')
ax.set_title('Line and scatter plot')
ax.set_xlabel('Measurement X')
Text(0.5,0,'Measurement X')
The scatter and line plot can easily be separated into two subplots within the same figure. Instead of assigning a single returned axes to ax
, the two returned axes objects are assigned to ax1
and ax2
respectively.
fig, (ax1, ax2) = plt.subplots(1, 2)
# The default is (1, 1), that's why it does not need
# to be specified with only one subplot
To prevent plot elements, such as the axis ticklabels from overlapping, tight_layout()
method can be used.
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.tight_layout()
The figure size can easily be controlled when it is created.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4)) # This refers to the size of the figure in inches when printed or in a PDF
fig.tight_layout()
Bringing it all together to separate the line and scatter plot.
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(8, 4))
ax1.scatter(x, y, color='red')
ax2.plot(x, y, linestyle='dashed')
ax1.set_title('Scatter plot')
ax2.set_title('Line plot')
fig.tight_layout()
Challenge 1¶
- There are a plethora of colors available to use in
matplotlib
. Change the color of the line and the dots in the figure using your favorite color from this list.- Use the documentation to also change the styling of the line in the line plot and the type of marker used in the scatter plot (you might need to search online for this).
Figures can be saved by calling the savefig()
method and specifying the name of file to create. The resolution of the figure can be controlled by the dpi
parameter.
fig.savefig('scatter-and-line.png', dpi=300)
A PDF-file can be saved by changing the extension in the specified file name. Since PDF is a vector file format, there is not need to specify the resolution.
fig.savefig('scatter-and-line.pdf')
This concludes the customization section. The concepts taught here will be applied in the next section on how to choose a suitable plot type for data sets with many observations.
If the data frame from the previous lecture is not loaded, read it in first.
import pandas as pd
# world_data = pd.read_csv('../world-data-gapminder.csv')
# If not saved to disk yesterday
world_data = pd.read_csv('https://raw.githubusercontent.com/UofTCoders/2018-09-10-utoronto/gh-pages/data/world-data-gapminder.csv')
fig, ax = plt.subplots()
ax.scatter(x='year', y='population', data=world_data)
<matplotlib.collections.PathCollection at 0x7f4dbe565208>
The reason for the appearance of this graph is that one scatter dot has been added for each year for every country. To instead see how the world's total population has changes over the years, the population for all countries for each year needs to be summed together. This can be done using the data frame techniques from the previous lecture.
# One could also do `as_index=False` with `groupby()`
world_pop = world_data.groupby('year')['population'].sum().reset_index()
fig, ax = plt.subplots()
ax.scatter(x='year', y='population', data=world_pop)
<matplotlib.collections.PathCollection at 0x7f4dbde57a90>
This plot shows how the world population has been steadingly increasing since the 1800s and dramatically picked up pace in the 1950s.
It is possible to use matplotlib
in this way to explore visual relationships in data frame. However, it gets complicated once we want to include more variable, e.g. stratifying the data in subplots based on region and income level in the example above would include writing double loops and keeping track of plot layout and grouping variables manually.
seaborn
¶When visually exploring data with lots of variables, it is in many cases easier to think in terms of what is to be explored in the data, rather than what graphical elements are to be added to the plot. For example, instead of instructing the computer to "go through a data frame and plot any observations of country-X in blue, any observations of country-Y in red, etc", it can be easier to just type "color the data by country".
Facilitating semantic mappings of data variable to graphical elements is one of the goals of the seaborn
plotting package.
Thanks to its functional way of interfacing with data, only minimal changes are required if the underlying data change or to switch the type of plot used for the visualization. seaborn
provides a language that facilitates thinking about data in ways that are conducive for exploratory data analysis and allows for the creation of publication quality plots with minimal adjustments and tweaking.
The syntax of plotting with seaborn
was introduced briefly already in the introductory lecture and it is similar to how matplotlib plots data frames. For example, to make the same scatter plot as above:
import seaborn as sns
sns.scatterplot(x='year', y='population', data=world_pop)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db4dfee80>
In addition to providing a data-centric syntax, seaborn
also facilitates visualization of common statistical aggregations. For example, the when creating a line plot in seaborn
, the default is aggregate and average all observations with the same value on the x-axis, and to create a shaded region representing the 95% confidence interval for these observations.
sns.lineplot(x='year', y='population', data=world_data)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db46d5e48>
In this case, it would be more appropriate to have the shaded area describe the variation in the data, such as the standard deviation, rather than an inference about the reproducibility.
sns.lineplot(x='year', y='population', data=world_data, ci='sd')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4dbde0a668>
To change from showing the average world population per country per year to showing the total population for all countries per year, the estimator
parameter can be used. Here, the shaded are is also removed with ci=None
.
# The `estimator` parameter is currently non-functional for sns.scatterplot, but will be added soon
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db4341eb8>
Before continuing the exploration of the world populatino data, let's discuss how to customize the appearance of our plots. The returned object is an matplotlib axes, so all configuration available through matplotlib
can be applied to the returned object by first assigning it to a variable name (ax
by convention).
ax = sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
ax.set_title('World population since the 1800s', fontsize=16)
ax.set_xlabel('Year', fontsize=12)
Text(0.5,0,'Year')
In addition to all the customization available through the standard matplotlib
syntax, seaborn
also offers its own functions for changing the appearance of the plots.
In essence, these functions are shortcuts to facilitate changing many matplotlib
parameters. For example, a more effective approach than setting individual font sizes or colors of graphical elements is to set the overall size and style for all graphs.
# TODO in general, be sure to link seaborn documentation where appropriate
sns.set(context='talk', style='darkgrid', palette='pastel')
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db2273f60>
These functions are like changing the settings in a graphical program and will apply to all following plots.
Challenge 2¶
Find out which styles and contexts are available. Try some of them out and choose your favorite style and contxt. Hint This information is available both through the built-in and the online documentation.
For the rest of this tutorial, the ticks
style will be used,
sns.set(context='notebook', style='ticks', font_scale=1.4)
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db224e630>
# Unfortunately changing the font size reduces the number of ticklabels which in turn
# automatically changes the formatting of the scientific notation to `1e10` for some reason
# I files an issue for this (https://github.com/matplotlib/matplotlib/issues/12072)
#
# The previous notation `1e9` can be returned by doing
# `ax.ticklabel_format(useOffset=10, axis='y')`, but it looks slightly different
For styles that include the frame around the plot, there is a special seaborn
function to remove the top- and leftmost borders (again by modifying the underlying matplotlib
parameters).
sns.lineplot(x='year', y='population', data=world_data, estimator='sum', ci=None)
sns.despine()
If the style options exposed through seaborn
are not sufficient, it is possible to change all plot parameters directly through the matplotlib
rc and style interfaces.
As mentioned above, the strength of a descriptive plotting syntax is being able to describe the plot appearance in human-friendly vocabulary and have the computer assign variables to graphical objects accordingly. For example, to plot subsets of the data in different colors, the hue
parameter can be used.
sns.lineplot(x='year', y='population', hue='income_group',
data=world_data, ci=None, estimator='sum')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db21721d0>
This separation of the data shows that the population has risen the fastest in middle income countries.
The plot can be made more accessible by changing the style of each line to not only rely on color to separate them.
sns.lineplot(x='year', y='population', hue='income_group', style='income_group',
data=world_data, ci=None, estimator='sum')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db21bed30>
Just like in the previous lecture, the values of the ordinal variable income_group
are not listed in an intuitive order. A custom order can easily be specified by passing a list to the hue_order
parameter, but this would have to be done for every plot. A better approach would be to encode the order in the data frame itself, using the top level pandas
function Categorical()
.
world_data['income_group'] = (
pd.Categorical(world_data['income_group'], ordered=True,
categories=['Low', 'Lower middle', 'Upper middle', 'High'])
)
world_data['income_group'].dtype
CategoricalDtype(categories=['Low', 'Lower middle', 'Upper middle', 'High'], ordered=True)
sns.lineplot(x='year', y='population', hue='income_group', style='income_group',
data=world_data, ci=None, estimator='sum')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db20abe48>
The legend now has the colors in the expected order. This modification also ensures that when making plots with income groups on the x- or y-axis, they will be plotted in the right order.
It is difficult to explore multiple categorical relationships within one single plot. For example, to see how the income groups compare within each region, the hue
and style
variables could be used for different variables, but this makes the plot very difficult to interpret.
sns.lineplot(x='year', y='population', hue='income_group', style='region',
data=world_data, ci=None, estimator='sum')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4db2148978>
An effective approach for exploring multiple categorical variables in a data set is to plot so-called "small multiples" of the data, where the same type of plot is used for different subsets of the data. These plots are drawn in rows and columns forming a grid pattern, and can be referred to as a "lattice", "facet", or "trellis" plot.
Visualizing categorical variables in this manner is a key step in exploratory data analysis, and thus seaborn
has a dedicated plot function for this, called relplot()
(for "relational plot" since it visualizes the relationships between numerical variables). The syntax to relplot()
is very similar to lineplot()
, but we need to specify that the kind of plot we want is a line plot.
# Create the same plot as above
sns.relplot(x='year', y='population', hue='income_group', style='income_group', kind='line',
data=world_data, ci=None, estimator='sum')
<seaborn.axisgrid.FacetGrid at 0x7f4db1fe77f0>
The region
variable can now be mapped to different facets/subplots in a grid pattern.
# TODO switch this to some more interesting column if I have time
sns.relplot(x='year', y='population', data=world_data, estimator='sum',
kind='line', hue='income_group', col='region', ci=None)
<seaborn.axisgrid.FacetGrid at 0x7f4db1f4fd30>
It's a little hard to see becuse the figure is very wide and has been shrunk to fit in the notebook. To avoid this, relplot()
can use the col_wrap
parameter to plot on several rows. The height
and aspect
parameters can be used to set the height and width of each facet.
sns.relplot(x='year', y='population', data=world_data, estimator='sum',
kind='line', hue='income_group', col='region', ci=None,
col_wrap=3, height=2.5, aspect=1.3)
<seaborn.axisgrid.FacetGrid at 0x7f4db1d1c470>
Facetting the plot by region reveals that the biggest absolute population increase happened among middle income countries in Asia. We will soon look closer on which countries are
The returned object from relplot()
is a grid (a special kind of figure) with many axes, and can therefore not be placed within a preexisting figure. It is saved just as any matplotlib
figure with savefig()
, but has some special methods for easily changing plot aethetics on each axes. Remember that names such as fig
, ax
, and here g
, are only by convention, and any variable name could have been used.
g = sns.relplot(x='year', y='population', data=world_data,
kind='line', hue='income_group', col='region', ci=None,
col_wrap=3, height=2.5, aspect=1.3)
g.set_titles('{col_name}', y=0.95)
g.set_axis_labels(y_var='Population', x_var='Year')
g.savefig('grid-figure.png')
Finally, we might want to keep the color as being per income group, but drawing one line per country. For this we can set units='country'
and estimator=None
(so don't aggregate, just draw one line per country with the raw values).
sns.relplot(x='year', y='population', data=world_data, estimator=None, units='country',
kind='line', hue='income_group', col='region', ci=None,
col_wrap=3, height=2.5, aspect=1.3)
<seaborn.axisgrid.FacetGrid at 0x7f4db19822b0>
Two countries in Asia stand out in terms of total population. To find out which these are, we can filter the data.
world_data.loc[world_data['year'] == 2018].nlargest(8, 'population')
country | year | population | region | sub_region | income_group | life_expectancy | income | children_per_woman | child_mortality | pop_density | co2_per_capita | years_in_school_men | years_in_school_women | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
7226 | China | 2018 | 1420000000 | Asia | Eastern Asia | Upper middle | 76.9 | 16000 | 1.64 | 9.95 | 151.0 | NaN | NaN | NaN |
15767 | India | 2018 | 1350000000 | Asia | Southern Asia | Lower middle | 69.1 | 6890 | 2.28 | 41.10 | 455.0 | NaN | NaN | NaN |
37229 | United States | 2018 | 327000000 | Americas | Northern America | High | 79.1 | 54900 | 1.90 | 6.06 | 35.7 | NaN | NaN | NaN |
15986 | Indonesia | 2018 | 267000000 | Asia | South-eastern Asia | Lower middle | 72.0 | 11700 | 2.31 | 25.00 | 147.0 | NaN | NaN | NaN |
5036 | Brazil | 2018 | 211000000 | Americas | Latin America and the Caribbean | Upper middle | 75.7 | 14300 | 1.70 | 14.20 | 25.2 | NaN | NaN | NaN |
26498 | Pakistan | 2018 | 201000000 | Asia | Southern Asia | Lower middle | 68.0 | 5220 | 3.35 | 76.80 | 260.0 | NaN | NaN | NaN |
25622 | Nigeria | 2018 | 196000000 | Africa | Sub-Saharan Africa | Lower middle | 66.1 | 5570 | 5.39 | 97.90 | 215.0 | NaN | NaN | NaN |
2846 | Bangladesh | 2018 | 166000000 | Asia | Southern Asia | Lower middle | 73.4 | 3720 | 2.05 | 32.00 | 1280.0 | NaN | NaN | NaN |
Challenge 3
- To find out the total amount of CO2 released into the atmosphere, used the
co2_per_capita
andpopulation
columns to create a new column:co2_total
.- Plot the total CO2 for the world and for each region.
- Create a facetted plot comparing total CO2 levels across income groups and regions.
world_data = world_data.rename(columns={'co2_per_capita': 'co2_per_capita'})
# Challenge 1
# 1.
world_data['co2_total'] = world_data['co2_per_capita'] * world_data['population']
# 2.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum')
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum', hue='region')
# 3.
sns.relplot(x='year', y='co2_total', data=world_data, kind='line', ci=None, estimator='sum',
hue='income_group', col='region', col_wrap=3, height=4)
# Discuss what these plots tell us:
# The world's total co2 emissions are rapidly increasing. Europe and the Americas have been the highest emitters for
# many years, but have recently been overtaken by Asia, which is now producing around twice the amount of co2 compard
# to Europe and America. But don't forget that we saw in the last lecture that the population in Asia is 5-6 times bigger
# than in Europe and America!
# It's important to look at both total production from a country because change within that single country has big
# potential of reaching many people. Not plotted here, but also also important is to explore which countries are high in co2 per capita
# since these might have more room to reduce the production. Of course, reality is more complicated. Some countries
# might import goods that deamnd high co2 production in their manufactoring country instead of producing themselves
# so they might "sponsor" the production in another countr, but would not show up high in this list.
<seaborn.axisgrid.FacetGrid at 0x7f4dab605748>
To continue exploring the CO2 emissions we started to look at in the last challenge, let's use the other type of plot for comparing quantitative variables: scatterplot()
. This is the default in the relplot()
function.
As mentioned in the discussion above, in addition to considering the total amount of CO2 produced per country, it is important to explore the CO2 produced per citizen.
sns.relplot(x='co2_total', y='co2_per_capita', data=world_data)
<seaborn.axisgrid.FacetGrid at 0x7f4db14f2dd8>
This looks funky, and not quite as expected... the reason is that we have plotted multiple data points per country, one for each year. This can be confusing since we don't know which dot is for which year and this plot is probably not what we want. We can filter the data to focus on a specific year. Unfortunately, there is not CO2 measurements available for the last few years. To find out in which years there are countries with CO2 measurement, drop the NAs in co2_per_capita
and look at the min and max value.
world_data.dropna(subset=['co2_per_capita'])['year'].agg(['min', 'max'])
min 1800 max 2014 Name: year, dtype: int64
Now subset the data for the latest available year with CO2 measurements, 2014.
world_data_2014 = world_data.loc[world_data['year'] == 2014]
sns.relplot(x='co2_total', y='income', data=world_data_2014)
<seaborn.axisgrid.FacetGrid at 0x7f4db177efd0>
Here we can see that there are a few countries in the world that have significantly and one coutnry that is rather high in both measurements.
Just as before it is possible to map plot semantics and facet the plot according to variables in the data set. scatterplot()
can also scale the dot size according to a variable in the data set.
# `sizes` controls the dots min and max size
sns.relplot(x='co2_total', y='co2_per_capita', hue='income_group', size='population',
data=world_data_2014, sizes=(40, 400))
<seaborn.axisgrid.FacetGrid at 0x7f4db0409f98>
Unsuprinsingly, some of the countries that are high in the total co2_emissions are also the most popolous countries. The trends between different regions can now be easily compared by facetting the data by region.
sns.relplot(x='co2_total', y='co2_per_capita', hue='income_group', size='population',
data=world_data_2014, sizes=(40, 400), col='region', col_wrap=3, height=4)
<seaborn.axisgrid.FacetGrid at 0x7f4db03f4f98>
Already here we can get a pretty good idea of which some of these countries are. The high emission middle income countries in Asia are likely China and India, while the American country high in both total emissions and emissions per capita must be the USA. However, there are some curious dots, like which the high co2_capita regions are in Asia and the Americas.
Challenge¶
Let's use some of the aggregation methods from yesterday to complement the plots we have just made.
- Find out which are the 10 countries with the highest co2 emissions per capita.
- Find out which are the 10 countries with the highest total co2 emissions.
- Which 10 countries have produce the most CO2 in total since the 1800s?
# 1.
world_data_2014.nlargest(10, 'co2_per_capita')
country | year | population | region | sub_region | income_group | life_expectancy | income | children_per_woman | child_mortality | pop_density | co2_per_capita | years_in_school_men | years_in_school_women | co2_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
28465 | Qatar | 2014 | 2370000 | Asia | Western Asia | High | 80.5 | 121000 | 1.95 | 8.6 | 205.00 | 45.4 | 8.70 | 10.70 | 1.075980e+08 |
35473 | Trinidad and Tobago | 2014 | 1350000 | Americas | Latin America and the Caribbean | High | 73.0 | 31600 | 1.78 | 19.8 | 264.00 | 34.2 | 12.30 | 13.10 | 4.617000e+07 |
18610 | Kuwait | 2014 | 3780000 | Asia | Western Asia | High | 79.9 | 70800 | 2.01 | 9.1 | 212.00 | 25.2 | 11.90 | 12.30 | 9.525600e+07 |
2623 | Bahrain | 2014 | 1340000 | Asia | Western Asia | High | 76.7 | 44400 | 2.07 | 7.8 | 1760.00 | 23.4 | 9.66 | 10.40 | 3.135600e+07 |
36787 | United Arab Emirates | 2014 | 9070000 | Asia | Western Asia | High | 76.4 | 64100 | 1.78 | 7.9 | 109.00 | 23.3 | 12.30 | 13.00 | 2.113310e+08 |
29560 | Saudi Arabia | 2014 | 30800000 | Asia | Western Asia | High | 76.6 | 50000 | 2.64 | 13.8 | 14.30 | 19.5 | 12.30 | 9.44 | 6.006000e+08 |
20581 | Luxembourg | 2014 | 556000 | Europe | Western Europe | High | 81.9 | 93800 | 1.56 | 2.5 | 215.00 | 17.4 | 13.60 | 13.90 | 9.674400e+06 |
37225 | United States | 2014 | 318000000 | Americas | Northern America | High | 78.9 | 51800 | 1.95 | 6.8 | 34.70 | 16.5 | 14.50 | 14.90 | 5.247000e+09 |
1747 | Australia | 2014 | 23500000 | Oceania | Australia and New Zealand | High | 82.6 | 43400 | 1.87 | 4.0 | 3.06 | 15.4 | 14.00 | 14.40 | 3.619000e+08 |
26275 | Oman | 2014 | 3960000 | Asia | Western Asia | High | 77.2 | 40300 | 2.80 | 11.0 | 12.80 | 15.4 | 9.53 | 8.03 | 6.098400e+07 |
# 2.
world_data_2014.nlargest(10, 'co2_per_capita')
country | year | population | region | sub_region | income_group | life_expectancy | income | children_per_woman | child_mortality | pop_density | co2_per_capita | years_in_school_men | years_in_school_women | co2_total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
28465 | Qatar | 2014 | 2370000 | Asia | Western Asia | High | 80.5 | 121000 | 1.95 | 8.6 | 205.00 | 45.4 | 8.70 | 10.70 | 1.075980e+08 |
35473 | Trinidad and Tobago | 2014 | 1350000 | Americas | Latin America and the Caribbean | High | 73.0 | 31600 | 1.78 | 19.8 | 264.00 | 34.2 | 12.30 | 13.10 | 4.617000e+07 |
18610 | Kuwait | 2014 | 3780000 | Asia | Western Asia | High | 79.9 | 70800 | 2.01 | 9.1 | 212.00 | 25.2 | 11.90 | 12.30 | 9.525600e+07 |
2623 | Bahrain | 2014 | 1340000 | Asia | Western Asia | High | 76.7 | 44400 | 2.07 | 7.8 | 1760.00 | 23.4 | 9.66 | 10.40 | 3.135600e+07 |
36787 | United Arab Emirates | 2014 | 9070000 | Asia | Western Asia | High | 76.4 | 64100 | 1.78 | 7.9 | 109.00 | 23.3 | 12.30 | 13.00 | 2.113310e+08 |
29560 | Saudi Arabia | 2014 | 30800000 | Asia | Western Asia | High | 76.6 | 50000 | 2.64 | 13.8 | 14.30 | 19.5 | 12.30 | 9.44 | 6.006000e+08 |
20581 | Luxembourg | 2014 | 556000 | Europe | Western Europe | High | 81.9 | 93800 | 1.56 | 2.5 | 215.00 | 17.4 | 13.60 | 13.90 | 9.674400e+06 |
37225 | United States | 2014 | 318000000 | Americas | Northern America | High | 78.9 | 51800 | 1.95 | 6.8 | 34.70 | 16.5 | 14.50 | 14.90 | 5.247000e+09 |
1747 | Australia | 2014 | 23500000 | Oceania | Australia and New Zealand | High | 82.6 | 43400 | 1.87 | 4.0 | 3.06 | 15.4 | 14.00 | 14.40 | 3.619000e+08 |
26275 | Oman | 2014 | 3960000 | Asia | Western Asia | High | 77.2 | 40300 | 2.80 | 11.0 | 12.80 | 15.4 | 9.53 | 8.03 | 6.098400e+07 |
# 3.
world_data.groupby('country')['co2_total'].sum().nlargest(10)
country United States 3.760390e+11 China 1.747358e+11 Russia 1.085341e+11 Germany 8.609852e+10 United Kingdom 7.443773e+10 Japan 5.752009e+10 India 4.174063e+10 France 3.555696e+10 Canada 2.947731e+10 Ukraine 2.938304e+10 Name: co2_total, dtype: float64
In addition to what we have just seen, an interesting aspect to explore is how this relationship between per capita and total CO2 emissions has changed over time for different income groups. As we have seen before, this can be explored in a line graph, but an alternative approach which allows us to see the spread at each point in time is to subset certain years from the data and create a facet for each year:
world_data_1920_2018 = world_data.loc[world_data['year'].isin([1920, 1940, 1960, 1980, 2000, 2014])]
sns.relplot(x='co2_total', y='co2_per_capita', col='year', hue='income_group',
data=world_data_1920_2018, col_wrap=3, height=3.5)
<seaborn.axisgrid.FacetGrid at 0x7f4db01d3978>
In the exercises above, we chose suitable variables to illustrate the plotting concepts. Often when doing EDA, it will not be as easy to know what comparison to start with. Unless you have good reason for choosing to look at a particular relationship, starting by plotting the pairwise relationships of all quantitative variables can be helpful.
# Use 2014 data since we know CO2 has observations for that year
# This might take some time
sns.pairplot(world_data_2014)
<seaborn.axisgrid.PairGrid at 0x7f4db0048c50>
The year column is not that insightful since there is only one year in the data. Removing that column gives more space for the rest of the plots.
# TODO suggest add a 'mirror' keyword to pairplot
sns.pairplot(world_data_2014.drop(columns='year'))
<seaborn.axisgrid.PairGrid at 0x7f4da8aecb00>
Each plot on the diagonal shows the distribution of a single variable in a histogram. The plots below the diagonal shows the relationship between two numerical variables in a scatter plot. The plots above the diagonal are mirror images of those below the diagonal.
Plotting all pairwise relationships like this gives a great overview for what to look into next. For example, the relationships we explored above between child mortality and children per women or those between co2_per_capita
and co2_total
can also be seen here, as can other unexplored relationships. It is possible to quantitative the strength of these relationships by checking the Pearson correlation coefficients between columns.
world_data_2014.drop(columns='year').corr()
population | life_expectancy | income | children_per_woman | child_mortality | pop_density | co2_per_capita | years_in_school_men | years_in_school_women | co2_total | |
---|---|---|---|---|---|---|---|---|---|---|
population | 1.000000 | 0.020899 | -0.039127 | -0.075136 | -0.012679 | 0.010329 | 0.009876 | -0.012609 | -0.055508 | 0.810722 |
life_expectancy | 0.020899 | 1.000000 | 0.656187 | -0.799298 | -0.874404 | 0.177470 | 0.466554 | 0.726919 | 0.732383 | 0.117341 |
income | -0.039127 | 0.656187 | 1.000000 | -0.530189 | -0.550647 | 0.277383 | 0.807494 | 0.581746 | 0.582572 | 0.097359 |
children_per_woman | -0.075136 | -0.799298 | -0.530189 | 1.000000 | 0.876623 | -0.144019 | -0.430218 | -0.751975 | -0.784130 | -0.148606 |
child_mortality | -0.012679 | -0.874404 | -0.550647 | 0.876623 | 1.000000 | -0.126336 | -0.442394 | -0.789018 | -0.818036 | -0.122293 |
pop_density | 0.010329 | 0.177470 | 0.277383 | -0.144019 | -0.126336 | 1.000000 | 0.120080 | 0.084184 | 0.080018 | -0.010954 |
co2_per_capita | 0.009876 | 0.466554 | 0.807494 | -0.430218 | -0.442394 | 0.120080 | 1.000000 | 0.441900 | 0.454274 | 0.159584 |
years_in_school_men | -0.012609 | 0.726919 | 0.581746 | -0.751975 | -0.789018 | 0.084184 | 0.441900 | 1.000000 | 0.964648 | 0.122927 |
years_in_school_women | -0.055508 | 0.732383 | 0.582572 | -0.784130 | -0.818036 | 0.080018 | 0.454274 | 0.964648 | 1.000000 | 0.088188 |
co2_total | 0.810722 | 0.117341 | 0.097359 | -0.148606 | -0.122293 | -0.010954 | 0.159584 | 0.122927 | 0.088188 | 1.000000 |
With so many columns, it is slow to process all the information as numbers. A higher bandwidth operation is to let our brain interpret colors for the strength of the relationships through a heatmap.
# This and the correlation above might be moved to lecture 4
sns.heatmap(world_data_2014.drop(columns='year').corr())
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d990a2be0>
The heatmap can be made more informative by changing to a diverging colormap which is generally recommended when there is a natural middle points (such as 0 in our case). Optionally the heatmap can be annotated with the correlation coefficients.
# This and the correlaition above might be moved to lecture 4
fig, ax = plt.subplots(figsize=(10, 6))
sns.heatmap(world_data_2014.drop(columns='year').corr(), annot=True, ax=ax, cmap='coolwarm')
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d98271ef0>
There are more formal ways of interrogating the effect between different variables, such as regression which are outside the scope of this lecture. However, the pairwise scatter plot and correlation coefficient matrix and quick ways to get an informative overview of how the data frame columns relate to each other.
Let's zoom in on the relationship between income and life expectancy which appears to be quite strong.
# Make this a challenge where they learn how to find things on stackoverflow
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014)
This relationship appears to be log linear and can be visualized with the x-axis set to log-scale.
Challenge¶
- Find out how to change the x-axis to be log-scaled. Search online for how to change the scale of a matplotlib axes object. Remember that seaborn plots return matplotlib axes objects, so all matplotlib function to modify the axes will work on this plot. Good sites to use are the documentation pages for the respecitve package, and stackoverflow. However, it is often the fastest to type in a well chosen query in your favorite search engine.
- In the logged plot, color the dots according to the region of the observation.
# Challenge solutions
# 1.
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014)
ax.set_xscale('log')
# Challenge solutions
# 2.
ax = sns.scatterplot(x='income', y='life_expectancy', data=world_data_2014, hue='region')
ax.set_xscale('log')
Another interesting realtionship we could see from the pairplot
was how child mortality is related to how many children are born per woman. A common misconception is that saving poor children will lead to overpopulation. However, using the same approach for the CO2 data we can section out years of the data and look at how this relationship has changed over time.
world_data_1920_2018 = world_data.loc[world_data['year'].isin([1920, 1940, 1960, 1980, 2000, 2018])]
sns.relplot(x='children_per_woman', y='child_mortality', col='year', hue='income_group',
data=world_data_1920_2018, col_wrap=3, height=3.5)
<seaborn.axisgrid.FacetGrid at 0x7f4d980e26a0>
Now it is clearer to see what is going on. Reducing child mortality is correlated with smaller family sizes. As more children survive, parents can feel more secure with a smaller family size. Ending poverty is also related to these variables, since most high income countries are found in the ower left corner of the plots (remember that the income gruop is classified based on 2018 year's income).
It is important to note that from a plot like this, it is not possible to tell causation, just correlation. However, in the gapminder video library there are a few videos on this topic, including this and this discussing how reducing poverty can help slow down population growth through decreased family sizes. Current estimations suggests that the word population will stabilize around 11 billion ppl and the children per woman will be close to 2 worldwide in year 2100.
When exploring a single quantitative variable, we can choose between plotting every data point (e.g. categorical scatterplots such as swarm plots and strip plots), an approximation of the distribution (e.g. histograms and violinplots), or several distribution statistics including measures of central tendency (e.g. boxplots and barplots).
A good place to start is to visualize the variable's distribution with distplot()
.
# Let's look at life expectancy during 2018
world_data_2018 = world_data.loc[world_data['year'] == 2018]
sns.distplot(world_data_2018['life_expectancy'])
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d93c7c630>
The line is a KDE (kernel density estimate) plot, as seen previously in the pairplot. This can be thought of as a smoothened histogram.
distplot()
can be customized to increase the number of bins and the bandwidth of the kernel. These are both calculated according to heuristics for what should be good numbers for the underlying data, but it is good to know how to control them.
sns.distplot(world_data_2018['life_expectancy'], bins=30, rug=True,
kde_kws={'bw':1, 'color':'black'})
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d98067e10>
The rug plot shows exactly where each data point is along the x-axis. To compare distributions between values of a categorical variables, violinplot are often used. These consist of two KDEs mirrored across the midline.
sns.violinplot(x='life_expectancy', y='income_group', data=world_data_2018)
<matplotlib.axes._subplots.AxesSubplot at 0x7f4d98f25828>
Since income_group
was defined as an ordered categorical variable previously, this order is preserved when distributing the income groups along the yaxis.
There is notable variation in life expectancy between income groups, people in wealthier countries live longer. This variation contribute sto the mulitmodality seen in the first distribution plot of the life expectancy for all countries in the world. However, there is also large overlap between income groups and variation within the groups, so there are more variables affecting the life expectancy than just the income.
Dissecting multimodal distributions in this manner and trying to find underlying explaining variables to why a distribution appears to consist of many small distribution (multimodal) is common practice in EDA. It looks like some income groups, e.g. "high", still consist of multimodal distributions. To explore these further, facetting can be used just as previously. The categorical equivalent of relplot
is catplot
(categorical plot).
sns.catplot(x='life_expectancy', y='income_group', data=world_data_2018)
<seaborn.axisgrid.FacetGrid at 0x7f4d93fcb4a8>
The default is a stripplot
a type of categorical scatterplot where the dots are randomly jittered to not overlap. This is very fast, but it is sometimes hard to see how many dots are in a group due to overlap of the graphical elements. A more ordered approach is to create a another type of categorical scatterplot, a so called swarmplot, where the dots are positioned so that they are guaranteed not to overlap.
sns.catplot(x='life_expectancy', y='income_group', data=world_data_2018, kind='swarm')
<seaborn.axisgrid.FacetGrid at 0x7f4d980ba668>
The swarm plot gives us a better sense of the overall appearance of the distribution than the stripplot, and we can see the same bimodality in the high income group as seen in the violinplot, but which was hard to see in the stripplot.
Now it is clear where the most points are. A drawback is that this method can be slow for large datasets. For really large datasets, even stripplot is slow and it is necessary to approximate the distributions with a violinplot instead of showing each observation. Or show some distribution statistics instead, such as with a boxplot (more on that later).
Now we can use color to find out that differences in regions are often related to income level.
# Will update this to look prettier
sns.catplot(x='life_expectancy', y='region', data=world_data_2014, kind='box',
col='income_group', col_wrap=2)
<seaborn.axisgrid.FacetGrid at 0x7f4d93fcb6d8>
The variable levels are automatically ordered and it is easy to see how life expectancy generally grow with higher average income.
One can see that the income might be more indicative.
WIth the powerful gridplots it is now easy to see how this income distribution has changed . In contrast to a line plot with the average change over time, we can here see how the distribution itself changes, not just the average. While countries in general has increased their life expectancy, differences can be seen in how they have done it: Europe and the Americas has gone from having countries with high and low life_expectancy levels to tighter distributions where all countries have high, while africa has transitioned from most countries having low life_exp through a period of very diverse life lengths depending on country.
# If both columns can be interpreted as numerical, the orient keyword can be added to be explicit
sns.catplot(x='life_expectancy', y='year', orient='horizontal', data=world_data_1920_2018, kind='violin',
col='region', col_wrap=3, color='lightgrey')
<seaborn.axisgrid.FacetGrid at 0x7f4d93ac6080>
Let's see how much of the variation during the transition in African life expectancy can be explained by geographically close regions performing differently. First how many sub_regions are there in each Africa.
world_data_1920_2018.groupby('region')['sub_region'].nunique()
region Africa 2 Americas 2 Asia 5 Europe 4 Oceania 4 Name: sub_region, dtype: int64
Two, what are those.
world_data_1920_2018.groupby('region')['sub_region'].unique()
region Africa [Northern Africa, Sub-Saharan Africa] Americas [Latin America and the Caribbean, Northern Ame... Asia [Southern Asia, Western Asia, South-eastern As... Europe [Southern Europe, Western Europe, Eastern Euro... Oceania [Australia and New Zealand, Melanesia, Microne... Name: sub_region, dtype: object
Let's see if sub-saharan and northern Africa have had different development when it comes to life expectancy.
# The split parameter saves some space and looks slick
africa = world_data_1920_2018.loc[world_data_1920_2018['region'] == 'Africa']
sns.catplot(x='life_expectancy', y='year', orient='horizontal', data=africa, kind='violin',
hue='sub_region', palette='pastel', split=True)
<seaborn.axisgrid.FacetGrid at 0x7f4d93955f28>
world_data.dropna(subset=['years_in_school_women'])['year'].agg(['min', 'max'])
min 1970 max 2015 Name: year, dtype: int64
Challenge¶
- Subset data frame for the years 1975, 1995, and 2015
- Make a new column of ratio women men in education
- plot for regions and income groups and times (reword)
# Challenge solutions
# 1.
world_data_1970_2015 = world_data.loc[world_data['year'].isin([1975, 1995, 2015])].copy()
# 2.
world_data_1970_2015['women_men_school_ratio'] = world_data_1970_2015['years_in_school_women'] / world_data_1970_2015['years_in_school_men']
# world_data_1970_2015['women_men_school_ratio']
# 3a.
sns.catplot(y='women_men_school_ratio', x='year', data=world_data_1970_2015, hue='region', dodge=True, kind='point')
<seaborn.axisgrid.FacetGrid at 0x7f4d980d9f60>
# 3b.
sns.catplot(y='women_men_school_ratio', x='year', data=world_data_1970_2015, hue='income_group', dodge=True, kind='point')
<seaborn.axisgrid.FacetGrid at 0x7f4d93878710>