In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like:
In this guided project, we'll use SQL in Jupyter Notebook to explore and analyze data from this database, named factbook.db
%%capture
%load_ext sql
%sql sqlite:///factbook.db
'Connected: None@factbook.db'
Query database to get list of tables in it
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
We observe there are two tables in the factbook database, namely sqlite_sequence
, facts
Next we write a query to return first 5 rows of the facts table
%%sql
SELECT *
FROM facts
LIMIT 5;
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
Here are the descriptions for some of the columns:
Let's start by calculating some summary statistics and look for any outlier countries. We use aggregate functions like MAX, MIN to find e.g countrie(s) with the highest/least population.
%%sql
--Minimum/Maximum Population and Population_growth
SELECT MIN(population) as min_population,
MAX(population) as max_population,
MIN(population_growth) as min_population_growth,
MAX(population_growth) as max_population_growth
FROM facts;
Done.
min_population | max_population | min_population_growth | max_population_growth |
---|---|---|---|
0 | 7256490011 | 0.0 | 4.02 |
A few things stick out:
Let's use subqueries to zoom in on just these countries without using the specific values:
%%sql
--Least Populated Country
SELECT *
FROM facts
WHERE population = (SELECT MIN(population)
FROM facts);
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
250 | ay | Antarctica | None | 280000 | None | 0 | None | None | None | None |
%%sql
--Most Populated Country
SELECT *
FROM facts
WHERE population == (SELECT MAX(population)
FROM facts);
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
261 | xx | World | None | None | None | 7256490011 | 1.08 | 18.6 | 7.8 | None |
It seems like the table contains a row for the whole world, which explains the population of over 7.2 billion. It also seems like the table contains a row for Antarctica, which explains the population of 0. This seems to match the CIA Factbook page for Antarctica:
Now that we know this, we should recalculate the summary statistics we calculated earlier, while excluding the row for the whole world.
%%sql
--Minimum/Maximum Population and Population_growth
SELECT MIN(population) as min_population,
MAX(population) as max_population,
MIN(population_growth) as min_population_growth,
MAX(population_growth) as max_population_growth
FROM facts
WHERE name != 'World';
Done.
min_population | max_population | min_population_growth | max_population_growth |
---|---|---|---|
0 | 1367485388 | 0.0 | 4.02 |
%%sql
--Most Populated Country
SELECT *
FROM facts
WHERE population == (SELECT MAX(population)
FROM facts
WHERE name != 'World');
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
37 | ch | China | 9596960 | 9326410 | 270550 | 1367485388 | 0.45 | 12.49 | 7.53 | 0.44 |
Now, we find countries that are densely populated. To do this, we'll identify countries that have:
%%sql
--Average Population and Area
SELECT AVG(population) as avg_population,
AVG(area) as avg_area
FROM facts;
Done.
avg_population | avg_area |
---|---|
62094928.32231405 | 555093.546184739 |
%%sql
SELECT *
FROM facts
WHERE population > (SELECT AVG(population) as avg_population
FROM facts)
AND area < (SELECT AVG(area) as avg_area
FROM facts) ;
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
14 | bg | Bangladesh | 148460 | 130170 | 18290 | 168957745 | 1.6 | 21.14 | 5.61 | 0.46 |
65 | gm | Germany | 357022 | 348672 | 8350 | 80854408 | 0.17 | 8.47 | 11.42 | 1.24 |
85 | ja | Japan | 377915 | 364485 | 13430 | 126919659 | 0.16 | 7.93 | 9.51 | 0.0 |
138 | rp | Philippines | 300000 | 298170 | 1830 | 100998376 | 1.61 | 24.27 | 6.11 | 2.09 |
173 | th | Thailand | 513120 | 510890 | 2230 | 67976405 | 0.34 | 11.19 | 7.8 | 0.0 |
185 | uk | United Kingdom | 243610 | 241930 | 1680 | 64088222 | 0.54 | 12.17 | 9.35 | 2.54 |
192 | vm | Vietnam | 331210 | 310070 | 21140 | 94348835 | 0.97 | 15.96 | 5.93 | 0.3 |
%%sql
--Most densly populated countries
SELECT name, population, area,
CAST(population as float)/area *100 as population_density
FROM facts
ORDER BY population_density DESC
LIMIT 13;
Done.
name | population | area | population_density |
---|---|---|---|
Macau | 592731 | 28 | 2116896.4285714286 |
Monaco | 30535 | 2 | 1526750.0 |
Singapore | 5674472 | 697 | 814127.9770444763 |
Hong Kong | 7141106 | 1108 | 644504.1516245487 |
Gaza Strip | 1869055 | 360 | 519181.94444444444 |
Gibraltar | 29258 | 6 | 487633.3333333333 |
Bahrain | 1346613 | 760 | 177185.92105263157 |
Maldives | 393253 | 298 | 131964.09395973154 |
Malta | 413965 | 316 | 131001.58227848102 |
Bermuda | 70196 | 54 | 129992.59259259258 |
Sint Maarten | 39689 | 34 | 116732.35294117646 |
Bangladesh | 168957745 | 148460 | 113806.91432035565 |
Guernsey | 66080 | 78 | 84717.94871794872 |
$population\_density = \frac{population}{area} * 100$
Macau, Monaco, Singapore are the top three most densly populated country in the world.
%%sql
--Fastest Growing Country
SELECT *
FROM facts
WHERE population_growth == (SELECT MAX(population_growth)
FROM facts
WHERE name != 'World');
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
162 | od | South Sudan | 644329 | None | None | 12042910 | 4.02 | 36.91 | 8.18 | 11.47 |
%%sql
--Fastest Growing Countries
SELECT name, population,
population_growth
FROM facts
ORDER BY population_growth DESC
LIMIT 10;
Done.
name | population | population_growth |
---|---|---|
South Sudan | 12042910 | 4.02 |
Malawi | 17964697 | 3.32 |
Burundi | 10742276 | 3.28 |
Niger | 18045729 | 3.25 |
Uganda | 37101745 | 3.24 |
Qatar | 2194817 | 3.07 |
Burkina Faso | 18931686 | 3.03 |
Mali | 16955536 | 2.98 |
Cook Islands | 9838 | 2.95 |
Iraq | 37056169 | 2.93 |
%%sql
--Countries adding the most people next year, 2016
SELECT name, population, population_growth,
CAST(population_growth * population as Float)/100 as population_added
FROM facts
WHERE name != 'World'
ORDER BY population_added DESC
LIMIT 14;
Done.
name | population | population_growth | population_added |
---|---|---|---|
India | 1251695584 | 1.22 | 15270686.1248 |
China | 1367485388 | 0.45 | 6153684.246 |
Nigeria | 181562056 | 2.45 | 4448270.372 |
Pakistan | 199085847 | 1.46 | 2906653.3662 |
Ethiopia | 99465819 | 2.89 | 2874562.1691 |
Bangladesh | 168957745 | 1.6 | 2703323.92 |
United States | 321368864 | 0.78 | 2506677.1392 |
Indonesia | 255993674 | 0.92 | 2355141.8008000003 |
Congo, Democratic Republic of the | 79375136 | 2.45 | 1944690.8320000002 |
Philippines | 100998376 | 1.61 | 1626073.8536 |
Egypt | 88487396 | 1.79 | 1583924.3884 |
Brazil | 204259812 | 0.77 | 1572800.5524000002 |
Mexico | 121736809 | 1.18 | 1436494.3462 |
Tanzania | 51045882 | 2.79 | 1424180.1078 |
$population\_growth = \frac{population\_added}{population} * 100$
India is set to add about 15 million people next year - the most for any nation. With about 6 million and 4.5 million people set to join China and Nigeria respectively, next year.
%%sql
--Countries that have more water than land
SELECT name, area, area_land, area_water,
CAST(area_water as float)/area *100 as water_land_ratio
FROM facts
ORDER BY water_land_ratio DESC
LIMIT 12;
Done.
name | area | area_land | area_water | water_land_ratio |
---|---|---|---|---|
British Indian Ocean Territory | 54400 | 60 | 54340 | 99.88970588235294 |
Virgin Islands | 1910 | 346 | 1564 | 81.8848167539267 |
Puerto Rico | 13791 | 8870 | 4921 | 35.6826916104706 |
Bahamas, The | 13880 | 10010 | 3870 | 27.881844380403457 |
Guinea-Bissau | 36125 | 28120 | 8005 | 22.15916955017301 |
Malawi | 118484 | 94080 | 24404 | 20.596873839505754 |
Netherlands | 41543 | 33893 | 7650 | 18.41465469513516 |
Uganda | 241038 | 197100 | 43938 | 18.228661040997686 |
Eritrea | 117600 | 101000 | 16600 | 14.1156462585034 |
Liberia | 111369 | 96320 | 15049 | 13.512736937567905 |
Bangladesh | 148460 | 130170 | 18290 | 12.319816785666173 |
Gambia, The | 11300 | 10120 | 1180 | 10.442477876106194 |
Only two countries have more body of water than land - British Indian Ocean Territory(99.9% water) and Virgin Isalnds(82%)
%%sql
--Countries that have higher death rate than birth rate
SELECT name, birth_rate, death_rate
FROM facts
WHERE death_rate > birth_rate
ORDER BY death_rate DESC
Done.
name | birth_rate | death_rate |
---|---|---|
Ukraine | 10.72 | 14.46 |
Bulgaria | 8.92 | 14.44 |
Latvia | 10.0 | 14.31 |
Lithuania | 10.1 | 14.27 |
Russia | 11.6 | 13.69 |
Serbia | 9.08 | 13.66 |
Belarus | 10.7 | 13.36 |
Hungary | 9.16 | 12.73 |
Moldova | 12.0 | 12.59 |
Estonia | 10.51 | 12.4 |
Croatia | 9.45 | 12.18 |
Romania | 9.14 | 11.9 |
Germany | 8.47 | 11.42 |
Slovenia | 8.42 | 11.37 |
Greece | 8.66 | 11.09 |
Portugal | 9.27 | 11.02 |
Czech Republic | 9.63 | 10.34 |
Italy | 8.74 | 10.19 |
Poland | 9.74 | 10.19 |
Bosnia and Herzegovina | 8.87 | 9.75 |
Saint Pierre and Miquelon | 7.42 | 9.72 |
Japan | 7.93 | 9.51 |
Austria | 9.41 | 9.42 |
Monaco | 6.65 | 9.24 |
By intuition, most of the countries that have higher death rate than birth rate seem to be (Eastern) European countries. With Ukraine having the highest death rate on the list.