In this project, we'll use SQL in Jupyter Notebook to explore and analyze data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like:
%%capture
%load_ext sql
%sql sqlite:///factbook.db
We'll begin by getting a sense of what the data looks like.
%%sql
SELECT *
FROM facts
LIMIT 5;
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
Here are the descriptions for some of the columns:
Let's start by calculating some summary statistics and see what they tell us
%%sql
SELECT MIN(population),
MAX(population),
MIN(population_growth),
MAX(population_growth)
FROM facts;
* sqlite:///factbook.db Done.
MIN(population) | MAX(population) | MIN(population_growth) | MAX(population_growth) |
---|---|---|---|
0 | 7256490011 | 0.0 | 4.02 |
A few things stick out from the summary statistics in the last screen:
Let's use subqueries to zoom in on just these countries without using the specific values.
%%sql
SELECT *
FROM facts
WHERE population ==
(SELECT MIN(population)
FROM facts);
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
250 | ay | Antarctica | None | 280000 | None | 0 | None | None | None | None |
It seems like the table contains a row for Antarctica, which explains the population of 0. This seems to match the CIA Factbook page for Antarctica:
%%sql
SELECT *
FROM facts
WHERE population ==
(SELECT MAX(population)
FROM facts);
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
261 | xx | World | None | None | None | 7256490011 | 1.08 | 18.6 | 7.8 | None |
We also see that the table contains a row for the whole world, which explains the maximum population of over 7.2 billion we found earlier.
Now that we know this, we should recalculate the summary statistics we calculated earlier, while excluding the row for the whole world.
%%sql
SELECT MIN(population),
MAX(population),
MIN(population_growth),
MAX(population_growth)
FROM facts
WHERE name!="World"
* sqlite:///factbook.db Done.
MIN(population) | MAX(population) | MIN(population_growth) | MAX(population_growth) |
---|---|---|---|
0 | 1367485388 | 0.0 | 4.02 |
There's a country whose population closes in on 1.4 billion!
Let's explore density. Density depends on the population and the country's area. Let's look at the average values for these two columns.
We should take care of discarding the row for the whole planet.
%%sql
SELECT AVG(population) AS AVG_POPULATION,
AVG(area) AS AVG_AREA
FROM facts
WHERE name!="World"
* sqlite:///factbook.db Done.
AVG_POPULATION | AVG_AREA |
---|---|
32242666.56846473 | 555093.546184739 |
We see that the average population is around 32 million and the average area is 555 thousand square kilometers.
To finish, we'll build on the query above to find countries that are densely populated. We'll identify countries that have:
%%sql
SELECT *
FROM facts
WHERE population > (SELECT AVG(population)
FROM facts)
AND area < (SELECT AVG(area)
FROM facts);
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
14 | bg | Bangladesh | 148460 | 130170 | 18290 | 168957745 | 1.6 | 21.14 | 5.61 | 0.46 |
65 | gm | Germany | 357022 | 348672 | 8350 | 80854408 | 0.17 | 8.47 | 11.42 | 1.24 |
85 | ja | Japan | 377915 | 364485 | 13430 | 126919659 | 0.16 | 7.93 | 9.51 | 0.0 |
138 | rp | Philippines | 300000 | 298170 | 1830 | 100998376 | 1.61 | 24.27 | 6.11 | 2.09 |
173 | th | Thailand | 513120 | 510890 | 2230 | 67976405 | 0.34 | 11.19 | 7.8 | 0.0 |
185 | uk | United Kingdom | 243610 | 241930 | 1680 | 64088222 | 0.54 | 12.17 | 9.35 | 2.54 |
192 | vm | Vietnam | 331210 | 310070 | 21140 | 94348835 | 0.97 | 15.96 | 5.93 | 0.3 |
Some of these countries are generally known to be densely populated, so we have confidence in our results!
%%sql
SELECT name,MAX(population) AS Highest_population
FROM facts
WHERE name <> "World"
* sqlite:///factbook.db Done.
name | Highest_population |
---|---|
China | 1367485388 |
From the analysis we can see that *China* has the highest population with over *1.3 billion people*
%%sql
SELECT name,MAX(population_growth) AS highest_growth_rate
FROM facts
WHERE name <> "World"
* sqlite:///factbook.db Done.
name | highest_growth_rate |
---|---|
South Sudan | 4.02 |
*** South Sudan*** has the highest growth rate
%%sql
SELECT *, CAST(area_water AS FLOAT)/CAST(area_land AS FLOAT)AS water_to_land_ratio
FROM facts
WHERE name <> "World"
ORDER BY water_to_land_ratio DESC
LIMIT 5;
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | water_to_land_ratio |
---|---|---|---|---|---|---|---|---|---|---|---|
228 | io | British Indian Ocean Territory | 54400 | 60 | 54340 | None | None | None | None | None | 905.6666666666666 |
247 | vq | Virgin Islands | 1910 | 346 | 1564 | 103574 | 0.59 | 10.31 | 8.54 | 7.67 | 4.520231213872832 |
246 | rq | Puerto Rico | 13791 | 8870 | 4921 | 3598357 | 0.6 | 10.86 | 8.67 | 8.15 | 0.5547914317925592 |
12 | bf | Bahamas, The | 13880 | 10010 | 3870 | 324597 | 0.85 | 15.5 | 7.05 | 0.0 | 0.3866133866133866 |
71 | pu | Guinea-Bissau | 36125 | 28120 | 8005 | 1726170 | 1.91 | 33.38 | 14.33 | 0.0 | 0.2846728307254623 |
From the above, *British indian ocean* has the highest water to land mass ratio
%%sql
SELECT *, CAST(area_water AS FLOAT) > CAST(area_land AS FLOAT) AS water_greaterThan_land
FROM facts
WHERE name <> "World"
ORDER BY water_greaterThan_land DESC
LIMIT 5;
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | water_greaterThan_land |
---|---|---|---|---|---|---|---|---|---|---|---|
228 | io | British Indian Ocean Territory | 54400 | 60 | 54340 | None | None | None | None | None | 1 |
247 | vq | Virgin Islands | 1910 | 346 | 1564 | 103574 | 0.59 | 10.31 | 8.54 | 7.67 | 1 |
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 | 0 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 | 0 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 | 0 |
*British indian ocean* has more water than land
%%sql
SELECT name AS country,population,CAST(population_growth*population AS FLOAT) AS NEXTYR_POP
FROM facts
WHERE name <> "World"
ORDER BY NEXTYR_POP DESC
LIMIT 7;
* sqlite:///factbook.db Done.
country | population | NEXTYR_POP |
---|---|---|
India | 1251695584 | 1527068612.48 |
China | 1367485388 | 615368424.6 |
Nigeria | 181562056 | 444827037.20000005 |
Pakistan | 199085847 | 290665336.62 |
Ethiopia | 99465819 | 287456216.91 |
Bangladesh | 168957745 | 270332392.0 |
United States | 321368864 | 250667713.92000002 |
*India* will add more people to their population next year
%%sql
SELECT *, CAST(death_rate AS FLOAT)/CAST(birth_rate AS FLOAT) AS death_to_birthrate
FROM facts
WHERE name <> "World"
ORDER BY death_to_birthrate DESC
LIMIT 5;
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate | death_to_birthrate |
---|---|---|---|---|---|---|---|---|---|---|---|
26 | bu | Bulgaria | 110879 | 108489 | 2390 | 7186893 | 0.58 | 8.92 | 14.44 | 0.29 | 1.6188340807174888 |
153 | ri | Serbia | 77474 | 77474 | 0 | 7176794 | 0.46 | 9.08 | 13.66 | 0.0 | 1.5044052863436124 |
96 | lg | Latvia | 64589 | 62249 | 2340 | 1986705 | 1.06 | 10.0 | 14.31 | 6.26 | 1.431 |
102 | lh | Lithuania | 65300 | 62680 | 2620 | 2884433 | 1.04 | 10.1 | 14.27 | 6.27 | 1.4128712871287128 |
75 | hu | Hungary | 93028 | 89608 | 3420 | 9897541 | 0.22 | 9.16 | 12.73 | 1.33 | 1.3897379912663756 |
*Bulgaria* has the highest death rate than birth rate
%%sql
SELECT name AS country,population,CAST(population/area AS FLOAT) AS population_to_area
FROM facts
WHERE name <> "World"
ORDER BY population_to_area DESC
LIMIT 7;
* sqlite:///factbook.db Done.
country | population | population_to_area |
---|---|---|
Macau | 592731 | 21168.0 |
Monaco | 30535 | 15267.0 |
Singapore | 5674472 | 8141.0 |
Hong Kong | 7141106 | 6445.0 |
Gaza Strip | 1869055 | 5191.0 |
Gibraltar | 29258 | 4876.0 |
Bahrain | 1346613 | 1771.0 |
*Macau* has the highest death rate than birth rate