In this project, we worked with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth and try to answer a couple of questions such as:
To start off, let's make use of the following code to connect our Jupyter Notebook to our database file:
%%capture
%load_ext sql
%sql sqlite:///factbook.db
'Connected: None@factbook.db'
To know the table with which we are to work, let us query the database to get this information directly.
%%sql
SELECT *
FROM sqlite_master
WHERE type='table'
Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
As seen in the output of the above cell, let us explore the facts table for our project.
%%sql
SELECT *
FROM facts
LIMIT 5;
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
Here are the descriptions for some of the columns:
name
- The name of the country.area
- The total land and sea area of the country.population
- The country's population.population_growth
- The country's population growth as a percentage.birth_rate
- The country's birth rate, or the number of births a year per 1,000 people.death_rate
- The country's death rate, or the number of death a year per 1,000 people.area
- The country's total area (both land and water).area_land
- The country's land area in square kilometers.area_water
- The country's waterarea in square kilometers.Let's start by calculating some summary statistics and see what they tell us.
%%sql
SELECT MAX(population) AS max_pop,
MAX(population_growth) AS max_growth,
MIN(population) AS min_pop,
MIN(population_growth) AS min_pop_growth
FROM facts;
Done.
max_pop | max_growth | min_pop | min_pop_growth |
---|---|---|---|
7256490011 | 4.02 | 0 | 0.0 |
We see a few interesting things in the summary statistics:
Let's look at these countries individually and decide if they are to be excluuded in our analysis going forward.
%%sql
SELECT *
FROM facts
WHERE population = (SELECT MIN(population)
FROM facts)
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
250 | ay | Antarctica | None | 280000 | None | 0 | None | None | None | None |
%%sql
SELECT *
FROM facts
WHERE population = (SELECT MAX(population)
FROM facts)
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
261 | xx | World | None | None | None | 7256490011 | 1.08 | 18.6 | 7.8 | None |
It seems like the table contains a row for the whole World, which explains the population of over 7.2 billion. It also seems like the table contains a row for Antarctica, which explains the population of 0. (Refer the below image for confirmation)
So, let us calculate the summary statistics again, this time excluding the entire 'World' row.
%%sql
SELECT MAX(population) AS max_pop,
MAX(population_growth) AS max_growth,
MIN(population) AS min_pop,
MIN(population_growth) AS min_pop_growth
FROM facts
WHERE name <> 'World'
Done.
max_pop | max_growth | min_pop | min_pop_growth |
---|---|---|---|
1367485388 | 4.02 | 0 | 0.0 |
Let's explore density. Density depends on the population and the country's area. Let's look at the average values for these two columns.
We should take care of discarding the row for the whole planet
%%sql
SELECT ROUND(AVG(population), 2) AS avg_pop,
ROUND(AVG(area), 2) AS avg_area
FROM facts
WHERE name <> 'World'
Done.
avg_pop | avg_area |
---|---|
32242666.57 | 555093.55 |
We see that the average population is around 32 million and the average area is 555 thousand square kilometers.
To finish, we'll build on the query above to find countries that are densely populated. We'll identify countries that have:
Above average values for population. Below average values for area.
%%sql
SELECT name, population, area
FROM facts
WHERE population>(SELECT AVG(population)
FROM facts
WHERE name <> 'World') AND area<(SELECT AVG(area)
FROM facts
WHERE name <> 'World')
Done.
name | population | area |
---|---|---|
Bangladesh | 168957745 | 148460 |
Germany | 80854408 | 357022 |
Iraq | 37056169 | 438317 |
Italy | 61855120 | 301340 |
Japan | 126919659 | 377915 |
Korea, South | 49115196 | 99720 |
Morocco | 33322699 | 446550 |
Philippines | 100998376 | 300000 |
Poland | 38562189 | 312685 |
Spain | 48146134 | 505370 |
Thailand | 67976405 | 513120 |
Uganda | 37101745 | 241038 |
United Kingdom | 64088222 | 243610 |
Vietnam | 94348835 | 331210 |
Further, let's calculate the population to area ratio.
%%sql
SELECT name, population, area, ROUND(CAST(population AS float)/CAST(area AS float), 2) AS pop_area_ratio
FROM facts
WHERE name <> 'World'
ORDER BY pop_area_ratio DESC
LIMIT 10;
Done.
name | population | area | pop_area_ratio |
---|---|---|---|
Macau | 592731 | 28 | 21168.96 |
Monaco | 30535 | 2 | 15267.5 |
Singapore | 5674472 | 697 | 8141.28 |
Hong Kong | 7141106 | 1108 | 6445.04 |
Gaza Strip | 1869055 | 360 | 5191.82 |
Gibraltar | 29258 | 6 | 4876.33 |
Bahrain | 1346613 | 760 | 1771.86 |
Maldives | 393253 | 298 | 1319.64 |
Malta | 413965 | 316 | 1310.02 |
Bermuda | 70196 | 54 | 1299.93 |
On comparing the above 2 cells, we would expect atleast 80-90% of the countries to match in both the queries. The reason that this did not happen is due to the comparitively low value of area which resulted in a higher population to area ratio in the second query.
Nevertheless, we can conclude that the countries in the first query are densly populated given their high population and low area.
%%sql
SELECT name, population AS current_pop, population_growth,
ROUND(population*population_growth, 2) AS Population_growth_next_year
FROM facts
WHERE name <> 'World'
ORDER BY Population_growth_next_year DESC
LIMIT 10;
Done.
name | current_pop | population_growth | Population_growth_next_year |
---|---|---|---|
India | 1251695584 | 1.22 | 1527068612.48 |
China | 1367485388 | 0.45 | 615368424.6 |
Nigeria | 181562056 | 2.45 | 444827037.2 |
Pakistan | 199085847 | 1.46 | 290665336.62 |
Ethiopia | 99465819 | 2.89 | 287456216.91 |
Bangladesh | 168957745 | 1.6 | 270332392.0 |
United States | 321368864 | 0.78 | 250667713.92 |
Indonesia | 255993674 | 0.92 | 235514180.08 |
Congo, Democratic Republic of the | 79375136 | 2.45 | 194469083.2 |
Philippines | 100998376 | 1.61 | 162607385.36 |
We observe that India will add the most number of people to its population in the next year. Though India may not have the highest population_growth amongst the top 10 countries, it is due to the giant value of the current population that puts India with the highest possibility to add the most number of people in the next year.
%%sql
SELECT name, area_water, area_land,
ROUND(CAST(area_water AS Float)/CAST(area_land AS Float), 2) AS Water_land_ratio
FROM facts
WHERE name <> 'World'
ORDER BY Water_land_ratio DESC
LIMIT 10;
Done.
name | area_water | area_land | Water_land_ratio |
---|---|---|---|
British Indian Ocean Territory | 54340 | 60 | 905.67 |
Virgin Islands | 1564 | 346 | 4.52 |
Puerto Rico | 4921 | 8870 | 0.55 |
Bahamas, The | 3870 | 10010 | 0.39 |
Guinea-Bissau | 8005 | 28120 | 0.28 |
Malawi | 24404 | 94080 | 0.26 |
Netherlands | 7650 | 33893 | 0.23 |
Uganda | 43938 | 197100 | 0.22 |
Eritrea | 16600 | 101000 | 0.16 |
Liberia | 15049 | 96320 | 0.16 |
We observe that British Indian Ocean Territory has a high proprtion of water on comparison to it's land as it an archipelago of 58 islands covering some 640,000 sq km of ocean. Source. We further observe that British Indian Ocean Territory and Virgin Islands are the only countries to have the area of water greater than land.
%%sql
SELECT name, death_rate, birth_rate
FROM facts
WHERE death_rate>birth_rate
ORDER BY death_rate DESC
LIMIT 10;
Done.
name | death_rate | birth_rate |
---|---|---|
Ukraine | 14.46 | 10.72 |
Bulgaria | 14.44 | 8.92 |
Latvia | 14.31 | 10.0 |
Lithuania | 14.27 | 10.1 |
Russia | 13.69 | 11.6 |
Serbia | 13.66 | 9.08 |
Belarus | 13.36 | 10.7 |
Hungary | 12.73 | 9.16 |
Moldova | 12.59 | 12.0 |
Estonia | 12.4 | 10.51 |
We observe that the above countries have a death rate greater than the birth rate. This is evident with the decline in population. As an example, see the variation in population of Ukraine in the past 2 decades. A similar trend can be seen in the case of other countries as well.