In this guided project from DataQuest we will look at the CIA's World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information such as the following:
population
— the global population.population_growth
— the annual population growth rate, as a percentage.area
— the total land and water area.birth_rate
— annual number of births per 1,000 peopleWe will use SQL (sqlite) to analyze data from this database. The raw database file can be downloaded locally here.
%%capture
%load_ext sql
%sql sqlite:///factbook.db
'Connected: None@factbook.db'
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
%%sql
SELECT *
FROM facts
LIMIT 10;
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
6 | ac | Antigua and Barbuda | 442 | 442 | 0 | 92436 | 1.24 | 15.85 | 5.69 | 2.21 |
7 | ar | Argentina | 2780400 | 2736690 | 43710 | 43431886 | 0.93 | 16.64 | 7.33 | 0.0 |
8 | am | Armenia | 29743 | 28203 | 1540 | 3056382 | 0.15 | 13.61 | 9.34 | 5.8 |
9 | as | Australia | 7741220 | 7682300 | 58920 | 22751014 | 1.07 | 12.15 | 7.14 | 5.65 |
10 | au | Austria | 83871 | 82445 | 1426 | 8665550 | 0.55 | 9.41 | 9.42 | 5.56 |
Fields:
id
— unique identifier for each record.code
— an abbreviation of the name of the country.name
— the name of the country.area
— the country's total area (both land and water).area_land
— the country's land area in square kilometers.area_water
— the country's water area in square kilometers.population
— the country's population.population_growth
— the country's annual population growth as a percentage.birth_rate
— the country's birth rate, or the number of births per year per 1,000 people.death_rate
— the country's death rate, or the number of death per year per 1,000 people.migration_rate
— the country's migration rate, or difference between the number of persons entering and leaving a country during the year per 1,000 people.%%sql
SELECT SUM(population) AS total_population,
COUNT(*) AS num_countries,
CAST(AVG(population) AS INT) AS mean_population
FROM facts;
Done.
total_population | num_countries | mean_population |
---|---|---|
15026972654 | 261 | 62094928 |
I know our population is exploding here on Earth, but I don't think we've quite reached the 15 billion mark...YET! Something is clearly breaking our calculation for total population.
Also, according to this website, there are now 195 independent sovereign nations in the world (not including the disputed but de facto independent Taiwan), plus some 60 dependent areas, and several disputed territories, like Kosovo. So something is a little off with respect to the number of countries we have listed in our database.
Therefore, it's no wonder that our calculated average of over 62 million people per country seems a little high.
Let's investigate these discrepencies by running another query on the database.
%%sql
SELECT MIN(population) AS min_population,
MAX(population) AS max_population,
MIN(population_growth) AS min_population_growth,
MAX(population_growth) AS max_population_growth
FROM facts;
Done.
min_population | max_population | min_population_growth | max_population_growth |
---|---|---|---|
0 | 7256490011 | 0.0 | 4.02 |
It seems odd that there is a country with a population of 0 and one with a population greater than 7.2 billion. Let's zoom in on these with the use of a subquery to figure out what's happening here.
%%sql
SELECT *
FROM facts
WHERE population == (SELECT MIN(population)
FROM facts
);
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
250 | ay | Antarctica | None | 280000 | None | 0 | None | None | None | None |
From the CIA's World Factbook on Antartica:
no indigenous inhabitants, but there are both permanent and summer-only staffed research stations
note: 53 countries have signed the 1959 Antarctic Treaty; 30 of those operate through their National Antarctic Program a number of seasonal-only (summer) and year-round research stations on the continent and its nearby islands south of 60 degrees south latitude (the region covered by the Antarctic Treaty); the population engaging in and supporting science or managing and protecting the Antarctic region varies from approximately 4,400 in summer to 1,100 in winter; in addition, approximately 1,000 personnel, including ship's crew and scientists doing onboard research, are present in the waters of the treaty region
Although not a country — it's a continent — Antarctica is included in our list of countries and thefore makes sense why we are getting a population of 0 in our query results. This should be kept front-of-mind when constructing queries using aggregate functions (e.g. AVERAGE) which could be influenced by this entry.
%%sql
SELECT *
FROM facts
WHERE population == (SELECT MAX(population)
FROM facts
);
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
261 | xx | World | None | None | None | 7256490011 | 1.08 | 18.6 | 7.8 | None |
Again we find another 'country' that isn't actually a country in the database: World
. We should keep this in mind when constructing queries involving population since it has the potential to seriously alter our results.
For the above mentioned reasons, we will make it a point to exclude these rows in our queries regarding population.
population
¶%%sql
SELECT *
FROM facts
WHERE population IS NULL;
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
198 | at | Ashmore and Cartier Islands | 5 | 5 | 0 | None | None | None | None | None |
201 | cr | Coral Sea Islands | 3 | 3 | 0 | None | None | None | None | None |
202 | hm | Heard Island and McDonald Islands | 412 | 412 | 0 | None | None | None | None | None |
208 | ip | Clipperton Island | 6 | 6 | 0 | None | None | None | None | None |
210 | fs | French Southern and Antarctic Lands | None | None | None | None | None | None | None | None |
222 | bv | Bouvet Island | 49 | 49 | 0 | None | None | None | None | None |
223 | jn | Jan Mayen | 377 | 377 | 0 | None | None | None | None | None |
228 | io | British Indian Ocean Territory | 54400 | 60 | 54340 | None | None | None | None | None |
240 | sx | South Georgia and South Sandwich Islands | 3903 | 3903 | 0 | None | None | None | None | None |
244 | bq | Navassa Island | 5 | 5 | 0 | None | None | None | None | None |
248 | wq | Wake Island | 6 | 6 | 0 | None | None | None | None | None |
249 | um | United States Pacific Island Wildlife Refuges | None | None | None | None | None | None | None | None |
252 | pf | Paracel Islands | 7 | 7 | 0 | None | None | None | None | None |
253 | pg | Spratly Islands | 5 | 5 | 0 | None | None | None | None | None |
256 | xq | Arctic Ocean | None | None | None | None | None | None | None | None |
257 | zh | Atlantic Ocean | None | None | None | None | None | None | None | None |
258 | xo | Indian Ocean | None | None | None | None | None | None | None | None |
259 | zn | Pacific Ocean | None | None | None | None | None | None | None | None |
260 | oo | Southern Ocean | None | None | None | None | None | None | None | None |
%%sql
SELECT COUNT(*) AS population_null_count
FROM facts
WHERE population IS NULL;
Done.
population_null_count |
---|
19 |
The above queries (combined) show why we have more records than we do countries + disputed regions in our database: it also includes records for non-countries: Antarctica, World, Arctic Ocean, Atlantic Ocean, Pacific Ocean, etc...
The database also includes 19 records that are missing a value for population. However, since aggregate functions automatically exclude null values, we do not need to concern ourselves with these records when calling MIN, MAX, and AVG on the population
field.
%%sql
SELECT MIN(population) AS min_population,
MAX(population) AS max_population,
MIN(population_growth) AS min_population_growth,
MAX(population_growth) AS max_population_growth
FROM facts
WHERE name <> 'World'
AND name <> 'Antarctica';
Done.
min_population | max_population | min_population_growth | max_population_growth |
---|---|---|---|
48 | 1367485388 | 0.0 | 4.02 |
With filters in place to remove Antarctica
and World
from our query, we see much more reasonable results for min and max populations.
Let's now do the same while calculating the average population and the average area per country.
population
and area
¶%%sql
SELECT CAST(AVG(population) AS INT) AS mean_population,
CAST(AVG(area) AS INT) AS mean_area,
CAST(AVG(population/area) AS INT) AS mean_density_by_country,
((SELECT CAST(AVG(population) AS INT)
FROM facts
WHERE name <> 'World'
AND name <> 'Antarctica') / (SELECT CAST(AVG(area) AS INT)
FROM facts
WHERE name <> 'World'
AND name <> 'Antarctica'))
AS mean_density_worldwide
FROM facts
WHERE name <> 'World'
AND name <> 'Antarctica';
Done.
mean_population | mean_area | mean_density_by_country | mean_density_worldwide |
---|---|---|---|
32377011 | 555093 | 419 | 58 |
Again, with the two outliers removed, we see much more reasonable results; the average population is around 32 million people and average area is around 555 thousand square kilometres.
Finally, let's use these figures to find densely populated countries: countries with an above average population and simultaneously a below average area.
%%sql
SELECT id, name, population, area, population/area AS density
FROM facts
WHERE population > (SELECT CAST(AVG(population) AS INT)
FROM facts
WHERE name <> 'World'
AND name <> 'Antarctica')
AND area_land < (SELECT CAST(AVG(area) AS INT)
FROM facts
WHERE name <> 'World'
AND name <> 'Antarctica')
ORDER BY density DESC;
Done.
id | name | population | area | density |
---|---|---|---|---|
14 | Bangladesh | 168957745 | 148460 | 1138 |
91 | Korea, South | 49115196 | 99720 | 492 |
138 | Philippines | 100998376 | 300000 | 336 |
85 | Japan | 126919659 | 377915 | 335 |
192 | Vietnam | 94348835 | 331210 | 284 |
185 | United Kingdom | 64088222 | 243610 | 263 |
65 | Germany | 80854408 | 357022 | 226 |
83 | Italy | 61855120 | 301340 | 205 |
182 | Uganda | 37101745 | 241038 | 153 |
173 | Thailand | 67976405 | 513120 | 132 |
139 | Poland | 38562189 | 312685 | 123 |
163 | Spain | 48146134 | 505370 | 95 |
80 | Iraq | 37056169 | 438317 | 84 |
120 | Morocco | 33322699 | 446550 | 74 |
This result seems reasonable since these countries are known for being densely populated.
Now we go beyond the scope of the guided project (this is where the real learning happens!) by answering some 'home-brewed' questions.
First and foremost: where are my water nations at? More specifically: which countries have the largest area of water compared to total area?
growth_rate
¶%%sql
SELECT id, name, population
FROM facts
WHERE name <> 'World'
ORDER BY population DESC
LIMIT 1;
Done.
id | name | population |
---|---|---|
37 | China | 1367485388 |
China has the largest population
with 1,367,485,388 people.
%%sql
SELECT id, name, population_growth
FROM facts
WHERE name <> 'World'
ORDER BY population_growth DESC
LIMIT 1;
Done.
id | name | population_growth |
---|---|---|
162 | South Sudan | 4.02 |
South Sudan has the largest population_growth
at 4.02% per year.
%%sql
SELECT id, name, area, area_land,
area_water, ROUND(area_water*100.0/area, 2) || '%' AS percent_water
FROM facts
ORDER BY area_water*100.0/area DESC
LIMIT 20;
Done.
id | name | area | area_land | area_water | percent_water |
---|---|---|---|---|---|
228 | British Indian Ocean Territory | 54400 | 60 | 54340 | 99.89% |
247 | Virgin Islands | 1910 | 346 | 1564 | 81.88% |
246 | Puerto Rico | 13791 | 8870 | 4921 | 35.68% |
12 | Bahamas, The | 13880 | 10010 | 3870 | 27.88% |
71 | Guinea-Bissau | 36125 | 28120 | 8005 | 22.16% |
106 | Malawi | 118484 | 94080 | 24404 | 20.6% |
125 | Netherlands | 41543 | 33893 | 7650 | 18.41% |
182 | Uganda | 241038 | 197100 | 43938 | 18.23% |
56 | Eritrea | 117600 | 101000 | 16600 | 14.12% |
99 | Liberia | 111369 | 96320 | 15049 | 13.51% |
14 | Bangladesh | 148460 | 130170 | 18290 | 12.32% |
63 | Gambia, The | 11300 | 10120 | 1180 | 10.44% |
196 | Taiwan | 35980 | 32260 | 3720 | 10.34% |
60 | Finland | 338145 | 303815 | 34330 | 10.15% |
77 | India | 3287263 | 2973193 | 314070 | 9.55% |
58 | Ethiopia | 1104300 | None | 104300 | 9.44% |
32 | Canada | 9984670 | 9093507 | 891163 | 8.93% |
168 | Sweden | 450295 | 410335 | 39960 | 8.87% |
38 | Colombia | 1138910 | 1038700 | 100210 | 8.8% |
25 | Brunei | 5765 | 5265 | 500 | 8.67% |
It should come as no surprise that the top few results are an ocean and a handful of islands. What is surprising is that we see two countries in Africa which are comprised of more than 20% water.
Despite being one of the largest countries in the world, my homeland (Canada) made the top 20! I knew there was an innate reason why I'm drawn to water!
Next, let's find the places where the death_rate
is higher than the birth_rate
.
death_rate
and birth_rate
¶%%sql
SELECT id, name, birth_rate, death_rate,
ROUND(death_rate - birth_rate, 2) AS difference, migration_rate, population_growth
FROM facts
WHERE birth_rate < death_rate
ORDER BY difference DESC
LIMIT 10;
Done.
id | name | birth_rate | death_rate | difference | migration_rate | population_growth |
---|---|---|---|---|---|---|
26 | Bulgaria | 8.92 | 14.44 | 5.52 | 0.29 | 0.58 |
153 | Serbia | 9.08 | 13.66 | 4.58 | 0.0 | 0.46 |
96 | Latvia | 10.0 | 14.31 | 4.31 | 6.26 | 1.06 |
102 | Lithuania | 10.1 | 14.27 | 4.17 | 6.27 | 1.04 |
183 | Ukraine | 10.72 | 14.46 | 3.74 | 2.25 | 0.6 |
75 | Hungary | 9.16 | 12.73 | 3.57 | 1.33 | 0.22 |
65 | Germany | 8.47 | 11.42 | 2.95 | 1.24 | 0.17 |
158 | Slovenia | 8.42 | 11.37 | 2.95 | 0.37 | 0.26 |
142 | Romania | 9.14 | 11.9 | 2.76 | 0.24 | 0.3 |
44 | Croatia | 9.45 | 12.18 | 2.73 | 1.39 | 0.13 |
According to our data, despite having more deaths than births, all of these countries still show a positive population_growth
. This seems at odds with the above query showing that only two of these countries have a migration_rate
greater than the difference between death and birth rates.
Checking the CIA Factbook website for Bulgaria shows that Bulgaria's net migration is in fact negative. The website also shows that many countries (~40) have a negative population_growth
whereas our database does not show any such countries (see below).
It is possible that the negative sign was dropped from many figures when the database was constructed. Tsk-tsk!
population_growth
or migration_rate
¶%%sql
SELECT id, name, birth_rate, death_rate, population_growth
FROM facts
WHERE population_growth < 0
OR migration_rate < 0;
Done.
id | name | birth_rate | death_rate | population_growth |
---|
Well that little discovery has sucked my will to explore this database any further! :(
Oh well. Spilt milk. Let's move on to learn something new from DataQuest! :)
Here is what our queries revealed about this (untrustworthy?) database:
population
population_growth
is 4.02% (South Sudan)population
but below average area
:death_rate
than birth_rate
:population_growth
or migration_rate
(I don't believe this...)