This project works with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information below. To name a few are:
name
- The name of the country.area
- The country's total area (both land and water).area_land
- The country's land area in square kilometers.area_water
- The country's waterarea in square kilometers.population
- The population as of 2015
.population_growth
- The annual population growth rate, as a percentage.birth_rate
- The country's birth rate, or the number of births a year per 1,000 people.death_rate
- The country's death rate, or the number of deaths a year per 1,000 people.The aim of this project is to use SQL in Jupyter Notebook to explore and analyze data from this database. To work on this project locally on your computer, you can download the SQLite factbook.db database by clicking the Download button on the right.
%%capture
%load_ext sql
%sql sqlite:///factbook.db
'Connected: None@factbook.db'
The code above connects the external database file to the jupyter notebook console.
In order to extract useful information such as table Name, the code below helps with that.
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
Based on the information gotten above, the database contains a table called facts
.
To explore the contents of the table, such as the various columns and rows contained in the facts
table, the code below can be used to achieve this, displaying only the first 5 rows of the table.
%%sql
SELECT *
FROM facts
LIMIT 5;
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
To begin analysis, first off I start by checking for outliers in the columns population
and population_growth
.
This is achieved by carrying out some statistical operations (such as calculating the MAX and MIN values) on the respective columns.
%%sql
SELECT MAX(population), MIN(population),
MAX(population_growth), MIN(population_growth)
FROM facts;
Done.
MAX(population) | MIN(population) | MAX(population_growth) | MIN(population_growth) |
---|---|---|---|
7256490011 | 0 | 4.02 | 0.0 |
Looking at the data gotten from the summary statistics performed, it points out that there's a country with a population of more than 7billion people which is the highest population amongst all the countries and also not very reasonable. Also, another with a record 0, which has the least population which quite frankly doesn't make any sense at all.
Moving further, I'll try exploring the data contained in the table to see if they're trends or meaning can be obtained from the data.
To do so I'll display the top 5 countries with the Maximum and Minimun number of people (population).
%%sql
SELECT id, name, population
FROM facts
ORDER BY population DESC
LIMIT 5;
Done.
id | name | population |
---|---|---|
261 | World | 7256490011 |
37 | China | 1367485388 |
77 | India | 1251695584 |
197 | European Union | 513949445 |
186 | United States | 321368864 |
%%sql
SELECT id, name, population
FROM facts
WHERE population >= 0
ORDER BY population ASC
LIMIT 5;
Done.
id | name | population |
---|---|---|
250 | Antarctica | 0 |
238 | Pitcairn Islands | 48 |
200 | Cocos (Keeling) Islands | 596 |
190 | Holy See (Vatican City) | 842 |
220 | Niue | 1190 |
Looking at the first table which displays the top 5 countries with the highest population, there is an obvious outlier which is given for the country name World and since it's known to the best my knowledge that World isn't a country, it is possible to assume that this value could've been gotten as a result of the summation of all the population values of the other countries. I'll exclude it in the next summary statistics taken, to see what the table looks like.
Also, the country name European Union is not a valid country name, therefore I'll also get exclude it.
Moving further to the second table displayed, its observed that a country named Antarctica (which is a continent and not a country) has a 0 count as its population. This can be used to validate the fact that there is a country in the table with a value 0.
Also, the so much minimal values for countries in the remaining rows, can be further investigated. For example, the Pitcairn Islands is known to be an abandoned country comprising of a group of volcanic islands in the South Pacific Ocean, east of French Polynesia. This validates the value gotten for the population of that country.
NB: Other countries can be looked into to determine the validity of their values.
Excluding the 3 countries identified previously, here's the top 5 countries with the highest population.
%%sql
SELECT id, name, population
FROM facts
WHERE name NOT IN ('World', 'European Union', 'Antarctica')
ORDER BY population DESC
LIMIT 5;
Done.
id | name | population |
---|---|---|
37 | China | 1367485388 |
77 | India | 1251695584 |
186 | United States | 321368864 |
78 | Indonesia | 255993674 |
24 | Brazil | 204259812 |
The most populated countries in the world based on the collected data are China with more than 1.3 billion, India following closely with more than 1.2 billion and U.S following reluctantly with about 321 billion people.
%%sql
SELECT MIN(population), MAX(population),
MIN(population_growth), MAX(population_growth)
FROM facts
WHERE name NOT IN ('World', 'European Union', 'Antarctica');
Done.
MIN(population) | MAX(population) | MIN(population_growth) | MAX(population_growth) |
---|---|---|---|
48 | 1367485388 | 0.0 | 4.02 |
Above shows what the new summary statistics look like.
Exluding the the 3 identified incorrectly named countries World, European Union and Antarctica, I'll then find the average of the remaining countries.
%%sql
SELECT AVG(population) AS 'avg_population', AVG(area) AS 'avg_area'
FROM facts
WHERE name NOT IN ('World', 'European Union', 'Antarctica');
Done.
avg_population | avg_area |
---|---|
30362063.589958157 | 539893.1895161291 |
The average population and area of all the countries excluding World, European Union and Antarctica is about 30 million and 530,000 square kilometers respectively.
Moving foward, it'll a good idea to explore how many countries have an area greater than the average area, which would be shown below.
NB: Excluding the the country names World, European Union and Antarctica.
%%sql
SELECT COUNT(*)
FROM facts
WHERE area > (SELECT AVG(area)
FROM facts)
AND name NOT IN ('World', 'European Union', 'Antarctica');
Done.
COUNT(*) |
---|
47 |
Now it's visible that they're 47 of 195 countries which have an average greater than the estimated average of all the countries.
I'll display a few of the corresponding countries below.
%%sql
SELECT id, name, area
FROM facts
WHERE area > (SELECT AVG(area)
FROM facts)
ORDER BY area DESC
LIMIT 5;
Done.
id | name | area |
---|---|---|
143 | Russia | 17098242 |
32 | Canada | 9984670 |
186 | United States | 9826675 |
37 | China | 9596960 |
24 | Brazil | 8515770 |
Given the values gotten, it's possible to identify the countries which are overpopulated by:
%%sql
SELECT id, name, population, area
FROM facts
WHERE population > (SELECT AVG(population)
FROM facts)
AND area < (SELECT AVG(area)
FROM facts)
ORDER BY id;
Done.
id | name | population | area |
---|---|---|---|
14 | Bangladesh | 168957745 | 148460 |
65 | Germany | 80854408 | 357022 |
85 | Japan | 126919659 | 377915 |
138 | Philippines | 100998376 | 300000 |
173 | Thailand | 67976405 | 513120 |
185 | United Kingdom | 64088222 | 243610 |
192 | Vietnam | 94348835 | 331210 |
The table above shows some of the countries suffering overpopulation, with the total number of residents overseeding the average population and area of its respective countries.
Now let's calculate population densities. To do so we will have to divide the country's population by its land area (hopefully eliminating the possiblity that there are people who live of water).
%%sql
SELECT id, name, population, area,
ROUND(CAST(population AS FLOAT)/area_land, 2) AS population_density
FROM facts
ORDER BY population_density DESC
LIMIT 5;
Done.
id | name | population | area | population_density |
---|---|---|---|---|
205 | Macau | 592731 | 28 | 21168.96 |
117 | Monaco | 30535 | 2 | 15267.5 |
156 | Singapore | 5674472 | 697 | 8259.78 |
204 | Hong Kong | 7141106 | 1108 | 6655.27 |
251 | Gaza Strip | 1869055 | 360 | 5191.82 |
Macau is a special administrative region in China, and is considered the most densely populated region in the world.
If we are following this way why not calculate the average population density and figure out how many (and what) countries have densitites more than its average? The oceans, the EU, the world, have no values for area_land and/or for population so we do not need to explicitly exclude them from calculation. On the other hand, the Antarctica has both values, so it's necessary to exclude it.
%%sql
SELECT ROUND(AVG(CAST(population AS FLOAT) / area_land), 2) AS avg_pop_density
FROM facts
WHERE name NOT IN ('Antarctica');
Done.
avg_pop_density |
---|
431.91 |
It's around 432 pop./km2, the density of the Netherlands, a state in Europe, a continent with one of the most highest densitites in the world.
Let's now figure out how many countries (and what) have densities more than the average value.
%%sql
SELECT COUNT(*) AS greater_avg_dens
FROM facts
WHERE
The following questions can further help in extracting meaningful information from the data contained in the table.
%%sql
SELECT id, name, MAX(population) AS 'maximun_population'
FROM facts
WHERE name != 'World';
Done.
id | name | maximun_population |
---|---|---|
37 | China | 1367485388 |
%%sql
SELECT id, name, MAX(population_growth) AS 'highest_population_growth'
FROM facts
WHERE name != 'World';
Done.
id | name | highest_population_growth |
---|---|---|
162 | South Sudan | 4.02 |
Stated above, are the countries with the highest population
and population growth
namely China and South Sudan.
%%sql
SELECT id, name, CAST(area_water AS FLOAT)/area_land AS ratio_water_to_land
FROM facts
ORDER BY ratio_water_to_land DESC
LIMIT 5;
Done.
id | name | ratio_water_to_land |
---|---|---|
228 | British Indian Ocean Territory | 905.6666666666666 |
247 | Virgin Islands | 4.520231213872832 |
246 | Puerto Rico | 0.5547914317925592 |
12 | Bahamas, The | 0.3866133866133866 |
71 | Guinea-Bissau | 0.2846728307254623 |
The diagram above shows British Indian Ocean Territory with the highest water to land ratio.
NB: All the countries listed above have islands which means they've bodies of water around them. This goes to validate the ratio_water_to_land column displayed.
%%sql
SELECT id, name, area_water, area_land
FROM facts
WHERE area_water > area_land;
Done.
id | name | area_water | area_land |
---|---|---|---|
228 | British Indian Ocean Territory | 54340 | 60 |
247 | Virgin Islands | 1564 | 346 |
The output shows only two countries with more water mass than land mass namely British Indian Ocean Territory and Virgin Islands.
To determine the countries with the prospects of adding more population, it is necessary to determine which countries are lessly populated and what their growth rate is, otherwise we might end up displaying result of countries that may turn out to be overpopulated.
This can be done by:
%%sql
SELECT id, name, population, population_growth, area
FROM facts
WHERE population < (SELECT AVG(population)
FROM facts)
AND area < (SELECT AVG(population)
FROM facts)
ORDER BY population_growth DESC
LIMIT 5;
Done.
id | name | population | population_growth | area |
---|---|---|---|---|
162 | South Sudan | 12042910 | 4.02 | 644329 |
106 | Malawi | 17964697 | 3.32 | 118484 |
29 | Burundi | 10742276 | 3.28 | 27830 |
182 | Uganda | 37101745 | 3.24 | 241038 |
141 | Qatar | 2194817 | 3.07 | 11586 |
Above shows the top 5 countries which tend to experience a high number of population increase.
%%sql
SELECT id, name, birth_rate, death_rate,
ROUND(death_rate - birth_rate, 3) AS death_birth_difference
FROM facts
WHERE death_rate > birth_rate
ORDER BY death_birth_difference
LIMIT 10;
Done.
id | name | birth_rate | death_rate | death_birth_difference |
---|---|---|---|---|
10 | Austria | 9.41 | 9.42 | 0.01 |
139 | Poland | 9.74 | 10.19 | 0.45 |
116 | Moldova | 12.0 | 12.59 | 0.59 |
47 | Czech Republic | 9.63 | 10.34 | 0.71 |
22 | Bosnia and Herzegovina | 8.87 | 9.75 | 0.88 |
83 | Italy | 8.74 | 10.19 | 1.45 |
85 | Japan | 7.93 | 9.51 | 1.58 |
140 | Portugal | 9.27 | 11.02 | 1.75 |
57 | Estonia | 10.51 | 12.4 | 1.89 |
143 | Russia | 11.6 | 13.69 | 2.09 |
%%sql
SELECT id, name, population, area,
(SELECT *
FROM facts
WHERE population =)
population/area AS ratio_population_to_area
FROM facts
WHERE population = 'None' OR area = 'None'
ORDER BY ratio_population_to_area;
(sqlite3.OperationalError) near ")": syntax error [SQL: SELECT id, name, population, area, (SELECT * FROM facts WHERE population =) population/area AS ratio_population_to_area FROM facts WHERE population = 'None' OR area = 'None' ORDER BY ratio_population_to_area;] (Background on this error at: http://sqlalche.me/e/e3q8)
%%sql
SELECT id, name, population, area
FROM facts
WHERE (population != 'None' OR area != 'None')
AND
(sqlite3.OperationalError) near "AND": syntax error [SQL: SELECT id, name, population, area FROM facts WHERE (population != 'None' OR area != 'None') AND] (Background on this error at: http://sqlalche.me/e/e3q8)