In this project, we're interested in finding information about the countries of the world, as presented in the CIA World Factbook. Using SQLite in Jupyter Notebook, we'll query the database to answer the following questions:
Before we can do that, we need to connect to our database.
%%capture
%load_ext sql
%sql sqlite:///factbook.db
'Connected: None@factbook.db'
Our first step is to get an overview of the data
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
We have a fairly simple database. The SQLite internal table sqlite_sequence is used to store information about SQLite: autoincrement columns. This means we only have one data table, called "facts". Let's take a look at the first five rows of the facts table.
%%sql
SELECT *
FROM facts
LIMIT 5
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
There are a total of 11 columns. The ones of most interest to us for our queries are:
With this knowledge, we can now begin our analysis.
Our first analysis step will be to find and calculate some summary statistics, namely the minimum and maximum populations and minimum and maximum population growth.
%%sql
SELECT MIN(population) AS minimum_pop,
MAX(population) AS max_pop,
MIN(population_growth) AS min_pop_growth,
MAX(population_growth) AS max_pop_growth
FROM facts
Done.
minimum_pop | max_pop | min_pop_growth | max_pop_growth |
---|---|---|---|
0 | 7256490011 | 0.0 | 4.02 |
We have a country with zero population and one with a populaton of 7.25 Billion (the approximate population of the world). Let's take a look at the countries with the highest and lowest populations
%%sql
SELECT name, population
FROM facts
ORDER BY population
DESC
LIMIT 5
Done.
name | population |
---|---|
World | 7256490011 |
China | 1367485388 |
India | 1251695584 |
European Union | 513949445 |
United States | 321368864 |
As suspected, there is a row for the population of the entire world. We'll have to take that into consideration for our query. Now let's take a look at the areas with the lowest populations.
%%sql
SELECT name, population
FROM facts
ORDER BY population
LIMIT 21 --Need 21 rows to get the first row with any population
Done.
name | population |
---|---|
Ashmore and Cartier Islands | None |
Coral Sea Islands | None |
Heard Island and McDonald Islands | None |
Clipperton Island | None |
French Southern and Antarctic Lands | None |
Bouvet Island | None |
Jan Mayen | None |
British Indian Ocean Territory | None |
South Georgia and South Sandwich Islands | None |
Navassa Island | None |
Wake Island | None |
United States Pacific Island Wildlife Refuges | None |
Paracel Islands | None |
Spratly Islands | None |
Arctic Ocean | None |
Atlantic Ocean | None |
Indian Ocean | None |
Pacific Ocean | None |
Southern Ocean | None |
Antarctica | 0 |
Pitcairn Islands | 48 |
The data include several rows with "None" for population. These include oceans, which make sense as no one is a citizen of the ocean, and several island chains. According to the CIA Factbook pages for several of these islands, they are all uninhabited or only house temporary research personnel.
There is also one row with zero population, which is Antartica. the CIA Factbook page for Antartica shows a similar status to the islands:
53 countries have signed the 1959 Antarctic Treaty; 30 of those operate through their National Antarctic Program a number of seasonal-only (summer) and year-round research stations on the continent and its nearby islands south of 60 degrees south latitude (the region covered by the Antarctic Treaty); the population engaging in and supporting science or managing and protecting the Antarctic region varies from approximately 4,400 in summer to 1,100 in winter; in addition, approximately 1,000 personnel, including ship's crew and scientists doing onboard research, are present in the waters of the treaty region.
Now that we know about the rows with the world and zero populations,we can exclude all of these rows to find the actual countries with the highest and lowest populations.
%%sql
SELECT name, max(population) as max_population
FROM facts
WHERE name != "World"; --
Done.
name | max_population |
---|---|
China | 1367485388 |
%%sql
SELECT name, min(population)
FROM facts
WHERE name != 'Antarctica'; -- min() ill ignore "None" by default
Done.
name | min(population) |
---|---|
Pitcairn Islands | 48 |
%%sql
SELECT name, max(population_growth) as max_population_growth
FROM facts
WHERE name != "World" AND name != "Antarctica";
Done.
name | max_population_growth |
---|---|
South Sudan | 4.02 |
%%sql
SELECT name, min(population_growth) as min_population_growth
FROM facts
WHERE name != "World" AND name != "Antarctica";
Done.
name | min_population_growth |
---|---|
Holy See (Vatican City) | 0.0 |
These numbers look much better, but we still have at least one country with zero growth. Let's take a deeper look at that.
%%sql
SELECT *
FROM facts
WHERE population_growth = 0
Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
190 | vt | Holy See (Vatican City) | 0 | 0 | 0 | 842 | 0.0 | None | None | None |
200 | ck | Cocos (Keeling) Islands | 14 | 14 | 0 | 596 | 0.0 | None | None | None |
207 | gl | Greenland | 2166086 | 2166086 | None | 57733 | 0.0 | 14.48 | 8.49 | 5.98 |
238 | pc | Pitcairn Islands | 47 | 47 | 0 | 48 | 0.0 | None | None | None |
We have 2 small islands with very small populations. It makes sense that the populations would not fluctuate much. The third "country" is Vatican city. As it's controlled by the Catholic Church, they can maintain the population so it would not necessarily be prone to normal population influences. The final country is Greenland. According to the Factbook, the birth rate is approximately 13.79 per 1000 people. This is balanced by the death rate of 8.96 per 1000 and immigration rate of -5.07 (14.03 total) for a net of -.24 people per 1000. It seems unlikely that the population is exactly the same from year to year, but accounting for polling methods, statistical errors and rounding, the number isn't entirely unreasonable.
We now turn our analysis to the average populations and land area of Earth's countries. We don't want to include Antartica or the whole world in this either, so we'll exclude them again.
%%sql
SELECT ROUND(AVG(population),2) AS avg_pop,
ROUND(AVG(area_land),2) AS avg_area_land
FROM facts
WHERE population != (SELECT --exclude row with world population
MAX(population)
FROM facts)
AND
population != (SELECT --Exclude Zero population (Antartica) and Null by default
MIN(population)
FROM facts)
Done.
avg_pop | avg_area_land |
---|---|
32377011.01 | 550643.76 |
Our final analysis step will be to find the most densely populated countries in the world. To do this, we'll look for countries with populations higher than the average and land areas lower than the average.
%%sql
SELECT name, population, area_land, population/area_land AS population_density
FROM facts
WHERE population > (SELECT AVG(population)
FROM facts)
AND
area_land < (SELECT AVG(area_land)
FROM facts)
ORDER BY population_density
DESC
Done.
name | population | area_land | population_density |
---|---|---|---|
Bangladesh | 168957745 | 130170 | 1297 |
Japan | 126919659 | 364485 | 348 |
Philippines | 100998376 | 298170 | 338 |
Vietnam | 94348835 | 310070 | 304 |
United Kingdom | 64088222 | 241930 | 264 |
Germany | 80854408 | 348672 | 231 |
Thailand | 67976405 | 510890 | 133 |
The drawback to this method is that it might exclude countries where the population is lower than average but has an extremely small land area or the land area is above average and the population is still very high. A quick search online for population densities shows that Bangladesh actually ranks around 10th in the world. Let's see what happens when we don't limit our analysis based on average population and area.
%%sql
SELECT name, population, area_land, population/area_land AS population_density
FROM facts
ORDER BY population_density
DESC
LIMIT 20
Done.
name | population | area_land | population_density |
---|---|---|---|
Macau | 592731 | 28 | 21168 |
Monaco | 30535 | 2 | 15267 |
Singapore | 5674472 | 687 | 8259 |
Hong Kong | 7141106 | 1073 | 6655 |
Gaza Strip | 1869055 | 360 | 5191 |
Gibraltar | 29258 | 6 | 4876 |
Bahrain | 1346613 | 760 | 1771 |
Maldives | 393253 | 298 | 1319 |
Malta | 413965 | 316 | 1310 |
Bermuda | 70196 | 54 | 1299 |
Bangladesh | 168957745 | 130170 | 1297 |
Sint Maarten | 39689 | 34 | 1167 |
Guernsey | 66080 | 78 | 847 |
Jersey | 97294 | 116 | 838 |
Taiwan | 23415126 | 32260 | 725 |
Barbados | 290604 | 430 | 675 |
Mauritius | 1339827 | 2030 | 660 |
Aruba | 112162 | 180 | 623 |
Lebanon | 6184701 | 10230 | 604 |
Saint Martin | 31754 | 54 | 588 |
Now Bangladesh, which was our highest density country is actually 11th. This lines up with the results of our search. Now Macau appears to be the country with the highest population at 21168 people per square kilometer, which also matches our online findings.
In this analysis, we reviewed some basic statistics about the countries of the world based on the CIA World Factbook. Due to some quirks of the data such as "countries" with zero population and an entry for the entire world, we had to mainipulate our queries a bit, but in the end we were able to answer all of the questions about the countries having the highest and lowest populations and population densities.
With 48 occupants, the Pitcairn Islands came in as the country with the lowest population. Unsurprizingly, the country with the highest population is China at 1,367,485,388 people. Turning to population growth, we actually found that Greenland has approximately zero growth and dug a little deeper into that to verify such an unexpected result. Showing the highest growth rate was South Sudan 4.02 people per 1000.
For our last question, we wanted to find countries that had the highest population. We looked at this in 2 ways. First, we found the average population and land areas (still excluding the rows with the world and zero populations). The average population came out to 32,377,011 people and the average land area is 550643.76 sqare km. looking at countries with above average populations and below average land areas, Bangladesh appeared to be the highest density country with 1297 people per square kilometer. This methid turned out to be flawed, however, as it exclueded countries with below average populations but extremely small land area and countries with larger than average land area and very high populations. Removing the restrictions we found that the highest density is actually Macau with 21168 people per square kilometer.