In this project, we will be using SQL to analyze data contained in CIA World Factbook database. This database contains information about all of the countries on Earth, such as:
id
- a unique numeric id for each countrycode
- a unique two-letter code for each countryname
- the name of each countrypopulation
- the global population.population growth
- the annual population growth rate, as a percentage.area
- the total land and water area in square kilometersbirth_rate
- the birth rate per 1,000 peopledeath_rate
- the death rate per 1,000 peoplemigration_rate
- the migration rate for each countryBefore we begin, we'll have to connect to our database using the following code:
%%capture
%load_ext sql
%sql sqlite:///factbook.db
Now we can begin to use SQL queries to peruse and explore our data! Let's see what we can find!
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
* sqlite:///factbook.db Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
Since we want to explore the information in the 'facts' database, we'll need to open that up:
%%sql
SELECT *
FROM facts
LIMIT 5;
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
Above we can see our database with all of the relevant information in an easy-to-read table. Now we can begin calculating some summary statistics and look for any outliers in our data.
SELECT MIN(population) as min_pop,
MAX(population) as max_pop,
MIN(population_growth) as min_growth,
MAX(population_growth) as max_growth
FROM facts;
File "<tokenize>", line 5 FROM facts; ^ IndentationError: unindent does not match any outer indentation level
We can see that our minimum population is 0, and our maximum population is 7.2+ BILLION. This is obviously not correct, as we're looking for information on individual countries, and this seems to be data relating to the entire world. Let's see if we can narrow our data down to find the culprit.
%%sql
SELECT *
FROM facts
WHERE population = (SELECT MIN(population)
FROM facts);
Here we can see that the entry for Antarctica reports a population of 0, which is why our previous query showed 0 as the minimum population. According to the CIA Factbook page for Antarctica, there are no indigenous inhabitants, only seasonal researchers who are staffed there are certain times of the year. Let's do the same query for the country that is reporting the maximum population:
%%sql
SELECT *
FROM facts
WHERE population = (SELECT MAX(population)
FROM facts);
Interestingly, there is a separate entry in our database for the entire World, which is why it lists the population for the entire planet for its population size. Let's re-run our previous query, but this time exclude the row including the world data. We can do this a number of ways, but I will simply query the population whose id != 261
(261 is the unique id for World):
%%sql
SELECT MIN(population) as min_pop,
MAX(population) as max_pop,
MIN(population_growth) as min_growth,
MAX(population_growth) as max_growth
FROM facts
WHERE id != 261;
Above we can see that we now have a more reasonable maximum population of ~1.36 billion people. This is most likely China, as it is one of the largest countries in the world. We can go ahead and verify that with another simple query:
%%sql
SELECT *
FROM facts
WHERE population = (SELECT MAX(population)
FROM facts
WHERE id != 261);
Now that we've figured out the outliers, we can move ahead and calculate the average population and area in our database.
%%sql
SELECT ROUND(AVG(population),3) as avg_pop,
ROUND(AVG(area),3) as avg_area
FROM facts
WHERE id != 261;
So we can see that the average population for a given country is ~32 million people, and the average area of a given country is ~555,000 sq. kilometers. Let's build upon this query one more time to find countries that meet both of the following criteria:
population
is above average (more than ~32 million).area
is below average (less than ~555,000).%%sql
SELECT id, code, name, population, area, ROUND(population/area,2) as pop_area_ratio
FROM facts
WHERE population > (SELECT AVG(population)
FROM facts
WHERE id != 261)
AND area < (SELECT AVG(area)
FROM facts
WHERE id != 261);
We can see from the above results that there are 14 countries that fit both of those criteria.
Let's go ahead and try to answer the following questions using our database:
%%sql
SELECT *
FROM facts
WHERE population = (SELECT MAX(population)
FROM facts
WHERE id != 261)
AND id != 261;
%%sql
SELECT *
FROM facts
WHERE population_growth = (SELECT MAX(population_growth)
FROM facts
WHERE id != 261)
AND id != 261;
We can see that China is the country with the highest population at ~1.37 billion people. South Sudan is the country with the highest population growth at a rate of 4.02%.
%%sql
SELECT name, area_land, area_water,
ROUND(CAST(area_water as float) / CAST(area_land as float), 2) as area_ratio
FROM facts
WHERE id != 261
ORDER BY area_ratio DESC
LIMIT 10;
%%sql
SELECT name, area_land, area_water
FROM facts
WHERE area_water > area_land
AND id != 261
ORDER BY area_water DESC
We can clearly see an outlier in the British Indian Ocean Territory, with an area_ratio of 900+, where the next highest ratio is the Virgin Islands at 4.52. This checks out because the British Indian Ocean Territory is an archipelago, so it's naturally surrounded by tons of ocean. Also, the two countries with the highest water-to-land ratios are also the only two countries who have more water than land.
For this, we take the current population and multiply it by the population_growth
(divided by 100 because its a percentage). The result will be the number of people that country will add to their population the following year.
%%sql
SELECT name, population AS current_pop, population_growth,
ROUND(population*(population_growth/100), 2) AS Population_growth_next_year
FROM facts
WHERE id != 261
ORDER BY Population_growth_next_year DESC
LIMIT 20;
According to our query, India will be adding the most people to its population - by about ~15.2 million people! That's more than twice the amount of the next highest country, China with an estimated population growth of ~6.2 million.
%%sql
SELECT name, birth_rate, population_growth
FROM facts
WHERE id != 261
ORDER BY birth_rate DESC
LIMIT 20;
Here we can see that Niger is the country with the highest birth rate. The top 20 countries with the highest birth rates are all African countries.
Which countries have a death rate that is higher than their birth rate?
%%sql
SELECT name, death_rate, birth_rate, ROUND(death_rate / birth_rate, 2) as death_birth_ratio
FROM facts
WHERE death_rate > birth_rate
ORDER BY death_rate DESC
LIMIT 20;
Here we can see that Ukraine has the highest death rate in the world, at 14.46 with a death/birth rate ratio of 1.35. Bulgaria has the highest death/birth rate ratio at 1.62.
%%sql
SELECT name, population, area, ROUND(population/area, 2) as pop_area_ratio
FROM facts
ORDER BY pop_area_ratio DESC
LIMIT 20;
When we calculated countries that had higher-than-average populations with lower-than-average area, our top results was Bangladesh, with a population-to-area ratio of 1138. If we remove the population and area criteria, we can see that there are 11 more countries with a much higher population-to-area ratio, with Macau being on top with a ratio of 21168. The top three countries in this list are all incredibly small countries, with insanely dense populations, thus contributing to their high ratios.
We have exemplified the use of SQL to quickly explore, aggregate, and summarize a large amount of data using very little code. We've shown just how powerful a tool SQL can be, especially when it comes to speed and readability of complex queries.