!conda install -yc conda-forge ipython-sql
Solving environment: ...working... done # All requested packages already installed.
==> WARNING: A newer version of conda exists. <== current version: 4.5.11 latest version: 4.8.5 Please update conda by running $ conda update -n base -c defaults conda
Learn
At Dataquest, we strongly advocate project-based learning. For this reason, we created a large number of guided projects. As you work on guided projects, you can either:
Jupyter Notebook — often called just Jupyter — is a popular tool that data professionals use to:
This combination of code and narrative text is a key element in almost any data science workflow. Jupyter makes it easy to combine these three elements, and this is what makes it a must-have tool for any data scientist.
The Jupyter interface on the right of the screen has a few sections, including:
To type and run code, we:
To work on this project in your own computer (if you want to), you should install Jupyter. To do this we recommend you install Anaconda — the self-described world's most popular data science platform.
The default installation of Anaconda installs Jupyter in addition to a plethora of other useful tools. To install it, follow the instructions for your operating system here.
You'll also need ipython-sql
which you can install by starting Jupyter and in a code cell running the following code:
You'll only have to do this once.
Guided projects do not have answer checking. You shouldn't feel bound to follow our instructions, in fact we highly encourage you to explore; we're just giving some guidance!
Learn
In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like:
population
- The population as of 2015
.population_growth
- The annual population growth rate, as a percentage.area
- The total land and water area.In this guided project, we'll use SQL in Jupyter Notebook to explore and analyze data from this database. If you want to work on this project in your computer, you can download the SQLite factbook.db
database by clicking the Download button on the right.
We'll use the following code to connect our Jupyter Notebook to our database file:
%%capture
%load_ext sql
%sql sqlite:///factbook.db
You should add this code block as the first cell in your notebook.
You can find the solutions notebook for this guided project here or by clicking the key button on the right:
Let's get started!
Instructions
You should see the output: 'Connected: None@factbook.db
'
%%capture
%load_ext sql
%sql sqlite:///factbook.db
Learn
Throughout this course, we always introduced the name of the table and what the table looked like in the jobs.db
database we worked with. We can actually query the database to get this information directly:
SELECT *
FROM sqlite_master
WHERE type='table';
This will return the following output:
To run SQL queries in this project we add %sql
on its own line to the start of our query. So to execute the query above, we'll use this code:
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
Let's get started writing queries on our database!
Instructions
facts
table in the database.%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
* sqlite:///factbook.db Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
%%sql
SELECT *
FROM facts
LIMIT 5;
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
Learn
Here are the descriptions for some of the columns:
name
- The name of the country.area
- The country's total area (both land and water).area_land
- The country's land area in square kilometers.area_water
- The country's water area in square kilometers.population
- The country's population.population_growth
- The country's population growth as a percentage.birth_rate
- The country's birth rate, or the number of births a year per 1,000 people.death_rate
- The country's death rate, or the number of death a year per 1,000 people.Let's start by calculating some summary statistics and look for any outlier countries.
Instructions
%%sql
SELECT
MIN(population), MAX(population), MIN(population_growth), MAX(population_growth)
FROM facts;
* sqlite:///factbook.db Done.
MIN(population) | MAX(population) | MIN(population_growth) | MAX(population_growth) |
---|---|---|---|
0 | 7256490011 | 0.0 | 4.02 |
Learn
A few things stick out from the summary statistics in the last screen:
0
7256490011
(or more than 7.2 billion people)Let's use subqueries to zoom in on just these countries without using the specific values.
Instructions
%%sql
SELECT name, MIN(population)
FROM facts;
* sqlite:///factbook.db Done.
name | MIN(population) |
---|---|
Antarctica | 0 |
%%sql
SELECT name, MAX(population)
FROM facts;
* sqlite:///factbook.db Done.
name | MAX(population) |
---|---|
World | 7256490011 |
Learn
It seems like the table contains a row for the whole world, which explains the population of over 7.2 billion. It also seems like the table contains a row for Antarctica, which explains the population of 0. This seems to match the CIA Factbook page for Antarctica:
Now that we know this, we should recalculate the summary statistics we calculated earlier, while excluding the row for the whole world.
Instructions
%%sql
SELECT
MIN(population), MAX(Population), MIN(population_growth), MAX(population_growth)
FROM facts
WHERE name <> 'World';
* sqlite:///factbook.db Done.
MIN(population) | MAX(Population) | MIN(population_growth) | MAX(population_growth) |
---|---|---|---|
0 | 1367485388 | 0.0 | 4.02 |
%%sql
SELECT
AVG(population), AVG(area)
FROM facts
WHERE name <> 'World';
* sqlite:///factbook.db Done.
AVG(population) | AVG(area) |
---|---|
32242666.56846473 | 555093.546184739 |
Learn
To finish, we'll build on the query we wrote for the previous screen to find countries that are densely populated. We'll identify countries that have: + Above average values for population. + Below average values for area.
It's possible to write this query by copying the values you calculated on the previous screen, but try to find a way to do this without hard coding them!
Instructions
%%sql
SELECT name, population, area
FROM facts
WHERE population > (SELECT AVG(population) FROM facts)
AND area < (SELECT AVG(area) FROM facts);
* sqlite:///factbook.db Done.
name | population | area |
---|---|---|
Bangladesh | 168957745 | 148460 |
Germany | 80854408 | 357022 |
Japan | 126919659 | 377915 |
Philippines | 100998376 | 300000 |
Thailand | 67976405 | 513120 |
United Kingdom | 64088222 | 243610 |
Vietnam | 94348835 | 331210 |
That's it for the guided steps. Here are some next steps for you to explore:
population/area
ratio and how does it compare to list we found in the previous screen?Curious to see what other students have done on this project? Head over to our Community to check them out. While you are there, please remember to show some love and give your own feedback!
And of course, we welcome you to share your own project and show off your hard work. Head over to our Community to share your finished Guided Project!
1. What country has the most people?
%%sql
SELECT name AS Country, printf("%,d",MAX(population)) AS Population
FROM facts
WHERE name <> 'World';
* sqlite:///factbook.db Done.
Country | Population |
---|---|
China | 1,367,485,388 |
,
2. What country has the highest growth rate?
%%sql
SELECT name AS Country, round(MAX(population_growth)/100,5) AS PopulationGrowthRate
FROM facts
WHERE name <> 'World';
* sqlite:///factbook.db Done.
Country | PopulationGrowthRate |
---|---|
South Sudan | 0.0402 |
,
3. Which countries have the highest ratios of water to land?
%%sql
SELECT name AS Country, (area_water/area_land) AS Water_Land_Ratio
FROM facts
WHERE name <> 'World'
AND Water_Land_Ratio <> 'None'
AND Water_Land_Ratio > 0;
* sqlite:///factbook.db Done.
Country | Water_Land_Ratio |
---|---|
British Indian Ocean Territory | 905 |
Virgin Islands | 4 |
,
,
4.Which countries will add the most people to their population next year?
%%sql
SELECT name AS Country,
printf("%,d",population) AS population,
round(population_growth/100,5) AS population_growth,
CAST(population * population_growth/100 AS Integer) AS people_growth
FROM facts
WHERE name <> 'World'
ORDER BY people_growth DESC
LIMIT 3;
* sqlite:///factbook.db Done.
Country | population | population_growth | people_growth |
---|---|---|---|
India | 1,251,695,584 | 0.0122 | 15270686 |
China | 1,367,485,388 | 0.0045 | 6153684 |
Nigeria | 181,562,056 | 0.0245 | 4448270 |
, , ,
5. Which countries have a higher death rate than birth rate?
%%sql
SELECT name AS country,
birth_rate,
death_rate,
round((death_rate - birth_rate),2) AS rate_diff
FROM facts
WHERE death_rate > birth_rate
ORDER BY rate_diff DESC
LIMIT 3;
* sqlite:///factbook.db Done.
country | birth_rate | death_rate | rate_diff |
---|---|---|---|
Bulgaria | 8.92 | 14.44 | 5.52 |
Serbia | 9.08 | 13.66 | 4.58 |
Latvia | 10.0 | 14.31 | 4.31 |
,
,
,
6. What countries have the highest population/area
ratio and how does it compare to list we found in the previous screen?
%%sql
SELECT name,
printf("%,d",population) AS population,
printf("%,d",area) AS area,
population/area AS pop_area_ratio
FROM facts
ORDER BY pop_area_ratio DESC
LIMIT 3;
* sqlite:///factbook.db Done.
name | population | area | pop_area_ratio |
---|---|---|---|
Macau | 592,731 | 28 | 21168 |
Monaco | 30,535 | 2 | 15267 |
Singapore | 5,674,472 | 697 | 8141 |
, , , , ,