The goal of our work - analyze data with SQL query from CIA World Factbook. In our work we use dataset from Kaggle in the format sqllite database file. After downloading database for local HDD connect to it.
%%capture
%load_ext sql
%sql sqlite:///factbook.db
View general information about database using this query
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
* sqlite:///factbook.db Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
As we see from output of SQL query the database "factbook.dd" contains two tables:
also in the column "sql" we see SQL script for creating tables and we can see name and type of data of columns of the table "facts"
View first five rows and all columns with this query.
%%sql
SELECT
*
FROM facts
LIMIT 5;
* sqlite:///factbook.db Done.
id | code | name | area | area_land | area_water | population | population_growth | birth_rate | death_rate | migration_rate |
---|---|---|---|---|---|---|---|---|---|---|
1 | af | Afghanistan | 652230 | 652230 | 0 | 32564342 | 2.32 | 38.57 | 13.89 | 1.51 |
2 | al | Albania | 28748 | 27398 | 1350 | 3029278 | 0.3 | 12.92 | 6.58 | 3.3 |
3 | ag | Algeria | 2381741 | 2381741 | 0 | 39542166 | 1.84 | 23.67 | 4.31 | 0.92 |
4 | an | Andorra | 468 | 468 | 0 | 85580 | 0.12 | 8.13 | 6.96 | 0.0 |
5 | ao | Angola | 1246700 | 1246700 | 0 | 19625353 | 2.78 | 38.78 | 11.49 | 0.46 |
%%sql
SELECT *
FROM sqlite_master
WHERE type='table';
* sqlite:///factbook.db Done.
type | name | tbl_name | rootpage | sql |
---|---|---|---|---|
table | sqlite_sequence | sqlite_sequence | 3 | CREATE TABLE sqlite_sequence(name,seq) |
table | facts | facts | 47 | CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, "code" varchar(255) NOT NULL, "name" varchar(255) NOT NULL, "area" integer, "area_land" integer, "area_water" integer, "population" integer, "population_growth" float, "birth_rate" float, "death_rate" float, "migration_rate" float) |
Extract MIN and Max for total population of countries.
%%sql
SELECT
name, population AS "Max & Min population"
FROM
facts
WHERE
population = (SELECT MAX(population) FROM facts)
OR
population = (SELECT MIN(population) FROM facts)
ORDER BY "Max & Min population" DESC ;
* sqlite:///factbook.db Done.
name | Max & Min population |
---|---|
World | 7256490011 |
Antarctica | 0 |
Extract MAX and MIN values for population growth
%%sql
SELECT
name, population_growth as "Max & Min populatin growth"
FROM
facts
WHERE
population_growth = (SELECT MAX (population_growth) FROM facts)
OR
population_growth = (SELECT MIN (population_growth) FROM facts)
ORDER BY "Max & Min populatin growth" DESC;
* sqlite:///factbook.db Done.
name | Max & Min populatin growth |
---|---|
South Sudan | 4.02 |
Holy See (Vatican City) | 0.0 |
Cocos (Keeling) Islands | 0.0 |
Greenland | 0.0 |
Pitcairn Islands | 0.0 |
Extract MIN and Max for area of countries.
%%sql
SELECT
name, area as "Max & Min area"
FROM
facts
WHERE
area = (SELECT MAX (area) FROM facts)
OR
area = (SELECT MIN (area) FROM facts)
ORDER BY "Max & Min area" DESC;
* sqlite:///factbook.db Done.
name | Max & Min area |
---|---|
Russia | 17098242 |
Holy See (Vatican City) | 0 |
As we see from above for further correct analyze we have to except values "World" and "Antarctica" from name column because its aren't formally defined as sovereign states. Formally we have to except and "Holy See (Vatican City)" for population_growth columns because citizen of Vatican pulpiteer of celibacy.
Rewrite first query using Sqlite Common Table Expressions (CTE) for defining correct MAX and MIN values for population of countries.
%%sql
WITH cte_name_state_population (name, population) AS (
SELECT
name, population
FROM
facts
WHERE
name <> "World" AND name <> "Antarctica"
)
SELECT
name, population AS "Max & Min population"
FROM
cte_name_state_population
WHERE
population = (SELECT MAX(population) FROM cte_name_state_population)
OR
population = (SELECT MIN(population) FROM cte_name_state_population)
ORDER BY "Max & Min population" DESC ;
* sqlite:///factbook.db Done.
name | Max & Min population |
---|---|
China | 1367485388 |
Pitcairn Islands | 48 |
Rewrite second query for getting real MAX and MIN growth values for countries
%%sql
WITH cte_name_state_pop_growth (name, population_growth) AS (
SELECT
name, population_growth
FROM
facts
WHERE
name <> "World" AND name <> "Antarctica" AND name <> "Holy See (Vatican City)"
)
SELECT
name, population_growth as "Max & Min populatin growth"
FROM
cte_name_state_pop_growth
WHERE
population_growth = (SELECT MAX (population_growth) FROM cte_name_state_pop_growth)
OR
population_growth = (SELECT MIN (population_growth) FROM cte_name_state_pop_growth)
ORDER BY "Max & Min populatin growth" DESC;
* sqlite:///factbook.db Done.
name | Max & Min populatin growth |
---|---|
South Sudan | 4.02 |
Cocos (Keeling) Islands | 0.0 |
Greenland | 0.0 |
Pitcairn Islands | 0.0 |
Calculate number of countries, average for population and total population of countries.
%%sql
WITH cte_name_state_population (population) AS (
SELECT
population
FROM
facts
WHERE
name <> "World" AND name <> "Antarctica"
)
SELECT
COUNT() AS "Countries",
AVG(population) AS "Average population",
SUM(population) AS "Total population"
FROM
cte_name_state_population;
* sqlite:///factbook.db Done.
Countries | Average population | Total population |
---|---|---|
259 | 32377011.0125 | 7770482643 |
Calculate number of countries, average for population growth and total population grows of countries excluding Vatican with their celibate.
%%sql
WITH cte_name_state_pop_growth (name, population_growth) AS (
SELECT
name, population_growth
FROM
facts
WHERE
name <> "World" AND name <> "Antarctica" AND name <> "Holy See (Vatican City)"
)
SELECT
COUNT() AS "Countries",
AVG(population_growth) AS "Average population growth",
SUM(population_growth) AS "Total population growth"
FROM
cte_name_state_pop_growth;
* sqlite:///factbook.db Done.
Countries | Average population growth | Total population growth |
---|---|---|
258 | 1.2066239316239318 | 282.35 |
Calculate average values of area and densely for countries (exclude Vatican).
%%sql
WITH cte_name_state_area (name, area, population) AS (
SELECT
name, area, population
FROM
facts
WHERE
name <> "World" AND name <> "Antarctica" AND name <> "Holy See (Vatican City)"
)
SELECT
COUNT() AS "Countries",
AVG(area) AS "Average area",
SUM(area) AS "Total area",
SUM(population) AS "Total population",
CAST (population AS FLOAT) / CAST(area AS FLOAT) AS "Population Density"
FROM
cte_name_state_area;
* sqlite:///factbook.db Done.
Countries | Average area | Total area | Total population | Population Density |
---|---|---|---|---|
258 | 557331.8266129033 | 138218293 | 7770481801 | 49.92769728470018 |
Firs step, define more densely countries using this query. Us we now area of Vatican equal zero, because we except it from query.
%%sql
WITH cte_name_state_population_area (name, population, area) AS (
SELECT
name, population, area
FROM
facts
WHERE
name <> "World" AND name <> "Antarctica" AND name <> "Holy See (Vatican City)"
)
SELECT
name AS "Country",
CAST (population AS FLOAT) / CAST(area AS FLOAT) AS "Population Density",
population AS "Population",
area AS "Area"
FROM
cte_name_state_population_area
WHERE
population > (SELECT AVG(population) FROM cte_name_state_population_area)
AND
area < (SELECT AVG(area) FROM cte_name_state_population_area)
ORDER BY "Population Density" DESC;
* sqlite:///factbook.db Done.
Country | Population Density | Population | Area |
---|---|---|---|
Bangladesh | 1138.0691432035565 | 168957745 | 148460 |
Korea, South | 492.53104693140796 | 49115196 | 99720 |
Philippines | 336.6612533333333 | 100998376 | 300000 |
Japan | 335.8418136353413 | 126919659 | 377915 |
Vietnam | 284.8610700160019 | 94348835 | 331210 |
United Kingdom | 263.0771396904889 | 64088222 | 243610 |
Germany | 226.4689795026637 | 80854408 | 357022 |
Italy | 205.26687462666754 | 61855120 | 301340 |
Uganda | 153.92487906471177 | 37101745 | 241038 |
Thailand | 132.47662340193327 | 67976405 | 513120 |
Poland | 123.3259958104802 | 38562189 | 312685 |
Spain | 95.26907810119319 | 48146134 | 505370 |
Iraq | 84.5419388250969 | 37056169 | 438317 |
Morocco | 74.6225484268279 | 33322699 | 446550 |
Second step, define more densely countries using this query. Us we now area of Vatican equal zero, because we except it from query.
%%sql
WITH cte_name_state_population_area (name, population, area) AS (
SELECT
name, population, area
FROM
facts
WHERE
name <> "World" AND name <> "Antarctica" AND name <> "Holy See (Vatican City)"
)
SELECT
name AS "Country",
CAST (population AS FLOAT) / CAST(area AS FLOAT) AS "Population Density",
population AS "Population",
area AS "Area"
FROM
cte_name_state_population_area
WHERE
population < (SELECT AVG(population) FROM cte_name_state_population_area)
AND
area > (SELECT AVG(area) FROM cte_name_state_population_area)
ORDER BY "Population Density" DESC;
* sqlite:///factbook.db Done.
Country | Population Density | Population | Area |
---|---|---|---|
Madagascar | 40.56391461584455 | 23812681 | 587041 |
Venezuela | 32.09852530014802 | 29275460 | 912050 |
Mozambique | 31.65342265255573 | 25303113 | 799380 |
Peru | 23.688624324627146 | 30444999 | 1285216 |
Chile | 23.15594985861696 | 17508260 | 756102 |
Zambia | 20.01847683685482 | 15066266 | 752618 |
South Sudan | 18.690622337346294 | 12042910 | 644329 |
Somalia | 16.649044862676956 | 10616380 | 637657 |
Angola | 15.741840859870058 | 19625353 | 1246700 |
Mali | 13.671702446072866 | 16955536 | 1240192 |
Saudi Arabia | 12.909915383148268 | 27752316 | 2149690 |
Bolivia | 9.831666486130745 | 10800882 | 1098581 |
Central African Republic | 8.654377961552784 | 5391539 | 622984 |
Kazakhstan | 6.6634085654519435 | 18157122 | 2724900 |
Botswana | 3.752116961476974 | 2182719 | 581730 |
Libya | 3.644006956363595 | 6411776 | 1759540 |
Mauritania | 3.4895721354419327 | 3596702 | 1030700 |
Australia | 2.9389442490976876 | 22751014 | 7741220 |
Namibia | 2.6838875058838374 | 2212307 | 824292 |
Mongolia | 1.9134821202519505 | 2992908 | 1564116 |
Greenland | 0.026653143042335344 | 57733 | 2166086 |
In this works we know next facts about the world:
Unfortunately SQLlite have weakness with embedded statistical function like percentile_xx in Postgresql.
Created on Mar 20, 2021
@author: Vadim Maklakov, used some ideas from public Internet resources.
© 3-clause BSD License
Software environment: Debian 10.7, Python 3.8.7
required next preinstalled python modules: ipython-sql.