Analyzing CIA Factbook Data Using SQLite and Python

In this Project we're working with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. You can download the SQLite database, factbook.db, from this GitHub repo.

The Factbook contains demographic information like:

name - The name of the country.
area - The total land and water area.
area_land - The country's land area in square kilometers.
area_water - The country's waterarea in square kilometers.
population - The population as of 2015.
population_growth - The annual population growth rate, as a percentage.
birth_rate - The country's birth rate, or the number of births a year per 1,000 people.
death_rate - The country's death rate, or the number of death a year per 1,000 people.

Aim

Here, we'll explore the Python SQLite workflow to explore, analyze, and visualize data from this database

Overview

  • Importing pandas and sqlite3.
  • Connecting to factbook.db and use pandas.read_sql_query() to return information on the tables in the database.
  • We will run another query to return first 5 rows of the facts table in the database.
In [5]:
import sqlite3
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

conn = sqlite3.connect('factbook.db')
query = "SELECT * FROM sqlite_master WHERE type='table';"
pd.read_sql_query(query, conn)
Out[5]:
type name tbl_name rootpage sql
0 table facts facts 2 CREATE TABLE "facts" ("id" INTEGER PRIMARY KEY...
1 table sqlite_sequence sqlite_sequence 3 CREATE TABLE sqlite_sequence(name,seq)
In [6]:
query1 = "SELECT * FROM facts LIMIT 5"
pd.read_sql_query(query1, conn)
Out[6]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate created_at updated_at
0 1 af Afghanistan 652230 652230 0 32564342 2.32 38.57 13.89 1.51 2015-11-01 13:19:49.461734 2015-11-01 13:19:49.461734
1 2 al Albania 28748 27398 1350 3029278 0.30 12.92 6.58 3.30 2015-11-01 13:19:54.431082 2015-11-01 13:19:54.431082
2 3 ag Algeria 2381741 2381741 0 39542166 1.84 23.67 4.31 0.92 2015-11-01 13:19:59.961286 2015-11-01 13:19:59.961286
3 4 an Andorra 468 468 0 85580 0.12 8.13 6.96 0.00 2015-11-01 13:20:03.659945 2015-11-01 13:20:03.659945
4 5 ao Angola 1246700 1246700 0 19625353 2.78 38.78 11.49 0.46 2015-11-01 13:20:08.625072 2015-11-01 13:20:08.625072

Summary Statistics

We will write a single query that returns the:

- minimum population
- maximum population
- minimum population growth
- maximum population growth
In [7]:
query2 = '''SELECT MIN(population) 'min_pop', 
MAX(population) 'max_pop', 
MIN(population_growth) 'min_pop_growth', 
MAX(population_growth) 'max_pop_growth' 
FROM facts'''
pd.read_sql_query(query2, conn)
Out[7]:
min_pop max_pop min_pop_growth max_pop_growth
0 0 7256490011 0.0 4.02

Exploring Outliers

  • We will write a query that return the countries with a population of 0.
  • We will write a query that return the countries with a population of 7256490011.
In [8]:
query3 = '''SELECT * FROM facts WHERE population == 0 or population == 7256490011'''
pd.read_sql_query(query3, conn)
Out[8]:
id code name area area_land area_water population population_growth birth_rate death_rate migration_rate created_at updated_at
0 250 ay Antarctica None 280000.0 None 0 NaN NaN NaN None 2015-11-01 13:38:44.885746 2015-11-01 13:38:44.885746
1 261 xx World None NaN None 7256490011 1.08 18.6 7.8 None 2015-11-01 13:39:09.910721 2015-11-01 13:39:09.910721

Here, we can observe that country listed with zero population is Antartica, since there is nobody living there.
And, the country listed with 7.2 Billion people is actually the world itself where whole world's population gets counted in just a single country.

Histograms

  • Using just the non-outlier rows, we will generate a 2 by 2 grid of histograms for the following columns:

    • population
    • population_growth
    • birth_rate
    • death_rate
In [9]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

fig = plt.figure(figsize =(10, 5))
ax = fig.add_subplot(1, 1, 1)

query4 = '''SELECT population, population_growth, 
birth_rate, death_rate 
FROM facts
WHERE population != 0 and population != 7256490011;'''

pd.read_sql_query(query4, conn).hist(ax = ax)
plt.show()

Further Exploration

Which countries have the highest population density?

In [10]:
query5 = '''SELECT name, cast(population as float)/cast(area_land as float) 'population_density' 
FROM facts ORDER BY population_density desc'''

pd.read_sql_query(query5, conn)
Out[10]:
name population_density
0 Macau 21168.964286
1 Monaco 15267.500000
2 Singapore 8259.784571
3 Hong Kong 6655.271202
4 Gaza Strip 5191.819444
5 Gibraltar 4876.333333
6 Bahrain 1771.859211
7 Maldives 1319.640940
8 Malta 1310.015823
9 Bermuda 1299.925926
10 Bangladesh 1297.977606
11 Sint Maarten 1167.323529
12 Guernsey 847.179487
13 Jersey 838.741379
14 Taiwan 725.825356
15 Barbados 675.823256
16 Mauritius 660.013300
17 Aruba 623.122222
18 Lebanon 604.565103
19 Saint Martin 588.037037
20 San Marino 541.311475
21 Rwanda 513.285755
22 Korea, South 506.760173
23 Netherlands 500.041424
24 West Bank 493.859220
25 Nauru 454.285714
26 India 420.993721
27 Burundi 418.312928
28 Tuvalu 418.038462
29 Puerto Rico 405.677227
... ... ...
231 Greenland 0.026653
232 Antarctica 0.000000
233 Ethiopia NaN
234 South Sudan NaN
235 Sudan NaN
236 Holy See (Vatican City) NaN
237 European Union NaN
238 Ashmore and Cartier Islands NaN
239 Coral Sea Islands NaN
240 Heard Island and McDonald Islands NaN
241 Clipperton Island NaN
242 French Southern and Antarctic Lands NaN
243 Saint Barthelemy NaN
244 Bouvet Island NaN
245 Jan Mayen NaN
246 Akrotiri NaN
247 British Indian Ocean Territory NaN
248 Dhekelia NaN
249 South Georgia and South Sandwich Islands NaN
250 Navassa Island NaN
251 Wake Island NaN
252 United States Pacific Island Wildlife Refuges NaN
253 Paracel Islands NaN
254 Spratly Islands NaN
255 Arctic Ocean NaN
256 Atlantic Ocean NaN
257 Indian Ocean NaN
258 Pacific Ocean NaN
259 Southern Ocean NaN
260 World NaN

261 rows × 2 columns

Histogram of population densities.

In [11]:
fig1 = plt.figure(figsize=(10, 5))
ax1 = fig1.add_subplot(1, 1, 1)
pd.read_sql_query(query5, conn).hist(ax = ax1)
plt.show()

Which countries have the highest ratios of water to land?

In [12]:
query6 = '''SELECT name, cast(area_water as float)/cast(area_land as float) 
'water_land_ratio' FROM facts ORDER BY water_land_ratio desc'''
pd.read_sql_query(query6, conn)
Out[12]:
name water_land_ratio
0 British Indian Ocean Territory 905.666667
1 Virgin Islands 4.520231
2 Puerto Rico 0.554791
3 Bahamas, The 0.386613
4 Guinea-Bissau 0.284673
5 Malawi 0.259396
6 Netherlands 0.225710
7 Uganda 0.222922
8 Eritrea 0.164356
9 Liberia 0.156240
10 Bangladesh 0.140509
11 Gambia, The 0.116601
12 Taiwan 0.115313
13 Finland 0.112996
14 India 0.105634
15 Canada 0.098000
16 Sweden 0.097384
17 Colombia 0.096476
18 Brunei 0.094967
19 Guyana 0.092050
20 French Polynesia 0.088842
21 Nicaragua 0.086507
22 Burundi 0.083723
23 Iran 0.076130
24 United States 0.072551
25 Tanzania 0.069429
26 Vietnam 0.068178
27 Rwanda 0.067699
28 Estonia 0.067000
29 Norway 0.064151
... ... ...
231 Turks and Caicos Islands 0.000000
232 American Samoa 0.000000
233 Guam 0.000000
234 Navassa Island 0.000000
235 Northern Mariana Islands 0.000000
236 Wake Island 0.000000
237 Gaza Strip 0.000000
238 Paracel Islands 0.000000
239 Spratly Islands 0.000000
240 Western Sahara 0.000000
241 Ethiopia NaN
242 New Zealand NaN
243 South Sudan NaN
244 Sudan NaN
245 Holy See (Vatican City) NaN
246 European Union NaN
247 Greenland NaN
248 French Southern and Antarctic Lands NaN
249 Saint Barthelemy NaN
250 Saint Martin NaN
251 Akrotiri NaN
252 Dhekelia NaN
253 United States Pacific Island Wildlife Refuges NaN
254 Antarctica NaN
255 Arctic Ocean NaN
256 Atlantic Ocean NaN
257 Indian Ocean NaN
258 Pacific Ocean NaN
259 Southern Ocean NaN
260 World NaN

261 rows × 2 columns

Which countries have more water than land?

In [13]:
query7 = '''SELECT name, cast(area_water as float) - cast(area_land as float) 
'diff' FROM facts 
WHERE diff > 0
ORDER BY diff desc'''
pd.read_sql_query(query7, conn)
Out[13]:
name diff
0 British Indian Ocean Territory 54280.0
1 Virgin Islands 1218.0

British Indian Ocean Territory and Virgin Islands are the only two countries with more water area than land.

In [ ]: