#!/usr/bin/env python # coding: utf-8 # # Analyzing CIA Factbook Data Using SQL # In this project we will use SQL to analyze some of the data in the [CIA World Factbook](https://www.cia.gov/library/publications/the-world-factbook/). The purpose is to show how data stored in a database can be analyzed with SQL. # # The CIA World Factbook provides various data points about 267 world entities. Below are some examples of these data points: # * Demographics # * Enonomy # * Transportation # * Energy # * Communications # * Government # * Military # * Geography # * History # # The database file in this project contains a selection of these data points. We will use SQL to answer some basic questions about the data. Most of the time the answers will be observed in cell output immediately following code cells and will not be spelled out into separate sentences to avoid wordiness and duplication. # ### Navigation # [(1) Connect to Database](#Connect-to-Database-Stored-Locally) # # [(2) List Tables](#List-Tables-Currently-In-The-Database) # # [(3) Inspect a Table](#Inspect-the-'facts'-Table) # # [(4) Calculate Summary Statistics (MIN, MAX, SUM, AVG)](#Calculate-Summary-Statistics-For-Select-Fields) # # [(5) Filter Data Dynamically by Using a Subquery](#View-Country-Names-With-Minimum-and-Maximum-Population) # # [(6) Calculate Summary Statistics With a Condition](#Calculate-Summary-Statistics-For-Select-Fields-With-Filtering) # # [(7) Answer Other Questions About Data](#Answer-Other-Questions-About-Data) # ### Connect to Database Stored Locally # In[1]: get_ipython().run_cell_magic('capture', '', '%load_ext sql\n%sql sqlite:///factbook.db\n') # [To Top](#Navigation) # ### List Tables Currently In The Database # In[2]: get_ipython().run_cell_magic('sql', '', "\nSELECT *\n FROM sqlite_master\n WHERE type = 'table';\n") # [To Top](#Navigation) # ### Inspect the 'facts' Table # In[3]: get_ipython().run_cell_magic('sql', '', '\nSELECT *\n FROM facts\n LIMIT 5;\n') # This preview looks clean. The column headers are styled properly per database conventions. # The columns present in the database are: # * `id` - unique identifier for each entry in the database # * `code` - country code # * `name` - country name # * `area` - total land and water area in square kilometers # * `area_land` - total land area in square kilometers # * `area_water` - total water area in square kilometers # * `population` - total population # * `population_growth`- annual population growth as a percentage # * `birth_rate` - number of people born per 1,000 people per year # * `death_rate` - number of deaths per 1,000 people per year # * `migration_rate` - migration rate. The DataQuest project instructions do not have a better description for this column. See more below. # We need to attempt to understand the `migration_rate` field more. Searching the CIA World Factbook site does not yield useful results because the database used for this project seems to have been selected by someone else who did not include the explanation for this field, or it was omitted by the author of this dataquest project. CIA has references to "net migration", which can be a negative or a potitive value. In this case, `migration_rate` does not appear to be the same as "net migration" because there are no negative values here (see cell below). # In[4]: get_ipython().run_cell_magic('sql', '', '\nSELECT name, migration_rate\n FROM facts\n WHERE migration_rate < 0;\n') # [To Top](#Navigation) # ### Calculate Summary Statistics For Select Fields # In[5]: get_ipython().run_cell_magic('sql', '', '\nSELECT MIN(population),\n MAX(population),\n MIN(population_growth),\n MAX(population_growth)\n FROM facts;\n') # Minimum population is 0, which is either an error or is an entity that is not an inhabited area. # Maximum population is over 7 billion, which appears to be the total population of the world. # Minimum and maximum population growth seem plausible. # [To Top](#Navigation) # ### View Country Names With Minimum and Maximum Population # In[6]: get_ipython().run_cell_magic('sql', '', '\nSELECT name, population\n FROM facts\n WHERE population == (SELECT MIN(population)\n FROM facts);\n') # In[7]: get_ipython().run_cell_magic('sql', '', '\nSELECT name, population\n FROM facts\n WHERE population == (SELECT MAX(population)\n FROM facts);\n') # This query confirms our suspicion that not all entities in the database are countries. Some are continents or the entire world. # [To Top](#Navigation) # ### Calculate Summary Statistics For Select Fields With Filtering # In[8]: get_ipython().run_cell_magic('sql', '', '\nSELECT MIN(population),\n MAX(population),\n MIN(population_growth),\n MAX(population_growth)\n FROM facts\n WHERE population <> (SELECT MAX(population) /* filter out maximum population */\n FROM facts); \n') # In[9]: get_ipython().run_cell_magic('sql', '', '\nSELECT ROUND(AVG(population), 0), ROUND(AVG(area), 0)\n FROM facts\n WHERE population <> (SELECT MAX(population) /* filter out maximum population */\n FROM facts);\n') # [To Top](#Navigation) # ### Answer Other Questions About Data # ##### See list of countries with above average population and below average area # In[10]: get_ipython().run_cell_magic('sql', '', '\nSELECT name\n FROM facts\n WHERE population > (SELECT AVG(population)\n FROM facts)\n AND area < (SELECT AVG(area)\n FROM facts);\n \n') # ##### List 5 top countries by population (descending) # In[11]: get_ipython().run_cell_magic('sql', '', "\nSELECT name, population, population_growth\n FROM facts\n WHERE name NOT IN ('World', 'European Union')\n ORDER BY population DESC\n LIMIT 5;\n") # ##### List 5 top countries by population growth rate (descending) # In[12]: get_ipython().run_cell_magic('sql', '', "\nSELECT name, population_growth\n FROM facts\n WHERE name NOT IN ('World', 'European Union')\n ORDER BY population_growth DESC\n LIMIT 5;\n") # ##### List top 5 countries with the highest ratio of water to land # In[13]: get_ipython().run_cell_magic('sql', '', "\nSELECT name, \n ROUND(CAST(area_water AS Float) / CAST(area_land AS Float), 2) AS water_to_land_ratio\n FROM facts\n WHERE name <> 'British Indian Ocean Territory'\n ORDER BY water_to_land_ratio DESC\n LIMIT 5; \n") # ##### List all countries with more water than land # In[14]: get_ipython().run_cell_magic('sql', '', "\nSELECT name, area_land, area_water\n FROM facts\n WHERE area_land < area_water\n AND name <> 'British Indian Ocean Territory';\n") # ##### List 5 countries that will add most people to their population next year # In[15]: get_ipython().run_cell_magic('sql', '', "\nSELECT name, CAST(ROUND(population * population_growth / 100, 0) AS Int) AS population_added\n FROM facts\n WHERE name <> 'World'\n ORDER BY population_added DESC\n LIMIT 5;\n") # It's worth noting that India will add more than 15 million people as compared to China's 6 million. # ##### Quantify countries with higher death rate than birth rate, then list them # In[16]: get_ipython().run_cell_magic('sql', '', '\nSELECT COUNT(*) AS number_of_countries_with_death_rate_higher_than_birth_rate\n FROM facts\n WHERE birth_rate - death_rate < 0;\n') # In[17]: get_ipython().run_cell_magic('sql', '', '\nSELECT name, birth_rate, death_rate, ROUND(birth_rate - death_rate, 2) AS difference\n FROM facts\n WHERE difference < 0\n ORDER BY difference;\n') # ##### List 5 counties with the highest population density # In[18]: get_ipython().run_cell_magic('sql', '', '\nSELECT name, population, area_land, population / area_land AS population_density\n FROM facts\n ORDER BY population_density DESC\n LIMIT 5;\n') # These results are different from above where we listed [countries with above average population and below average areas.](#See-list-of-countries-with-above-average-population-and-below-average-area) This is because the highest density countries tend to be very small in land area and never above average in total population. # [To Top](#Navigation) # ## Conclusion # This project demonstrated simple SQL usage to answer basic questions about a dataset containing mostly numeric data.