#!/usr/bin/env python # coding: utf-8 # ## Analyzing CIA Factbook Data Using SQL # In this project, we'll work with data from the CIA World Factbook, a compendium of statistics about all of the countries on Earth. The Factbook contains demographic information like: # - population - The population as of 2015. # - population_growth - The annual population growth rate, as a percentage. # - area - The total land and water area. # # In this guided project, we'll use SQL in Jupyter Notebook to explore and analyze data from this database, named ```factbook.db``` # #### Connecting to the factbook database # In[1]: get_ipython().run_cell_magic('capture', '', '%load_ext sql\n%sql sqlite:///factbook.db\n') # Query database to get list of tables in it # In[2]: get_ipython().run_cell_magic('sql', '', "SELECT * \nFROM sqlite_master\nWHERE type='table';\n") # We observe there are two tables in the factbook database, namely ```sqlite_sequence```, ```facts``` # Next we write a query to return first 5 rows of the facts table # In[3]: get_ipython().run_cell_magic('sql', '', 'SELECT * \nFROM facts\nLIMIT 5;\n') # Here are the descriptions for some of the columns: # - name - The name of the country. # - area- The country's total area (both land and water). # - area_land - The country's land area in square kilometers. # - area_water - The country's waterarea in square kilometers. # - population - The country's population. # - population_growth- The country's population growth as a percentage. # - birth_rate - The country's birth rate, or the number of births a year per 1,000 people. # - death_rate - The country's death rate, or the number of death a year per 1,000 people. # #### Summary Statisitcs # Let's start by calculating some summary statistics and look for any outlier countries. We use aggregate functions like MAX, MIN to find e.g countrie(s) with the highest/least population. # In[4]: get_ipython().run_cell_magic('sql', '', '--Minimum/Maximum Population and Population_growth\n\nSELECT MIN(population) as min_population,\n MAX(population) as max_population,\n MIN(population_growth) as min_population_growth, \n MAX(population_growth) as max_population_growth\nFROM facts; \n') # A few things stick out: # # - There's a country with a population of 0 # - There's a country with a population of 7256490011 (or more than 7.2 billion people ... which by intuition is the population of the world itself) # # Let's use subqueries to zoom in on just these countries without using the specific values: # In[5]: get_ipython().run_cell_magic('sql', '', '--Least Populated Country\n\nSELECT *\nFROM facts\nWHERE population = (SELECT MIN(population) \n FROM facts);\n') # In[6]: get_ipython().run_cell_magic('sql', '', '--Most Populated Country\n\nSELECT *\nFROM facts\nWHERE population == (SELECT MAX(population) \n FROM facts);\n') # It seems like the table contains a row for the whole world, which explains the population of over 7.2 billion. It also seems like the table contains a row for Antarctica, which explains the population of 0. This seems to match the CIA Factbook [page for Antarctica:](https://www.cia.gov/library/publications/the-world-factbook/geos/ay.html) # # Now that we know this, we should recalculate the summary statistics we calculated earlier, while excluding the row for the whole world. # In[7]: get_ipython().run_cell_magic('sql', '', "--Minimum/Maximum Population and Population_growth\n\nSELECT MIN(population) as min_population,\n MAX(population) as max_population,\n MIN(population_growth) as min_population_growth, \n MAX(population_growth) as max_population_growth\nFROM facts\nWHERE name != 'World'; \n") # In[8]: get_ipython().run_cell_magic('sql', '', "--Most Populated Country\n\nSELECT *\nFROM facts\nWHERE population == (SELECT MAX(population) \n FROM facts \n WHERE name != 'World');\n") # Now, we find countries that are densely populated. To do this, we'll identify countries that have: # - Above average values for population. # - Below average values for area. # In[9]: get_ipython().run_cell_magic('sql', '', '--Average Population and Area\n\nSELECT AVG(population) as avg_population,\n AVG(area) as avg_area\nFROM facts; \n') # In[10]: get_ipython().run_cell_magic('sql', '', 'SELECT * \nFROM facts\nWHERE population > (SELECT AVG(population) as avg_population\n FROM facts) \n AND area < (SELECT AVG(area) as avg_area\n FROM facts) ; \n') # In[11]: get_ipython().run_cell_magic('sql', '', '--Most densly populated countries\n\nSELECT name, population, area, \n CAST(population as float)/area *100 as population_density\nFROM facts\nORDER BY population_density DESC\nLIMIT 13;\n') # $population\_density = \frac{population}{area} * 100$ # # Macau, Monaco, Singapore are the top three most densly populated country in the world. # In[12]: get_ipython().run_cell_magic('sql', '', "--Fastest Growing Country\n\nSELECT *\nFROM facts\nWHERE population_growth == (SELECT MAX(population_growth) \n FROM facts \n WHERE name != 'World');\n") # In[13]: get_ipython().run_cell_magic('sql', '', '--Fastest Growing Countries\n\nSELECT name, population, \n population_growth \nFROM facts\nORDER BY population_growth DESC\nLIMIT 10;\n') # In[14]: get_ipython().run_cell_magic('sql', '', "--Countries adding the most people next year, 2016\n\nSELECT name, population, population_growth, \n CAST(population_growth * population as Float)/100 as population_added \nFROM facts\nWHERE name != 'World'\nORDER BY population_added DESC\nLIMIT 14;\n") # $population\_growth = \frac{population\_added}{population} * 100$ # # India is set to add about 15 million people next year - the most for any nation. With about 6 million and 4.5 million people set to join China and Nigeria respectively, next year. # In[15]: get_ipython().run_cell_magic('sql', '', '--Countries that have more water than land\n\nSELECT name, area, area_land, area_water, \n CAST(area_water as float)/area *100 as water_land_ratio\nFROM facts\nORDER BY water_land_ratio DESC\nLIMIT 12;\n') # Only two countries have more body of water than land - British Indian Ocean Territory(99.9% water) and Virgin Isalnds(82%) # In[16]: get_ipython().run_cell_magic('sql', '', '--Countries that have higher death rate than birth rate\n\nSELECT name, birth_rate, death_rate\nFROM facts\nWHERE death_rate > birth_rate\nORDER BY death_rate DESC\n') # By intuition, most of the countries that have higher death rate than birth rate seem to be (Eastern) European countries. With Ukraine having the highest death rate on the list.