#!/usr/bin/env python # coding: utf-8 # # Analyzing CIA Factbook Data Using SQL # ## Introduction # In this guided project, we will be working with data from the [CIA World Factbook](https://www.cia.gov/the-world-factbook/), a compendium of statistics about every countries on Earth. The site provides demographic information for each country. # # This dataset is a SQLite Database named factbook.db that can be dowloaded [here]( https://dsserver-prod-resources-1.s3.amazonaws.com/257/factbook.db). # # *Note: the data used is not current.* # # The goal of this project is to learn basic statistics using SQL (more precisely, [SQLite](https://www.sqlite.org/index.html)). # ## Analyzing the Data # **Connecting to database** # # First of all we need to connect to the database: # In[1]: get_ipython().run_line_magic('reload_ext', 'sql') get_ipython().run_line_magic('sql', 'sqlite:///factbook.db') # **Overview of the Data** # # Now let's get the information about the database structure: # In[2]: get_ipython().run_cell_magic('sql', '', "select *\n from sqlite_master\nwhere type='table';\n") # The database actually contain only one table named 'facts'. # Let's take a look at its structure and content (5 lines are enough): # In[3]: get_ipython().run_cell_magic('sql', '', 'select *\nfrom facts\nlimit 5;\n') # *Here are the descriptions for the columns:* # # - name — the name of the country. # - area— the country's total area (both land and water). # - area_land — the country's land area in square kilometers. # - area_water — the country's waterarea in square kilometers. # - population — the country's population. # - population_growth— the country's population growth as a percentage. # - birth_rate — the country's birth rate, or the number of births per year per 1,000 people. # - death_rate — the country's death rate, or the number of death per year per 1,000 people # - migration_rate - the country's migration rate. # In[4]: get_ipython().run_cell_magic('sql', '', 'select count(*)\nfrom facts\n;\n') # The table contains 261 rows including NULL and duplicates. # **Section 1. Summary Statistics** # # The goal of the current calculation is to get following: # - Minimum population. # - Maximum population. # - Minimum population growth. # - Maximum population growth. # In[5]: get_ipython().run_cell_magic('sql', '', "select \nmin(population) as 'Minimum population',\nmax(population) as 'Maximum population',\nmin(population_growth) as 'Minimum population growth, %',\nmax(population_growth) as 'Maximum population growth, %'\nfrom facts;\n") # **Section 2. Exploring Outliers** # The result above looks incorrect, the analysis showed that in some countries the population is zero, and in some it is close to [the world population](https://www.worldometers.info/world-population/). # # Let's find which country has zero population: # In[6]: get_ipython().run_cell_magic('sql', '', 'select name as Country,\n population as Population\nfrom facts\nwhere population == 0;\n') # That's right, there is no [permanent population](https://worldpopulationreview.com/continents/antarctica-population) in Antarctica. # # Now let's find which country has the maximun population: # In[7]: get_ipython().run_cell_magic('sql', '', 'select name as Country, \n population as Population\nfrom facts\nwhere population == (select max(population) from facts);\n') # Thus, the database contains the value of the total population of the world . This means that we need to exclude this row and recalculate our statistics: # In[8]: get_ipython().run_cell_magic('sql', '', 'select min(population), max(population),\nmin(population_growth), max(population_growth)\nfrom facts\nwhere name <> \'World\' --exclude the "World" row\nand name <> \'Antarctica\' --exclude the \'Antarctica\' row\n; \n') # As we can see, the population range is from 48 people to over 1.3 billion, and the population growth is from zero to 4 people per 1,000 people. Are we curious enough to know what these countries are? Well, I am: # In[9]: get_ipython().run_cell_magic('sql', '', "select name as Country,\n population as Total_population,\n population*100/(select population from facts\n where name = 'World') as '%_from_World_population'\nfrom facts\nwhere \n population = (select min(population)\n from facts\n where name != 'Antarctica')\n or population = (select max(population)\n from facts\n where name != 'World')\n ;\n") # In[10]: get_ipython().run_cell_magic('sql', '', "select name as Country,\n population_growth as 'Growth_population_%'\nfrom facts\nwhere \n population_growth = (select min(population_growth)\n from facts)\n or population_growth = (select max(population_growth)\n from facts)\n ;\n") # Above we see the following: # - China is the most populous country, but it is home to only 18% of the world's population; # - Pitcairn Island is the least populated country and population growth is shown as zero; # - three more countries show zero population growth; # - the population of South Sudan is growing faster than in any other country. # # Zero as the growth value does not mean that population is constant. It may be too small to be shown in the data. # Let's test this hipothesis: # In[11]: get_ipython().run_cell_magic('sql', '', "select name as Country,\n population as Population_Total,\n population_growth as 'Growth_population_%',\n birth_rate as Births_per_1000,\n death_rate as Deaths_per_1000\nfrom facts\nwhere \n population_growth = (select min(population_growth)\n from facts)\n;\n") # We see that in database three countries with the minimum population growth do not have records about births and deaths. What about Greenland? # In[12]: get_ipython().run_cell_magic('sql', '', "select name as Country,\n (birth_rate - death_rate)*100/population as Growth_Rate_Full\n from facts\n where name = 'Greenland'\n ;\n") # Thus, the Greenland's population is growthing, but too low to shown the value of the growth in the current data. # By the way, we can see that some countries do not have data in the "population_growth" column. # Let's look at these rows: # In[13]: get_ipython().run_cell_magic('sql', '', "select name as Country,\n population_growth as 'Growth_%',\n birth_rate as Births_per_1000,\n death_rate as Deaths_per_1000,\n area as Total_Area_km2,\n area_land as Land_km2\nfrom facts\nwhere \n population_growth is null\n order by name\n ;\n") # In[14]: get_ipython().run_cell_magic('sql', '', 'select count(name)\n from facts\n where population_growth is null\n ;\n') # We can see that in 25 lines there is no data about growth, births or deaths. Thus, we cannot fill missing data by calculation. Between these lines there are rows with data about Antarctica and all five oceans. # In[15]: get_ipython().run_cell_magic('sql', '', 'select count(name)\n from facts\n where population_growth is null\n and area < 300\n ;\n') # The 11 countries for which data are missing are smaller than 300 km2 and are small islands or a group of small islands. There are also a couple of group islands with a fairly large total area. Separately lies Kosovo, about which we do not have the necessary data. # **Section 3. Exploring Average population and Area** # Now let's calculate the average value for population and area: # In[16]: get_ipython().run_cell_magic('sql', '', "select \n round(avg(population), 2) as 'Average_Population',\n round(avg(area), 2) as 'Average_Area'\nfrom facts\nwhere name <> 'World';\n") # And one more calculation: which countries meet both of the following criteria: # - The population is above average. # - The area is below average. # In[17]: get_ipython().run_cell_magic('sql', '', "select name as Country,\n population as Total_population,\n area as Total_Area\nfrom facts\nwhere population > (select avg(population) \n from facts\n where name <> 'World') \n and area < (select avg(area) \n from facts\n where name <> 'World');\n") # In[18]: get_ipython().run_cell_magic('sql', '', 'select count(name) as "Amount_of_countries"\nfrom facts\nwhere name in\n(select name \nfrom facts\nwhere population > (select avg(population) \n from facts\n where name <> \'World\') \n and area < (select avg(area) \n from facts\n where name <> \'World\')\n);\n') # As we see, 14 countries meet both criteria. # **Section 4. Finding Densely Populated Countries** # In this section, we will analyze which countries have an above average population at a time when the land (without water area) is below the average: # In[19]: get_ipython().run_cell_magic('sql', '', "select name as Country,\npopulation as Total_population,\npopulation/area_land as People_per_km2_of_land\nfrom facts\nwhere population > (select avg(population) from facts where name<>'World')\nand area < (select avg(area_land) from facts where name<>'World')\norder by People_per_km2_of_land desc\nlimit 10;\n") # Above we can see that the most populous country is Bangladesh- over 1200 people per square kilometer of the land. Which country is the least populuos? # In[20]: get_ipython().run_cell_magic('sql', '', "select name as Country,\npopulation as Total_population,\npopulation/area_land as People_per_km2_of_land\nfrom facts\nwhere population > (select avg(population) from facts where name<>'World')\nand area < (select avg(area_land) from facts where name<>'World')\norder by People_per_km2_of_land\nlimit 5;\n") # It's Morocco - only 74 people per square kilometer of the land. # **Section 5. Some More Analysis** # *Q1. Which country has the most people? Which country has the highest growth rate?* # In[21]: get_ipython().run_cell_magic('sql', '', "select name as 'Country',\n population as 'Total_Population',\n population_growth as 'Growth_Rate_%'\n from facts\n where population== (select max(population)\n from facts\n where name !='World')\n or population_growth == (select max(population_growth)\n from facts\n where name !='World')\n ;\n") # So, China is the most populous, but its growth rate is low. South Sudan's growth rate is the highest, but the total population is below average. # *Q2. Which country has the highest ratio of water to land? Which country has less water than land? # In[22]: get_ipython().run_cell_magic('sql', '', "select name as 'Country',\n area_water/area_land as 'Water/Land_Ratio',\n area_water-area_land as 'Water/Land_Difference'\n from facts\n where area_water/area_land == (select max(area_water/area_land)\n from facts\n where name !='World' )\n or area_water-area_land == (select min(area_water-area_land)\n from facts\n where name !='World')\n ;\n") # We have get that British Indian Ocean Territory has 905 times more water than land, and Russian lands are much larger than water area. # *Q3. Which countries will add the most people to their populations next year?* # In[23]: get_ipython().run_cell_magic('sql', '', "select name as Country,\n cast((birth_rate - death_rate)*population/1000 as int) as Population_Growth_Next_Year\n from facts\n where name <> 'World'\n order by Population_Growth_Next_Year desc\n limit 10\n ;\n") # Above we see that the world can expect that next year India will add over 15 millions people to its population. # *Q4. Which countries have a higher death rate than birth rate?* # In[24]: get_ipython().run_cell_magic('sql', '', "select name as 'country',\n death_rate, birth_rate,\n round(death_rate - birth_rate, 2) as diff\n from facts\nwhere name <> 'World'\norder by diff desc\nlimit 10\n ;\n") # The above calculation shows which populations are declining. If you notice, all these countries are in Europe. What about the opposite situation? # In[25]: get_ipython().run_cell_magic('sql', '', "select name as 'country',\n birth_rate, death_rate,\n round(birth_rate - death_rate, 2) as diff\n from facts\nwhere name <> 'World'\norder by diff desc\nlimit 10\n ;\n") # Here we get the countries whose population is growing most intensively. And they are all in Africa. # *Q5. Which countries have the highest population/area ratio?* # In[26]: get_ipython().run_cell_magic('sql', '', "select name as Country,\n cast(population/area as int) as People_per_km2_Area,\n population as Population, area as Total_Area_km2\n from facts\nwhere name <> 'World'\norder by People_per_km2_Area desc\nlimit 10\n ;\n") # The calculation shows that there are countries with extremely high population densities that did not get in result of our analysis above because their total population is below average. # ## Conclusion # In this analysis, we reviewed some of the key statistics about the countries of the world based on the CIA World Fact Book. As we have seen, a small database with a small amount of data and fairly simple queries can provide detailed analysis and deliver interesting information that allows us to better and deeper understand our world.