Import Basic Libraries
Retrieve Population Data from Quandl
Plot the Data
Subplots
Dual Y-Axes
Periodic Change
Periodic Percent Change
Indexing Data
Time-series data visualizations are everywhere. While these charts are understood amongst individuals of all professions, effectively communicating change over time can present unexpected challenges. When creating any type of visualization, it is important to first determine the message you would like to communicate. The increased popularity of exploratory data visualization tools such as Tableau and Microsoft Power BI make it easy to forget this step. These tools provide users with the ability to connect to databases and click around until they find the prettiest visualization. These capabilities can often lead to ineffective visualizations with no explicit purpose.
When creating time-series line charts, it’s important to consider which of the following you would like to communicate:
Ultimately, no chart can communicate all of these effectively. It is important to recognize this, determine which message is most important, and then design your visual accordingly.
%matplotlib inline
import pandas as pd
import Quandl as qd
import warnings
warnings.filterwarnings('ignore')
#Nick's Quandl Auth token
auth = '9zjPBpsaLGqS-KPGzvyn'
Quandl is an online data warehouse which has millions of public datasets. Quandl's API is set up to pull data directly into a Pandas dataframe, and it automatically sets the date as the index. For more info on using Quandl with Python, visit: https://www.quandl.com/help/python
Quandl houses the world bank's public data. The north_america_codes.json file contains all of the total population data for each country in North America, including Central America and the Caribbean.
df_codes = pd.read_json('north_america_codes.json')
df_codes.head()
code | country | |
---|---|---|
0 | WORLDBANK/USA_SP_POP_TOTL | USA |
1 | WORLDBANK/CAN_SP_POP_TOTL | Canada |
10 | WORLDBANK/HTI_SP_POP_TOTL | Haiti |
11 | WORLDBANK/JAM_SP_POP_TOTL | Jamaica |
12 | WORLDBANK/KNA_SP_POP_TOTL | Saint Kitts and Nevis |
Using the Quandl API, I loop through each country to pull population data. As each country's data is pulled, it is concatenated into a single Pandas DataFrame (df).
df = pd.DataFrame
for x in df_codes.code:
df_temp = ''
df_temp = qd.get(x,authtoken=auth)
df_temp.rename(columns={'Value': x[10:13]}, inplace=True)
if df.empty:
df = df_temp
else:
df = pd.concat([df, df_temp],axis=1)
df.columns = [x.lower() for x in df.columns]
I then calculate the total for North America. For the purpose of this analysis, we are going to compare USA, Mexico, and Canada in addition to the North American total. The DataFrame is then limited to just these four columns.
df.insert(0,'north america',df.sum(axis=1))
df = df[['north america', 'usa', 'mex', 'can']]
df.tail(5)
north america | usa | mex | can | |
---|---|---|---|---|
Date | ||||
2010-12-31 | 540355247 | 309347057 | 117886404 | 34005274 |
2011-12-31 | 545608318 | 311721632 | 119361233 | 34342780 |
2012-12-31 | 550985148 | 314112078 | 120847477 | 34754312 |
2013-12-31 | 556361769 | 316497531 | 122332399 | 35158304 |
2014-12-31 | 561674093 | 318857056 | 123799215 | 35540419 |
Plotly is a third party library that allows users to develop interactive visualizations and share them online. The Plotly library cufflinks was created specifically to interact with Pandas dataframes. Cufflinks allows users to make great visualizations in a single line of code.
import cufflinks as cf
# Use these imports for offline development
#import plotly.offline as py
#py.init_notebook_mode()
#cf.go_offline()
# Use these imports for online publishing
import plotly.plotly as py
cf.go_online()
colors = ['orange', 'blue', 'green', 'red']
dims = (800,500)
width = 2.5
title = """North America Population"""
fig1 = df.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True )
py.iplot(fig1)
The most basic method for visualizing change is to directly plot the data. The chart above shows population of the United States, Mexico, Canada, and North America (including Central America and the Caribbean). While this affords readers the ability to see the absolute units, each series has a vastly different scale. These differences in scale makes it difficult for your audience to quickly compare change. Looking at this chart, which country do you think grew at the fastest rate?
title = """North America Population"""
fig2 = df.iplot(subplots = True,theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True )
py.iplot(fig2)
The subplots method allows us to look at each series individually while also comparing the general trends. The subplots method can be helpful for comparing datasets with vastly different scales; however, it is not particularly useful for this analysis. Subplots are informative when there is large variation in your data. They are not effective for datasets that constantly increase over time. These four charts essentially just show ~45 degree angles.
title = 'North America Population'
fig3= df.iplot(theme='white',dimensions=dims,colors=colors,title=title, \
secondary_y =['mex','can'],legend = False, width=width, asFigure=True )
py.iplot(fig3)
It can be tempting to use a secondary y-axis such as to help solve the problem of scale. I strongly caution against this approach. In this chart, the populations of Canada and Mexico are plotted on the right-axis. A dual axes chart can potentially cause a few different issues:
Stephen Few, one of the experts in the data visualization field, wrote about how he could not identify a scenario in which a dual y-axis is ever the best way of visualizing data. While I mostly agree, I believe there are circumstances where a dual y-axis can help provide context (such as how many observations took place in a specific location on a chart). For this analysis, a dual y-axis is not an effective way of communicating change amongst our datasets.
df_diff = df.diff()
title = """Annual Change in North American Population
"""
fig4 = df_diff.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)
py.iplot(fig4)
While plotting change in absolute units allows us to make comparisons within specific datasets, it is not particularly effective for comparing change across data sets with vastly different scales. If we examine, 1990-1994 we can see the population of the United States had much higher than normal growth. What this chart does not effectively communicate, is the rapid growth in Mexico from 1960-1980.
df_pct_change = df.pct_change() * 100
title = """Annual Percent Change in North American Population"""
fig5 = df_pct_change.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)
py.iplot(fig5)
Visualizing percent change is a great way to establish growth relationships between data sets of different units and scales. Of all the charts I made when creating this post, this yielded the most surprising results. Two items particularly jumped out at me:
While this type of chart demonstrates change, readers completely lose context of scale. This chart does not communicate how much larger the population of the United States is compared with Canada (the US has roughly 10x the population of Canada). Another drawback to the percent change method is the outlier effect. If the population of a country decreased one year, an increase in population the following year would be overstated.
x = df[df.index == df.index.min()].squeeze()
df_1960 = 100 + ((df - x) / x) * 100
x
north america 268076376 usa 180671000 mex 38676974 can 17909009 Name: 1960-12-31 00:00:00, dtype: float64
df_1960.head()
north america | usa | mex | can | |
---|---|---|---|---|
Date | ||||
1960-12-31 | 100.000000 | 100.000000 | 100.000000 | 100.000000 |
1961-12-31 | 102.029384 | 101.671547 | 103.263691 | 102.021279 |
1962-12-31 | 104.012274 | 103.247339 | 106.612141 | 103.936516 |
1963-12-31 | 105.966402 | 104.743982 | 110.050073 | 105.890840 |
1964-12-31 | 107.920963 | 106.209076 | 113.585406 | 107.906585 |
title = """North American Population (Index 100 = December 31, 1960)"""
fig6 = df_1960.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)
py.iplot(fig6)
Indexing data is my absolute favorite way to compare change across datasets. This chart allows the reader to understand the rate at which change has occurred across datasets from a certain point in time (December 31, 1960). By using this fixed point in time as a reference, we reduce the impact of single outliers. This method not only allows us to not only compare datasets which have different scales, but also those which are measured in different units. What jumped out to me most was the fact that Mexico’s population has more than tripled since 1960!
Whilte I love index charts, there is no perfect time-series chart. Two specific areas of caution when using an index are:
All of the previously discussed charts can be useful for communicating change across time. That being said, no time-series chart is perfect. As data visualizers, we must accept this and:
It is also important to remember that charts are free! There is no need to try to squeeze every bit of information into a single chart. I feel the entire story of North American population growth can be explained using the following three charts:
py.iplot(fig1)
py.iplot(fig5)
py.iplot(fig6)