Visualizing Change Using Time-Series Line Charts

by Nick Heitzman

Contents¶

Introduction¶

Time-series data visualizations are everywhere. While these charts are understood amongst individuals of all professions, effectively communicating change over time can present unexpected challenges. When creating any type of visualization, it is important to first determine the message you would like to communicate. The increased popularity of exploratory data visualization tools such as Tableau and Microsoft Power BI make it easy to forget this step. These tools provide users with the ability to connect to databases and click around until they find the prettiest visualization. These capabilities can often lead to ineffective visualizations with no explicit purpose.

When creating time-series line charts, it’s important to consider which of the following you would like to communicate:

Actual value of units?
Change in absolute units?
Percent change?
Change from a specific point in time?

Ultimately, no chart can communicate all of these effectively. It is important to recognize this, determine which message is most important, and then design your visual accordingly.

Import Basic Libraries¶

In [1]:

%matplotlib inline
import pandas as pd
import Quandl as qd
import warnings
warnings.filterwarnings('ignore')

#Nick's Quandl Auth token
auth = '9zjPBpsaLGqS-KPGzvyn'

Retreive Population Data from Quandl¶

What is Quandl?¶

Quandl is an online data warehouse which has millions of public datasets. Quandl's API is set up to pull data directly into a Pandas dataframe, and it automatically sets the date as the index. For more info on using Quandl with Python, visit: https://www.quandl.com/help/python

Quandl houses the world bank's public data. The north_america_codes.json file contains all of the total population data for each country in North America, including Central America and the Caribbean.

In [2]:

df_codes = pd.read_json('north_america_codes.json')

In [3]:

df_codes.head()

Out[3]:

	code	country
0	WORLDBANK/USA_SP_POP_TOTL	USA
1	WORLDBANK/CAN_SP_POP_TOTL	Canada
10	WORLDBANK/HTI_SP_POP_TOTL	Haiti
11	WORLDBANK/JAM_SP_POP_TOTL	Jamaica
12	WORLDBANK/KNA_SP_POP_TOTL	Saint Kitts and Nevis

Retreive Data¶

Using the Quandl API, I loop through each country to pull population data. As each country's data is pulled, it is concatenated into a single Pandas DataFrame (df).

In [4]:

df = pd.DataFrame

for x in df_codes.code:
    df_temp = ''
    df_temp = qd.get(x,authtoken=auth)
    df_temp.rename(columns={'Value': x[10:13]}, inplace=True)
    
    if df.empty:
        df = df_temp
    else:
        df = pd.concat([df, df_temp],axis=1)
        
df.columns = [x.lower() for x in df.columns]

Data Munging¶

I then calculate the total for North America. For the purpose of this analysis, we are going to compare USA, Mexico, and Canada in addition to the North American total. The DataFrame is then limited to just these four columns.

In [5]:

df.insert(0,'north america',df.sum(axis=1))
df = df[['north america', 'usa', 'mex', 'can']]
df.tail(5)

Out[5]:

	north america	usa	mex	can
Date
2010-12-31	540355247	309347057	117886404	34005274
2011-12-31	545608318	311721632	119361233	34342780
2012-12-31	550985148	314112078	120847477	34754312
2013-12-31	556361769	316497531	122332399	35158304
2014-12-31	561674093	318857056	123799215	35540419

Methods for Visualizing Change¶

Plotly is a third party library that allows users to develop interactive visualizations and share them online. The Plotly library cufflinks was created specifically to interact with Pandas dataframes. Cufflinks allows users to make great visualizations in a single line of code.

In [6]:

import cufflinks as cf

# Use these imports for offline development
#import plotly.offline as py
#py.init_notebook_mode() 
#cf.go_offline()

# Use these imports for online publishing
import plotly.plotly as py
cf.go_online()

In [7]:

colors = ['orange', 'blue', 'green', 'red']
dims = (800,500)
width = 2.5

Plot the Data¶

In [8]:

title = """North America Population"""
fig1 = df.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True )
py.iplot(fig1)

Out[8]:

The most basic method for visualizing change is to directly plot the data. The chart above shows population of the United States, Mexico, Canada, and North America (including Central America and the Caribbean). While this affords readers the ability to see the absolute units, each series has a vastly different scale. These differences in scale makes it difficult for your audience to quickly compare change. Looking at this chart, which country do you think grew at the fastest rate?

Using Subplots¶

In [9]:

title = """North America Population"""
fig2 = df.iplot(subplots = True,theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True )
py.iplot(fig2)

Out[9]:

The subplots method allows us to look at each series individually while also comparing the general trends. The subplots method can be helpful for comparing datasets with vastly different scales; however, it is not particularly useful for this analysis. Subplots are informative when there is large variation in your data. They are not effective for datasets that constantly increase over time. These four charts essentially just show ~45 degree angles.

Dual Y-Axes¶

In [10]:

title = 'North America Population'
fig3= df.iplot(theme='white',dimensions=dims,colors=colors,title=title, \
        secondary_y =['mex','can'],legend = False, width=width, asFigure=True )
py.iplot(fig3)

Out[10]:

It can be tempting to use a secondary y-axis such as to help solve the problem of scale. I strongly caution against this approach. In this chart, the populations of Canada and Mexico are plotted on the right-axis. A dual axes chart can potentially cause a few different issues:

Readers have to fight the tendency to compare magnitude between lines
Our brains are trained to look for periods in time in which lines intersect. We instinctually believe these are significant points in time. In a dual axes chart, these intersections are meaningless.

Stephen Few, one of the experts in the data visualization field, wrote about how he could not identify a scenario in which a dual y-axis is ever the best way of visualizing data. While I mostly agree, I believe there are circumstances where a dual y-axis can help provide context (such as how many observations took place in a specific location on a chart). For this analysis, a dual y-axis is not an effective way of communicating change amongst our datasets.

Periodic Change¶

In [11]:

df_diff = df.diff()
title = """Annual Change in North American Population
"""
fig4 = df_diff.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)
py.iplot(fig4)

Out[11]:

While plotting change in absolute units allows us to make comparisons within specific datasets, it is not particularly effective for comparing change across data sets with vastly different scales. If we examine, 1990-1994 we can see the population of the United States had much higher than normal growth. What this chart does not effectively communicate, is the rapid growth in Mexico from 1960-1980.

Periodic Percent Change¶

In [12]:

df_pct_change = df.pct_change() * 100
title = """Annual Percent Change in North American Population"""
fig5 = df_pct_change.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)
py.iplot(fig5)

Out[12]:

Visualizing percent change is a great way to establish growth relationships between data sets of different units and scales. Of all the charts I made when creating this post, this yielded the most surprising results. Two items particularly jumped out at me:

None of the previous charts illustrated that Mexico has experienced more rapid population growth than the United States and Canada.
Population growth is slowing amongst the three major countries in North America. While this is a bit surprising, a closer look at the previous chart helps explain this. Absolute annual population growth (the numerator) has been relatively flat since 1960; however, the current population of each country (the denominator) continues to increase.

While this type of chart demonstrates change, readers completely lose context of scale. This chart does not communicate how much larger the population of the United States is compared with Canada (the US has roughly 10x the population of Canada). Another drawback to the percent change method is the outlier effect. If the population of a country decreased one year, an increase in population the following year would be overstated.

Indexing Data¶

In [13]:

x = df[df.index == df.index.min()].squeeze()
df_1960 = 100 + ((df - x) / x) * 100
x

Out[13]:

north america    268076376
usa              180671000
mex               38676974
can               17909009
Name: 1960-12-31 00:00:00, dtype: float64

In [14]:

df_1960.head()

Out[14]:

	north america	usa	mex	can
Date
1960-12-31	100.000000	100.000000	100.000000	100.000000
1961-12-31	102.029384	101.671547	103.263691	102.021279
1962-12-31	104.012274	103.247339	106.612141	103.936516
1963-12-31	105.966402	104.743982	110.050073	105.890840
1964-12-31	107.920963	106.209076	113.585406	107.906585

In [15]:

title = """North American Population (Index 100 = December 31, 1960)"""
fig6 = df_1960.iplot(theme='white',dimensions=dims,colors=colors,title=title,width=width, asFigure=True)
py.iplot(fig6)

Out[15]:

Indexing data is my absolute favorite way to compare change across datasets. This chart allows the reader to understand the rate at which change has occurred across datasets from a certain point in time (December 31, 1960). By using this fixed point in time as a reference, we reduce the impact of single outliers. This method not only allows us to not only compare datasets which have different scales, but also those which are measured in different units. What jumped out to me most was the fact that Mexico’s population has more than tripled since 1960!

Whilte I love index charts, there is no perfect time-series chart. Two specific areas of caution when using an index are:

It is irresponsible to pick an outlier as the starting point. This misleads your audience, as the change since an outlier rarely relevant.
Similar to the percent change chart, an audience would be unable to understand the differences in magnitude across datasets.

Conclusion¶

All of the previously discussed charts can be useful for communicating change across time. That being said, no time-series chart is perfect. As data visualizers, we must accept this and:

Determine the message we would like to communicate and
Choose the method which most effectively delivers this message

It is also important to remember that charts are free! There is no need to try to squeeze every bit of information into a single chart. I feel the entire story of North American population growth can be explained using the following three charts:

In [16]:

py.iplot(fig1)

Out[16]:

In [17]:

py.iplot(fig5)

Out[17]:

In [18]:

py.iplot(fig6)

Out[18]:

Visualizing Change Using Time-Series Line Charts

by Nick Heitzman

Contents¶

Introduction¶

Obtain Population Data¶

Methods for Visualizing Change¶

Conculsion¶

Introduction¶

Import Basic Libraries¶

Retreive Population Data from Quandl¶

What is Quandl?¶

Retreive Data¶

Data Munging¶

Methods for Visualizing Change¶

Plot the Data¶

Using Subplots¶

Dual Y-Axes¶

Periodic Change¶

Periodic Percent Change¶

Indexing Data¶

Conclusion¶