Scaffolding the visualisation design space

It's often useful to think about the visualisation design space before thinking about point designs. However, the visualisation design space is large. By scaffolding it, we can reassure ourselves that we haven't missed an obvious point design.

Here, I've attempted to scaffold a small part of the visualisation design space. I've made two assumptions. First, you have tidy data. For the data to be tidy, each variable should form a column, each observation should form a row, and each type of observation should form a table. Second, the data have location, time, and category variables, as well as a value variable.

There are three steps:

  1. Think about the location, time, and category variables. Think about the domains of these variables; that is, think about the unique values of these variables. Think about whether you're interested in one, some, or all unique values of these variables.

  2. Choose two variables and sketch a 3x3 grid. The rows are for the first variable. The columns are for the second variable. The first row/column has a cardinality of one, the second row/column has a cardinality of some, and the third row/column has a cardinality of all. The variable you didn't choose has a cardinality of one.

  3. For each cell in the grid, sketch a single-view visualisation that represents the value variable for the given cardinalities of the given row/column variables.

That's it! Well, almost. It's worth sketching several single-view visualisations in Step 3. If you're struggling, then try transposing the rows and columns in Step 2.

Example

Here's an example of scaffolding a small part of the visualisation design space. The data, which are from the World Bank, are GDP per capita for the G20 countries from 1990 to 2015. I used Altair to produce the single-view visualisations. This is because Altair is more than a library: through Vega and Vega-Lite, it's a domain-specific language.

In [1]:
import altair as alt
import pandas as pd
from pandas_datareader import wb
In [2]:
alt.renderers.enable('notebook')
Out[2]:
RendererRegistry.enable('notebook')
In [3]:
g20_countries = ['AR', 'AU', 'BR', 'CA', 'CN', 'DE', 'EU', 'FR', 'GB', 'ID', 'IN', 'IT', 'JP', 'KR', 'MX', 'RU', 'SA', 'TR', 'US', 'ZA']
In [4]:
df = wb.download(indicator='NY.GDP.PCAP.CD', country=g20_countries, start=1990, end=2015, errors='ignore').sort_index()
In [5]:
df.columns = [x.replace('.', '-') for x in df.columns]  # Altair doesn't like column names that contain periods.

Notice that we have tidy data.

In [6]:
df.head()
Out[6]:
NY-GDP-PCAP-CD
country year
Argentina 1990 4318.774700
1991 5715.504397
1992 6798.026763
1993 6940.350358
1994 7449.480390

Step 1

The location variable is country. The unique values of this variable are the 20 names of the G20 countries. The time variable is year. The unique values of this variable are the 26 years from 1990 to 2015. There isn't a category variable. The value variable is NY-GDP-PCAP-CD.

Step 2

Here's the 3x3 grid, with rows as location and columns as time.

Location/Time One Some All
One ? ? ?
Some ? ? ?
All ? ? ?

Step 3

The easiest cell to fill is one location, one time. We will use a single number. Is a single number a single-view visualisation? Well, it's faster and more accurate to extract information from a single number than it is to extract information from a chart that represents a single number.

In [7]:
def print_one_location_one_time(location='Canada', time='2015'):
    print('{}, {}: {}'.format(location, time, df.loc[(location, time), 'NY-GDP-PCAP-CD']))
print_one_location_one_time()
Canada, 2015: 43525.3701865304

If we move from the top left to the bottom right of the 3x3 grid, then the next easiest cells to fill are some locations, one time and one location, some times.

Let's consider some locations, one time: the top ten countries for 2015. We will use a bar chart.

In [8]:
df_2015 = df.loc[(slice(None), '2015'), :]
In [9]:
df_2015_top_10 = df_2015.sort_values('NY-GDP-PCAP-CD', ascending=False).iloc[:10]
In [10]:
alt.Chart(df_2015_top_10.reset_index()).mark_bar().encode(x=alt.X('country', sort=None), y='NY-GDP-PCAP-CD')
Out[10]:

Let's consider one location, some times: Australia, from 2006 to 2015. We will use a line chart.

In [11]:
df_australia = df.loc['Australia', slice('2006', '2015'), :]
In [12]:
alt.Chart(df_australia.reset_index()).mark_line().encode(x='year', y='NY-GDP-PCAP-CD')
Out[12]:

Let's update our 3x3 grid.

Location/Time One Some All
One Single number Line chart ?
Some Bar chart ? ?
All ? ? ?

Let's consider all locations, one time and one location, all times. To decide how to fill these cells, we should ask "How many is all?"

We know there are 20 unique values of the location variable and 26 unique values of the time variable. These cardinalities are small enough to use bar charts and line charts again - we would add an extra bar and line for each extra country and year. For larger location cardinalities, we might consider several bar charts, or small multiples - one for each group of locations. For larger time cardinalities, we might consider a focus+context chart.

Let's update our 3x3 grid.

Location/Time One Some All
One Single number Line chart Focus+context chart
Some Bar chart ? ?
All Small multiples ? ?

Let's consider some locations, some times: the top five countries with the largest mean GDP per capita, from 2006 to 2015. We will use a multi-series line chart, with each series encoded using a different named colour, or colour hue. We can distinguish between six and 12 colour hues (Ware, 2008), so we will be able to distinguish between the five lines.

In [13]:
index_top_5 = df.loc[(slice(None), slice('2006', '2015')), :].groupby('country').mean().sort_values('NY-GDP-PCAP-CD', ascending=False).iloc[:5].index
In [14]:
df_top_5 = df.loc[index_top_5]
In [15]:
alt.Chart(df_top_5.reset_index()).mark_line().encode(x='year', y='NY-GDP-PCAP-CD', color='country')
Out[15]:

Let's update our 3x3 grid.

Location/Time One Some All
One Single number Line chart Focus+context chart
Some Bar chart Multi-series line chart ?
All Small multiples ? ?

We've been moving from the top left to the bottom right of the 3x3 grid. For the final three cells, let's move from the bottom right to the top left of the 3x3 grid and consider all locations, all times. We will use a matrix.

In [16]:
alt.Chart(df.reset_index()).mark_rect().encode(x='year', y='country', color='NY-GDP-PCAP-CD')
Out[16]:

Let's update our 3x3 grid.

Location/Time One Some All
One Single number Line chart Focus+context chart
Some Bar chart Multi-series line chart ?
All Small multiples ? Matrix

We're left with all locations, some times and some locations, all times. I think these are the hardest cells to fill. This is because to fill these cells, we really have to think about the trade-offs.

If we move from the top left to the bottom right of the 3x3 grid, then we could create another multi-series line chart. However, remember that we can distinguish between six and 12 colour hues (Ware, 2008). If we created another multi-series line chart, then we wouldn't be able to distingish between all locations. Would we accept this trade-off?

If we move from the bottom right to the top left of the 3x3 grid, then we could create another matrix. However, the combination of colour hue, colour luminance, and colour saturation is relatively ineffective, compared to the other visual channels (Munzner, 2014). For all locations, all times, we trade-off ineffective visual channels for a compact single-view visualisation. Would we accept this trade-off for all locations, some times and some locations, all times?

We know that the cardinalities of the location and time variables are small, so we will use a multi-series line chart for all locations, some times and some locations, all times. However, we will use interaction to distinguish between all locations. If you mouse over a line, then the line will highlight. If the mouse cursor is at a value on the x and y axes, then the value will appear in a tooltip. We will also reduce the opacity of each line, to distinguish between more dense and less dense bunches of lines.

In [17]:
highlight = alt.selection_single(on='mouseover', nearest=True)
In [18]:
alt.Chart(df.reset_index()).mark_line().encode(
    x='year',
    y='NY-GDP-PCAP-CD',
    opacity=alt.condition(~highlight, alt.value(.25), alt.value(1)),
    detail='country',
    tooltip=['country', 'year', 'NY-GDP-PCAP-CD'],
).add_selection(highlight)
Out[18]:

Let's update our 3x3 grid.

Location/Time One Some All
One Single number Line chart Focus+context chart
Some Bar chart Multi-series line chart Interactive multi-series line chart
All Small multiples Interactive multi-series line chart Matrix

Conclusion

Here, I've attempted to scaffold a small part of the visualisation design space by emphasising nine pairwise comparisons of cardinalities and variables. By being systematic, hopefully we've reassured ourselves that we haven't missed an obvious point design.

References

Munzner, Tamara (2014). Visualization Analysis and Design. A.K. Peters Visualization Series, CRC Press.

Ware, Colin (2008). Visual Thinking for Design. Morgan Kaufmann.