In a 1973 paper, Graphs in Statistical Analysis, published in The American Statistician, Vol. 27, No. 1. (Feb., 1973), pp. 17-21, statistician Francis Anscombe provided the briefiest of abstracts: "Graphs are essential to good statistical analysis".
His paper opened with a brief meditation on the usefulness of graphs:
Most textbooks on statistical methods, and most statistical computer programs, pay too little attention to graphs. Few of us escape being indoctrinated with these notions:
A computer should make both calculations and graphs. Both sorts of output should be studied; each will contribute to understanding.
Graphs can have various purposes, such as: (i) to help us perceive and appreciate some broad features of the data, (ii) to let us look behind those broad features and see what else is there. Most kinds of statistical calculation rest on assumptions about the behavior of the data. Those assumptions may be false, and then the calculations may be misleading. We ought always to try to check whether the assumptions are reasonably correct; and if they are wrong we ought to be able to perceive in what ways they are wrong. Graphs are very valuable for these purposes.
Good statistical analysis is not a purely routine matter, and generally calls for more than one pass through the computer. The analysis should be sensitive both to peculiar features in the given numbers and also to whatever background information is available about the variables. The latter is particularly helpful in suggesting alternative ways of setting up the analysis. Thought and ingenuity devoted to devising good graphs are likely to pay off. Many ideas can be gleaned from the literature...
To illustrate his call to arms, Anscombe generatad a set of four simple pairwise datasets (sets I, II, II, IV below with an x and y value each) intended to demonstrate the usefulness of looking at graphs.
import pandas as pd
We can read the data for Anscome's quartet in from a data file to a hierarchically indexed dataframe.
aq=pd.read_csv('data/anscombesQuartet_hier.csv',header=[0,1],index_col=[0])
aq
The summary statistical properties of of the datasets I to IV hardly varied. The means were identical for x and y across the groups, the variances were all but indistinguishable in any meaningful sense of the term.
aq.mean()
aq.var()
Other statistical properties, such as regression lines, were also the same.
So from these summary statistics, we conclude that the datasets are to all intents and purposes the same.
But what if we look at them?
The most natural way to look at this data is to use a scatterplot, with the x-values places along a continuous horizontal axis and the y-values against the vertical axis. Points are plotted as marks using their x and y values within each group as the Cartesian co-ordinates for each point.
Using ggplot, we can construct the plot directly from the dataframe, if it's correctly shaped...
The easiest way to plot this quartet id to generate a dataframe that has a column containing the group number and then columns for corresponding x and y pairs. We can then generate what is known as a faceted plot over the groups, generating one chart per group.
In order to do this, we need to reshape the dataframe. One way of doing this would be to use OpenRefine, and a combination of transpose operations. (If you would like to try this, I have provided the data for Anscombe's quartet in another form that is perhaps easier to read in to OpenRefine: Anscombe's quartet - simple CSV; see a rough draft walkthrough of how to reshape Anscombe's Quartet using OpenRefine here Another approach would be to use pandas, as we shall see here.
It's easy enough to melt the original dataframe into a long form that we can then reshape back to slightly wider form, but we also need to create a new index column that will allow us to align the data within each group without giving a duplicate index clash.
One solution is to generate a sequence of index values from 0..10 for each set of x and y values. This will be used along with the group value to create index values in the unmelted dataframe.
tmp=pd.melt(aq)
tmp['index']=list(range(int(len(tmp)/8)))*8
tmp[8:15]
If we set a hierarchical index on the group, index and var columns, we can then unstack()
the final var column.
We then need to tidy up the column names to remove the upper hierarchical level.
df=tmp.set_index(['group','index', 'var']).unstack()
df.columns = [col[1].strip() for col in df.columns.values]
df[7:15]
Finally, we can reset the index to give us the simple dataframe representation we require.
df.reset_index(inplace=True)
df[7:15]
We are now in a position to plot the data.
from ggplot import *
Before we show the data, let's see how the linear regression lines compare across the different data sets.
ggplot(aes(x='x', y='y'), data=df)+facet_wrap('group',scales='fixed')+stat_smooth(method='lm',se=False)
We can also look to see how the regression lines compare with 95% confidence limits.
ggplot(aes(x='x', y='y'), data=df)+facet_wrap('group',scales='fixed')+stat_smooth(method='lm')
Now we're starting to see some differences...
Finally, let's look at the actually data points themselves:
ggplot(aes(x='x', y='y'), data=df)+geom_point()+facet_wrap('group')
Now we can finally see how distinct these data sets actually are, each with it's own story tell, but not stories we would have been alerted to from the simple summary statistics.
Anscombe's quartet, though only a small dataset, offers a salutary lesson. The summary statistics for the x and y values across each group may be the same, and a quick look at the data tables hard to picture with any degree of certainity, but when visualised as a whole, each group of data clearly tells a different story.
Working back from the ggplot commands, we see how striaghtforward it can be to generate what is quite a complex plot from a relatively simple command. However, in order to be able to "write" this chart, or set of charts, we need to get the data into the right sort of shape. And that may be quite an involved process.
In may situations, preparing the data (which may include cleaning it) may take much more time than the actual analysis or visualisation. But that is the price we pay for being able to use such powerful analysis and visualisation tools.
Anscombe concluded his paper as follows:
Graphical output such as described above is readily available to anyone who does his own programming. I myself habitually generate such plots at an APL terminal, and have come to appreciate their importance. A skilled Fortran or PL/1 programmer, with an organized library of subroutines, can do the same (on a larger scale). Unfortunately, most persons who have recourse to a computer for statistical analysis of data are not much interested either in computer programming or in statistical method, being primarily concerned with their own proper business. Hence the common use of library programs and various statistical packages. Most of these originated in the pre-visual era. The user is not showered with graphical displays. He can get them only with trouble, cunning and a fighting spirit. It's time that was changed.
Computational techniques have moved on somewhat since 1973, of course, and times have indeed changed. Graphical displays are everywhere, and libraries such as ggplot that are rooted in a sound grammatical basis mean that we are now in a position to "write" very powerful expressions that can generate statistical graphics for us, directly from a cleaned and prepared dataset, using just a few well chosen phrases. But getting the data into the right shape may stull require significant amounts of trouble, cunning and a fighting spirit.
May the visualisations begin...
DO NOT REMOVE THIS NOTICE/CELL FROM THIS IPYTHON NOTEBOOK
This notebook was prepared for use in a course on data analysis and data management due to be released by The Open University in October 2015 and is made available AS IS, and IN DRAFT FORM ONLY. It may be used for educational purposes only.
Comments to: tony.hirst@open.ac.uk