Content under Creative Commons Attribution license CC-BY 4.0, code under BSD 3-Clause License © 2017 L.A. Barba, N.C. Clementi

Seeing stats in a new light

Welcome to the second lesson in "Take off with stats," Module 2 of our course in Engineering Computations. In the previous lesson, Cheers! Stats with Beers, we did some exploratory data analysis with a data set of canned craft beers in the US [1]. We'll continue using that same data set here, but with a new focus on visualizing statistics.

In her lecture "Looking at Data", Prof. Kristin Sainani says that you should always plot your data. Immediately, several things can come to light: are there outliers in your data? (Outliers are data points that look abnormally far from other values in the sample.) Are there data points that don't make sense? (Errors in data entry can be spotted this way.) And especially, you want to get a visual representation of how data are distributed in your sample.

In this lesson, we'll play around with different ways of visualizing data. There are so many ways to play! Have a look at the gallery of The Data Viz Project by ferdio (a data viz agency in Copenhagen). Aren't those gorgeous? Wouldn't you like to be able to make some pretty pics like that? Python can help!

Let's begin. We'll import our favorite Python libraries, and set some font parameters for our plots to look nicer. Then we'll load our data set for craft beers and begin!

In [1]:
import numpy
import pandas
from matplotlib import pyplot
%matplotlib inline

#Import rcParams to set font styles
from matplotlib import rcParams

#Set font style and size 
rcParams[''] = 'serif'
rcParams['font.size'] = 16
In [2]:
# Load the beers data set using pandas, and assign it to a dataframe
beers = pandas.read_csv("../../data/beers.csv")

If you downloaded this notebook alone, and don't have the data file on the location we assume above, get it by adding a code cell, and execute the following code in it:

from urllib.request import urlretrieve
URL = ''
urlretrieve(URL, 'beers.csv')

The data file will be downloaded to your working directory, and you will then need to remove the path information, i.e., the string '../../data/', in the code to read it.

OK. Let's have a look at the first few rows of the pandas dataframe we just created from the file, and confirm that it's a dataframe using the type() function. We only display the first 10 rows to save some space.

In [3]:
In [4]:
Unnamed: 0 abv ibu id name style brewery_id ounces
0 0 0.050 NaN 1436 Pub Beer American Pale Lager 408 12.0
1 1 0.066 NaN 2265 Devil's Cup American Pale Ale (APA) 177 12.0
2 2 0.071 NaN 2264 Rise of the Phoenix American IPA 177 12.0
3 3 0.090 NaN 2263 Sinister American Double / Imperial IPA 177 12.0
4 4 0.075 NaN 2262 Sex and Candy American IPA 177 12.0
5 5 0.077 NaN 2261 Black Exodus Oatmeal Stout 177 12.0
6 6 0.045 NaN 2260 Lake Street Express American Pale Ale (APA) 177 12.0
7 7 0.065 NaN 2259 Foreman American Porter 177 12.0
8 8 0.055 NaN 2258 Jade American Pale Ale (APA) 177 12.0
9 9 0.086 NaN 2131 Cone Crusher American Double / Imperial IPA 177 12.0

Quantitative vs. categorical data

As you can see in the nice table that pandas printed for the dataframe, we have several features for each beer: the label abv corresponds to the acohol-by-volume fraction, label ibu refers to the international bitterness unit (IBU), then we have the name of the beer and the style, the brewery ID number, and the liquid volume of the beer can, in ounces.

Alcohol-by-volume is a numeric feature: a volume fraction, with possible values from 0 to 1 (sometimes also given as a percentage). In the first 10 rows of our dataframe, the ibu value is missing (all those NaNs), but we saw in the previous lesson that ibu is also a numeric feature, with values that go from a minimum of 4 to a maximum of 138 (in our data set). IBU is pretty mysterious: how do you measure the bitterness of beer? It turns out that bitterness is measured as parts per million of isohumulone, the acid found in hops [2]. Who knew?

For these numeric features, we learned that we can get an idea of the central tendency in the data using the mean value, and we get ideas of spread of the data with the standard deviation (and also with the range, but standard deviation is the most common).

Notice that the beer data also has a feature named style: it can be "American IPA" or "American Porter" or a bunch of other styles of beer. If we want to study the beers according to style, we'll have to come up with some new ideas, because you can't take the mean or standard deviation of this feature!

Quantitative data have meaning through a numeric feature, either on a continuous scale (like a fraction from 0 to 1), or a discrete count. Categorical data, in contrast, have meaning through a qualitative feature (like the style of beer). Data in this form can be collected in groups (categories), and then we can count the number of data items in that group. For example, we could ask how many beers (in our set) are of the style "American IPA," or ask how many beers we have in each style.

Visualizing quantitative data

In the previous lesson, we played around a bit with the abv and ibu columns of the dataframe beers. For each of these columns, we extracted it from the dataframe and saved it into a pandas series, then we used the dropna() method to get rid of null values. This "clean" data was our starting point for some exploratory data analysis, and for plotting the data distributions using histograms. Here, we will add a few more ingredients to our recipes for data exploration, and we'll learn about a new type of visualization: the box plot.

Let's repeat here the process for extracting and cleaning the two series, and getting the values into NumPy arrays:

In [5]:
#Repeat cleaning values abv
abv_series = beers['abv']
abv_clean = abv_series.dropna()
abv = abv_clean.values
In [6]:
#Repeat cleaning values ibu
ibu_series = beers['ibu']
ibu_clean = ibu_series.dropna()
ibu = ibu_clean.values

Let's also repeat a histogram plot for the abv variable, but this time choose to plot just 10 bins (you'll see why in a moment).

In [7]:
pyplot.hist(abv, bins=10, color='#3498db', histtype='bar', edgecolor='white') 
pyplot.title('Alcohol by Volume (abv) \n')

You can tell that the most frequent values of abv fall in the bin just above 0.05 (5% alcohol), and the bin below. The mean value of our data is 0.06, which happens to be within the top-frequency bin, but data is not always so neat (sometimes, extreme values weigh heavily on the mean). Note also that we have a right skewed distribution, with higher-frequency bins occuring in the lower end of the range than in the higher end.

If you played around with the bin sizes in the previous lesson, you might have noticed that with a lot of bins, it becomes harder to visually pick out the patterns in the data. But if you use too few bins, the plot is also unhelpful. What number of bins is just right? Well, it depends on your data, so you'll just have to experiment and use your best judgement.

Let's learn a new trick. It turns out that pandas has built-in methods to make histograms directly from columns of a dataframe! (It uses Matplotlib internally for that.) The syntax is short and sweet:


And pandas plots a pretty nice histogram without help. You can add optional parameters to set these to your liking; see the documentation. Check it out, and compare with our previous plot.

In [8]:
beers.hist(column='abv', edgecolor='white')
pyplot.title('Alcohol by Volume (abv) \n');

Which one do you like better? Well, the pandas histogram took fewer lines of code to create. And it doesn't look bad at all. But we do have more fine-grained control with Matplotlib. Which method you choose in a real situation will just depend on the situation and your preference.

Exploring quantitative data (continued)

In the previous lesson, you learned how to compute the mean of the data using numpy.mean(). How easy is that? But then we wrote our own custom functions to compute variance or standard deviation. It shouldn't surprise you by now that there are also NumPy functions for that!

  • Go to the documentation of numpy.var() and analyze if this function is computing the sample variance. Hint: Check what it says about the "data degrees of freedom."

If you did the reading, you might have noticed that, by default, the argument ddof in numpy.var() is set to zero. If we use the default option, then we are not really calculating the sample variance. Recall from the previous lesson that the sample variance is:

$$ \begin{equation*} \text{var}_{sample} = \frac{1}{N-1}\sum_{i} (x_i - \bar{x})^2 \end{equation*} $$

Therefore, we need to be explicit about the division by $N-1$ when calling numpy.var(). How do we do that? We explicitly set ddof to 1.

For example, to compute the sample variance for our abv variable, we do:

In [9]:
var_abv = numpy.var(abv, ddof=1)

Now we can compute the standard deviation by taking the square root of var_abv:

In [10]:
std_abv = numpy.sqrt(var_abv)

You might be wondering if there is a built-in function for the standard deviation in NumPy. Go on and search online and try to find something.

Spoiler alert! You will.

  1. Read the documentation about the NumPy standard deviation function, compute the standard deviation for abv using this function, and check that you obtained the same value than if you take the square root of the variance computed with NumPy.

  2. Compute the variance and standard deviation for the variable ibu.

Median value

So far, we've learned to characterize quantitative data using the mean, variance and standard deviation.

If you watched Prof. Sainani's lecture Describing Quantitative Data: Where is the center? (recommended in our previous lesson), you'll recall that she also introduced the median: the middle value in the data, the value that separates your data set in half. (If there's an even number of data values, you take the average between the two middle values.)

As you may anticipate, NumPy has a built-in function that computes the median, helpfully named numpy.median().


Using NumPy, compute the median for our variables abv and ibu. Compare the median with the mean, and look at the histogram to locate where the values fall on the x-axis.

Box plots

Another handy way to visualize the distribution of quantitative data is using box plots. By "distribution" of the data, we mean some idea of the dataset's "shape": where is the center, what is the range, what is the variation in the data. Histograms are the most popular type of plots in exploratory data analysis. But check out box plots: they are easy to make with pyplot:

In [11]:
pyplot.boxplot(abv, labels=['Alcohol by volume']);
In [12]:
pyplot.boxplot(ibu, labels=['International bitterness unit']);

What is going on here? Obviously, there is a box: it represents 50% of the data in the middle of the data range, with the line across it (here, in orange) indicating the median.

The bottom of the box is at the 25th percentile, while the top of the box is at the 75th percentile. In other words, the bottom 25% of the data falls below the box, and the top 25% of the data falls above the box.

Confused by percentiles? The Nth percentile is the value below which N% of the observations fall.

Recall the bell curve from our previous lesson: we said that 95% of the data falls at a distance $\pm 2 \sigma$ from the mean. This implies that 5% of the data (the rest) falls in the (symmetrical) tails, which in turn implies that the 2.5 percentile is at $-2\sigma$ from the mean, and the 97.5 percentile is at $+2\sigma$ from the mean.

The percentiles 25, 50, and 75 are also named quartiles, since they divide the data into quarters. They are named first (Q1), second (Q2 or median) and third quartile (Q3), respectively.

Fortunately, NumPy has a function to compute percentiles and we can do it in just one line. Let's use numpy.percentile() to compute the abv and ibu quartiles.

abv quartiles

In [13]:
Q1_abv = numpy.percentile(abv, q=25)
Q2_abv = numpy.percentile(abv, q=50)
Q3_abv = numpy.percentile(abv, q=75)

print('The first quartile for abv is {}'.format(Q1_abv))
print('The second quartile for abv is {}'.format(Q2_abv))
print('The third quartile for abv is {}'.format(Q3_abv))
The first quartile for abv is 0.05
The second quartile for abv is 0.056
The third quartile for abv is 0.067

ibu quartiles

You can also pass a list of percentiles to numpy.percentile() and calculate several of them in one go. For example, to compute the quartiles of ibu, we do:

In [14]:
quartiles_ibu = numpy.percentile(ibu, q=[25, 50, 75])

print('The first quartile for ibu is {}'.format(quartiles_ibu[0]))
print('The second quartile for ibu is {}'.format(quartiles_ibu[1]))
print('The third quartile for ibu is {}'.format(quartiles_ibu[2]))
The first quartile for ibu is 21.0
The second quartile for ibu is 35.0
The third quartile for ibu is 64.0

OK, back to box plots. The height of the box—between the 25th and 75th percentile—is called the interquartile range (IQR). Outside the box, you have two vertical lines—the so-called "whiskers" of the box plot—which used to be called "box and whiskers plot" [3].

The whiskers extend to the upper and lower extremes (short horizontal lines). The extremes follow the following rules:

  • Top whisker: lower value between the maximum and Q3 + 1.5 x IQR.
  • Bottom whisker: higher value between the minimum and Q1 - 1.5 x IQR

Any data values beyond the upper and lower extremes are shown with a marker (here, small circles) and are an indication of outliers in the data.


Calculate the end-points of the top and bottom whiskers for both the abv and ibu variables, and compare the results with the whisker end-points you see in the plot.

A bit of history:

"Box-and-whiskers" plots were invented by John Tukey over 45 years ago. Tukey was a famous mathematician/statistician who is credited with coining the words software and bit [4]. He was active in the efforts to break the Enigma code durig WWII, and worked at Bell Labs in the first surface-to-air guided missile ("Nike"). A classic 1947 work on early design of the electonic computer acknowledged Tukey: he designed the electronic circuit for computing addition. Tukey was also a long-time advisor for the US Census Bureau, and a consultant for the Educational Testing Service (ETS), among many other contributions [5].


Box plots are also drawn horizontally. Often, several box plots are drawn side-by-side with the purpose of comparing distributions.

Visualizing categorical data

The typical method of visualizing categorical data is using bar plots. These show visually the frequency of appearance of items in each category, or the proportion of data in each category. Suppose we wanted to know how many beers of the same style are in our data set. Remember: the style of the beer is an example of categorical data. Let's extract the column with the style information from the beers dataframe, assign it to a variable named style_series, check the type of this variable, and view the first 10 elements.

In [15]:
style_series = beers['style']
In [16]:
In [17]:
0               American Pale Lager
1           American Pale Ale (APA)
2                      American IPA
3    American Double / Imperial IPA
4                      American IPA
5                     Oatmeal Stout
6           American Pale Ale (APA)
7                   American Porter
8           American Pale Ale (APA)
9    American Double / Imperial IPA
Name: style, dtype: object

Already in the first 10 elements we see that we have two beers of the style "American IPA," two beers of the style "American Pale Ale (APA)," but only one beer of the style "Oatmeal Stout." The question is: how many beers of each style are contained in the whole series?

Luckily, pandas has a built-in function to answer that question: series.value_counts() (where series is the variable name of the pandas series you want the counts for). Let's try it on our style_series, and save the result in a new variable named style_counts.

In [18]:
style_counts = style_series.value_counts()
American IPA                      424
American Pale Ale (APA)           245
American Amber / Red Ale          133
American Blonde Ale               108
American Double / Imperial IPA    105
Name: style, dtype: int64
In [19]:
In [20]:

The len() function tells us that style_counts has 99 elements. That is, there are a total of 99 styles of beer in our data set. Wow, that's a lot!

Notice that value_counts() returned the counts sorted in decreasing order: the most popular beer in our data set is "American IPA" with 424 entries in our data. The next-most popular beer is "American Pale Ale (APA)" with a lot fewer entries (245), and the counts decrease sharply after that. Naturally, we'd like to know how much more popular are the top-2 beers from the rest. Bar plot to the rescue!

Below, we'll draw a horizontal bar plot directly with pandas (which uses Matplotlib internally) using the plot.barh() method for series. We'll only show the first 20 beers, because otherwise we'll get a huge plot. This plot gives us a clear visualization of the popularity ranking of beer styles in the US!

In [21]:
style_counts[0:20].plot.barh(figsize=(10,8), color='#008367', edgecolor='gray');

Visualizing multiple data

These visualizations are really addictive! We're now getting ambitious: what if we wanted to show more than one feature, together on the same plot? What if we wanted to get insights about the relationship between two features through a multi-variable plot?

For example, don't you want to know if the bitterness of beers is associated with the alcohol-by-volume fraction? We do!

Scatter plots

Maybe we can do this: imagine a plot that has the alcohol-by-volume on the absissa, and the IBU value on the ordinate. For each beer, we can place a dot on this plot with its abv and ibu values as $(x, y)$ coordinates. This is called a scatter plot.

We run into a bit of a problem, though. The way we handled the beer data above, we extracted the column for abv into a series, dropped the null entries, and saved the values into a NumPy array. We then repeated this process for the ibu column. Because a lot more ibu values are missing, we ended up with two arrays of different length: 2348 entries for the abv series, and 1405 entries for the ibu series. If we want to make a scatter plot with these two features, we'll need series (or arrays) of the same length.

Let's instead clean the whole beers dataframe (which will completely remove any row that has a null entry), and then extract the values of the two series into NumPy arrays.

In [22]:
beers_clean = beers.dropna()
In [23]:
ibu = beers_clean['ibu'].values
In [24]:
abv = beers_clean['abv'].values

Notice that both arrays now have 1403 entries—not 1405 (the length of the clean ibu data), because two rows that had a non-null ibu value did have a null abv value and were dropped.

With the two arrays of the same length, we can now call the pyplot.scatter() function.

In [25]:
pyplot.scatter(abv, ibu, color='#3498db') 
pyplot.title('Scatter plot of alcohol-by-volume vs. IBU \n')

Hmm. That's a bit of a mess. Too many dots! But we do make out that the beers with low alcohol-by-volume tend to have low bitterness. For higher alcohol fraction, the beers can be anywhere on the bitterness scale: there's a lot of vertical spread on those dots to the right of the plot.

An idea! What if the bitterness has something to do with style? Neither of us knows much about beer, so we have no clue. Could we explore this question with visualization? We found a way!

Bubble chart

What we imagined is that we could group together the beers by style, and then make a new scatter plot where each marker corresponds to a style. The beers within a style, though, have many values of alcohol fraction and bitterness: we have to come up with a "summary value" for each style. Well, why not the mean… we can calculate the average abv and the average ibu for all the beers in each style, use that pair as $(x,y)$ coordinate, and put a dot there representing the style.

Better yet! We'll make the size of the "dot" proportional to the popularity of the style in our data set! This is called a bubble chart.

How to achieve this idea? We searched online for "mean of a column with pandas" and we landed in dataframe.mean(). This could be helpful… But we don't want the mean of a whole column—we want the mean of the column values grouped by style. Searching online again, we landed in dataframe.groupby(). This is amazing: pandas can group a series for you!

Here's what we want to do: group beers by style, then compute the mean of abv and ibu in the groups. We experimented with beers_clean.groupby('style').mean() and were amazed… However, one thing was bothersome: pandas computed the mean (by style) of every column, including the id and brewery_id, which have no business being averaged. So we decided to first drop the columns we don't need, leaving only abv, ibu and style. We can use the dataframe.drop() method for that. Check it out!

In [26]:
beers_styles = beers_clean.drop(['Unnamed: 0','name','brewery_id','ounces','id'], axis=1)
In [27]:
abv ibu style
14 0.061 60.0 American Pale Ale (APA)
21 0.099 92.0 American Barleywine
22 0.079 45.0 Winter Warmer
24 0.044 42.0 American Pale Ale (APA)
25 0.049 17.0 Fruit / Vegetable Beer
26 0.049 17.0 Fruit / Vegetable Beer
27 0.049 17.0 Fruit / Vegetable Beer
28 0.070 70.0 American IPA
29 0.070 70.0 American IPA
30 0.070 70.0 American IPA

We now have a dataframe with only the numeric features abv and ibu, and the categorical feature style. Let's find out how many beers we have of each style—we'd like to use this information to set the size of the style bubbles.

In [28]:
style_counts = beers_styles['style'].value_counts()
In [29]:
American IPA                      301
American Pale Ale (APA)           153
American Amber / Red Ale           77
American Double / Imperial IPA     75
American Blonde Ale                61
American Pale Wheat Ale            61
American Porter                    39
American Brown Ale                 38
Fruit / Vegetable Beer             30
Hefeweizen                         27
Name: style, dtype: int64
In [30]:
In [31]:

The number of beers in each style appears on each row of style_counts, sorted in decreasing order of count. We have 90 different styles, and the most popular style is the "American IPA," with 301 beers…

Discuss with your neighbor:
  • What happened? We used to have 99 styles and 424 counts in the "American IPA" style. Why is it different now?

OK. We want to characterize each style of beer with the mean values of the numeric features, abv and ibu, within that style. Let's get those means.

In [32]:
style_means = beers_styles.groupby('style').mean()
In [33]:
abv ibu
Abbey Single Ale 0.049000 22.000000
Altbier 0.054625 34.125000
American Adjunct Lager 0.046545 11.000000
American Amber / Red Ale 0.057195 36.298701
American Amber / Red Lager 0.048063 23.250000
American Barleywine 0.099000 96.000000
American Black Ale 0.073150 68.900000
American Blonde Ale 0.050148 20.983607
American Brown Ale 0.057842 29.894737
American Dark Wheat Ale 0.052200 27.600000

Looking good! We have the information we need: the average abv and ibu by style, and the counts by style. The only problem is that style_counts is sorted by decreasing count value, while style_means is sorted alphabetically by style. Ugh.

Notice that style_means is a dataframe that is now using the style string as a label for each row. Meanwhile, style_counts is a pandas series, and it also uses the style as label or index to each element.

More online searching and we find the series.sort_index() method. It will sort our style counts in alphabetical order of style, which is what we want.

In [34]:
style_counts = style_counts.sort_index()
In [35]:
Abbey Single Ale               2
Altbier                        8
American Adjunct Lager        11
American Amber / Red Ale      77
American Amber / Red Lager    16
American Barleywine            2
American Black Ale            20
American Blonde Ale           61
American Brown Ale            38
American Dark Wheat Ale        5
Name: style, dtype: int64

Above, we used Matplotlib to create a scatter plot using two NumPy arrays as the x and y parameters. Like we saw previously with histograms, pandas also has available some plotting methods (calling Matplotlib internally). Scatter plots made easy!

In [36]:
                         x='abv', y='ibu', s=style_counts, 
                         title='Beer ABV vs. IBU mean values by style');

That's rad! Perhaps the bubbles are too small. We could multiply the style_counts by a factor of 5, or maybe 10? You should experiment.

But we are feeling gung-ho about this now, and decided to find a way to make the color of the bubbles also vary with the style counts. Below, we import the colormap module of Matplotlib, and we set our colors using the viridis colormap on the values of style_counts, then we repeat the plot with these colors on the bubbles and some transparency. What do you think?

In [37]:
from matplotlib import cm
colors = cm.viridis(style_counts.values)
In [38]:
                         x='abv', y='ibu', s=style_counts*20, color=colors,
                         title='Beer ABV vs. IBU mean values by style\n',
                         alpha=0.3); #alpha sets the transparency

It looks like the most popular beers do follow a linear relationship between alcohol fraction and IBU. We learned a lot about beer without having a sip!

Wait… one more thing! What if we add a text label next to the bigger bubbles, to identify the style?

OK, here we go a bit overboard, but we couldn't help it. We played around a lot to get this version of the plot. It uses enumerate to get pairs of indices and values from a list of style names; an if statement to select only the large-count styles; and the iloc[] slicing method of pandas to get a slice based on index position, and extract abv and ibu values to an $(x,y)$ coordinate for placing the annotation text. Are we overkeen or what!

In [39]:
ax = style_means.plot.scatter(figsize=(10,10), 
                               x='abv', y='ibu', s=style_counts*20, color=colors,
                               title='Beer ABV vs. IBU mean values by style\n',

for i, txt in enumerate(list(style_counts.index.values)):
    if style_counts.values[i] > 65:
        ax.annotate(txt, (style_means.abv.iloc[i],style_means.ibu.iloc[i]), fontsize=12)

What we've learned

  • You should always plot your data.
  • The concepts of quantitative and categorical data.
  • Plotting histograms directly on columns of dataframes, using pandas.
  • Computing variance and standard deviation using NumPy built-in functions.
  • The concept of median, and how to compute it with NumPy.
  • Making box plots using pyplot.
  • Five statistics of a box plot: the quartiles Q1, Q2 (median) and Q3 (and interquartile range Q3$-$Q1), upper and lower extremes.
  • Visualizing categorical data with bar plots.
  • Visualizing multiple data with scatter plots and bubble charts.
  • pandas is awesome!


  1. Craft beer datatset by Jean-Nicholas Hould.
  2. What's The Meaning Of IBU? by Jim Dykstra for The Beer Connoisseur (2015).
  3. 40 years of boxplots (2011). Hadley Wickham and Lisa Stryjewski, Am. Statistician. PDF available
  4. John Wilder Tukey, Encyclopædia Britannica.
  5. John W. Tukey: His life and professional contributions (2002). David R. Brillinger, Ann. Statistics. PDF available

From "Statistics in Medicine,", a free course in Stanford Online by Prof. Kristin Sainani, we highly recommend that you watch this lecture:

In [40]:
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../../style/custom.css'
HTML(open(css_file, "r").read())