Welcome to the second lesson in "Take off with stats," Module 2 of our course in *Engineering Computations*. In the previous lesson, Cheers! Stats with Beers, we did some exploratory data analysis with a data set of canned craft beers in the US [1]. We'll continue using that same data set here, but with a new focus on *visualizing statistics*.

In her lecture "Looking at Data", Prof. Kristin Sainani says that you should always plot your data. Immediately, several things can come to light: are there outliers in your data? (Outliers are data points that look abnormally far from other values in the sample.) Are there data points that don't make sense? (Errors in data entry can be spotted this way.) And especially, you want to get a *visual* representation of how data are distributed in your sample.

In this lesson, we'll play around with different ways of visualizing data. There are so many ways to play! Have a look at the gallery of The Data Viz Project by *ferdio* (a data viz agency in Copenhagen). Aren't those gorgeous? Wouldn't you like to be able to make some pretty pics like that? Python can help!

Let's begin. We'll import our favorite Python libraries, and set some font parameters for our plots to look nicer. Then we'll load our data set for craft beers and begin!

In [1]:

```
import numpy
import pandas
from matplotlib import pyplot
%matplotlib inline
#Import rcParams to set font styles
from matplotlib import rcParams
#Set font style and size
rcParams['font.family'] = 'serif'
rcParams['font.size'] = 16
```

In [2]:

```
# Load the beers data set using pandas, and assign it to a dataframe
beers = pandas.read_csv("../../data/beers.csv")
```

If you downloaded this notebook alone, and don't have the data file on the location we assume above, get it by adding a code cell, and execute the following code in it:

```
from urllib.request import urlretrieve
URL = 'http://go.gwu.edu/engcomp2data1?accessType=DOWNLOAD'
urlretrieve(URL, 'beers.csv')
```

The data file will be downloaded to your working directory, and you will then need to remove the path information, i.e., the string `'../../data/'`

, in the code to read it.

OK. Let's have a look at the first few rows of the `pandas`

dataframe we just created from the file, and confirm that it's a dataframe using the `type()`

function. We only display the first 10 rows to save some space.

In [3]:

```
type(beers)
```

Out[3]:

In [4]:

```
beers[0:10]
```

Out[4]:

As you can see in the nice table that `pandas`

printed for the dataframe, we have several features for each beer: the label `abv`

corresponds to the acohol-by-volume fraction, label `ibu`

refers to the international bitterness unit (IBU), then we have the `name`

of the beer and the `style`

, the brewery ID number, and the liquid volume of the beer can, in ounces.

Alcohol-by-volume is a numeric feature: a volume fraction, with possible values from 0 to 1 (sometimes also given as a percentage). In the first 10 rows of our dataframe, the `ibu`

value is missing (all those `NaN`

s), but we saw in the previous lesson that `ibu`

is also a numeric feature, with values that go from a minimum of 4 to a maximum of 138 (in our data set). IBU is pretty mysterious: how do you measure the bitterness of beer? It turns out that bitterness is measured as parts per million of *isohumulone*, the acid found in hops [2]. Who knew?

For these numeric features, we learned that we can get an idea of the *central tendency* in the data using the **mean value**, and we get ideas of *spread* of the data with the **standard deviation** (and also with the range, but standard deviation is the most common).

Notice that the beer data also has a feature named `style`

: it can be "American IPA" or "American Porter" or a bunch of other styles of beer. If we want to study the beers according to style, we'll have to come up with some new ideas, because you can't take the mean or standard deviation of this feature!

**Quantitative data** have meaning through a numeric feature, either on a continuous scale (like a fraction from 0 to 1), or a discrete count.
**Categorical data**, in contrast, have meaning through a qualitative feature (like the style of beer). Data in this form can be collected in groups (categories), and then we can count the number of data items in that group. For example, we could ask how many beers (in our set) are of the style "American IPA," or ask how many beers we have in each style.

In the previous lesson, we played around a bit with the `abv`

and `ibu`

columns of the dataframe `beers`

. For each of these columns, we extracted it from the dataframe and saved it into a `pandas`

series, then we used the `dropna()`

method to get rid of null values. This "clean" data was our starting point for some exploratory data analysis, and for plotting the data distributions using **histograms**. Here, we will add a few more ingredients to our recipes for data exploration, and we'll learn about a new type of visualization: the **box plot**.

Let's repeat here the process for extracting and cleaning the two series, and getting the values into NumPy arrays:

In [5]:

```
#Repeat cleaning values abv
abv_series = beers['abv']
abv_clean = abv_series.dropna()
abv = abv_clean.values
```

In [6]:

```
#Repeat cleaning values ibu
ibu_series = beers['ibu']
ibu_clean = ibu_series.dropna()
ibu = ibu_clean.values
```

Let's also repeat a histogram plot for the `abv`

variable, but this time choose to plot just 10 bins (you'll see why in a moment).

In [7]:

```
pyplot.figure(figsize=(6,4))
pyplot.hist(abv, bins=10, color='#3498db', histtype='bar', edgecolor='white')
pyplot.title('Alcohol by Volume (abv) \n')
pyplot.xlabel('abv')
pyplot.ylabel('Frequency');
```

You can tell that the most frequent values of `abv`

fall in the bin just above 0.05 (5% alcohol), and the bin below. The mean value of our data is 0.06, which happens to be within the top-frequency bin, but data is not always so neat (sometimes, extreme values weigh heavily on the mean). Note also that we have a *right skewed* distribution, with higher-frequency bins occuring in the lower end of the range than in the higher end.

If you played around with the bin sizes in the previous lesson, you might have noticed that with a lot of bins, it becomes harder to visually pick out the patterns in the data. But if you use too few bins, the plot is also unhelpful. What number of bins is just right? Well, it depends on your data, so you'll just have to experiment and use your best judgement.

Let's learn a new trick. It turns out that `pandas`

has built-in methods to make histograms directly from columns of a dataframe! (It uses Matplotlib internally for that.) The syntax is short and sweet:

`dataframe.hist(column='label')`

And `pandas`

plots a pretty nice histogram without help. You can add optional parameters to set these to your liking; see the documentation. Check it out, and compare with our previous plot.

In [8]:

```
beers.hist(column='abv', edgecolor='white')
pyplot.title('Alcohol by Volume (abv) \n');
```

Which one do you like better? Well, the `pandas`

histogram took fewer lines of code to create. And it doesn't look bad at all. But we do have more fine-grained control with Matplotlib. Which method you choose in a real situation will just depend on the situation and your preference.

In the previous lesson, you learned how to compute the mean of the data using `numpy.mean()`

. How easy is that? But then we wrote our own custom functions to compute variance or standard deviation. It shouldn't surprise you by now that there are also NumPy functions for that!

- Go to the documentation of
`numpy.var()`

and analyze if this function is computing the*sample variance*.**Hint**: Check what it says about the "data degrees of freedom."

If you did the reading, you might have noticed that, by default, the argument `ddof`

in `numpy.var()`

is set to zero. If we use the default option, then we are not really calculating the sample variance. Recall from the previous lesson that the **sample variance** is:

$$ \begin{equation*} \text{var}_{sample} = \frac{1}{N-1}\sum_{i} (x_i - \bar{x})^2 \end{equation*} $$

Therefore, we need to be explicit about the division by $N-1$ when calling `numpy.var()`

. How do we do that? We explicitly set `ddof`

to `1`

.

For example, to compute the sample variance for our `abv`

variable, we do:

In [9]:

```
var_abv = numpy.var(abv, ddof=1)
print(var_abv)
```

Now we can compute the standard deviation by taking the square root of `var_abv`

:

In [10]:

```
std_abv = numpy.sqrt(var_abv)
print(std_abv)
```

You might be wondering if there is a built-in function for the standard deviation in NumPy. Go on and search online and try to find something.

**Spoiler alert!**
You will.

Read the documentation about the NumPy standard deviation function, compute the standard deviation for

`abv`

using this function, and check that you obtained the same value than if you take the square root of the variance computed with NumPy.Compute the variance and standard deviation for the variable

`ibu`

.

So far, we've learned to characterize quantitative data using the mean, variance and standard deviation.

If you watched Prof. Sainani's lecture Describing Quantitative Data: Where is the center? (recommended in our previous lesson), you'll recall that she also introduced the **median**: the middle value in the data, the value that separates your data set in half. (If there's an even number of data values, you take the average between the two middle values.)

As you may anticipate, NumPy has a built-in function that computes the median, helpfully named `numpy.median()`

.

Using NumPy, compute the median for our variables `abv`

and `ibu`

. Compare the median with the mean, and look at the histogram to locate where the values fall on the x-axis.

Another handy way to visualize the distribution of quantitative data is using **box plots**. By "distribution" of the data, we mean some idea of the dataset's "shape": where is the center, what is the range, what is the variation in the data.
Histograms are the most popular type of plots in exploratory data analysis. But check out box plots: they are easy to make with `pyplot`

:

In [11]:

```
pyplot.boxplot(abv, labels=['Alcohol by volume']);
```

In [12]:

```
pyplot.boxplot(ibu, labels=['International bitterness unit']);
```

What is going on here? Obviously, there is a *box*: it represents 50% of the data in the middle of the data range, with the line across it (here, in orange) indicating the *median*.

The bottom of the box is at the 25th *percentile*, while the top of the box is at the 75th percentile. In other words, the bottom 25% of the data falls below the box, and the top 25% of the data falls above the box.

*Confused by percentiles?*
The Nth percentile is the value below which N% of the observations fall.

Recall the bell curve from our previous lesson: we said that 95% of the data falls at a distance $\pm 2 \sigma$ from the mean. This implies that 5% of the data (the rest) falls in the (symmetrical) tails, which in turn implies that the 2.5 percentile is at $-2\sigma$ from the mean, and the 97.5 percentile is at $+2\sigma$ from the mean.

The percentiles 25, 50, and 75 are also named *quartiles*, since they divide the data into quarters. They are named first (Q1), second (Q2 or median) and third quartile (Q3), respectively.

Fortunately, NumPy has a function to compute percentiles and we can do it in just one line. Let's use `numpy.percentile()`

to compute the `abv`

and `ibu`

quartiles.

** abv quartiles **

In [13]:

```
Q1_abv = numpy.percentile(abv, q=25)
Q2_abv = numpy.percentile(abv, q=50)
Q3_abv = numpy.percentile(abv, q=75)
print('The first quartile for abv is {}'.format(Q1_abv))
print('The second quartile for abv is {}'.format(Q2_abv))
print('The third quartile for abv is {}'.format(Q3_abv))
```

** ibu quartiles **

You can also pass a list of percentiles to `numpy.percentile()`

and calculate several of them in one go. For example, to compute the quartiles of `ibu`

, we do:

In [14]:

```
quartiles_ibu = numpy.percentile(ibu, q=[25, 50, 75])
print('The first quartile for ibu is {}'.format(quartiles_ibu[0]))
print('The second quartile for ibu is {}'.format(quartiles_ibu[1]))
print('The third quartile for ibu is {}'.format(quartiles_ibu[2]))
```

OK, back to box plots. The height of the box—between the 25th and 75th percentile—is called the *interquartile range* (IQR). Outside the box, you have two vertical lines—the so-called "whiskers" of the box plot—which used to be called "box and whiskers plot" [3].

The whiskers extend to the upper and lower extremes (short horizontal lines). The extremes follow the following rules:

- Top whisker: lower value between the
**maximum**and`Q3 + 1.5 x IQR`

. - Bottom whisker: higher value between the
**minimum**and`Q1 - 1.5 x IQR`

Any data values beyond the upper and lower extremes are shown with a marker (here, small circles) and are an indication of outliers in the data.

Calculate the end-points of the top and bottom whiskers for both the `abv`

and `ibu`

variables, and compare the results with the whisker end-points you see in the plot.

"Box-and-whiskers" plots were invented by John Tukey over 45 years ago. Tukey was a famous mathematician/statistician who is credited with coining the words *software* and *bit* [4]. He was active in the efforts to break the *Enigma* code durig WWII, and worked at Bell Labs in the first surface-to-air guided missile ("Nike"). A classic 1947 work on early design of the electonic computer acknowledged Tukey: he designed the electronic circuit for computing addition. Tukey was also a long-time advisor for the US Census Bureau, and a consultant for the Educational Testing Service (ETS), among many other contributions [5].

Box plots are also drawn horizontally. Often, several box plots are drawn side-by-side with the purpose of comparing distributions.

The typical method of visualizing categorical data is using **bar plots**. These show visually the frequency of appearance of items in each category, or the proportion of data in each category. Suppose we wanted to know how many beers of the same style are in our data set. Remember: the *style* of the beer is an example of *categorical data*. Let's extract the column with the style information from the `beers`

dataframe, assign it to a variable named `style_series`

, check the type of this variable, and view the first 10 elements.

In [15]:

```
style_series = beers['style']
```

In [16]:

```
type(style_series)
```

Out[16]:

In [17]:

```
style_series[0:10]
```

Out[17]:

Already in the first 10 elements we see that we have two beers of the style "American IPA," two beers of the style "American Pale Ale (APA)," but only one beer of the style "Oatmeal Stout." The question is: how many beers of each style are contained in the whole series?

Luckily, `pandas`

has a built-in function to answer that question: `series.value_counts()`

(where `series`

is the variable name of the `pandas`

series you want the counts for). Let's try it on our `style_series`

, and save the result in a new variable named `style_counts`

.

In [18]:

```
style_counts = style_series.value_counts()
style_counts[0:5]
```

Out[18]:

In [19]:

```
type(style_counts)
```

Out[19]:

In [20]:

```
len(style_counts)
```

Out[20]:

The `len()`

function tells us that `style_counts`

has 99 elements. That is, there are a total of 99 styles of beer in our data set. Wow, that's a lot!

Notice that `value_counts()`

returned the counts sorted in decreasing order: the most popular beer in our data set is "American IPA" with 424 entries in our data. The next-most popular beer is "American Pale Ale (APA)" with a lot fewer entries (245), and the counts decrease sharply after that. Naturally, we'd like to know how much more popular are the top-2 beers from the rest. Bar plot to the rescue!

Below, we'll draw a horizontal bar plot directly with `pandas`

(which uses Matplotlib internally) using the `plot.barh()`

method for series. We'll only show the first 20 beers, because otherwise we'll get a huge plot. This plot gives us a clear visualization of the popularity ranking of beer styles in the US!

In [21]:

```
style_counts[0:20].plot.barh(figsize=(10,8), color='#008367', edgecolor='gray');
```

These visualizations are really addictive! We're now getting ambitious: what if we wanted to show more than one feature, together on the same plot? What if we wanted to get insights about the relationship between two features through a multi-variable plot?

For example, don't you want to know if the bitterness of beers is associated with the alcohol-by-volume fraction? We do!

Maybe we can do this: imagine a plot that has the alcohol-by-volume on the absissa, and the IBU value on the ordinate. For each beer, we can place a dot on this plot with its `abv`

and `ibu`

values as $(x, y)$ coordinates. This is called a **scatter plot**.

We run into a bit of a problem, though. The way we handled the beer data above, we extracted the column for `abv`

into a series, dropped the null entries, and saved the values into a NumPy array. We then repeated this process for the `ibu`

column. Because a lot more `ibu`

values are missing, we ended up with two arrays of different length: 2348 entries for the `abv`

series, and 1405 entries for the `ibu`

series. If we want to make a scatter plot with these two features, we'll need series (or arrays) of the same length.

Let's instead clean the whole `beers`

dataframe (which will completely remove any row that has a null entry), and *then* extract the values of the two series into NumPy arrays.

In [22]:

```
beers_clean = beers.dropna()
```

In [23]:

```
ibu = beers_clean['ibu'].values
len(ibu)
```

Out[23]:

In [24]:

```
abv = beers_clean['abv'].values
len(abv)
```

Out[24]:

Notice that both arrays now have 1403 entries—not 1405 (the length of the clean `ibu`

data), because two rows that had a non-null `ibu`

value *did* have a null `abv`

value and were dropped.

With the two arrays of the same length, we can now call the `pyplot.scatter()`

function.

In [25]:

```
pyplot.figure(figsize=(8,8))
pyplot.scatter(abv, ibu, color='#3498db')
pyplot.title('Scatter plot of alcohol-by-volume vs. IBU \n')
pyplot.xlabel('abv')
pyplot.ylabel('IBU');
```

Hmm. That's a bit of a mess. Too many dots! But we do make out that the beers with low alcohol-by-volume tend to have low bitterness. For higher alcohol fraction, the beers can be anywhere on the bitterness scale: there's a lot of vertical spread on those dots to the right of the plot.

An idea! What if the bitterness has something to do with *style*? Neither of us knows much about beer, so we have no clue. Could we explore this question with visualization? We found a way!

What we imagined is that we could group together the beers by style, and then make a new scatter plot where each marker corresponds to a style. The beers within a style, though, have many values of alcohol fraction and bitterness: we have to come up with a "summary value" for each style. Well, why not the *mean*… we can calculate the average `abv`

and the average `ibu`

for all the beers in each style, use that pair as $(x,y)$ coordinate, and put a dot there representing the style.

Better yet! We'll make the size of the "dot" proportional to the popularity of the style in our data set! This is called a **bubble chart**.

How to achieve this idea? We searched online for "mean of a column with pandas" and we landed in `dataframe.mean()`

. This could be helpful… But we don't want the mean of a *whole* column—we want the mean of the column values grouped by *style*. Searching online again, we landed in `dataframe.groupby()`

. This is amazing: `pandas`

can group a series for you!

Here's what we want to do: group beers by style, then compute the mean of `abv`

and `ibu`

in the groups. We experimented with `beers_clean.groupby('style').mean()`

and were amazed… However, one thing was bothersome: `pandas`

computed the mean (by style) of every column, including the `id`

and `brewery_id`

, which have no business being averaged. So we decided to first drop the columns we don't need, leaving only `abv`

, `ibu`

and `style`

. We can use the `dataframe.drop()`

method for that. Check it out!

In [26]:

```
beers_styles = beers_clean.drop(['Unnamed: 0','name','brewery_id','ounces','id'], axis=1)
```

In [27]:

```
beers_styles[0:10]
```

Out[27]:

We now have a dataframe with only the numeric features `abv`

and `ibu`

, and the categorical feature `style`

. Let's find out how many beers we have of each style—we'd like to use this information to set the size of the style bubbles.

In [28]:

```
style_counts = beers_styles['style'].value_counts()
```

In [29]:

```
style_counts[0:10]
```

Out[29]:

In [30]:

```
type(style_counts)
```

Out[30]:

In [31]:

```
len(style_counts)
```

Out[31]:

The number of beers in each style appears on each row of `style_counts`

, sorted in decreasing order of count. We have 90 different styles, and the most popular style is the "American IPA," with 301 beers…

- What happened? We used to have 99 styles and 424 counts in the "American IPA" style. Why is it different now?

OK. We want to characterize each style of beer with the *mean values* of the numeric features, `abv`

and `ibu`

, within that style. Let's get those means.

In [32]:

```
style_means = beers_styles.groupby('style').mean()
```

In [33]:

```
style_means[0:10]
```

Out[33]:

Looking good! We have the information we need: the average `abv`

and `ibu`

by style, and the counts by style. The only problem is that `style_counts`

is sorted by decreasing count value, while `style_means`

is sorted alphabetically by style. Ugh.

Notice that `style_means`

is a dataframe that is now using the style string as a *label* for each row. Meanwhile, `style_counts`

is a `pandas`

series, and it also uses the style as label or index to each element.

More online searching and we find the `series.sort_index()`

method. It will sort our style counts in alphabetical order of style, which is what we want.

In [34]:

```
style_counts = style_counts.sort_index()
```

In [35]:

```
style_counts[0:10]
```

Out[35]:

Above, we used Matplotlib to create a scatter plot using two NumPy arrays as the `x`

and `y`

parameters. Like we saw previously with histograms, `pandas`

also has available some plotting methods (calling Matplotlib internally). Scatter plots made easy!

In [36]:

```
style_means.plot.scatter(figsize=(8,8),
x='abv', y='ibu', s=style_counts,
title='Beer ABV vs. IBU mean values by style');
```

That's rad! Perhaps the bubbles are too small. We could multiply the `style_counts`

by a factor of 5, or maybe 10? You should experiment.

But we are feeling gung-ho about this now, and decided to find a way to make the *color* of the bubbles also vary with the style counts. Below, we import the `colormap`

module of Matplotlib, and we set our colors using the *viridis* colormap on the values of `style_counts`

, then we repeat the plot with these colors on the bubbles and some transparency. *What do you think?*

In [37]:

```
from matplotlib import cm
colors = cm.viridis(style_counts.values)
```

In [38]:

```
style_means.plot.scatter(figsize=(10,10),
x='abv', y='ibu', s=style_counts*20, color=colors,
title='Beer ABV vs. IBU mean values by style\n',
alpha=0.3); #alpha sets the transparency
```

It looks like the most popular beers do follow a linear relationship between alcohol fraction and IBU. We learned a lot about beer without having a sip!

*Wait… one more thing!* What if we add a text label next to the bigger bubbles, to identify the style?

OK, here we go a bit overboard, but we couldn't help it. We played around a lot to get this version of the plot. It uses `enumerate`

to get pairs of indices and values from a list of style names; an `if`

statement to select only the large-count styles; and the `iloc[]`

slicing method of `pandas`

to get a slice based on index position, and extract `abv`

and `ibu`

values to an $(x,y)$ coordinate for placing the annotation text. *Are we overkeen or what!*

In [39]:

```
ax = style_means.plot.scatter(figsize=(10,10),
x='abv', y='ibu', s=style_counts*20, color=colors,
title='Beer ABV vs. IBU mean values by style\n',
alpha=0.3);
for i, txt in enumerate(list(style_counts.index.values)):
if style_counts.values[i] > 65:
ax.annotate(txt, (style_means.abv.iloc[i],style_means.ibu.iloc[i]), fontsize=12)
```

- You should always plot your data.
- The concepts of quantitative and categorical data.
- Plotting histograms directly on columns of dataframes, using
`pandas`

. - Computing variance and standard deviation using NumPy built-in functions.
- The concept of median, and how to compute it with NumPy.
- Making box plots using
`pyplot`

. - Five statistics of a box plot: the quartiles Q1, Q2 (median) and Q3 (and interquartile range Q3$-$Q1), upper and lower extremes.
- Visualizing categorical data with bar plots.
- Visualizing multiple data with scatter plots and bubble charts.
`pandas`

is awesome!

- Craft beer datatset by Jean-Nicholas Hould.
- What's The Meaning Of IBU? by Jim Dykstra for The Beer Connoisseur (2015).
- 40 years of boxplots (2011). Hadley Wickham and Lisa Stryjewski,
*Am. Statistician*. PDF available - John Wilder Tukey, Encyclopædia Britannica.
- John W. Tukey: His life and professional contributions (2002). David R. Brillinger,
*Ann. Statistics*. PDF available

From "Statistics in Medicine,", a free course in Stanford Online by Prof. Kristin Sainani, we highly recommend that you watch this lecture:

In [40]:

```
# Execute this cell to load the notebook's style sheet, then ignore it
from IPython.core.display import HTML
css_file = '../../style/custom.css'
HTML(open(css_file, "r").read())
```

Out[40]: