Bokeh Tutorial

07. Bar and Categorical Data Plots

In [1]:
from bokeh.io import show, output_notebook
from bokeh.plotting import figure

output_notebook()
Loading BokehJS ...

Basic Bar Charts

Bar charts are a common and important type of plot. Bokeh makes it simple to create all sorts of stacked or nested bar charts, and to deal with categorical data in general.

The example below shows a simple bar chart created using the vbar method for drawing vertical bars. (There is a corresponding hbar for horizontal bars.) We also set a few plot properties to make the chart look nicer, see chapter Styling and Theming for information about visual properties.

In [2]:
# Here is a list of categorical values (or factors)
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']

# Set the x_range to the list of categories above
p = figure(x_range=fruits, plot_height=250, title="Fruit Counts")

# Categorical values can also be used as coordinates
p.vbar(x=fruits, top=[5, 3, 4, 2, 4, 6], width=0.9)

# Set some properties to make the plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0

show(p)

When we want to create a plot with a categorical range, we pass the ordered list of categorical values to figure, e.g. x_range=['a', 'b', 'c']. In the plot above, we passed the list of fruits as x_range, and we can see those refelected as the x-axis.

The vbar glyph method takes an x location for the center of the bar, a top and bottom (which defaults to 0), and a width. When we are using a categorical range as we are here, each category implicitly has width of 1, so setting width=0.9 as we have done here makes the bars shrink away from each other. (Another option would be to add some padding to the range.)

In [3]:
# Exercise: Create your own simple bar chart

Since vbar is a glyph method, we can use it with a ColumnDataSource just as we woudl with any other glyph. In the example below, we put the data (including color data) in a ColumnDataSource and use that to drive our plot. We also add a legend, see chapter Adding Annotations.ipynb for more information about legends and other annotations.

In [4]:
from bokeh.models import ColumnDataSource
from bokeh.palettes import Spectral6

fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
counts = [5, 3, 4, 2, 4, 6]

source = ColumnDataSource(data=dict(fruits=fruits, counts=counts, color=Spectral6))

p = figure(x_range=fruits, plot_height=250, y_range=(0, 9), title="Fruit Counts")
p.vbar(x='fruits', top='counts', width=0.9, color='color', legend="fruits", source=source)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

show(p)
In [5]:
# Exercise: Create your own simple bar chart driven by a ColumnDataSource

Stacked Bars

It's often desirable to stack bars together. Bokeh makes this straightforward using the vbar_stack and hbar_stack methods. When passing data to one of these methods, the data source should have a series for each "row" in the stack. You will provide an ordered list of column names to stack together from the data source.

In the example below, we see simulated data for fruit exports (positive values) and imports (negative values) stacked using two calls to hbar_stack. The values in the columns for each year are ordered according to the fruits, i.e. this is not a "tidy" data format.

In [6]:
from bokeh.palettes import GnBu3, OrRd3

years = ['2015', '2016', '2017']

exports = {'fruits' : fruits,
           '2015'   : [2, 1, 4, 3, 2, 4],
           '2016'   : [5, 3, 4, 2, 4, 6],
           '2017'   : [3, 2, 4, 4, 5, 3]}
imports = {'fruits' : fruits,
           '2015'   : [-1, 0, -1, -3, -2, -1],
           '2016'   : [-2, -1, -3, -1, -2, -2],
           '2017'   : [-1, -2, -1, 0, -2, -2]}

p = figure(y_range=fruits, plot_height=250, x_range=(-16, 16), title="Fruit import/export, by year")

p.hbar_stack(years, y='fruits', height=0.9, color=GnBu3, source=ColumnDataSource(exports),
             legend=["%s exports" % x for x in years])

p.hbar_stack(years, y='fruits', height=0.9, color=OrRd3, source=ColumnDataSource(imports),
             legend=["%s imports" % x for x in years])

p.y_range.range_padding = 0.1
p.ygrid.grid_line_color = None
p.legend.location = "center_left"

show(p)

Notice we also added some padding around the categorical range (e.g. at both ends of the axis) by specifying

p.y_range.range_padding = 0.1
In [7]:
# Create a stacked bar chart with a single call to vbar_stack

Grouped Bar Charts

Sometimes we want to group bars together, instead of stacking them. Bokeh can handle up to three levels of nested (hierarchical) categories, and will automatically group output according to the outermost level. To specify neted categorical coordinates, the columns of the data source should contain tuples, for example:

x = [ ("Apples", "2015"), ("Apples", "2016"), ("Apples", "2017"), ("Pears", "2015), ... ]

Values in other columns correspond to each item in x, exactly as in other cases. When plotting with these kinds of nested coordinates, we must tell Bokeh the contents and order the axis range, by explicitly passing a FactorRange to figure. In the example below, this is seen as

p = figure(x_range=FactorRange(*x), ....)
In [8]:
from bokeh.models import FactorRange

fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
years = ['2015', '2016', '2017']

data = {'fruits' : fruits,
        '2015'   : [2, 1, 4, 3, 2, 4],
        '2016'   : [5, 3, 3, 2, 4, 6],
        '2017'   : [3, 2, 4, 4, 5, 3]}

# this creates [ ("Apples", "2015"), ("Apples", "2016"), ("Apples", "2017"), ("Pears", "2015), ... ]
x = [ (fruit, year) for fruit in fruits for year in years ]
counts = sum(zip(data['2015'], data['2016'], data['2017']), ()) # like an hstack

source = ColumnDataSource(data=dict(x=x, counts=counts))

p = figure(x_range=FactorRange(*x), plot_height=250, title="Fruit Counts by Year")

p.vbar(x='x', top='counts', width=0.9, source=source)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None

show(p)
In [9]:
# Exercise: Make the chart above have a different color for each year by adding colors to the ColumnDataSource

Another way we can set the color of the bars is to use a transorm. We first saw some transforms in previous chapter Data Sources and Transformations. Here we use a new one factor_cmap that accepts a the name of a column to use for colormapping, as well as the palette and factors that define the color mapping.

Additionally we can configure it to map just the sub-factors if desired. For instance in this case we don't want shade each (fruit, year) pair differently. Instead, we want to only shade based on the year. So we pass start=1 and end=2 to specify the slice range of each factor to use when colormapping. Then we pass the result as the fill_color value:

    fill_color=factor_cmap('x', palette=['firebrick', 'olive', 'navy'], factors=years, start=1, end=2))

to have the colors be applied automatically based on the underlying data.

In [10]:
from bokeh.transform import factor_cmap

p = figure(x_range=FactorRange(*x), plot_height=250, title="Fruit Counts by Year")

p.vbar(x='x', top='counts', width=0.9, source=source, line_color="white",

       # use the palette to colormap based on the the x[1:2] values
       fill_color=factor_cmap('x', palette=['firebrick', 'olive', 'navy'], factors=years, start=1, end=2))

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None

show(p)

It is also possible to achieve grouped bar plots using another technique called "visual dodge". That would be useful e.g. if you only wanted to have the axis labeled by fruit type, and not include the years on the axis. This tutorial does not cover that technique but you can find information in the User's Guide.

Mixing Categorical Levels

If you have created a range with nested categories as above, it is possible to plot glyphs using only the "outer" categories, if desired. The plot below shows monthly values grouped by quarter as bars. The data for these are in the famliar format:

factors = [("Q1", "jan"), ("Q1", "feb"), ("Q1", "mar"), ....]

The plot also overlays a line representing average quarterly values, and this is accomplished by using only the "quarter" part of each nexted category:

p.line(x=["Q1", "Q2", "Q3", "Q4"], y=....)
In [11]:
factors = [("Q1", "jan"), ("Q1", "feb"), ("Q1", "mar"),
           ("Q2", "apr"), ("Q2", "may"), ("Q2", "jun"),
           ("Q3", "jul"), ("Q3", "aug"), ("Q3", "sep"),
           ("Q4", "oct"), ("Q4", "nov"), ("Q4", "dec")]

p = figure(x_range=FactorRange(*factors), plot_height=250)

x = [ 10, 12, 16, 9, 10, 8, 12, 13, 14, 14, 12, 16 ]
p.vbar(x=factors, top=x, width=0.9, alpha=0.5)

qs, aves = ["Q1", "Q2", "Q3", "Q4"], [12, 9, 13, 14]
p.line(x=qs, y=aves, color="red", line_width=3)
p.circle(x=qs, y=aves, line_color="red", fill_color="white", size=10)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None

show(p)

Using Pandas GroupBy

We may want to make charts based on the results of "group by" operations. Bokeh can utilize Pandas GroupBy objects directly to make this simpler. Let's take a look at how Bokeh deals with GroupBy objects by examining the "cars" data set.

In [12]:
from bokeh.sampledata.autompg import autompg_clean as df

df.cyl = df.cyl.astype(str)
df.head()
Out[12]:
mpg cyl displ hp weight accel yr origin name mfr
0 18.0 8 307.0 130 3504 12.0 70 North America chevrolet chevelle malibu chevrolet
1 15.0 8 350.0 165 3693 11.5 70 North America buick skylark 320 buick
2 18.0 8 318.0 150 3436 11.0 70 North America plymouth satellite plymouth
3 16.0 8 304.0 150 3433 12.0 70 North America amc rebel sst amc
4 17.0 8 302.0 140 3449 10.5 70 North America ford torino ford

Suppose we would like to display some values grouped according to "cyl". If we create df.groupby(('cyl')) then call group.describe() we can see that Pandas automatically computes various statistics for each group.

In [13]:
group = df.groupby(('cyl'))

group.describe()
Out[13]:
accel displ ... weight yr
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
cyl
3 4.0 13.250000 0.500000 12.5 13.25 13.5 13.5 13.5 4.0 72.500000 ... 2495.0 2720.0 4.0 75.500000 3.696846 72.0 72.75 75.0 77.75 80.0
4 199.0 16.581910 2.383185 11.6 14.80 16.2 18.0 24.8 199.0 109.670854 ... 2562.5 3270.0 199.0 77.030151 3.737484 70.0 74.00 77.0 80.00 82.0
5 3.0 18.633333 2.369247 15.9 17.90 19.9 20.0 20.1 3.0 145.000000 ... 3240.0 3530.0 3.0 79.000000 1.000000 78.0 78.50 79.0 79.50 80.0
6 83.0 16.254217 2.031778 11.3 15.05 16.0 17.6 21.0 83.0 218.361446 ... 3431.0 3907.0 83.0 75.951807 3.264381 70.0 74.00 76.0 78.00 82.0
8 103.0 12.955340 2.224759 8.0 11.50 13.0 14.0 22.2 103.0 345.009709 ... 4403.5 5140.0 103.0 73.902913 3.021214 70.0 72.00 73.0 76.00 81.0

5 rows × 48 columns

Bokeh allows us to create a ColumnDataSource directly from Pandas GroupBy objects, and when this happens, the data source is automatically filled with the summary values from group.desribe(). Observe the column names below, which correspond to the output above.

In [14]:
source = ColumnDataSource(group)

",".join(source.column_names)
Out[14]:
'accel_count,accel_mean,accel_std,accel_min,accel_25%,accel_50%,accel_75%,accel_max,displ_count,displ_mean,displ_std,displ_min,displ_25%,displ_50%,displ_75%,displ_max,hp_count,hp_mean,hp_std,hp_min,hp_25%,hp_50%,hp_75%,hp_max,mpg_count,mpg_mean,mpg_std,mpg_min,mpg_25%,mpg_50%,mpg_75%,mpg_max,weight_count,weight_mean,weight_std,weight_min,weight_25%,weight_50%,weight_75%,weight_max,yr_count,yr_mean,yr_std,yr_min,yr_25%,yr_50%,yr_75%,yr_max,cyl'

Knowing these column names, we can immediately create bar charts based on Pandas GroupBy objects. The example below plots the aveage MPG per cylinder, i.e. columns "mpg_mean" vs "cyl"

In [15]:
from bokeh.palettes import Spectral5

cyl_cmap = factor_cmap('cyl', palette=Spectral5, factors=sorted(df.cyl.unique()))

p = figure(plot_height=350, x_range=group)
p.vbar(x='cyl', top='mpg_mean', width=1, line_color="white", 
       fill_color=cyl_cmap, source=source)

p.xgrid.grid_line_color = None
p.xaxis.axis_label = "number of cylinders"
p.yaxis.axis_label = "Mean MPG"
p.y_range.start = 0

show(p)
In [16]:
# Exercise: Use the same dataset to make a similar plot of mean horsepower (hp) by origin

Catgorical Scatterplots

So far we have seen Categorical data used together with various bar glyphs. But Bokeh can use categorical coordinates for most any glyphs. Let's create a scatter plot with categorical coordinates on one axis. The commits data set simply has a series datetimes of GitHub commit. Additional columns to express the day and hour of day for each commit have already been added.

In [17]:
from bokeh.sampledata.commits import data

data.head()
Out[17]:
day time
datetime
2017-04-22 15:11:58-05:00 Sat 15:11:58
2017-04-21 14:20:57-05:00 Fri 14:20:57
2017-04-20 14:35:08-05:00 Thu 14:35:08
2017-04-20 10:34:29-05:00 Thu 10:34:29
2017-04-20 09:17:23-05:00 Thu 09:17:23

To create our scatter plot, we pass the list of categories as the range just as before

p = figure(y_range=DAYS, ...)

Then we can plot circles for each commit, with "time" driving the x-coordinate, and "day" driving the y-coordinate.

p.circle(x='time', y='day', ...)

To make the values more distinguishable, we can also add a jitter transform to the y-coordinate, which is shown in the complete example below.

In [18]:
from bokeh.transform import jitter

DAYS = ['Sun', 'Sat', 'Fri', 'Thu', 'Wed', 'Tue', 'Mon']

source = ColumnDataSource(data)

p = figure(plot_width=800, plot_height=300, y_range=DAYS, x_axis_type='datetime', 
           title="Commits by Time of Day (US/Central) 2012—2016")

p.circle(x='time', y=jitter('day', width=0.6, range=p.y_range),  source=source, alpha=0.3)

p.xaxis[0].formatter.days = ['%Hh']
p.x_range.range_padding = 0
p.ygrid.grid_line_color = None

show(p)
In [19]:
# Exercise: Create a plot using categorical coordinates and any non-"bar" glyphs