There have been many examples of useful and exciting data visualizations for a variety of topics and applications.
from IPython.display import IFrame, HTML
from IPython.core.display import display
display(IFrame("http://demographics.coopercenter.org/DotMap/index.html", '800px', '600px'))
display(IFrame("http://www.nytimes.com/interactive/2014/07/31/world/africa/ebola-virus-outbreak-qa.html", '800px', '600px'))
Most of these invlove directly coding JavaScript.
Not everyone enjoys writing JavaScript.
Bokeh is a Python package for creating interactive, browser-based visualizations, and is well-suited for "big data" applications.
Bokeh allows users to create interactive html visualizations without using JS.
Bokeh is a language-based visualization system. This allows for:
Bokeh philosophy:
Make a smart choice when it is possible to do so automatically, and expose low-level capabilities when it is not.
Bokeh writes to a custom-built HTML5 Canvas library, which affords it high performance. This allows it to integrate with other web tools, such as Google Maps.
Bokeh plots are based on visual elements called glyphs that are bound to data objects.
First we'll import the bokeh.plotting module, which defines the graphical functions and primitives.
import bokeh.plotting as bk
Next, we'll tell Bokeh to display its plots directly into the notebook. This will cause all of the Javascript and data to be embedded directly into the HTML of the notebook itself. (Bokeh can output straight to HTML files, or use a server, which we'll look at later.)
Next, we'll import NumPy and create some simple data.
import numpy as np
x = np.linspace(-6, 6, 100)
y = np.random.normal(0.3*x, 1)
Now we'll call Bokeh's circle()
function to render a red circle at
each of the points in x and y.
We can immediately interact with the plot:
bk.circle(x, y, color="red", plot_width=500, plot_height=500)
bk.show()
Let's try plotting multiple series on the same axes.
First, we generate some data from an exponential distribution with mean θ=1.
from scipy.stats import expon
theta = 1
measured = np.random.exponential(theta, 1000)
hist, edges = np.histogram(measured, density=True, bins=50)
Calling hold
will allow us to plot multiple elements on the same set of axes.
bk.hold(True)
Next, create our figure, which is not displayed until we ask Bokeh to do so explicitly. We will customize the intractive toolbar, as well as customize the background color.
bk.figure(title="Exponential Distribution (θ=1)",tools="previewsave",
background_fill="#E8DDCB")
The quad glyph displays axis-aligned rectangles with the given attributes.
bk.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], fill_color="#036564", line_color="#033649")
<bokeh.objects.Plot at 0x112130390>
Next, add lines showing the form of the probability distribution function (PDF) and cumulative distribution function (CDF).
x = np.linspace(0, 10, 1000)
bk.line(x, expon.pdf(x, scale=1), line_color="#D95B43", line_width=8, alpha=0.7, legend="PDF")
bk.line(x, expon.cdf(x, scale=1), line_color="white", line_width=2, alpha=0.7, legend="CDF")
<bokeh.objects.Plot at 0x112130390>
Finally, add a legend before releasing the hold and displaying the complete plot.
bk.legend().orientation = "top_right"
bk.hold(False)
bk.show()
Bokeh's core display model relies on composing graphical primitives which are bound to data series. A more sophisticated example demonstrates this idea.
Bokeh ships with a small set of interesting "sample data" in the bokeh.sampledata
package. We'll load up some historical automobile fuel efficiency data, which is returned as a Pandas DataFrame
.
from bokeh.sampledata.autompg import autompg
We first need to reshape the data, by grouping it according to the year of the car, and then by the country of origin (here, USA or Japan).
grouped = autompg.groupby("yr")
mpg = grouped["mpg"]
mpg_avg = mpg.mean()
mpg_std = mpg.std()
years = np.asarray(grouped.groups.keys())
american = autompg[autompg["origin"]==1]
japanese = autompg[autompg["origin"]==3]
american.head(10)
mpg | cyl | displ | hp | weight | accel | yr | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18 | 8 | 307 | 130 | 3504 | 12.0 | 70 | 1 | chevrolet chevelle malibu |
1 | 15 | 8 | 350 | 165 | 3693 | 11.5 | 70 | 1 | buick skylark 320 |
2 | 18 | 8 | 318 | 150 | 3436 | 11.0 | 70 | 1 | plymouth satellite |
3 | 16 | 8 | 304 | 150 | 3433 | 12.0 | 70 | 1 | amc rebel sst |
4 | 17 | 8 | 302 | 140 | 3449 | 10.5 | 70 | 1 | ford torino |
5 | 15 | 8 | 429 | 198 | 4341 | 10.0 | 70 | 1 | ford galaxie 500 |
6 | 14 | 8 | 454 | 220 | 4354 | 9.0 | 70 | 1 | chevrolet impala |
7 | 14 | 8 | 440 | 215 | 4312 | 8.5 | 70 | 1 | plymouth fury iii |
8 | 14 | 8 | 455 | 225 | 4425 | 10.0 | 70 | 1 | pontiac catalina |
9 | 15 | 8 | 390 | 190 | 3850 | 8.5 | 70 | 1 | amc ambassador dpl |
For each year, we want to plot the distribution of MPG within that year. As a guide, we will include a box that represents the mean efficiency, plus and minus one standard deviation. We will make these boxes partly transparent.
bk.hold(True)
bk.figure(title='Automobile mileage by year and country')
bk.quad(left=years-0.4, right=years+0.4, bottom=mpg_avg-mpg_std, top=mpg_avg+mpg_std, fill_alpha=0.4)
<bokeh.objects.Plot at 0x11467aed0>
Next, we overplot the actual data points, using contrasting symbols for American and Japanese cars.
# Add Japanese cars as circles
bk.circle(x=np.asarray(japanese["yr"]),
y=np.asarray(japanese["mpg"]),
size=8, alpha=0.4, line_color="red", fill_color=None, line_width=2)
# Add American cars as triangles
bk.triangle(x=np.asarray(american["yr"]),
y=np.asarray(american["mpg"]),
size=8, alpha=0.4, line_color="blue", fill_color=None, line_width=2)
<bokeh.objects.Plot at 0x11467aed0>
We can add axis labels by binding them to the axis_label
attribute of each axis.
xax, yax = bk.axis()
xax.axis_label = 'Year'
yax.axis_label = 'MPG'
bk.hold(False)
bk.show()
To link plots together at a data level, we can explicitly wrap the data in a ColumnDataSource. This allows us to reference columns by name.
source = bk.ColumnDataSource(autompg.to_dict("list"))
source.add(autompg["yr"], name="yr")
'yr'
The gridplot
function takes a 2-dimensional list containing elements to be arranged in a grid on the same canvas.
plot_config = dict(plot_width=300, plot_height=300, tools="pan, wheel_zoom, box_zoom, select")
bk.gridplot([[
bk.circle("yr", "mpg", color="blue", title="MPG by Year",
source=source, **plot_config),
bk.circle("hp", "displ", color="green", title="HP vs. Displacement",
source=source, **plot_config),
bk.circle("mpg", "displ", size="cyl", line_color="red", title="MPG vs. Displacement",
fill_color=None, source=source, **plot_config)
]])
<bokeh.objects.GridPlot at 0x11469ee50>
We can use the select
tool to select points on one plot, and the linked points on the other plots will automagically highlight.
bk.show()
First, we import the data with Pandas and manipulate it as needed.
import pandas as pd
from bokeh.objects import HoverTool
from bokeh.sampledata.unemployment1948 import data
from collections import OrderedDict
data['Year'] = [str(x) for x in data['Year']]
years = list(data['Year'])
months = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
data = data.set_index('Year')
Specify a color map (where do we get color maps, you ask? -- Try Color Brewer)
colors = [
"#75968f", "#a5bab7", "#c9d9d3", "#e2e2e2", "#dfccce",
"#ddb7b1", "#cc7878", "#933b41", "#550b1d"
]
Set up the data for plotting. We will need to have values for every pair of year/month names. Map the rate to a color.
month = []
year = []
color = []
rate = []
for y in years:
for m in months:
month.append(m)
year.append(y)
monthly_rate = data[m][y]
rate.append(monthly_rate)
color.append(colors[min(int(monthly_rate)-2, 8)])
Create a ColumnDataSource
with columns: month, year, color, rate
source = bk.ColumnDataSource(
data=dict(
month=month,
year=year,
color=color,
rate=rate,
)
)
Create a new figure.
bk.figure()
use the `rect renderer with the following attributes:
x_range
is years, y_range
is months (reversed)line_color
for the rectangles is None
plot_width
and plot_height
bk.rect('year', 'month', 0.95, 0.95, source=source,
x_axis_location="above",
x_range=years, y_range=list(reversed(months)),
color='color', line_color=None,
tools="resize,hover", title="US Unemployment (1948 - 2013)",
plot_width=900, plot_height=400)
<bokeh.objects.Plot at 0x11468af10>
Style the plot, including:
bk.grid().grid_line_color = None
bk.axis().axis_line_color = None
bk.axis().major_tick_line_color = None
bk.axis().major_label_text_font_size = "5pt"
bk.axis().major_label_standoff = 0
bk.xaxis().major_label_orientation = np.pi/3
Configure the hover tool to display the month, year and rate
hover = bk.curplot().tools[1]
hover.tooltips = OrderedDict([
('date', '@month @year'),
('rate', '@rate'),
])
Now we can display the plot. Try moving your pointer over different cells in the plot.
bk.show()
Similarly, we can provide a geographic heatmap, here using data just from Texas.
from bokeh.sampledata import us_counties, unemployment
from collections import OrderedDict
county_xs=[
us_counties.data[code]['lons'] for code in us_counties.data
if us_counties.data[code]['state'] == 'tx'
]
county_ys=[
us_counties.data[code]['lats'] for code in us_counties.data
if us_counties.data[code]['state'] == 'tx'
]
colors = ["#F1EEF6", "#D4B9DA", "#C994C7", "#DF65B0", "#DD1C77", "#980043"]
county_colors = []
for county_id in us_counties.data:
if us_counties.data[county_id]['state'] != 'tx':
continue
try:
rate = unemployment.data[county_id]
idx = min(int(rate/2), 5)
county_colors.append(colors[idx])
except KeyError:
county_colors.append("black")
bk.patches(county_xs, county_ys, fill_color=county_colors, fill_alpha=0.7,
tools="pan,wheel_zoom,box_zoom,reset,hover,previewsave",
line_color="white", line_width=0.5,
title="Texas Unemployment 2009")
hover = bk.curplot().select(dict(type=HoverTool))
hover.tooltips = OrderedDict([
("index", "$index"),
("(x,y)", "($x, $y)"),
("fill color", "$color[hex, swatch]:fill_color"),
])
bk.show()
The examples so far have been relatively low-level, in that individual elements of plots need to be specified by hand. The bokeh.charts
interface makes it easy to get up-and-running with a high-level API that tries to make smart layout and design decisions by default.
To use them, you simply import the chart type you need from bokeh.charts
:
Bar
BoxPlot
HeatMap
Histogram
Scatter
Timeseries
To illustrate, let's create some random data and display it as histograms.
normal = np.random.standard_normal(1000)
student_t = np.random.standard_t(6, 1000)
distributions = pd.DataFrame({'Normal': normal, 'Student-T': student_t})
from bokeh.charts import Histogram
hist = Histogram(distributions, bins=np.sqrt(len(normal)), notebook=True)
hist.title("Histograms").ylabel("frequency").legend(True).width(600).height(300)
hist.show()
Notice how we strung together methods for formatting the chart.
Here is a scatter plot example, using the famous iris dataset.
from collections import OrderedDict
from bokeh.charts import Scatter
from bokeh.sampledata.iris import flowers
setosa = flowers[(flowers.species == "setosa")][["petal_length", "petal_width"]]
versicolor = flowers[(flowers.species == "versicolor")][["petal_length", "petal_width"]]
virginica = flowers[(flowers.species == "virginica")][["petal_length", "petal_width"]]
xyvalues = OrderedDict([("setosa", setosa.values),
("versicolor", versicolor.values),
("virginica", virginica.values)])
scatter = Scatter(xyvalues)
scatter.title("iris dataset, dict_input").xlabel("petal_length").ylabel("petal_width").legend("top_left")
scatter.width(600).height(400).notebook().show()
Bokeh plots are compatible with other Python plotting packages, such as Matplotlib, Seaborn, and ggplot, via an onboard compatibility layer bokeh.mpl
. This allows existing plots in these packages to be converted seamlessly into Bokeh.
titanic = pd.read_csv('http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3.csv')
%matplotlib inline
from ggplot import ggplot, aes, geom_density
from bokeh import mpl
import matplotlib.pyplot as plt
g = ggplot(titanic, aes(x='age', color='pclass')) + geom_density()
g.draw()
plt.title("Plot of titanic passenger age distribution by class")
mpl.to_bokeh(name="density")
Session output file 'density.html' already exists, will be overwritten.
import seaborn as sns
# We generated random data
titanic = titanic.dropna(subset=['age'])
# And then just call the violinplot from Seaborn
sns.violinplot(titanic.age, groupby=titanic.pclass, color="Set3")
plt.title("Distribution of age by passenger class")
mpl.to_bokeh(name="violin")
Mandelbrot set images are made by sampling complex numbers and determining for each whether the result tends towards infinity when a particular mathematical operation is iterated on it.
First, functions for generating the Mandelbrot set image. They create a 2D array of numbers, over which a color map can be displayed.
from __future__ import division
def mandel(x, y, max_iters):
"""
Given the real and imaginary parts of a complex number,
determine if it is a candidate for membership in the Mandelbrot
set given a fixed number of iterations.
"""
c = complex(x, y)
z = 0.0j
for i in range(max_iters):
z = z*z + c
if (z.real*z.real + z.imag*z.imag) >= 4:
return i
return max_iters
def create_fractal(min_x, max_x, min_y, max_y, image, iters):
height = image.shape[0]
width = image.shape[1]
pixel_size_x = (max_x - min_x) / width
pixel_size_y = (max_y - min_y) / height
for x in range(width):
real = min_x + x * pixel_size_x
for y in range(height):
imag = min_y + y * pixel_size_y
color = mandel(real, imag, iters)
image[y, x] = color
Define the bounding coordinates to generate the Mandelbrot image, then create a scalar image (2D array of numbers)
min_x = -2.0
max_x = 1.0
min_y = -1.0
max_y = 1.0
img = np.zeros((1024, 1536), dtype = np.uint8)
create_fractal(min_x, max_x, min_y, max_y, img, 20)
Use image
renderer to display Mandelbrot image, colormapped with "Spectral-11" color palette. The renderer can display many images at once, so it takes lists of images, coordinates, and palettes.
bk.image(image=[img],
x=[min_x],
y=[min_y],
dw=[max_x-min_x],
dh=[max_y-min_y],
palette=["Spectral-11"],
x_range = [min_x, max_x],
y_range = [min_y, max_y],
title="Mandelbrot",
tools="pan,wheel_zoom,box_zoom,reset,previewsave",
plot_width=900,
plot_height=600
)
<bokeh.objects.Plot at 0x1161cf310>
bk.show()
In 1951, Will Burtin published a visualization to compare the effectiveness of three popular antibiotics on a suite of bacteria, measured in terms of minimum inhibitory concentration.
The longer the bar, the smaller the effective dose. The orange group is Gram-negative bacteria, while the purple is Gram-positive.
We will attempt to re-create this plot in Bokeh.
We start by defining the data and computing some derived quantities using NumPy and Pandas
from bokeh.objects import Range1d
from StringIO import StringIO
from math import log, sqrt
from collections import OrderedDict
antibiotics = """
bacteria, penicillin, streptomycin, neomycin, gram
Mycobacterium tuberculosis, 800, 5, 2, negative
Salmonella schottmuelleri, 10, 0.8, 0.09, negative
Proteus vulgaris, 3, 0.1, 0.1, negative
Klebsiella pneumoniae, 850, 1.2, 1, negative
Brucella abortus, 1, 2, 0.02, negative
Pseudomonas aeruginosa, 850, 2, 0.4, negative
Escherichia coli, 100, 0.4, 0.1, negative
Salmonella (Eberthella) typhosa, 1, 0.4, 0.008, negative
Aerobacter aerogenes, 870, 1, 1.6, negative
Brucella antracis, 0.001, 0.01, 0.007, positive
Streptococcus fecalis, 1, 1, 0.1, positive
Staphylococcus aureus, 0.03, 0.03, 0.001, positive
Staphylococcus albus, 0.007, 0.1, 0.001, positive
Streptococcus hemolyticus, 0.001, 14, 10, positive
Streptococcus viridans, 0.005, 10, 40, positive
Diplococcus pneumoniae, 0.005, 11, 10, positive
"""
drug_color = OrderedDict([
("Penicillin", "#0d3362"),
("Streptomycin", "#c64737"),
("Neomycin", "black" ),
])
gram_color = {
"positive" : "#aeaeb8",
"negative" : "#e69584",
}
df = pd.read_csv(StringIO(antibiotics), skiprows=1, skipinitialspace=True)
width = 800
height = 800
inner_radius = 90
outer_radius = 300 - 10
minr = np.sqrt(log(.001 * 1E4))
maxr = np.sqrt(log(1000 * 1E4))
a = (outer_radius - inner_radius) / (minr - maxr)
b = inner_radius - a * maxr
def rad(mic):
return a * np.sqrt(np.log(mic * 1E4)) + b
big_angle = 2.0 * np.pi / (len(df) + 1)
small_angle = big_angle / 7
We are going to be combining several glyph renderers on to one plot, first we need to tell Bokeh to reuse the same plot using hold:
bk.figure()
bk.hold(True)
Next we add the first glyph, the red and blue regions using annular_wedge
. We also take this opportunity to set some of the overall properties of the plot:
x = np.zeros(len(df))
y = np.zeros(len(df))
angles = np.pi/2 - big_angle/2 - df.index.values*big_angle
colors = [gram_color[gram] for gram in df.gram]
bk.annular_wedge(
x, y, inner_radius, outer_radius, -big_angle+angles, angles, color=colors,
plot_width=width, plot_height=height, title="", tools="", x_axis_type=None, y_axis_type=None
)
<bokeh.objects.Plot at 0x1178c5950>
Next we grab the current plot using curplot
and customize the look of the plot further:
plot = bk.curplot()
plot.x_range = Range1d(start=-420, end=420)
plot.y_range = Range1d(start=-420, end=420)
plot.min_border = 0
plot.background_fill = "#f0e1d2"
plot.border_fill = "#f0e1d2"
plot.outline_line_color = None
bk.xgrid().grid_line_color = None
bk.ygrid().grid_line_color = None
Add the small wedges representing the antibiotic effectiveness, also using annular_wedge
:
bk.annular_wedge(
x, y, inner_radius, rad(df.penicillin), -big_angle+angles + 5*small_angle, -big_angle+angles+6*small_angle, color=drug_color['Penicillin'],
)
bk.annular_wedge(
x, y, inner_radius, rad(df.streptomycin), -big_angle+angles + 3*small_angle, -big_angle+angles+4*small_angle, color=drug_color['Streptomycin'],
)
bk.annular_wedge(
x, y, inner_radius, rad(df.neomycin), -big_angle+angles + 1*small_angle, -big_angle+angles+2*small_angle, color=drug_color['Neomycin'],
)
<bokeh.objects.Plot at 0x1178c5950>
Add circular and radial axes lines using circle
, text
, and annular_wedge
:
labels = np.power(10.0, np.arange(-3, 4))
radii = a * np.sqrt(np.log(labels * 1E4)) + b
bk.circle(x, y, radius=radii, fill_color=None, line_color="white")
bk.text(x[:-1], radii[:-1], [str(r) for r in labels[:-1]], angle=0, text_font_size="8pt", text_align="center", text_baseline="middle")
bk.annular_wedge(
x, y, inner_radius-10, outer_radius+10, -big_angle+angles, -big_angle+angles, color="black",
)
<bokeh.objects.Plot at 0x1178c5950>
Text labels for the bacteria using text
:
xr = radii[0]*np.cos(np.array(-big_angle/2 + angles))
yr = radii[0]*np.sin(np.array(-big_angle/2 + angles))
label_angle = np.array(-big_angle/2+angles)
label_angle[label_angle < -np.pi/2] += np.pi # easier to read labels on the left side
bk.text(xr, yr, df.bacteria, angle=label_angle, text_font_size="9pt", text_align="center", text_baseline="middle")
<bokeh.objects.Plot at 0x1178c5950>
Add custom legend
bk.circle([-40, -40], [-370, -390], color=gram_color.values(), radius=5)
bk.text([-30, -30], [-370, -390], text=["Gram-" + x for x in gram_color.keys()],
angle=0, text_font_size="7pt", text_align="left", text_baseline="middle")
bk.rect([-40, -40, -40], [18, 0, -18], width=30, height=13, color=drug_color.values())
bk.text([-15, -15, -15], [18, 0, -18], text=drug_color.keys(), angle=0,
text_font_size="9pt", text_align="left", text_baseline="middle")
<bokeh.objects.Plot at 0x1178c5950>
bk.hold(False)
bk.show()
Bokeh deals with large dataset via abstract rendering, which dynamic downsamples areas with high data density so that they can be rendered more efficiently.
Abstract rendering is a bin-based rendering technique that provides greater control over a visual representation and access to larger data sets through server-side processing. Bokeh provides both a high-level and low-level interface with the abstract rendering engine.
At a high level, all abstract rendering applications start with a plot. Abstract rendering takes the plot and renders it to a canvas that uses data values instead of colors and bins instead of pixels. With the data values collected into bins, the plot can be analyzed and transformed to ensure the true nature of the underlying data source is preserved. Bokeh then shades the bins to represent the underlying quantity of data.
Abstract rendering is tied to the bokeh server infrastructure, and can thus only be used with an active bokeh server and with plots employing a ServerDataSource.
The server can be started via:
bokeh-server -D remotedata
Bokeh's abstract rendering recipes provide simplified access to common abstract rendering operations. The recipes interface is declarative, in that a high-level operation is requested and the abstract rendering system constructs the proper low-level function combinations.
This example shows how abstract rendering is used to generate two types of plots:
Heatmap converts a plot of points into a plot of densities. The most common scenario is a recipes with one item type and overplotting issues. Adjacency-matrix based graph visualizations also benefit from a heatmap if there is more than one node per pixel row/col. The basic process is that each bin collects the count of the number of items that fall into the bin. After that, a color scale is constructed that ensures the full range of the data covers is covered by the color scale.
The contours recipe converts a plot of points into ISO contours. It works on the same principal as the heatmap recipe (binning counts), but instead of building color ramps, the contours recipe produces a number of regions representing ranges of counts.
from bokeh.plotting import square, output_server, show
from bokeh.objects import ServerDataSource
import bokeh.transforms.ar_downsample as ar
output_server("Census")
# 2010 US Census tracts
source = ServerDataSource(data_url="/defaultuser/CensusTracts.hdf5", owner_username="defaultuser")
plot = square(
'LON', 'LAT',
source=source,
plot_width=600,
plot_height=400,
title="Census Tracts")
ar.heatmap(plot, low=(255, 200, 200), points=True, title="Census Tracts (Server Colors)")
show()
Using saved session configuration for http://localhost:5006/ To override, pass 'load_from_config=False' to Session
ar.contours(plot, title="ISO Contours")
show()
from IPython.core.display import HTML
import urllib
def css_styling():
styles = urllib.urlopen('https://raw.githubusercontent.com/fonnesbeck/Bios366/master/notebooks/styles/custom.css').read()
return HTML(styles)
css_styling()