Charts provides a few things, but the core purpose is to help you understand your data by separating it into unique visual groups before you know what those groups are.
What Charts ultimately does is build Glyph Renderers to represent the groups of data and add them to a Chart.
It does this by adapting data into a consistent format, understanding metadata about the specific chart, deriving the unique groups of data based on the chart configuration, then assigning attributes to each group.
The topics covered in this deep dive are:
The Charts interface is built around the use of a table data structure, pandas. There are a few primary reasons for doing this.
import pandas as pd
from bokeh.charts import Scatter, show, output_notebook
output_notebook()
Data that contains like data grouped and labeled by a column and new records listed as rows, or data that can be inferred into this format. This format is pretty flexible and an analyst will commonly encounter this format in databases, so we will go further into certain styles of structuring the data and the reasoning behind them.
When encountering databases, you will likely be accessing one to many tables, joining them, then performing some exploratory analysis. This joined dataset will likely contain columns with dates and strings that describe the values in the record. These columns which uniquely identify the columns that contain numerical measurements are often called dimensions
, and the numerical measurements are called values
.
from bokeh.charts.data_source import ChartDataSource
Arrays are assumed to be like values that would be assigned to a column. Passing in Chart([1, 2], [3, 4])
would be creating the following data source internally for the chart to use:
ds = ChartDataSource.from_data([1, 2], [3, 4])
ds.df.head()
In cases where there isn't enough metadata, column names will be automatically assigned to the array-like data in order received.
This data would be encountered more when dealing with json data, or serialized objects.
records = [
{'name': 'bob', 'age': 25},
{'name': 'susan', 'age': 22}
]
ds = ChartDataSource.from_data(records)
ds.df.head()
Example:
Imagine we had some sample data. The data is sampled over time at two different weather stations, each with a status raining. The sampled values are temperature. Dimensions are in italics and values are bolded. We will look at two different approaches to storing this data.
For the example we will assume the two weather stations each record a temperature on three different days, where it is raining on the first day for station a, and the second day for station b.
(preferred format for Charts)
Table-like data can be thought of as observations about the world or some process or system. As new observations are added, you will want to add new rows, and avoid having to add new columns, because you must carefully consider all other rows when that occurs.
For the example, a tall data source will minimize the number of columns that contain like information. For Tall data, the format will be the following:
data = dict(sample_time=['2015-12-01', '2015-12-02', '2015-12-03', '2015-12-01', '2015-12-02', '2015-12-03'],
temperature=[68, 67, 77, 45, 50, 43],
location=['station a', 'station a', 'station a', 'station b', 'station b', 'station b'],
raining=[True, False, False, False, True, False]
)
tall = pd.DataFrame(data)
tall.head()
(supported for some Charts, or with transformations)
It is often in scientific use cases or in pivoted data that you will find where multiple columns contain the same class of measurement. For instance, when sampling temperature for two weather stations, wide data would contain the weather_station dimension encoded into the column names. This is simple if we only have temperature data.
data = dict(sample_time=['2015-12-01', '2015-12-02', '2015-12-03'],
station_a_temp=[68, 67, 77],
station_b_temp=[45, 50, 43]
)
wide = pd.DataFrame(data)
wide.head()
However, if we need to add the raining flag, we must add two new columns, because the flag can be different between the two stations, and the columns are containing the station dimension.
data['station_a_raining'] = [True, False, False]
data['station_b_raining'] = [False, False, False]
wide = pd.DataFrame(data)
wide.head()
Tall data is better suited towards interactive analysis. While wide data is fine for simple data that can only be viewed against one or two dimensions, highly dimensional data will be much easier to use to reconfigure charts and/or adding new values.
tall_line1 = Scatter(tall, x='sample_time', y='temperature', color='location', legend=True)
tall_line2 = Scatter(tall, x='sample_time', y='temperature', color='raining', legend=True)
show(tall_line1)
show(tall_line2)
Imagine if a new station was added. There are two problems with the wide data:
More modifications required to the data structure and function call.
Difficult to build interactive applications when required to reference multiple series. Must handle multiple selections for some of the fields, which adds complexity.
The ChartDataSource can process any Charts data transformations that are added when creating the chart.
from bokeh.charts import bins, blend
data = {
'temperature_a': [32, 23, 95, 90, 23, 58, 90],
'temperature_b': [45, 34, 23, 88, 67, 34, 23]
}
ds = ChartDataSource.from_data(data, x=blend('temperature_a', 'temperature_b'))
ds.df.head()
ds = ChartDataSource.from_data(data, x=bins('temperature_a'))
ds.df.head()
ds = ChartDataSource.from_data(data, x=bins('temperature_a'), y=bins('temperature_b'))
ds.df.head()