Altair basics with COVID-19 data

This notebook is intended to show how to use some intermediate Altair functionality around filtering, tidying, and interactivity with an interesting public dataset. It is explicitly not intended to convey any epidemiological insight (raw case counts are not necessarily useful).

Data acquisition and basic manipulation in Pandas

We're going to start with data from covidtracking.com. This is an excellent resource that provides an API for curated, historical, state-by-state data.

In [ ]:
import pandas as pd
import altair as alt
alt.data_transformers.enable('json')

df=pd.read_csv("https://covidtracking.com/api/v1/states/daily.csv")

We're first going to reformat the dates to add hyphens between the year, month, and day, so 20200228 becomes 2020-02-28.

In [ ]:
def datemunge(di):
    d = str(di)
    return "%s-%s-%s" % (d[0:4], d[4:6], d[6:8])

cleaned = df.copy()

cleaned["date"] = cleaned["date"].apply(datemunge)

We'll then melt the data frame so that each observation is in its own row, so that (for example) state, date, positive, negative, hospitalized, icu becomes state, date, observation_type, observation_value, where observation_type is one of positive, negative, hospitalized, or icu.

In [ ]:
cleaned = pd.melt(cleaned, 
                  id_vars=['date', 'state', 'fips'], 
                  value_vars=list(set(df.columns) - set(['date', 'state', 'fips', 'hash', 'dateChecked', 'lastUpdateEt', 'dataQualityGrade'])), 
                  value_name="cases",
                  var_name="case type")

We can see the difference between these representations by looking at the source data (df) for Wisconsin on April 9th and the melted data (cleaned) for Wisconsin on April 9th.

In [ ]:
df[(df["state"] == "WI") & (df["date"] == 20200409)]
In [ ]:
cleaned[(cleaned["state"] == "WI") & (cleaned["date"] == "2020-04-09")].dropna()

Plotting per-state results

The next cell shows a function that operates on our cleaned (long-form) data frame to produce a chart of results for a specific state. We're using Altair's transform_filter function to postprocess the cleaned data to select

  1. only observations about a given state,
  2. only observations with a nonzero and non-NaN case count, and
  3. only observations of one of a set of case types
In [ ]:
def cases_for_state(state, show_points=False):
    case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']
    chart = alt.Chart(cleaned).\
                encode(alt.X("date:N"), 
                       alt.Y("cases", scale=alt.Scale(type="log")), 
                       alt.Color("case type", 
                                 sort=alt.EncodingSortField(field="cases", 
                                                            order="descending", 
                                                            op="max")),
                       tooltip=['date', 'state', 'case type', 'cases']).\
                transform_filter(alt.datum.state == state).\
                transform_filter(alt.datum.cases > 0).\
                transform_filter(alt.FieldOneOfPredicate("case type", case_types))
    
    return chart.mark_line() + chart.mark_point() if show_points else chart.mark_line()
In [ ]:
cases_for_state("WI")

Filtering in Pandas

Of course, we could generate a data frame that solely has the rows we care about for a given state, like this:

In [ ]:
case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']

wi_cases = cleaned[(cleaned["state"] == "WI") &
                   (cleaned["case type"].isin(case_types)) &
                   (pd.to_numeric(cleaned["cases"], errors="coerce") > 0)]

alt.Chart(
    wi_cases
  ).mark_line(
  ).encode(
    alt.X("date:N"), 
    alt.Y("cases", scale=alt.Scale(type="log")), 
    alt.Color("case type", 
              sort=alt.EncodingSortField(field="cases", 
                                         order="descending", 
                                         op="max"))
  )

Filtering our data in Altair can be more convenient, though, and enable interactive charting, as we'll see shortly.

Fold transformations in Altair

We used the melt function in Pandas to go from a wide-form table (df), in which each observation is a column, to a long-form table (cleaned), in which each observation is a row.

We can also do this transformation in Altair, with the transform_fold function, as in the next cell. The fold parameter takes a list of columns to break out into new observation types, and the as_ parameter takes a two-element list consisting of what to call the observation type column (whose values are the names of the columns from fold) and what to call the observation value column (whose values are the values of the columns from fold).

As a bonus, we'll also convert from "date integers" of the form 20200410 to actual date-time objects in Altair (instead of using DataFrame.apply). We'll construct these by dividing the date value by 10,000 to get the year, dividing the remainder of the date value divided by 10,000 by 100 to get the month, and taking the remainder of the value divided by 100 to get the day. (Since Vega dates use zero-indexed months, we'll also have to subtract one from the month. Phew!)

This will turn into a Vega expression that we can pass into Altair's transform_calculate method, and that looks like this:

alt.expr.datetime(
   alt.expr.floor(alt.datum.date / 10000), # year
   alt.expr.floor(alt.datum.date % 10000 / 100) - 1, # (zero-based) month
   alt.datum.date % 100 # day
)
In [ ]:
def cases_for_state_folded(state, show_points=False):
    case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']
    cleaned_date = alt.expr.datetime(alt.expr.floor(alt.datum.date / 10000), # year
                                     alt.expr.floor(alt.datum.date % 10000 / 100) - 1, # (zero-based) month
                                     alt.datum.date % 100) # day
    chart = alt.Chart(df).\
                encode(alt.X("monthdate(cleandate):O", title="date"), 
                       alt.Y("cases:Q", scale=alt.Scale(type="log")), 
                       alt.Color("case type:N", 
                                 sort=alt.EncodingSortField(field="cases", 
                                                            order="descending", 
                                                            op="max")),
                       tooltip=['yearmonthdate(cleandate)', 'state', 'case type:N', 'cases:Q']).\
                transform_filter(alt.datum.state == state).\
                transform_calculate(
                    cleandate=cleaned_date
                ).\
                transform_fold(
                    as_=["case type", "cases"],
                    fold=case_types
                ).\
                transform_filter(alt.datum.cases > 0)
                
    return chart.mark_line() + chart.mark_point() if show_points else chart.mark_line()


cases_for_state_folded("WI")

An interactive per-state plot

We can also use Altair's selection support to make an interactive chart that lets us choose which state to plot cases for.

In [ ]:
def interactive_cases_for_state():
    case_types = ['death', 'positive', 'hospitalizedCumulative', 'inIcuCumulative']
    input_dropdown = alt.binding_select(options=cleaned[(pd.to_numeric(cleaned["cases"], errors="coerce") > 0) & (cleaned["case type"] == "positive")]["state"].sort_values().unique())
    selection = alt.selection_single(fields=['state'], bind=input_dropdown, name='Choose', init={"state":"AK"})

    chart = alt.Chart(cleaned).\
                encode(alt.X("date:N"), 
                       alt.Y("cases", scale=alt.Scale(type="log")), 
                       alt.Color("case type", 
                                 sort=alt.EncodingSortField(field="cases", 
                                                            order="descending", 
                                                            op="max")),
                       tooltip=['date', 'state', 'case type', 'cases']).\
                transform_filter(selection).\
                transform_filter(alt.datum.cases > 0).\
                transform_filter(alt.FieldOneOfPredicate("case type", case_types)).\
                add_selection(selection)
    
    return chart.mark_line()
In [ ]:
interactive_cases_for_state()

Plotting cases by state on a map

To plot case counts on a map, we'll need to integrate geographic data (the shapes of states as GeoJSON polygons) with our observations.

We'll pull down state shapes from a public datasource that has both state and county data, using Altair's topo_feature function:

In [ ]:
states = alt.topo_feature("https://vega.github.io/vega-datasets/data/us-10m.json", "states")

To plot total case counts per state, we'll make a chloropleth in Altair and will need to join the case counts with the state shapes. The state shapes are keyed by FIPS numeric state codes, not by alphabetical state codes. We have the FIPS codes in the source data as fips, so we'll use Altair's transform_lookup function to indicate that we want to take the case count, case type, state, and date from a postprocessed data frame where the fips field matches the id field in our state collection.

In [ ]:
ctrim = cleaned[(cleaned["case type"] == "positive") & (cleaned["date"] == cleaned["date"].max())].copy()

alt.Chart(
    states
    ).mark_geoshape(
    ).encode(
        color='cases:Q',
        tooltip=['state:N', 'cases:Q', 'date:N']
    ).transform_lookup(
        lookup='id',
        from_=alt.LookupData(ctrim, 'fips', ['cases', 'case type', 'state', 'date'])
    ).project(
        type='albersUsa'
    ).properties(
        width=500, height=400
    )