The purpose of this notebook is to introduce a key Python library, pandas, that can be used to manipulate and analyse LibCrowds results data.
The pandas library provides access to high-performance data analysis tools via an accessible Python interface. We will use the library to load all of our In the Spotlight results into a structure called a dataframe. A dataframe is a two-dimensional data structure, similar to a spreadsheet, that accepts many different kinds of input. As everything is stored in memory, rather than on disk, the only limitation to this type of data structure is going to be the amount of RAM installed on the computer. However, for any modern computer this is unlikely to be an issue until we reach tens of millions of results.
We begin by importing pandas.
import pandas
For this notebook, our input will be all of the performance data collected so far via the crowdsourcing projects presented on In the Spotlight. In a previous notebook we saw how these results are modelled in their raw form. However, for the purposes of this notebook we have converted this raw data into a table of performances, where each row contains the known data for a specific performance (e.g. title, date, genre and theatre). The way in which this was achieved is slightly too complex to introduce here but for those interested the scripts can be found in this repository.
All we currently need to know about the code block below is that it loads our dataframe of performance data.
import os
import sys
module_path = os.path.abspath(os.path.join('..', 'data', 'scripts'))
if module_path not in sys.path:
sys.path.append(module_path)
from get_its_performances import get_performances_df
df = get_performances_df()
The head function returns the first n rows of a dataset (defaults to 5); we can use this function to take a first glance at our dataframe.
df.head()
title | date | genre | link | theatre | city | source | |
---|---|---|---|---|---|---|---|
0 | Pageantry | NaN | NaN | http://access.bl.uk/item/viewer/ark:/81055/vdc... | Theatre Royal, Margate | Margate | https://api.bl.uk/metadata/iiif/ark:/81055/vdc... |
1 | The Hypocrite | NaN | Comedy | http://access.bl.uk/item/viewer/ark:/81055/vdc... | Theatre Royal, Margate | Margate | https://api.bl.uk/metadata/iiif/ark:/81055/vdc... |
2 | The Padlock | NaN | Musical Farce | http://access.bl.uk/item/viewer/ark:/81055/vdc... | Theatre Royal, Margate | Margate | https://api.bl.uk/metadata/iiif/ark:/81055/vdc... |
3 | The Village Lawyer | NaN | Farce | http://access.bl.uk/item/viewer/ark:/81055/vdc... | Theatre Royal, Margate | Margate | https://api.bl.uk/metadata/iiif/ark:/81055/vdc... |
4 | Death of Gen. Wolfe | NaN | Ballet | http://access.bl.uk/item/viewer/ark:/81055/vdc... | Theatre Royal, Margate | Margate | https://api.bl.uk/metadata/iiif/ark:/81055/vdc... |
The remainder of this notebook will introduce a few basic functions that we can use to begin analysing and manipulating our dataset.
df.describe()
title | date | genre | link | theatre | city | source | |
---|---|---|---|---|---|---|---|
count | 2317 | 989 | 1298 | 2317 | 2317 | 2317 | 2317 |
unique | 1305 | 438 | 147 | 1076 | 6 | 6 | 1076 |
top | Rosina | 1830-11-23 | Farce | http://access.bl.uk/item/viewer/ark:/81055/vdc... | Miscellaneous Plymouth theatres | Plymouth | https://api.bl.uk/metadata/iiif/ark:/81055/vdc... |
freq | 13 | 7 | 221 | 12 | 1230 | 1230 | 12 |
This simple function already presents some interesting results. At the time of writing, we can see that we have over 140 unique genres.
We might be curious about what some of the more unusual genres are. To find them, we can use the value_counts function, which returns an object containing counts of unique values. Below, we call this function with the argument ascending=True
, to sort the output in ascending order.
counts = df.genre.value_counts(ascending=True)
We can then use then run the following command to display the first ten rows.
counts[:10]
Fairy Spectacle 1 Historical Melodrama 1 National Play 1 Drawing Room Entertainment 1 Grand National Military Spectacle 1 Masquerade 1 Sketch 1 Comic Drama 1 Grand Comic Pantomime 1 Petite Farce 1 Name: genre, dtype: int64
To display the top ten genres we could just change the ascending
argument above to False
(or remove it, as False
is the default).
Note that if we call the describe()
function with purely numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. We can demonstrate this by counting and describing the unique source
values. The source
contains the canvas ID that identifies an individual playbill, so the following function will produce a simple numeric analysis of the number of performances recorded on each playbill.
df.source.value_counts().describe()
count 1076.000000 mean 2.153346 std 0.991920 min 1.000000 25% 2.000000 50% 2.000000 75% 3.000000 max 12.000000 Name: source, dtype: float64
At the time of writing we have a minimum of 1 and a maximum of 12 performances recorded on a playbill, with a mean of 2.15.
In this notebook, we found out how to load all of our performance data from the In the Spotlight crowdsourcing projects into a pandas dataframe. We then run some functions to perform a basic anaysis of this dataframe.
For an introduction producing visualisations of this data using Python see An Introduction to Visualising In the Spotlight Data Using Python.