An Introduction to Analysing LibCrowds Results Data Using Python¶

The purpose of this notebook is to introduce a key Python library, pandas, that can be used to manipulate and analyse LibCrowds results data.

The pandas library provides access to high-performance data analysis tools via an accessible Python interface. We will use the library to load all of our In the Spotlight results into a structure called a dataframe. A dataframe is a two-dimensional data structure, similar to a spreadsheet, that accepts many different kinds of input. As everything is stored in memory, rather than on disk, the only limitation to this type of data structure is going to be the amount of RAM installed on the computer. However, for any modern computer this is unlikely to be an issue until we reach tens of millions of results.

We begin by importing pandas.

In [23]:

import pandas

The dataset¶

For this notebook, our input will be all of the performance data collected so far via the crowdsourcing projects presented on In the Spotlight. In a previous notebook we saw how these results are modelled in their raw form. However, for the purposes of this notebook we have converted this raw data into a table of performances, where each row contains the known data for a specific performance (e.g. title, date, genre and theatre). The way in which this was achieved is slightly too complex to introduce here but for those interested the scripts can be found in this repository.

All we currently need to know about the code block below is that it loads our dataframe of performance data.

In [24]:

import os
import sys
module_path = os.path.abspath(os.path.join('..', 'data', 'scripts'))
if module_path not in sys.path:
    sys.path.append(module_path)
from get_its_performances import get_performances_df
df = get_performances_df()

The head function returns the first n rows of a dataset (defaults to 5); we can use this function to take a first glance at our dataframe.

In [25]:

df.head()

Out[25]:

	title	date	genre	link	theatre	city	source
0	Pageantry	NaN	NaN	http://access.bl.uk/item/viewer/ark:/81055/vdc...	Theatre Royal, Margate	Margate	https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
1	The Hypocrite	NaN	Comedy	http://access.bl.uk/item/viewer/ark:/81055/vdc...	Theatre Royal, Margate	Margate	https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
2	The Padlock	NaN	Musical Farce	http://access.bl.uk/item/viewer/ark:/81055/vdc...	Theatre Royal, Margate	Margate	https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
3	The Village Lawyer	NaN	Farce	http://access.bl.uk/item/viewer/ark:/81055/vdc...	Theatre Royal, Margate	Margate	https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
4	Death of Gen. Wolfe	NaN	Ballet	http://access.bl.uk/item/viewer/ark:/81055/vdc...	Theatre Royal, Margate	Margate	https://api.bl.uk/metadata/iiif/ark:/81055/vdc...

The remainder of this notebook will introduce a few basic functions that we can use to begin analysing and manipulating our dataset.

Summarising dataframes¶

The describe function generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

In [26]:

df.describe()

Out[26]:

	title	date	genre	link	theatre	city	source
count	2317	989	1298	2317	2317	2317	2317
unique	1305	438	147	1076	6	6	1076
top	Rosina	1830-11-23	Farce	http://access.bl.uk/item/viewer/ark:/81055/vdc...	Miscellaneous Plymouth theatres	Plymouth	https://api.bl.uk/metadata/iiif/ark:/81055/vdc...
freq	13	7	221	12	1230	1230	12

This simple function already presents some interesting results. At the time of writing, we can see that we have over 140 unique genres.

We might be curious about what some of the more unusual genres are. To find them, we can use the value_counts function, which returns an object containing counts of unique values. Below, we call this function with the argument ascending=True, to sort the output in ascending order.

In [27]:

counts = df.genre.value_counts(ascending=True)

We can then use then run the following command to display the first ten rows.

In [28]:

counts[:10]

Out[28]:

Fairy Spectacle                      1
Historical Melodrama                 1
National Play                        1
Drawing Room Entertainment           1
Grand National Military Spectacle    1
Masquerade                           1
Sketch                               1
Comic Drama                          1
Grand Comic Pantomime                1
Petite Farce                         1
Name: genre, dtype: int64

To display the top ten genres we could just change the ascending argument above to False (or remove it, as False is the default).

Note that if we call the describe() function with purely numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. We can demonstrate this by counting and describing the unique source values. The source contains the canvas ID that identifies an individual playbill, so the following function will produce a simple numeric analysis of the number of performances recorded on each playbill.

In [29]:

df.source.value_counts().describe()

Out[29]:

count    1076.000000
mean        2.153346
std         0.991920
min         1.000000
25%         2.000000
50%         2.000000
75%         3.000000
max        12.000000
Name: source, dtype: float64

At the time of writing we have a minimum of 1 and a maximum of 12 performances recorded on a playbill, with a mean of 2.15.

Summary¶

In this notebook, we found out how to load all of our performance data from the In the Spotlight crowdsourcing projects into a pandas dataframe. We then run some functions to perform a basic anaysis of this dataframe.

For an introduction producing visualisations of this data using Python see An Introduction to Visualising In the Spotlight Data Using Python.