Summarizing data in OCRE

A previous notebook showed how to get an overview of the values of data in OCRE. This notebook shows you how to summarize and graph distributions of values for OCRE properties. It uses version 1.6.0 of the nomisma library.

Configure Jupyter notebook

First configure the Jupyter notebook. In addition to the nomisma library, we will use plotly for graph plots, and a histoutils package to simplify working with histograms.

In [ ]:
// 1. Add maven repository where we can find our libraries
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)
In [ ]:
// 2. Make libraries available with `$ivy` imports:
import $ivy.`edu.holycross.shot::nomisma:1.6.0`
import $ivy.`edu.holycross.shot::histoutils:2.2.0`
import $ivy.`org.plotly-scala::plotly-almond:0.7.1`

Load the full OCRE data set

In [ ]:
import edu.holycross.shot.nomisma._
val ocreCex = "https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"
val ocre = OcreSource.fromUrl(ocreCex)

// Sanity check:
require(ocre.size > 50000)

How are denominations distributed?

A previous notebook (on mybinder here) showed how to check the values for properties of an Ocre object. Let's review how many valid values OCRE includes for denomination. We'll use the hasDenomination function to get only issues with valid data values for the denomination property, then apply the denominationList to that result.

In [ ]:
println("Number of valid values for denomination: " + ocre.hasDenomination.denominationList.size)

Seventy one seems like a lot. How often does each denomination appear?

Ocre includes a function to create a Histogram object for a named property.The Histogram has a Vector of Frequencys, so if we sort the frequencies by count we can look at the first and last entries to see the most and least common values in OCRE for denomination.

In [ ]:
import edu.holycross.shot.histoutils._
val denominationHisto: edu.holycross.shot.histoutils.Histogram[String] = ocre.histogram("denomination").sorted
println("Entries in histogram of denominations: " + denominationHisto.size)
println("Most frequent denomination:  " + denominationHisto.frequencies.head)
println("Least frequent denomination: " + denominationHisto.frequencies.last)

It's straightforward to visualize histograms as bar graphs using the plotly library.

In [ ]:
// 1. Import plotly libraries, and set display defaults suggested for use in Jupyter NBs:
import plotly._, plotly.element._, plotly.layout._, plotly.Almond._
repl.pprinter() = repl.pprinter().copy(defaultHeight = 3)

Plotly can construct a bar graph from two parallel lists of values for x and y axis. The Frequency object in our histogram has item and count properties we can use for x and y respectively.

In [ ]:
val denominationValues = denominationHisto.frequencies.map(_.item)
val denominationCounts = denominationHisto.frequencies.map(_.count)

val denominationPlot = Seq(
  Bar(x = denominationValues, y = denominationCounts)
)
plot(denominationPlot)

Geographic regions

Let's take a second example: how frequently are issues struck in different geographic regions over the five centuries of data in OCRE?

In [ ]:
val regionHisto: edu.holycross.shot.histoutils.Histogram[String] = ocre.histogram("region").sorted

val regionValues = regionHisto.frequencies.map(_.item)
val regionCounts = regionHisto.frequencies.map(_.count)

val regionPlot = Seq(
  Bar(x = regionValues, y = regionCounts)
)
plot(regionPlot)

Geography of mints?

It would be nice to look further into the uneven distribution of issues outside of Italy. In a subsequent notebook, we'll take OCRE's information about specific mints and generate geographic maps.