Notebook

Building a citable text corpus from OCRE¶

This notebook shows you how to load OCRE data from a CEX file over the internet, and build a corpus of text citable by CTS URN. It uses version 1.7.0 of the nomisma library.

Configure Jupyter notebook¶

First configure the Jupyter notebook. In addition to the nomisma library, we'll need the cite and ohco2 libraries from the CITE architecture.

In [ ]:

// 1. Add maven repository where we can find our libraries
val myBT = coursierapi.MavenRepository.of("https://dl.bintray.com/neelsmith/maven")
interp.repositories() ++= Seq(myBT)

In [ ]:

// 2. Make libraries available with `$ivy` imports:
import $ivy.`edu.holycross.shot::nomisma:1.7.0`
import $ivy.`edu.holycross.shot::ohco2:10.16.0`
import $ivy.`edu.holycross.shot.cite::xcite:4.1.1`

Load the full OCRE data set¶

In [ ]:

import edu.holycross.shot.nomisma._
val ocreCex = "https://raw.githubusercontent.com/neelsmith/nomisma/master/cex/ocre-cite-ids.cex"
val ocre = OcreSource.fromUrl(ocreCex)

// Sanity check:
require(ocre.size > 50000)

TL;DR¶

You can build an OHCO2 corpus with the corpus function.

In [ ]:

import edu.holycross.shot.ohco2._
import edu.holycross.shot.cite._

val corpus: Corpus = ocre.corpus
println("Citable nodes of text in corpus: " + corpus.size)

How it works for individual issues¶

The OcreIssue class includes a textNodes function that creates a Vector of 0-2 CitableNodes. There will be two text nodes if the issue has both an obverse and reverse legend. Let's examine the CTS URNS of an issue that has both obverse and reverse legends.

In [ ]:

val issueId = "3.com.43"
val randomIssue = ocre.issue(issueId).get

println("In issue " + issueId + ", made " + randomIssue.textNodes.size + " text nodes")

for (n <- randomIssue.textNodes) {
    println("\nReference: " + n.urn)
    println("Text content: " + n.text)
}

Let's parse the components of the URN.

It belongs to the CTS namespace hcnum, and a text group issues.

Within that group, its document identifier is ric, and the specific version identifier is raw. When we process the corpus (e.g., to generate a fully expanded version of abbreviated terms), we will use a different version identifier, but the rest of the URN will be the same.

The passage component is directly adapted from the nomisma.org identifier: 3.com.43 identifies RIC volume 3, Commodus, issue 43. The final piece of the passage component distinguishes obverse text from reverse text.

How it works: building a corpus¶

The corpus function in Ocre creates 0-2 CitableNodes for each issue and compiles them into a text Corpus.

As in any CTS environment, we can then select texts identified at any level of the passage and work hierarchies.

In [ ]:

val commodus43 = corpus.nodes.filter(_.urn <=  CtsUrn("urn:cts:hcnum:issues.ric.raw:3.com.43"))
println("**OBV** " + commodus43.map(_.text).mkString(" **REV** "))

In [ ]:

val allCommodus = corpus.nodes.filter(_.urn <= CtsUrn("urn:cts:hcnum:issues.ric.raw:3.com"))
println("All legends in coins of Commodus: " + allCommodus.size)

In [ ]:

val allRIC3 = corpus.nodes.filter(_.urn <= CtsUrn("urn:cts:hcnum:issues.ric.raw:3"))
println("All legends in RIC 3: " + allRIC3.size)