Getting Texts

This notebook is focused on an essential component of digital text analysis: preparing a corpus of texts. It's part of the Art of Literary Text Analysis and assumes that you've already worked through Getting Setup and Getting Started. In this notebook we'll look at:

Note that we're especially interested here in working with plain texts, in later notebooks we'll deal with other formats.

Accessing Plain Texts Online

Let's create a new notebook and set its name (by clicking on the Untitled label above the toolbar) to GettingTexts.

Now we'll change the editing mode of the first cell to Markdown and copy and paste in the following Markdown-encoded text:

# Getting Texts

We are first going to experiment with loading a plain text into memory from the Gutenberg Project (http://gutenberg.org), an online library with tens of thousands of free texts in different languages and formats. We can Google something like python3 read from url to discover pages like https://docs.python.org/3/howto/urllib2.html that explain the basics of reading content.

Hit Shift-Enter to evaluate/format the Markdown cell and create a new code cell.

The Markdown cell explains the essentials of what we want to do: fetch the contents of a plain text document from a URL, assign it to a string variable, and demonstrate various basic operatons we can perform on a string. For this program we'll choose the Works of Edgar Allan Poe, Volume 1, available at http://www.gutenberg.org/files/2147/2147-0.txt (you're encouraged to visit this link before continuing).

First, let's fetch our document using urllib.request (not all the code below will be explained in detail now, but we'll come back to it).

In [1]:
import urllib.request
poeUrl = "http://www.gutenberg.org/files/2147/2147-0.txt"
poeString = urllib.request.urlopen(poeUrl).read().decode().strip()
print("This string has", len(poeString), "characters")
This string has 550332 characters

Most of the principles involved have already been covered in the Getting Started

  1. Import module (in this case urllib.request instead of the time module)
  2. Assigning a string (the url) to a variable name of our choice poeUrl
  3. Making function calls and assigning the result to the variable name poeString
  4. Printing the last expression (line of code), in this case to show the number of characters in our string

In this case urlopen() is the function name with an argument that contains our Poe URL and returns an HTTP response object one which we can invoke read() to get the bytes data at our URL. Next, we call decode() to convert the bytes data to a proper (Unicode by default) string. Finally, we call strip() to remove any leading and trailing whitespace.

Many things can go wrong during networking calls, but if all goes well, we should now have a variable (poeString) containing a string with the same contents as at our URL.

Fetching the contents of a URL is a relatively "expensive" operation (in code-speak this means that it's more computationally or time intensive), so we want to isolate that in its own Jupyter cell so that we don't have to run it more times that necessary. If we want to explore various aspects of the poeString string that we fetched, we should do that in a separate cell so that we're not re-fetching the string each time.

There's an additional motivation for not repeating the fetching operation: Project Gutenberg (and some other sites) monitor how many requests are made from your IP address, and it can temporarily cut you off if it detects what it considers to be too many requests (waiting a while will usually lift this restriction). Multiple requests in a short period of time can also be a problem for shared IP addresses, like in a classroom setting.

Some Simple String Functions

One of the essential concepts of Jupyter (and the underlying iPython) is that once code is executed, any variables remain accessible in memory for subsequent cells that are executed. This is essentially the kernel which interpretes and executes code and stores things in the memory ("Kernel" is one of the items in the File menu of Jupyter). Typically we execute cells as we proceed through a notebook (see the options under the Cell File menu).

We've already seen above how to show the length of a string using the len() function. The length is a number but it can be a bit difficult to read because there is no thousands separator. Let's improve the output by searching how to format a number with a thousands separator, which leads in particular to this suggestion for using the format specification mini-language.

In [2]:
poeStringLen = len(poeString)
poeStringLenFormatted = "{:,}".format(poeStringLen) # format mini-language
print("This string has", poeStringLenFormatted, "characters")
This string has 550,332 characters

This suggests that there are 550,332 characters (because we're in Python 3.x we should be dealing with Unicode, and so this should be a true count of the characters, not just the bytes since some characters require multiple bytes).

We've shown a longer form of the code above, but we can also nest functions, or have function arguments that contain other functions – this works as well:

print("This string has", "{:,}".format(len(poeString)), "characters")

This version is more succinct, but it can be more difficult to read (and to debug or resolve if there's a problem), programming is about choices!

Working with Parts of a String

A string is a sequence of characters and python has a powerful way of working with sequences. For instance, I can do this to get the first 25 characters of our poeString:

In [3]:
poeString[:25]
Out[3]:
'\ufeffProject Gutenberg’s The '

Our string is a sequence where each character has an index position. Python, like many languages, starts its indexing at 0, so we get something like this, where there are 25 characters (including the "P" in index 0):

P r o j e c t G u t e n b e r g ' s T h e W
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

Let's see a few more examples of working with string sequences:

In [4]:
print("First character:", poeString[0])
print("Last character", poeString[-1])
print("First 25 characters:", poeString[:25])
print("Last 25 characters:", poeString[-25:])
print("Characters 8 to 25:", poeString[8:30])
First character: 
Last character .
First 25 characters: Project Gutenberg’s The 
Last 25 characters: to hear about new eBooks.
Characters 8 to 25:  Gutenberg’s The Works

Working with character sequences like this is an essential aspect of text analysis and it's well worth becoming familiar with this syntax.

What else can we do with a string? A good place to look is the string methods documentation.

Counting Occurrences of a String

For instance, let's count the occurrences of one sequence (corpse) within another string poeString:

In [5]:
print("Occurrences of 'corpse':", poeString.count("corpse"))
Occurrences of 'corpse': 65

Clearly Poe likes talking about corpses. Is the count() function case-sensitive? Does it match full words or just strings within strings? Let's see:

In [6]:
print("Occurrences of 'corpse':", poeString.count("corpse"))
print("Occurrences of 'corps':", poeString.count("corpse"))
print("Occurrences of 'Corpse':", poeString.count("Corpse"))
Occurrences of 'corpse': 65
Occurrences of 'corps': 65
Occurrences of 'Corpse': 0

Apparently count() is case-sensitive and is matching strings, not words.

So what if we wanted to be sure to count all occurrences of corpse regardless of case? One solution would be to convert our string to the case we want to use. We'd probably do this with lower(), but for the sake of demonstration, let's do it with upper():

In [7]:
print("Occurrences of 'CORPSE':", poeString.upper().count("CORPSE"))
Occurrences of 'CORPSE': 65

This again demonstrates function chaining: poeString.upper() returns a new string and the new string has an available count() function. It's important to realize that poeString.upper() doesn't modify the variable poeString, it returns a new copy of the string.

We converted our poeString to lowercase characters since corpse (lowercase) isn't the same as Corpse (capitalized), though in this case it doesn't make any difference.

What if we wanted to find the index of the first occurrence of corpse and show the surrounding text?

In [8]:
firstCorpus = poeString.find("corpse") # the index position of the first occurrence of "corpse"
context = 30 # number of characters to show on either side of the index position
print(poeString[firstCorpus-context : firstCorpus+context])
and (horrible to relate!) the corpse of the daughter, head

Extracting Text

Our Poe text is actually a volume of multiple texts. What if we wanted to isolate only one of the texts, such as "The Gold Bug?"

To isolate the "The Gold Bug" in our Poe text, we might do something like the following (sometimes planning a program in natural language, rather than in computer code, can be useful):

  1. Find the index position of the start of the story, i.e. "THE GOLD-BUG"
  2. Find the index position of the end of the story, or the start of the next story, i.e. "FOUR BEASTS IN ONE"
  3. Create a new string from the index position of the start of the story (from step 1) to the index position of the end of the story (from step 2)

We know how to find the first two steps, and we've already seen a variant of the second step when we asked for the first few characters of the full Poe text. Let's first try in a simplified form to isolate "Gutenberg's" from our string "Project Gutenberg's The":

In [9]:
start = poeString.find("THE GOLD-BUG")
end = poeString.find("FOUR BEASTS IN ONE")
goldBugString = poeString[start:end].strip()
# show start and end of goldBugString
print(goldBugString[:50], "[…] ", goldBugString[-50:])
THE GOLD-BUG

          What ho! what ho! this f […]  it; perhaps it
required a dozen--who shall tell?”

Accessing Local Plain Texts

Code that relies on URL content is convenient, though not nearly as robust as content that's already been downloaded and stored locally: content can change or disappear from the web, and maybe you want to work on your notebook in a remote location or in an airplane without internet connectivity. Moreover, accessing content from your local machine is typically much faster than interacting with web-based content.

What we'll do in the next section is the following:

  1. create a local directory for data (if necessary)
  2. open a new file and write our goldBugString to the file
  3. (re)open the file and read from it

Let's begin by creating a new subdirectory (relative to the current notebook directory), using the os module.

In [10]:
import os
directory = "data"
if not os.path.exists(directory):
    os.makedirs(directory)

This demonstrates a conditional structure in Python where we test for a boolean value (true or false) of whether or not the directory exists.

Python uses a colon and indentation to indicate the parts of the conditional block. If we want to execute a block when a condition evaluates to true (like 1 < 5, one is smaller than five):

if _condition_:
    _block_

Or if a condition is not true (like 1 > 5, one is not smaller than five):

if *not* _condition_:
    _block_

If the data directory does't exist, we create it using mkdirs().

Now that we have a data directory, we need to open a new file in write ("w") mode and write out the string contents of goldBugString. The with block syntax we present here takes care of closing the file we've opened once we're done with it (once we're out of the indented block).

In [11]:
with open("data/goldBug.txt", "w") as f:
    f.write(goldBugString)

The open() function returns a file descriptor (that we've named f) and to which we can write contents. An alternative, by the way, to reading from a URL to a string and then writing the string to a file is to use the urlretrive function, though our method should work just fine as well.

Assuming things did work out, we can now turn around and open the file in read mode ("r" instead "w"), read the contents into a new variable that we'll call goldBugString2, and then close the file.

In [12]:
with open("data/goldBug.txt", "r") as f:
    goldBugString2 = f.read()

Let's have a peek at the contents in our goldBugString2 variable (read directly from a file), the same way we did before.

In [13]:
print(goldBugString2[:50], "[…] ", goldBugString2[-50:])
THE GOLD-BUG

          What ho! what ho! this fel […]  pit; perhaps it
required a dozen--who shall tell?”

Looks good!

In fact, as a digression, it's not quite the same string since the original uses Windows-based linefeed characters that were stripped during the file writing and reading process.

In [14]:
goldBugString == goldBugString2 # are these two strings the same?
Out[14]:
False

Notice here that we're using the equality operator with two equal signs (==), otherwise, we're making an assignment the same way we do when assigning a value to a variable.

Listing Files in a Local Directory

As with many things in programming languages like Python, there's more than one way of listing files in a directory. We're going to introduce a way here that also introduces a loop: a process that is repeated multiple times for each element in a list or for as long as a condition is true. We'll go a bit quickly here, but we'll come back to these concepts again soon.

But first let's start with the glob() function that allows us to list the files in a directory.

In [15]:
import glob
textFiles = glob.glob("data/*txt")
textFiles
Out[15]:
['data/goldBug.txt']

The results are shown as a list (delimited by the square brackets), with each element inside separated by a comma (here we only have one element because we only have one file so far).

We can ask what kind of object our textFiles variable contains.

In [16]:
type(textFiles)
Out[16]:
list

Lists are a type of variable that lend themselves to loops or to iterating over each element. For instance, to show each filename with the number of characters, we could do something like this:

In [17]:
totalCharacters = 0
for textFile in textFiles:
    f = open(textFile, "r")
    textString = f.read()
    f.close()
    chars = len(textString)
    print(textFile, "has", chars, "characters")
    totalCharacters += chars
print("total characters: ", totalCharacters)
data/goldBug.txt has 76459 characters
total characters:  76459

The code above is of the general form

 for _item_ in _list_:
    _block_

In other words, for each item in our textFiles list, we execute the block where textFile is the local variable holding the item in the list. Just as with the conditionals, the colon and indentation indicate what the loop condition is (as long as more elements exist in the list) and what block to execute for each iteration.

In the code above we're also calculating the total number of characters (tracking them in a variable that we've called totalCharacters. Each time we iterate over the list of files, we add the length of characters for the current file.

totalCharacters += chars

The += operater is a compact way to add a value to an existing variable. It's the equivalent of this:

totalCharacters = totalCharacters + chars

Finally, we're using the print() function here because it's a simple way of combining a string ("total characters: ") and a number (totalCharacters) – in Python you can't simply concatenate a string and a number.

Next Steps

Here are some tasks to try:

  • How would you create a subdirectory called Austen under the data directory we've already created?
  • For each of the plain text novels in English of Jane Austen in Project Gutenberg
    • How would you isolate the text content (without the Project Gutenberg header and footer)?
    • How would you save the text-only content into the data/Austen directory?
  • How would you loop over the files in the data/Austen directory and for each one print the file name and a count of "his" and "her"?
  • What is the total number of characters in the Austen corpus?

In the next notebook (Getting NLTK) we're going to introduce the Natural Language Toolkit that provides a huge number of useful functions for text analysis.


CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell. Edited and revised by Melissa Mony.
Created January 12, 2015 and last modified February 7, 2019 (Jupyter 5.0.0)