This notebook is focused on an essential component of digital text analysis: preparing a corpus of texts. It's part of the Art of Literary Text Analysis and assumes that you've already worked through Getting Setup and Getting Started. In this notebook we'll look at:
Note that we're especially interested here in working with plain texts, in later notebooks we'll deal with other formats.
Let's create a new notebook and set its name (by clicking on the Untitled label above the toolbar) to GettingTexts.
Now we'll change the editing mode of the first cell to Markdown and copy and paste in the following Markdown-encoded text:
# Getting Texts
We are first going to experiment with loading a plain text into memory from the Gutenberg Project (http://gutenberg.org), an online library with tens of thousands of free texts in different languages and formats. We can Google something like
python3 read from url
to discover pages like https://docs.python.org/3/howto/urllib2.html that explain the basics of reading content.
Hit Shift-Enter to evaluate/format the Markdown cell and create a new code cell.
The Markdown cell explains the essentials of what we want to do: fetch the contents of a plain text document from a URL, assign it to a string variable, and demonstrate various basic operatons we can perform on a string. For this program we'll choose the Works of Edgar Allan Poe, Volume 1, available at http://www.gutenberg.org/files/2147/2147-0.txt (you're encouraged to visit this link before continuing).
First, let's fetch our document using urllib.request (not all the code below will be explained in detail now, but we'll come back to it).
import urllib.request
poeUrl = "http://www.gutenberg.org/files/2147/2147-0.txt"
poeString = urllib.request.urlopen(poeUrl).read().decode().strip()
print("This string has", len(poeString), "characters")
This string has 550332 characters
Most of the principles involved have already been covered in the Getting Started
urllib.request
instead of the time
module)poeUrl
poeString
In this case urlopen() is the function name with an argument that contains our Poe URL and returns an HTTP response object one which we can invoke read() to get the bytes data at our URL. Next, we call decode() to convert the bytes data to a proper (Unicode by default) string. Finally, we call strip() to remove any leading and trailing whitespace.
Many things can go wrong during networking calls, but if all goes well, we should now have a variable (poeString) containing a string with the same contents as at our URL.
Fetching the contents of a URL is a relatively "expensive" operation (in code-speak this means that it's more computationally or time intensive), so we want to isolate that in its own Jupyter cell so that we don't have to run it more times that necessary. If we want to explore various aspects of the poeString string that we fetched, we should do that in a separate cell so that we're not re-fetching the string each time.
There's an additional motivation for not repeating the fetching operation: Project Gutenberg (and some other sites) monitor how many requests are made from your IP address, and it can temporarily cut you off if it detects what it considers to be too many requests (waiting a while will usually lift this restriction). Multiple requests in a short period of time can also be a problem for shared IP addresses, like in a classroom setting.
One of the essential concepts of Jupyter (and the underlying iPython) is that once code is executed, any variables remain accessible in memory for subsequent cells that are executed. This is essentially the kernel which interpretes and executes code and stores things in the memory ("Kernel" is one of the items in the File menu of Jupyter). Typically we execute cells as we proceed through a notebook (see the options under the Cell File menu).
We've already seen above how to show the length of a string using the len() function. The length is a number but it can be a bit difficult to read because there is no thousands separator. Let's improve the output by searching how to format a number with a thousands separator, which leads in particular to this suggestion for using the format specification mini-language.
poeStringLen = len(poeString)
poeStringLenFormatted = "{:,}".format(poeStringLen) # format mini-language
print("This string has", poeStringLenFormatted, "characters")
This string has 550,332 characters
This suggests that there are 550,332 characters (because we're in Python 3.x we should be dealing with Unicode, and so this should be a true count of the characters, not just the bytes since some characters require multiple bytes).
We've shown a longer form of the code above, but we can also nest functions, or have function arguments that contain other functions – this works as well:
print("This string has", "{:,}".format(len(poeString)), "characters")```
This version is more succinct, but it can be more difficult to read (and to <a href="Glossary.ipynb#Debug" title="The process of identifying and removing errors from a program.">debug</a> or resolve if there's a problem), programming is about choices!
### Working with Parts of a String
A <a href="Glossary.ipynb#String" title="A container for data of letters, numbers or symbols.">string</a> is a sequence of characters and python has a powerful way of working with <a href="Glossary.ipynb#Sequence" title="An ordered set of Lists, Tuples or Strings.">sequences</a>. For instance, I can do this to get the first 25 characters of our poeString:
poeString[:25]
'\ufeffProject Gutenberg’s The '
Our string is a sequence where each character has an index position. Python, like many languages, starts its indexing at 0, so we get something like this, where there are 25 characters (including the "P" in index 0):
P | r | o | j | e | c | t | G | u | t | e | n | b | e | r | g | ' | s | T | h | e | W | ||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | … |
Let's see a few more examples of working with string sequences:
print("First character:", poeString[0])
print("Last character", poeString[-1])
print("First 25 characters:", poeString[:25])
print("Last 25 characters:", poeString[-25:])
print("Characters 8 to 25:", poeString[8:30])
First character: Last character . First 25 characters: Project Gutenberg’s The Last 25 characters: to hear about new eBooks. Characters 8 to 25: Gutenberg’s The Works
Working with character sequences like this is an essential aspect of text analysis and it's well worth becoming familiar with this syntax.
What else can we do with a string? A good place to look is the string methods documentation.
For instance, let's count the occurrences of one sequence (corpse) within another string poeString
:
print("Occurrences of 'corpse':", poeString.count("corpse"))
Occurrences of 'corpse': 65
print("Occurrences of 'corpse':", poeString.count("corpse"))
print("Occurrences of 'corps':", poeString.count("corpse"))
print("Occurrences of 'Corpse':", poeString.count("Corpse"))
Occurrences of 'corpse': 65 Occurrences of 'corps': 65 Occurrences of 'Corpse': 0
Apparently count()
is case-sensitive and is matching strings, not words.
So what if we wanted to be sure to count all occurrences of corpse regardless of case? One solution would be to convert our string to the case we want to use. We'd probably do this with lower(), but for the sake of demonstration, let's do it with upper():
print("Occurrences of 'CORPSE':", poeString.upper().count("CORPSE"))
Occurrences of 'CORPSE': 65
This again demonstrates function chaining: poeString.upper()
returns a new string and the new string has an available count() function. It's important to realize that poeString.upper()
doesn't modify the variable poeString
, it returns a new copy of the string.
We converted our poeString to lowercase characters since corpse (lowercase) isn't the same as Corpse (capitalized), though in this case it doesn't make any difference.
What if we wanted to find the index of the first occurrence of corpse and show the surrounding text?
firstCorpus = poeString.find("corpse") # the index position of the first occurrence of "corpse"
context = 30 # number of characters to show on either side of the index position
print(poeString[firstCorpus-context : firstCorpus+context])
and (horrible to relate!) the corpse of the daughter, head
Our Poe text is actually a volume of multiple texts. What if we wanted to isolate only one of the texts, such as "The Gold Bug?"
To isolate the "The Gold Bug" in our Poe text, we might do something like the following (sometimes planning a program in natural language, rather than in computer code, can be useful):
We know how to find the first two steps, and we've already seen a variant of the second step when we asked for the first few characters of the full Poe text. Let's first try in a simplified form to isolate "Gutenberg's" from our string "Project Gutenberg's The":
start = poeString.find("THE GOLD-BUG")
end = poeString.find("FOUR BEASTS IN ONE")
goldBugString = poeString[start:end].strip()
# show start and end of goldBugString
print(goldBugString[:50], "[…] ", goldBugString[-50:])
THE GOLD-BUG What ho! what ho! this f […] it; perhaps it required a dozen--who shall tell?”
Code that relies on URL content is convenient, though not nearly as robust as content that's already been downloaded and stored locally: content can change or disappear from the web, and maybe you want to work on your notebook in a remote location or in an airplane without internet connectivity. Moreover, accessing content from your local machine is typically much faster than interacting with web-based content.
What we'll do in the next section is the following:
Let's begin by creating a new subdirectory (relative to the current notebook directory), using the os module.
import os
directory = "data"
if not os.path.exists(directory):
os.makedirs(directory)
This demonstrates a conditional structure in Python where we test for a boolean value (true or false) of whether or not the directory exists.
Python uses a colon and indentation to indicate the parts of the conditional block. If we want to execute a block when a condition evaluates to true (like 1 < 5
, one is smaller than five):
if _condition_: _block_
Or if a condition is not true (like 1 > 5
, one is not smaller than five):
if *not* _condition_: _block_
If the data directory does't exist, we create it using mkdirs().
Now that we have a data directory, we need to open a new file in write ("w") mode and write out the string contents of goldBugString. The with block syntax we present here takes care of closing the file we've opened once we're done with it (once we're out of the indented block).
with open("data/goldBug.txt", "w") as f:
f.write(goldBugString)
The open()
function returns a file descriptor (that we've named f
) and to which we can write contents. An alternative, by the way, to reading from a URL to a string and then writing the string to a file is to use the urlretrive function, though our method should work just fine as well.
Assuming things did work out, we can now turn around and open the file in read mode ("r" instead "w"), read the contents into a new variable that we'll call goldBugString2
, and then close the file.
with open("data/goldBug.txt", "r") as f:
goldBugString2 = f.read()
Let's have a peek at the contents in our goldBugString2 variable (read directly from a file), the same way we did before.
print(goldBugString2[:50], "[…] ", goldBugString2[-50:])
THE GOLD-BUG What ho! what ho! this fel […] pit; perhaps it required a dozen--who shall tell?”
Looks good!
In fact, as a digression, it's not quite the same string since the original uses Windows-based linefeed characters that were stripped during the file writing and reading process.
goldBugString == goldBugString2 # are these two strings the same?
False
As with many things in programming languages like Python, there's more than one way of listing files in a directory. We're going to introduce a way here that also introduces a loop: a process that is repeated multiple times for each element in a list or for as long as a condition is true. We'll go a bit quickly here, but we'll come back to these concepts again soon.
But first let's start with the glob() function that allows us to list the files in a directory.
import glob
textFiles = glob.glob("data/*txt")
textFiles
['data/goldBug.txt']
type(textFiles)
list
totalCharacters = 0
for textFile in textFiles:
f = open(textFile, "r")
textString = f.read()
f.close()
chars = len(textString)
print(textFile, "has", chars, "characters")
totalCharacters += chars
print("total characters: ", totalCharacters)
data/goldBug.txt has 76459 characters total characters: 76459
The code above is of the general form
for _item_ in _list_: _block_
In other words, for each item in our textFiles
list, we execute the block where textFile
is the local variable holding the item in the list. Just as with the conditionals, the colon and indentation indicate what the loop condition is (as long as more elements exist in the list) and what block to execute for each iteration.
In the code above we're also calculating the total number of characters (tracking them in a variable that we've called totalCharacters
. Each time we iterate over the list of files, we add the length of characters for the current file.
totalCharacters += chars```
The += operater is a compact way to add a value to an existing variable. It's the equivalent of this:
totalCharacters = totalCharacters + chars```
Finally, we're using the print()
function here because it's a simple way of combining a string ("total characters: ") and a number (totalCharacters
) – in Python you can't simply concatenate a string and a number.
Here are some tasks to try:
Austen
under the data
directory we've already created?data/Austen
directory?data/Austen
directory and for each one print the file name and a count of "his" and "her"?In the next notebook (Getting NLTK) we're going to introduce the Natural Language Toolkit that provides a huge number of useful functions for text analysis.
CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell. Edited and revised by Melissa Mony.
Created January 12, 2015 and last modified February 7, 2019 (Jupyter 5.0.0)