Chapter 3: Text Analysis

-- A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel


In this chapter we will introduce you to the task of text analysis in Python. You will learn how to read an entire corpus into Python, clean it and how to perform certain data analyses on those texts. We will also briefly introduce you to using Python's plotting library matplotlib, with which you can visualize your data.

Before we delve into the main subject of this chapter, text analysis, we will first write a couple of utility functions that build upon the things you learnt in the previous chapter. Often we don't work with a single text file stored at our computer, but with multiple text files or entire corpora. We would like to have a way to load a corpus into Python.

Remember how to read files? Each time we had to open a file, read the contents and then close the file. Since this is a series of steps we will often need to do, we can write a single function that does all that for us. We write a small utility function read_file(filename) that reads the specified file and simply returns all contents as a single string.

In [ ]:
def read_file(filename):
    "Read the contents of FILENAME and return as a string."
    infile = open(filename) # windows users should use codecs.open after importing codecs
    contents = infile.read()
    infile.close()
    return contents

Now, instead of having to open a file, read the contents and close the file, we can just call the function read_file to do all that:

In [ ]:
text = read_file("data/austen-emma-excerpt.txt")
print(text)

In the directory data/gutenberg/training we have a corpus consisting of multiple files with the extension .txt. This corpus is a collection of English novels which we downloaded for you from the Gutenberg project. We want to iterate over all these files. You can do this using the listdir function from the os module. We import this function as follows:

In [ ]:
from os import listdir

After that, the listdir function is available to use. This function takes as argument the path to a directory and returns all the files and subdirectories present in that directory:

In [ ]:
listdir("data")

Notice that listdir returns a list and we can iterate over that list. Now, consider the following function:

In [ ]:
def list_textfiles(directory):
    "Return a list of filenames ending in '.txt' in DIRECTORY."
    textfiles = []
    for filename in listdir(directory):
        if filename.endswith(".txt"):
            textfiles.append(directory + "/" + filename)
    return textfiles

The function listdir takes as argument the name of a directory and lists all filenames in that directory. We iterate over this list and append each filename that ends with the extension, .txt to a new list of textfiles. Using the list_textfiles function, the following code will read all text files in the directory data/gutenberg/training and outputs the length (in characters) of each:

In [ ]:
for filepath in list_textfiles("data/gutenberg/training"):
    text = read_file(filepath)
    print(filepath +  " has " + str(len(text)) + " characters.")

Sentence tokenization

In the previous chapter we wrote a function to tokenize or split a text string into a list of words. However, using this function we lose information about where sentences end and start in the text. We will develop a function split_sentences that performs some very simple sentence splitting when passed a text string. Each sentence will be represented as a new string, so the function as a whole returns a list of sentence strings. We assume that any occurrence of either . or ! or ? marks the end of a sentence. In reality, this is more ambiguous of course. Consider for example the use of periods as end-of-sentence marker as well as in abbreviations and initials!

How should we tackle this problem? Have a look at the following picture:

caption

The first sentence Hello there! spans from index 0 to index 11. The second sentence from 13 to 26. If we come up with a way to extract those indexes, we could slice the text into separate sentences. First we define a utility function end_of_sentence that takes as argument a character and returns True if it is an end-of-sentence marker, otherwise it returns False.


Quiz!

Write the function end_of_sentence_marker below:

In [ ]:
def end_of_sentence_marker(character):
    # insert your code here

# these tests should return True if your code is correct
print(end_of_sentence_marker("?") == True)
print(end_of_sentence_marker("a") == False)

An important function we will use is the built in enumerate. enumerate takes as argument any iterable (a string a list etc.). Let's see it in action:

In [ ]:
for element in enumerate("Python"):
    print(element)

As you can see, enumerate allows you to iterate over an iterable and for each element in that iterable, it gives you its corresponding index. A slightly more convenient way of iterating over enumerate is the following:

In [ ]:
for index, character in enumerate("Python"):
    print(index)

This way, we have easy access to both the index and the original item in the iterable. Now we know enough to write our split_sentences function. We will walk you through it, step by step, but first try to read the function and think about what it possibly does at each step:

In [ ]:
def split_sentences(text):
    "Split a text string into a list of sentences."
    sentences = []
    start = 0
    for end, character in enumerate(text):
        if end_of_sentence_marker(character):
            sentence = text[start: end + 1]
            sentences.append(sentence)
            start = end + 1
    return sentences

The function split_sentences takes as argument a text represented by a simple string. Within the function we define a variable sentences in which we will store the individual sentences. We need to extract both the start position and the end position of each sentence. We know that the first sentence will always start at position 0. Therefore we define a variable start and set it to zero.

Next we will use enumerate to loop over all individual characters in the text. Remember that enumerate returns pairs of indexes and their corresponding elements (here characters). For each character we check whether it is an end-of-sentence marker. If it is, the variable end marks the position in text where a sentence ends. We can now slice the text from the starting position to the end position and obtain our sentence. Notice that we add 1 to the end position. Why would that be? This is because, as you might remember from the first chapter, slices are non-inclusive, so text[start:end] would return the text starting at start and ending one position before end. Since we have reached the end of a sentence, we know that the next sentence will start at least one position later than our last end point. Therefore, we update the start variable to end + 1. Let's check whether our function works as promised:

In [ ]:
print(split_sentences("This is a sentence. Should we seperate it from this one?"))

It does!


Quiz!

To conclude this section, you will write a wrapper function tokenize, that takes as input a text represented by a string and tokenizes this string into sentences. After that, we clean each sentence, by lowercasing all words and removing punctuation. The final step is to tokenize each sentence into a list of words. The file preprocessing.py contains a function called clean_text which removes all punctuation from a text and turns all characters to lowercase. We import that function using the following line:

In [ ]:
from pyhum.preprocessing import clean_text
In [ ]:
def tokenize(text):
    """Transform TEXT into a list of sentences. Lowercase 
    each sentence and remove all punctuation. Finally split each
    sentence into a list of words."""
    # insert your code here

# these tests should return True if your code is correct
print(tokenize("This is a sentence. So, what!") == 
      [["this", "is", "a", "sentence"], ["so", "what"]])

General Text Statistics

When the next night came, Dinarazad said to her sister Shahrazad: ‘In God’s name, sister, if you are not asleep, then tell us one of your stories!’ Shahrazad answered: ‘With great pleasure! I have heard tell, honoured King, that…’

Alf Laylah Wa Laylah, the Stories of One Thousand and One Nights is a collection of folk tales, collected over many centuries by various authors, translators, and scholars across West, Central and South Asia and North Africa, forms a huge narrative wheel with an overarching plot, created by the frame story of Shahrazad.

The stories begin with the tale of king Shahryar and his brother, who, having both been deceived by their respective Sultanas, leave their kingdom, only to return when they have found someone who — in their view — was wronged even more. On their journey the two brothers encounter a huge jinn who carries a glass box containing a beautiful young woman. The two brothers hide as quickly as they can in a tree. The jinn lays his head on the girl’s lap and as soon as he is asleep, the girl demands the two kings to make love to her or else she will wake her ‘husband’. They reluctantly give in and the brothers soon discover that the girl has already betrayed the jinn ninety-eight times before. This exemplar of lust and treachery strengthens the Sultan’s opinion that all women are wicked and not to be trusted.

When king Shahryar returns home, his wrath against women has grown to an unprecedented level. To temper his anger, each night the king sleeps with a virgin only to execute her the next morning. In order to make an end to this cruelty and save womanhood from a "virgin scarcity", Sharazad offers herself as the next king’s bride. On the first night, Sharazad begins to tell the king a story, but she does not end it. The king’s curiosity to know how the story ends, prevents him from executing Shahrazad. The next night Shahrazad finishes her story, and begins a new one. The king, eager to know the ending of this tale as well, postpones her execution once more. Using this strategy for One Thousand and One Nights in a labyrinth of stories-within-stories-within-stories, Shahrazad attempts to gradually move the king’s cynical stance against women towards a politics of love and justice (see Marina Warner’s Stranger Magic (2013) in case you're interested).

The first European version of the Nights was translated into French by Antoine Galland. Many translations (in different languages) followed, such as the (heavily criticized) English translation by Sir Richard Francis Burton entitled The Book of the Thousand and a Night (1885). This version is freely available from the Gutenberg project (see here), and will be the one we will explore here.


Quiz!

In the directory data/arabian_nights you will find 999 files. This is because in Burton's translation some nights are missing. The name of the file represents the corresponding night of storytelling in Alf Laylah Wa Laylah. Go have a look. Use the tokenize function and the corpus reading function we have defined above and tokenize and clean each night. Store the result in the variable named corpus.

In [ ]:
# insert your code here

Great job! You now should have a corpus containing 999 texts. It is always important to check whether our code actually produces the desired results. Let's check whether we indeed have 999 texts:

In [ ]:
print(len(corpus))

OK, that seems to be correct. It would be convenient for further processing to have the corpus in chronological order. Let's have a look the first 20 files returned by list_textfiles:

In [ ]:
list_textfiles("data/arabian_nights")[:20]

As you can see the files are sorted by their string name and not by their numbering. To be able to sort the files by their numbers we must first remove the extension .txt as well as the directory data/arabian_nights/.


Quiz!

1) Write a function remove_txt that takes as argument a string and some extension that you want to remove. It should return the string without the extension. Tip: use the function splitext from the os.path module. Look up the documentation here.

In [ ]:
from os.path import splitext

def remove_ext(filename):
    # insert your code here
    
# these tests should return True if your code is correct
print(remove_ext("data/arabian_nights/1.txt") == "data/arabian_nights/1")
print(remove_ext("ridiculous_selfie.jpg") == "ridiculous_selfie")

2) Write a function remove_dir that takes as argument a filepath and removes the directory from a filepath. Tip: use the function basename from the os.path module. Look up the document here

In [ ]:
from os.path import basename

def remove_dir(filepath):
    # insert your code here
    
# these tests should return True if your code is correct
print(remove_dir("data/arabian_nights/1.txt") == "1.txt")
print(remove_dir("/a/kind/of/funny/filepath/to/file.txt") == "file.txt")

3) Combine the two functions remove_ext and remove_dir into one function get_filename. This function takes as argument a filepath and returns the name (without the extensions) of the file.

In [ ]:
def get_filename(filepath):
    # insert your code here
    
# these tests should return True if your code is correct
print(get_filename("data/arabian_nights/1.txt") == '1')

The final step is to convert numbers represented as string (e.g. "1" and "10") to a number. This can be achieved by using the function int:

In [ ]:
x_as_string = "1"
x_as_int = int(x_as_string)
print(x_as_int)

The process of converting a string into an integer, is called type casting. Strings are different types than integers. To see this, have a look at the following:

In [ ]:
x = "1"
y = "2"
print(x + y)

12? Yes, 12. This is because, as you might remember from the first chapter, we can use the + operator to concatenate two strings. If we apply the same operation to integers, as in:

In [ ]:
x = 1
y = 2
print(x + y)

we get the expected result of 3.


Quiz!

Combine the functions int and get_filename into the function get_night to obtain the integer corresponding to a night.

In [ ]:
def get_night(filepath):
    # insert your code here

# these tests should return True if your code is correct
print(get_night("data/arabian_nights/1.txt") == 1)

OK, so now we can convert the filepaths to integers corresponding to the nights of storytelling. But how will we use that to sort the corpus in chronological order? In chapter 1 we briefly discussed how to sort your collection of good reads. In combination with our get_night function, we can use sort to obtain a nicely chronologically ordered list of stories. Prepare yourself for some real Python magic, because the following lines of code might be a little dazzling...

First we list all files using list_textfiles and store it in the variable filenames:

In [ ]:
filenames = list_textfiles('data/arabian_nights')

Next we call the function .sort() on this list and supply as keyword our function get_night:

In [ ]:
filenames.sort(key=get_night)
filenames[:20]

As you can see, we now have a perfectly chronologically ordered list of filenames. But how, HOW! did that work? As you might have guessed, the argument of sort: key=get_night, has something to do with all this magic. Without this argument, Python would just sort the filenames alphabeticaly:

In [ ]:
filenames = list_textfiles('data/arabian_nights')
filenames.sort()
print(filenames[:20])

However, if we supply a function to key, Python will internally first apply that function to all items we want to sort. In our case this means Python converts all filepaths to integers. After that Python sorts the list. Then for each converted item it returns the corresponding item in the original list. (Technically this is not an accurate description, but it basically comes down to this.)

If you still feel a little dizzy after all this, don't be afraid. Sometimes it is good enough to use a particular piece of code even if you don't completely understand it. We can now use these functions to reload the corpus, this time in chronological order:

In [ ]:
corpus = []
filenames = list_textfiles("data/arabian_nights")
filenames.sort(key=get_night)
for filename in filenames:
    text = read_file(filename)
    corpus.append(tokenize(text))

Exploratory data analysis

As a first exploratory data analysis, we are going to compute for each night how many sentences it contains and how many words. It is quite easy to count the number of sentences per night, since each night is represented by a list of sentences.

In [ ]:
sentences_per_night = []
for night in corpus:
    sentences_per_night.append(len(night))
print(sentences_per_night[:10])

Using the function max we can find out what the highest number of sentences is:

In [ ]:
max(sentences_per_night)

Similarly, if we would like to now what the lowest number of sentences is, we use the function min:

In [ ]:
min(sentences_per_night)

Quiz!

The function sum takes a list of numbers as input and returns the sum:

In [ ]:
print(sum([1, 3, 3, 4]))

Use this function to compute the average number of sentences per night. Note if you use Python 2.7, you will need to convert the result of sum, which will be an integer to a float, using float(some_number).

In [ ]:
# if you use Python 3.x, both print statements will return 
# the same thing and you don't need to worry.
number = 1
print(number)
number = float(number)
print(number)
In [ ]:
# insert your code here

Given our data structure of a list of sentences which are themselves lists of words, it is a little trickier to count for each night how many words it contains. One possible way is the following:

In [ ]:
words_per_night = []
for night in corpus:
    n_words = 0
    for sentence in night:
        n_words += len(sentence)
    words_per_night.append(n_words)

Make sure you really understand these lines of code as you will need them in the next quiz.

The suspense created by Shahrazad’s story-telling skills is intriguing, especially the “cliff-hanger” ending each night which she uses to avert her own execution (and possibly that of womanhood). Every night she tells the Sultan a story only to stop at dawn and she picks up the thread the next night. But does it really take the whole night to tell a particular story?

I am not aware of any exact numbers about how many words people speak per minute. Averages seem to fluctuate between 100 and 200 words per minute. Narrators are advised to use approximately 150 words per minute in audiobooks. I suspect that this number is a little lower for live storytelling and assume it lies around 130 words per minute (including pauses). Using this information, we can compute the time it takes to tell a particular story as follows:

$$\textrm{story time}(\textrm{text}) = \frac{\textrm{number of words in text}}{\textrm{number of words per minute}}$$


Quiz!

1) Write a function called story_time that takes as input a text. Given a speed of 130 words per minute, compute how long it takes to tell that text.

In [ ]:
def story_time(text):
    # insert your code here

# these tests should return True if your code is correct
print(story_time([["story", "story"]]) * 130 == 2.0)

2) Compute the story_time for each night in our corpus. Assign the result to the variable story_time_per_night.

In [ ]:
story_time_per_night = []
# insert your code here
print(story_time_per_night[:10])

3) Compute the average, minimum and maximum story telling time.

In [ ]:
# insert your code here

Visualizing general statistics

Now that we have computed a range of general statistics for our corpus, it would be nice to visualize them. Python's plotting library matplotlib (see here) allows us to produce all kinds of graphs. We could for example, plot for each story, how many sentences it contains:

In [ ]:
import matplotlib.pyplot as plt

plt.plot(sentences_per_night)

Quiz!

1) Can you do the same for words_per_night?

In [ ]:
# insert your code here

2) And can you do the same for story_time_per_night?

In [ ]:
# insert your code here

3) In this final exercise we will put everything together what we have learnt so far. We want you to write a function positions_of that returns for a given word all sentence positions in the Arabian Nights where that word occurs. We are not interested in the positions relative to a particular night, but only to the corpus as a whole. Use that function to find all occurences of the name Sharahzad and store the corresponding indexes in the variable positions_of_shahrazad. Do the same thing for the name Ali. Store the result in positions_of_ali. Finally, find all occurences of Egypt and store the indexes in positions_of_egypt. Tip: (1) remember that we lowercased the entire corpus! (2) remember that indexes start at 0.

In [ ]:
def positions_of(word):
    #insert your code here

positions_of_shahrazad = positions_of("shahrazad")
positions_of_ali = positions_of("ali")
positions_of_egypt = positions_of("egypt")

If everything went well, the following lines of code should produce a nice dispersion plot of all sentence occurences of Shahrazad, Ali and Egypt in the corpus.

In [ ]:
plt.figure(figsize=(20, 8))
names = ["Shahrazad", "Ali", "Egypt"]
plt.plot(positions_of_shahrazad, [1]*len(positions_of_shahrazad), "|", markersize=100)
plt.plot(positions_of_ali, [2]*len(positions_of_ali), "|", markersize=100)
plt.plot(positions_of_egypt, [0]*len(positions_of_egypt), "|", markersize=100)
plt.yticks(range(len(names)), names)
_ = plt.ylim(-1, 3)

Then Shahrazad reached the morning, and fell silent in the telling of her tale…


Ignore the following, it's just here to make the page pretty:

In [2]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[2]:
/* Placeholder for custom user CSS mainly to be overridden in profile/static/custom/custom.css This will always be an empty file in IPython */