Chapter 2: First steps into text processing

-- A Python Course for the Humanities by Folgert Karsdorp and Maarten van Gompel


The previous chapter has hopefully whet your appetite. In this chapter we will focus on one of the most important tasks in Humanities research: text processing. One of the goals of text processing is to clean up your data as a pre-step to some kind of data analysis. Another common goal is to convert a given text collection to a different format. In this chapter we will provide you with the necessary tools to work with collections of texts, clean them and perform some rudimentary data analyses on them.

Reading files

Say you have a text stored on your computer. How can we read that text using Python? Python provides a really simple function called open with which we can read texts. In the folder data you find a couple of small text excerpts that we will use in this chapter. Go ahead and have a look at them. We can open these files with the following command:

In [ ]:
infile = open('data/austen-emma-excerpt.txt')

We now print infile. What do you think that will happen?

In [ ]:
print(infile)

"Hey! That's not what I expected to happen!", you might think. Python is not printing the contents of the file but only some mysterious mention of some TextIOWrapper. This TextIOWrapper thing is Python's way of saying it has opened a connection to the file data/austen-emma-excerpt.txt. In order to read the contents of the file we must add the function read as follows:

In [ ]:
print(infile.read())

read is a function that operates on TextWrapper objects and allows us to read the contents of a file into Python. Let's assign the contents of the file to the variable text:

In [ ]:
infile = open('data/austen-emma-excerpt.txt')
text = infile.read()

The variable text now holds the contents of the file data/austen-emma-excerpt.txt and we can access and manipulate it just like any other string. After we read the contents of a file, the TextWrapper no longer needs to be open. In fact, it is good practice to close it as soon as you do not need it anymore. Now, lo and behold, we can achieve that with the following:

In [ ]:
infile.close()

Quiz!

Just to recap some of the stuff we learnt in the previous chapter. Can you write code that defines the variable number_of_es and counts how many times the letter e occurs in text? (Tip: use a for loop and an if statement)

In [ ]:
number_of_es = 0
# insert your code here

# The following test should print True if your code is correct 
print(number_of_es == 78)

Writing our first function

In the previous quiz, you probably wrote a loop that iterates over all characters in text and adds 1 to number_of_es each time the program finds the letter e. Counting objects in a text is a very common thing to do. Therefore, Python provides the convenient function count. This function operates on strings (somestring.count(argument)) and takes as argument the object you want to count. Using this function, the solution to the quiz above can now be rewritten as follows:

In [ ]:
number_of_es = text.count("e")
print(number_of_es)

In fact, count takes as argument any string you would like to find. We could just as well count how often the determiner an occurs:

In [ ]:
print(text.count("an"))

The string an is found 12 times in our text. Does that mean that the word an occurs 12 times in our text? Go ahead. Count it yourself. In fact, an occurs only twice... Think about this. Why does Python print 12?

If we want to count how often the word an occurs in the text and not the string an, we could surround an with spaces, like the following:

In [ ]:
print(text.count(" an "))

Although it gets the job done in this particular case, it is generally not a very solid way of counting words in a text. What if there are instances of an followed by a semicolon or some end-of-sentence marker? Then we would need to query the text multiple times for each possible context of an. For that reason, we're going to approach the problem using a different, more sophisticated strategy.

Recall from the previous chapter the function split. What does this function do? The function split operates on a string and splits a string on spaces and returns a list of smaller strings (or words):

In [ ]:
print(text.split())

Quiz!

All the things you have learnt so far should enable you to write code that counts how often a certain items occurs in a list. Write some code that defines the variable number_of_hits and counts how often the word in (assigned to item_to_count) occurs in the the list of words called words.

In [ ]:
words = text.split()
number_of_hits = 0
item_to_count = "in"
# insert your code here

# The following test should print True if your code is correct 
print(number_of_hits == 3)

We will go through the previous quiz step by step. We would like to know how often the preposition in occurs in our text. As a first step we will split the string text into a list of words:

In [ ]:
words = text.split()

Next we define a variable number_of_hits and set it to zero.

In [ ]:
number_of_hits = 0

The final step is to loop over all words in words and add 1 to number_of_ins if we find a word that is equal to in:

In [ ]:
item_to_count = "in"
for word in words:
    if word == item_to_count:
        number_of_hits += 1
print(number_of_hits)

Now, say we would like to know how often the word of occurs in our text. We could adapt the previous lines of code to search for the word of, but what if we also would like to count the number of times the occurs, and house and had and... It would be really cumbersome to repeat all these lines of code for each particular search term we have. Programming is supposed to reduce our workload, not increase it. Just like the function count for strings, we would like to have a function that operates on lists, takes as argument the object we would like to count and returns the number of times this object occurs in our list.

In this and the previous chapter you have already seen lots of functions. A function does something, often based on some argument you pass to it, and generally returns a result. You are not just limited to using functions in the standard library but you can write your own functions.

In fact, you must write your own functions. Separating your problem into sub-problems and writing a function for each of those is an immensely important part of well-structured programming. Functions are defined using the def keyword, they take a name and optionally a number of parameters.

def some_name(optional_parameters):

The return statement returns a value back to the caller and always ends the execution of the function.

Going back to our problem, we want to write a function called count_in_list. It takes two arguments: (1) the object we would like to count and (2) the list in which we want to count that object. Let's write down the function definition in Python:

def count_in_list(item_to_count, list_to_search):

Do you understand all the syntax and keywords in the definition above? Now all we need to do is to add the lines of code we wrote before to the body of this function:

In [ ]:
def count_in_list(item_to_count, list_to_search): 
    number_of_hits = 0                            
    for item in list_to_search:                   
        if item == item_to_count:                 
            number_of_hits += 1                   
    return number_of_hits                         

All code should be familiar to you, except the return keyword. The return keyword is there to tell python to return as a result of calling the function the argument number_of_hits. OK, let's go through our function one more time, just to make sure you really understand all of it.

  1. First we define a function using def and give it the name count_in_list (line 1);
  2. This function takes two arguments: item_to_count and list_to_search (line 1);
  3. Within the function, we define a variable number_of_hits and assign to it the value zero (since at that stage we haven't found anything yet (line 2));
  4. We loop over all words in list_to_search (line 3);
  5. If we find a word that is equal to item_to_count (line 4), we add 1 to number_of_hits (line 5);
  6. Return the result of number_of_hits (line 6).

Let's test our little function! We will first count how often the word an occurs in our list of words words.

In [ ]:
print(count_in_list("an", words))

Quiz!

Using the function we defined, print how often the word the occurs in our text

In [ ]:
# insert your code here

A more general count function

Our function count_in_list is a concise and convenient piece of code allowing us to rapidly and without too much repitition count how often certain items occur in a given list. Now what if we would like to find out for all words in our text how often they occur. Then it would be still quite cumbersome to call our function for each unique word. We would like to have a function that takes as argument a particular list and counts for each unique item in that list how often it occurs. There are multiple ways of writing such a function. We will show you two ways of doing it.

A count function (take 1)

In the previous chapter you have acquainted yourself with the dictionary structure. Recall that a dictionary consists of keys and values and allows you to quickly lookup a value. We will use a dictionary to write the function counter that takes as argument a list and returns a dictionary with keys for each unique item and values showing the number of times it occurs in the list. We will first write some code without the function declaration. If that works, we will add it, just as before, to the body of a function.

We start with defining a variable counts which is an empty dictionary:

In [ ]:
counts = {}

Next we will loop over all words in our list words. For each word, we check whether the dictionary already contains it. If so, we add 1 to its value. If not, we add the word to the dictionary and assign to it the value 1.

In [ ]:
for word in words:
    if word in counts:
        counts[word] = counts[word] + 1
    else:
        counts[word] = 1
print(counts)

If you don't remember anymore how dictionaries work, go back to the previous chapter and read the part about dictionaries once more.

Now that our code is working, we can add it to a function. We define the function counter using the def keyword. It takes one argument (list_to_search).

In [ ]:
def counter(list_to_search):                 
    counts = {}                              
    for word in list_to_search:              
        if word in counts:                   
            counts[word] = counts[word] + 1  
        else:                                
            counts[word] = 1                 
    return counts                            

Hopefully we are boring you, but let's go through this function step by step.

  1. We define a function using def and give it the name counter (line 1);
  2. This function takes a single argument list_to_search which is the list we want to search through (line 1);
  3. Next we define a variable counts which is an empty dictionary (line 2);
  4. We loop over all words in list_to_search (line 3);
  5. If the word is already in counts, we look up its current value and add 1 to it (line 4-5);
  6. If the word is not in counts (else clause), we add the word to the dictionary and assign it the value 1 (line 6-7);
  7. Return the result of counts (line 8);

Let's try out our new function!

In [ ]:
print(counter(words))

Quiz!

Let's put some of the stuff we learnt so far together. What we want you to do is to read into Python the file data/austen-emma.txt, convert it to a list of words and assign to the variable emma_count how often the word Emma occurs in the text.

In [ ]:
emma_count = 0
# insert you code here

# The following test should print True if your code is correct 
print(emma_count == 481)

A count function (take 2)

Let's train our function writing skills a little more. We are going to write another counting function, this time using a slightly different strategy. Recall our function count_in_list. It takes as argument a list and the item we want to count in that list. It returns the number of times this item occurs in the list. If we call this function for each unique word in words, we obtain a list of frequencies, quite similar to the one we get from the function counter. What would happen if we just call the function count_in_list on each word in words?

In [ ]:
infile = open('data/austen-emma-excerpt.txt')
text = infile.read()
infile.close()
words = text.split()

for word in words:
    print(word, count_in_list(word, words))

As you can see, we obtain the frequency of each word token in words, where we would like to have it only for unique word forms. The challenge is thus to come up with a way to convert our list of words into a structure with solely unique words. For this Python provides a convenient data structure called set. It takes as argument some iterable (e.g. a list) and returns a new object containing only unique items:

In [ ]:
x = ['a', 'a', 'b', 'b', 'c', 'c', 'c']
unique_x = set(x)
print(unique_x)

Using set we can iterate over all unique words in our word list and print the corresponding frequency:

In [ ]:
unique_words = set(words)
for word in unique_words:
    print(word, count_in_list(word, words))

We wrap the lines of code above into the function counter2:

In [ ]:
def counter2(list_to_search):
    unique_words = set(list_to_search)
    for word in unique_words:
        print(word, count_in_list(word, list_to_search))

A final check to see whether our function behaves correctly:

In [ ]:
counter2(words)

Quiz!

We have written two functions counter and counter2, both used to count for each unique item in a particular list how often it occurs in that list. Can you come up with some pros and cons for each function? Why is counter2 better than counter or why is counter better than counter2?

Double click this cell and write down your answer.


Text clean up

In the previous section we wrote code to compute a frequency distribution of the words in a text stored on our computer. The function split is a quick and dirty way of splitting a string into a list of words. However, if we look through the frequency distributions, we notice quite an amount of noise. For instance, the pronoun her occurs 4 times, but we also find her. occurring 1 time and the capitalized Her, also 1 time. Of course we would like to add those counts to that of her. As it appears, the tokenization of our text using split is fast and simple, but it leaves us with noisy and incorrect frequency distributions.

There are essentially two strategies to follow to correct our frequency distributions. The first is to come up with a better procedure of splitting our text into words. The second is to clean-up our text and pass this clean result to the convenient split function. For now we will follow the second path.

Some words in our text are capitalized. To lowercase these words, Python provides the function lower. It operates on strings:

In [ ]:
x = 'Emma'
x_lower = x.lower()
print(x_lower)

We can apply this function to our complete text to obtain a completely lowercased text, using:

In [ ]:
text_lower = text.lower()
print(text_lower)

This solves our problem with miscounting capitalized words, leaving us with the problem of punctuation. The function replace is just the function we're looking for. It takes two arguments: (1) the string we would like to replace and (2) the string we want to replace the first argument with:

In [ ]:
x = 'Please. remove. all. dots. from. this. sentence.'
x = x.replace(".", "")
print(x)

Notice that we replace all dots with an empty string written as "".


Quiz!

Write code that to lowercase and remove all commas in the following short text:

In [ ]:
short_text = "Commas, as it turns out, are so much overestimated."
# insert your code here

# The following test should print True if your code is correct 
print(short_text == "commas as it turns out are so much overestimated.")

We would like to remove all punctuation from a text, not just dots and commas. We will write a function called remove_punc that removes all (simple) punctuation from a text. Again, there are many ways in which we can write this function. We will show you two of them. The first strategy is to repeatedly call replace on the same string each time replacing a different punctuation character with an empty string.

In [ ]:
def remove_punc(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    for marker in punctuation:
        text = text.replace(marker, "")
    return text

short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(remove_punc(short_text))

The second strategy we will follow is to show you that we can achieve the same result without using the built in function replace. Remember that a string consists of characters. We can loop over a string accessing each character in turn. Each time we find a punctuation marker we skip to the next character.

In [ ]:
def remove_punc2(text):
    punctuation = '!@#$%^&*()_-+={}[]:;"\'|<>,.?/~`'
    clean_text = ""
    for character in text:
        if character not in punctuation:
            clean_text += character
    return clean_text

short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(remove_punc2(short_text))

Quiz!

1) Can you come up with any pros or cons for each of the two functions above?

Write your answer here (double click me)

2) Now it is time to put everything together. We want to write a function clean_text that takes as argument a text represented by string. The function should return this string with all punctuation removed and all characters lowercased.

In [ ]:
def clean_text(text):
    # insert your code here
    
# The following test should print True if your code is correct 
short_text = "Commas, as it turns out, are overestimated. Dots, however, even more so!"
print(clean_text(short_text) == 
      "commas as it turns out are overestimated dots however even more so")

3) This last excercise puts everything together. We want you to open and read the file data/austen-emma.txt text once more, clean up the text and recompute the frequency distribution. Assign to woodhouse_counts the number of times the name Woodhouse occurs in the text.

In [ ]:
woodhouse_counts = 0
# insert your code here

# The following test should print True if your code is correct 
print(woodhouse_counts == 263)

Writing results to a file

We have accomplished a lot! You have learnt how to read files using Python from your computer, how to manipulate them, clean them up and compute a frequency distribution of the words in a text file. We will finish this chapter with explaining to you how to write your results to a file. We have already seen how to read a text from our disk. Writing to our disk is only slightly different. The following lines of code write a single sentence to the file first-output.txt.

In [ ]:
outfile = open("first-output.txt", mode="w")
outfile.write("My first output.")
outfile.close()

Go ahead and open the file first-output.txt located in the folder where this course resides. As you can see it contains the line My first output.. To write something to a file we open, just as in the case of reading a file, a TextIOWrapper which can be seen as a connection to the file first-output.txt. The difference with opening a file for reading is the mode with which we open the connection. Here the mode says w, meaning "open the file for writing". To open a file for reading, we set the mode to r. However, since this is Python's default setting, we may omit it.


Quiz!

In the final quiz of this chapter we will ask you to write the frequency distribution over the words in data/austen-emma.txt to the file data/austen-frequency-distribution.txt. We will give you some code to get you started

In [ ]:
# first open and read data/austen-emma.txt. Don't forget to close the infile
infile = open("data/austen-emma.txt")
text = # read the contents of the infile
# close the file handler
# clean the text

# next compute the frequency distribution using the function counter
frequency_distribution = 

# now open the file data/austen-frequency-distribution.txt for writing
outfile = 

for word, frequency in frequency_distribution.items():
    outfile.write(word + ";" + str(frequency) + '\n')
    
# close the outfile

Ignore the following, it's just here to make the page pretty:

In [1]:
from IPython.core.display import HTML
def css_styling():
    styles = open("styles/custom.css", "r").read()
    return HTML(styles)
css_styling()
Out[1]:
/* Placeholder for custom user CSS mainly to be overridden in profile/static/custom/custom.css This will always be an empty file in IPython */