Some corpora have already been marked up for use with NLTK, but you're often going to want to work with your own texts. So how to we load them in and prepare them for use with NLTK? We're going to start by looking at some plain text (.txt) files of speeches and press releases from the Malcolm Fraser archive, held by the University of Melbourne. We'll look at some of the advantages and disadvantages of using NLTK, and problems of data wrangling. You can check out the Fraser Archive here: http://www.unimelb.edu.au/malcolmfraser/
First of all, let's load in our text.
Via file management, open and inspect one file. What do you see? Are there any potential problems?
from __future__ import division
import nltk, re, pprint
import os
#import tokenizers
from nltk import word_tokenize
from nltk.text import Text
nltk.data.path.append('/home/researcher/nltk_data/')
nltk.download("book", download_dir='/home/researcher/nltk_data/')
Now, run the above import statements. You'll need these to import and process raw text. Now that we've got our texts, let's have a look at what is in the file directory.
#access items in the directory 'UMA_Fraser_Radio_Talks' and view the first 3
os.listdir('UMA_Fraser_Radio_Talks')[:3]
First we'll read in one speech and tokenize it. This means breaking it up into words for analysis
#open a file and call the content 'speech'
speech = open('UMA_Fraser_Radio_Talks/UDS2013680-100-full.txt').read()
#tokenize the speech and call the result 'vocab'
vocab = word_tokenize(speech)
len(vocab)
len(set(vocab))
vocab.count('South')
len(vocab)/len(set(vocab))
V = set(vocab)
long_words = [word for word in V if len(word) > 12]
sorted(long_words)
To perform more complex operations, we'll need to use a different tokenizer
sent_vocab = Text(word_tokenize(speech))
sent_vocab.concordance('wool')
sent_vocab.collocations()
#build a table of the 15 most common words in the text
from nltk.probability import FreqDist
fdist1 = FreqDist(sent_vocab)
fdist1.tabulate(15)
#graph the 20 most common words in the text
%matplotlib inline
fdist1.plot(20, cumulative=True)
fdist1.max()
100.0*fdist1.freq('Portland')
vocab[:20]
len(set(word.lower() for word in vocab))
len(set(word.lower() for word in vocab if word.isalpha()))
We've had a look at one file, but the real strength of NLTK is to be able to explore large bodies of text. When we manually inspected the first file, we saw that it contained a metadata section, before the body of the text. We can ask Python to show us the start of the file. For analysing the text, it is useful to split the metadata section off, so that we can interrogate it separately but also so that it won't distort our results when we analyse the text.
#view the first 100 characters of the first file
open('UMA_Fraser_Radio_Talks/' + os.listdir('UMA_Fraser_Radio_Talks')[0]).read()[:100]
#open the first file, read it and then split it into two parts, metadata and body
data = open('UMA_Fraser_Radio_Talks/' + os.listdir('UMA_Fraser_Radio_Talks')[0]).read().split("<!--end metadata-->")
#view the first part
data[0]
#split into lines, add '*' to the start of each line
for line in data[0].split('\r\n'):
print '*', line
#get rid of any line that starts with '<'
for line in data[0].split('\r\n'):
if line[0] == '<':
continue
print '*', line
#skip empty lines and any line that starts with '<'
for line in data[0].split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
print '*', line
#split the metadata items on ':' so that we can interrogate each one
for line in data[0].split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
element = line.split(':')
print '*', element
#actually, only split on the first colon
for line in data[0].split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
element = line.split(':', 1)
print '*', element
We've now split up the elements of the metadata, but we want to be able to interrogate it so that we can start to find out something about the collection of files. To do that, we need to build a dictionary.
metadata = {}
for line in data[0].split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
element = line.split(':', 1)
metadata[element[0]] = element[-1]
print metadata
#look up the date
print metadata['Date']
Creating a function means that we can perform an operation multiple times without having to type out all the code every time. There are over 700 files in our directory, so by defining a function and running it over all the files in our directory, we can then interrogate the collection and learn something about it. Creating a function also means that we can be sure that the exactly the same thing is happening each time
#open the first file, read it and then split it into two parts, metadata and body
data = open('UMA_Fraser_Radio_Talks/UDS2013680-100-full.txt')
data = data.read().split("<!--end metadata-->")
#define a function that breaks up the metadata for each file and gets rid of the whitespace at the start of each element
def parse_metadata(text):
metadata = {}
for line in text.split('\r\n'):
if not line:
continue
if line[0] == '<':
continue
element = line.split(':', 1)
metadata[element[0]] = element[-1].strip(' ')
return metadata
parse_metadata(data[0])
Now that we're confident that the function works, let's find out a bit about the corpus. As a start, it would be useful to know which years the texts are from. Are they evenly distributed over time? A graph will tell us!
#import conditional frequency distribution
from nltk.probability import ConditionalFreqDist
cfdist = ConditionalFreqDist()
for filename in os.listdir('UMA_Fraser_Radio_Talks'):
text = open('UMA_Fraser_Radio_Talks/' + filename).read()
#split text of file on 'end metadata'
text = text.split("<!--end metadata-->")
#parse metadata using previously defined function "parse_metadata"
metadata = parse_metadata(text[0])
#skip all speeches for which there is no exact date
if metadata['Date'][0] == 'c':
continue
#build a frequency distribution graph by year, that is, take the final bit of the 'Date' string after '/'
cfdist['count'][metadata['Date'].split('/')[-1]] += 1
cfdist.plot()
cfdistA = ConditionalFreqDist()
for filename in os.listdir('UMA_Fraser_Radio_Talks'):
text = open('UMA_Fraser_Radio_Talks/' + filename).read()
#split text of file on 'end metadata'
text = text.split("<!--end metadata-->")
#parse metadata using previously defined function "parse_metadata"
metadata = parse_metadata(text[0])
date = metadata['Date']
if date[0] == 'c':
year = date[1:]
elif date[0] != 'c':
year = date.split('/')[-1]
if year:
cfdistA['count'][year] += 1
cfdistA.plot()
cfdist2 = ConditionalFreqDist()
for filename in os.listdir('UMA_Fraser_Radio_Talks'):
text = open('UMA_Fraser_Radio_Talks/' + filename).read()
#split text of file on 'end metadata'
text = text.split("<!--end metadata-->")
#parse metadata using previously defined function "parse_metadata"
metadata = parse_metadata(text[0])
#skip all speeches for which there is no exact date
if metadata['Date'][0] == 'c':
continue
#build a frequency distribution graph by 'Description'
cfdist2['count'][metadata['Description']] += 1
cfdist2.plot()
Previously, we tokenized the text of a file so that we could conduct some analysis. Let's now tokenize just the body of the file, not the metadata. As an exersize, let's see how the modal verbs 'must', 'should' and 'will' occur in the text.
#tokenize the body of the text so that we can start to analyse it
tokens = word_tokenize(data[1])
tokens.count('should')
For each file, tokenize the body then count how often 'must', 'will' and 'should' occurs in each
for filename in os.listdir('UMA_Fraser_Radio_Talks'):
text = open('UMA_Fraser_Radio_Talks/' + filename).read()
#split text of file on 'end metadata'
text = text.split("<!--end metadata-->")
#parse metadata using previously defined function "parse_metadata"
metadata = parse_metadata(text[0])
#skip all speeches for which there is no exact date
if metadata['Date'][0] == 'c':
continue
#tokenise the text of the speech
tokens = word_tokenize(text[1].decode('ISO-8859-1'))
#show the date of each speech count how often 'should' and 'must' are used in each
print metadata['Date'], ',', tokens.count('should'), ',', tokens.count('must'), ',', tokens.count('will')
And graph that
cfdist3 = ConditionalFreqDist()
for filename in os.listdir('UMA_Fraser_Radio_Talks'):
text = open('UMA_Fraser_Radio_Talks/' + filename).read()
text = text.split('<!--end metadata-->')
metadata = parse_metadata(text[0])
date = metadata['Date']
if date[0] == 'c':
year = date[1:]
elif date[0] != 'c':
year = date.split('/')[-1]
if year == '1966':
continue
tokens = word_tokenize(text[1].decode('ISO-8859-1'))
cfdist3['should'][year] += tokens.count('should')
cfdist3['will'][year] += tokens.count('will')
cfdist3['must'][year] += tokens.count('must')
cfdist3.plot()
cfdist3 = ConditionalFreqDist()
for filename in os.listdir('UMA_Fraser_Radio_Talks'):
text = open('UMA_Fraser_Radio_Talks/' + filename).read()
text = text.split('<!--end metadata-->')
metadata = parse_metadata(text[0])
date = metadata['Date']
if date[0] == 'c':
year = date[1:]
elif date[0] != 'c':
year = date.split('/')[-1]
if year == '1966':
continue
tokens = word_tokenize(text[1].decode('ISO-8859-1'))
if len(tokens) == 0:
continue
cfdist3['should'][year] += tokens.count('should') / len(tokens)
cfdist3['will'][year] += tokens.count('will') / len(tokens)
cfdist3['must'][year] += tokens.count('must') / len(tokens)
cfdist3.plot()