This shows how you can get collocates for a word.
Note: It assumes you have the text in the same directory.
First we check what texts we have in the directory.
%ls *.txt
Hume Enquiry.txt Hume Treatise.txt untitled.txt
Now we open the text.
Copy in the title of the text you want to process and run the next cell. Our example uses the plain text version of Hume's A Treatise of Human Nature by David Hume that we downloaded.
targetText = "Hume Treatise.txt"
with open(targetText, "r") as f:
theText = f.read()
print("This string has", "{:,}".format(len(theText)), "characters")
This string has 1,344,061 characters
Note that we are tokenizing the full Gutenberg text file which includes metadata and license information. If you want only the tokens of the book you should delete the Gutenberg information from the text file.
import re
theTokens = re.findall(r'\b\w[\w-]*\b', theText.lower())
print(theTokens[:10])
['the', 'project', 'gutenberg', 'ebook', 'of', 'a', 'treatise', 'of', 'human', 'nature']
This will ask you what word you want collocates for, and create a list of collocates. Note that you can set the number of words of context.
wrd2find = input("What word do you want collocates for?") # Ask for the word to search for
context = 5 # This sets the context of words on either side to grab
end = len(theTokens)
counter = 0
theCollocates = []
for word in theTokens:
if word == wrd2find: # This checks to see if the word is what we want
for i in range(context):
if (counter - (i + 1)) >= 0: # This checks that we aren't at the beginning
theCollocates.append(theTokens[(counter - (i + 1))]) # This adds words before
if (counter + (i + 1)) < end: # This checks that we aren't at the end
theCollocates.append(theTokens[(counter + (i + 1))]) # This adds words afte
counter = counter + 1
print(theCollocates[:10])
What word do you want collocates for?gutenberg ['project', 'ebook', 'the', 'of', 'a', 'treatise', 'of', 'project', 'license', 'the']
Now we can do various things with the list of collocates.
print(len(theCollocates))
357
print(set(theCollocates))
123
import nltk
tokenDist = nltk.FreqDist(theCollocates)
tokenDist.tabulate(10)
project the foundation literary archive of to 1 a with 33 31 15 14 14 12 10 9 8 8
import matplotlib
%matplotlib inline
tokenDist.plot(25, title="Top Frequency Collocates for " + wrd2find.capitalize())
This will create CSV file with the name of the target word with the counts.
import csv
nameOfResults = wrd2find.capitalize() + ".Collocates.csv"
table = tokenDist.most_common()
with open(nameOfResults, "w") as f:
writer = csv.writer(f)
writer.writerows(table)
print("Done")
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-d704c1e68497> in <module>() 1 import csv ----> 2 nameOfResults = wrd2find.capitalize() + ".Collocates.csv" 3 table = tokenDist.most_common() 4 5 with open(nameOfResults, "w") as f: NameError: name 'wrd2find' is not defined
In the next utility example we will generate Concordances for a target word.
CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell. Edited and revised by Melissa Mony.
Created August 8, 2014 (Jupyter 4.2.1)