This notebook shows how you can generate a concordance using lists.
First we see what text files we have.
ls *.txt
Hume Enquiry.txt negative.txt positive.txt Hume Treatise.txt obama_tweets.txt
We are going to use the "Hume Enquiry.txt" from the Gutenberg Project. You can use whatever text you want. We print the first 50 characters to check.
theText2Use = "Hume Treatise.txt"
with open(theText2Use, "r") as fileToRead:
fileRead = fileToRead.read()
print("This string has", len(fileRead), "characters.")
print(fileRead[:50])
This string has 1344061 characters. The Project Gutenberg EBook of A Treatise of Human
Now we tokenize the text producing a list called "listOfTokens" and check the first words. This eliminate punctuation and lowercases the words.
import re
listOfTokens = re.findall(r'\b\w[\w-]*\b', fileRead.lower())
print(listOfTokens[:10])
['the', 'project', 'gutenberg', 'ebook', 'of', 'a', 'treatise', 'of', 'human', 'nature']
Now we get the word you want a concordance for an the context wanted.
word2find = input("What word do you want collocates for? ").lower() # Ask for the word to search for
context = input("How much context do you want? ")# This asks for the context of words on either side to grab
What word do you want collocates for? truth How much context do you want? 10
type(context)
str
contextInt = int(context)
type(contextInt)
int
len(listOfTokens)
228958
Here is the main function that does the work populating a new list with the lines of concordance. We check the first 5 concordance lines.
def makeConc(word2conc,list2FindIn,context2Use,concList):
end = len(list2FindIn)
for location in range(end):
if list2FindIn[location] == word2conc:
# Here we check whether we are at the very beginning or end
if (location - context2Use) < 0:
beginCon = 0
else:
beginCon = location - context2Use
if (location + context2Use) > end:
endCon = end
else:
endCon = location + context2Use + 1
theContext = (list2FindIn[beginCon:endCon])
concordanceLine = ' '.join(theContext)
# print(str(location) + ": " + concordanceLine)
concList.append(str(location) + ": " + concordanceLine)
theConc = []
makeConc(word2find,listOfTokens,int(context),theConc)
theConc[-5:]
['220330: a reason why the faculty of recalling past ideas with truth and clearness should not have as much merit in it', '223214: confessing my errors and should esteem such a return to truth and reason to be more honourable than the most unerring', '223680: from the other this therefore being regarded as an undoubted truth that belief is nothing but a peculiar feeling different from', '224382: mind and he will evidently find this to be the truth secondly whatever may be the case with regard to this', '225925: by their different feeling i should have been nearer the truth end of project gutenberg s a treatise of human nature']
Finally, we output to a text file.
nameOfResults = word2find.capitalize() + ".Concordance.txt"
with open(nameOfResults, "w") as fileToWrite:
for line in theConc:
fileToWrite.write(line + "\n")
print("Done")
Done
Here we check that the file was created.
ls *.Concordance.txt
Truth.Concordance.txt
CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell
Created September 30th, 2016 (Jupyter 4.2.1)