First we see what text files we have.
Hume Enquiry.txt negative.txt positive.txt Hume Treatise.txt obama_tweets.txt
We are going to use the "Hume Enquiry.txt" from the Gutenberg Project. You can use whatever text you want. We print the first 50 characters to check.
theText2Use = "Hume Treatise.txt" with open(theText2Use, "r") as fileToRead: fileRead = fileToRead.read() print("This string has", len(fileRead), "characters.") print(fileRead[:50])
This string has 1344061 characters. The Project Gutenberg EBook of A Treatise of Human
Now we tokenize the text producing a list called "listOfTokens" and check the first words. This eliminates punctuation and lowercases the words.
import re listOfTokens = re.findall(r'\b\w[\w-]*\b', fileRead.lower()) print(listOfTokens[:10])
['the', 'project', 'gutenberg', 'ebook', 'of', 'a', 'treatise', 'of', 'human', 'nature']
word2find = input("What word do you want collocates for? ").lower() # Ask for the word to search for context = input("How much context do you want? ")# This asks for the context of words on either side to grab
What word do you want collocates for? truth How much context do you want? 10
contextInt = int(context) type(contextInt)
def makeConc(word2conc,list2FindIn,context2Use,concList): end = len(list2FindIn) for location in range(end): if list2FindIn[location] == word2conc: # Here we check whether we are at the very beginning or end if (location - context2Use) < 0: beginCon = 0 else: beginCon = location - context2Use if (location + context2Use) > end: endCon = end else: endCon = location + context2Use + 1 theContext = (list2FindIn[beginCon:endCon]) concordanceLine = ' '.join(theContext) # print(str(location) + ": " + concordanceLine) concList.append(str(location) + ": " + concordanceLine) theConc =  makeConc(word2find,listOfTokens,int(context),theConc) theConc[-5:]
['220330: a reason why the faculty of recalling past ideas with truth and clearness should not have as much merit in it', '223214: confessing my errors and should esteem such a return to truth and reason to be more honourable than the most unerring', '223680: from the other this therefore being regarded as an undoubted truth that belief is nothing but a peculiar feeling different from', '224382: mind and he will evidently find this to be the truth secondly whatever may be the case with regard to this', '225925: by their different feeling i should have been nearer the truth end of project gutenberg s a treatise of human nature']
nameOfResults = word2find.capitalize() + ".Concordance.txt" with open(nameOfResults, "w") as fileToWrite: for line in theConc: fileToWrite.write(line + "\n") print("Done")
Here we check that the file was created.