Generating Concordances¶

This notebook shows how you can generate a concordance using lists.

First we see what text files we have.

In [1]:

ls *.txt

Hume Enquiry.txt   negative.txt       positive.txt
Hume Treatise.txt  obama_tweets.txt

We are going to use the "Hume Enquiry.txt" from the Gutenberg Project. You can use whatever text you want. We print the first 50 characters to check.

In [2]:

theText2Use = "Hume Treatise.txt"
with open(theText2Use, "r") as fileToRead:
    fileRead = fileToRead.read()
    
print("This string has", len(fileRead), "characters.")
print(fileRead[:50])

This string has 1344061 characters.
The Project Gutenberg EBook of A Treatise of Human

Tokenization¶

Now we tokenize the text producing a list called "listOfTokens" and check the first words. This eliminate punctuation and lowercases the words.

In [3]:

import re
listOfTokens = re.findall(r'\b\w[\w-]*\b', fileRead.lower())
print(listOfTokens[:10])

['the', 'project', 'gutenberg', 'ebook', 'of', 'a', 'treatise', 'of', 'human', 'nature']

Input¶

Now we get the word you want a concordance for an the context wanted.

In [4]:

word2find = input("What word do you want collocates for? ").lower() # Ask for the word to search for
context = input("How much context do you want? ")# This asks for the context of words on either side to grab

What word do you want collocates for? truth
How much context do you want? 10

In [5]:

type(context)

Out[5]:

str

In [7]:

contextInt = int(context)
type(contextInt)

Out[7]:

int

In [9]:

len(listOfTokens)

Out[9]:

Main function¶

Here is the main function that does the work populating a new list with the lines of concordance. We check the first 5 concordance lines.

In [10]:

def makeConc(word2conc,list2FindIn,context2Use,concList):

    end = len(list2FindIn)
    for location in range(end):
        if list2FindIn[location] == word2conc:
            # Here we check whether we are at the very beginning or end
            if (location - context2Use) < 0:
                beginCon = 0
            else:
                beginCon = location - context2Use
                
            if (location + context2Use) > end:
                endCon = end
            else:
                endCon = location + context2Use + 1
                
            theContext = (list2FindIn[beginCon:endCon])
            concordanceLine = ' '.join(theContext)
            # print(str(location) + ": " + concordanceLine)
            concList.append(str(location) + ": " + concordanceLine)

theConc = []
makeConc(word2find,listOfTokens,int(context),theConc)
theConc[-5:]

Out[10]:

['220330: a reason why the faculty of recalling past ideas with truth and clearness should not have as much merit in it',
 '223214: confessing my errors and should esteem such a return to truth and reason to be more honourable than the most unerring',
 '223680: from the other this therefore being regarded as an undoubted truth that belief is nothing but a peculiar feeling different from',
 '224382: mind and he will evidently find this to be the truth secondly whatever may be the case with regard to this',
 '225925: by their different feeling i should have been nearer the truth end of project gutenberg s a treatise of human nature']

Output¶

Finally, we output to a text file.

In [11]:

nameOfResults = word2find.capitalize() + ".Concordance.txt"

with open(nameOfResults, "w") as fileToWrite:
    for line in theConc:
        fileToWrite.write(line + "\n")
    
print("Done")

Done

Here we check that the file was created.

In [12]:

ls *.Concordance.txt

Truth.Concordance.txt

CC BY-SA From The Art of Literary Text Analysis by Stéfan Sinclair & Geoffrey Rockwell
Created September 30th, 2016 (Jupyter 4.2.1)

In [ ]: