This notebook shows how you can handle texts you already have.
ls
# Importing a text file
with open("Hume Treatise.txt", "r") as f:
Text1 = f.read()
print("This string has", len(Text1), "characters.")
This string has 1344061 characters.
print("This string has", "{:,}".format(len(Text1)), "characters") # This uses a mini format language
This string has 1,344,061 characters
Now we want to get just the text out and remove the Gutenberg stuff.
startText = "*** START OF THIS PROJECT GUTENBERG EBOOK"
endText = "End of Project Gutenberg's A Treatise of Human Nature, by David Hume"
start = Text1.find(startText)
end = Text1.find(endText)
HumeString1 = Text1[start:end].strip()
print(HumeString1[:100:],"\n+++++++++++++++++++++\n",HumeString1[-100:])
*** START OF THIS PROJECT GUTENBERG EBOOK A TREATISE OF HUMAN NATURE *** Produced by Col Choat +++++++++++++++++++++ e same object can only be different by their different feeling, I should have been nearer the truth.
Now we will use the built in tools to find things and show context.
## This will count the instances of the word we want.
## Note how I handle the possibility of a capitalized version of the word
print(HumeString1.count("Truth"))
print(HumeString1.count("truth"))
print((HumeString1.count("truth") + HumeString1.count("Truth")))
2 70 72
word2find = input("What word to find?")
context = 30
start = 0
while start != -1:
start = HumeString1.find(word2find,start+1)
print(HumeString1[(start-context):(start+context)].strip("\n"), "\n")
print("------------------------------------------------------")
word2find2 = word2find.capitalize()
print(word2find2)
start = 0
while start != -1:
start = HumeString1.find(word2find2,start+1)
print(HumeString1[(start-context):(start+context)].strip("\n"), "\n")
What word to find?thened e same difficulties, it is burthened with some additional on le contempt is likewise strengthened by the two relations of ------------------------------------------------------ Thened
import re
word2check = input("What root to look for?")
pattern2check = "\w*" + word2check + "\w*"
# \w matches a word character, \w* means match zero, one or more word characters
# By deleting the "\w*" on either side you
theMatches = re.compile(pattern2check, re.IGNORECASE).findall(HumeString1)
print("Number of matches ", len(theMatches))
print("Variant forms: ", set(theMatches)) # set removes duplicates
What root to look for?the Number of matches 21682 Variant forms: {'mathematician', 'burthened', 'strengthen', 'THERE', 'theory', 'Father', 'Mathematical', 'hypothesis', 'Otherwise', 'FARTHER', 'Theologians', 'otherwise', 'Either', 'others', 'strengthening', 'further', 'either', 'sympathetic', 'theirs', 'mathematics', 'farthest', 'breathe', 'anther', 'Others', 'then', 'mathematical', 'Neither', 'ether', 'hypotheses', 'there', 'strengthens', 'ANOTHER', 'farther', 'neither', 'authentic', 'father', 'atheist', 'Athenians', 'smoother', 'They', 'forefathers', 'other', 'Then', 'brothers', 'nevertheless', 'Other', 'Mathematics', 'Another', 'There', 'another', 'theft', 'their', 'hitherto', 'together', 'thereby', 'THEM', 'OTHERS', 'Whether', 'The', 'brother', 'thence', 'Themistocles', 'the', 'whether', 'them', 'THE', 'therefore', 'THESE', 'theologians', 'these', 'atheists', 'theatre', 'tothe', 'weather', 'mathematicians', 'atheism', 'These', 'THEY', 'themselves', 'hypothetical', 'THEREFORE', 'THEIR', 'WHETHER', 'Their', 'grandfather', 'Rather', 'northern', 'Mathematician', 'rather', 'altogether', 'they', 'OTHER', 'gathers', 'mothers', 'mother', 'strengthened'}
Now we will show how to use NLTK to do more.
First we have to import the NLTK library.
import nltk
Now we have to tokenize the text. We also lowercase it and eliminate non alphabetic tokens.
Hume1TokensLower = nltk.word_tokenize(HumeString1.lower()) # This lowercases all words and then tokenizes
Hume1TokensLower2 = [word for word in Hume1TokensLower if word[0].isalpha()] # This eliminates non words
print(Hume1TokensLower2[:10])
['start', 'of', 'this', 'project', 'gutenberg', 'ebook', 'a', 'treatise', 'of', 'human']
Now we get a distribution of words.
Hume1TokensLower2Dist = nltk.FreqDist(Hume1TokensLower2)
Hume1TokensLower2Dist.tabulate(10)
the of and to is a that in it we 13675 10241 8637 6804 4786 4623 4524 4335 3552 2856
We can plot the distribution of the top 25 words
import matplotlib
%matplotlib inline
Hume1TokensLower2Dist.plot(25, title="Top Frequency Word Tokens in Hume's Treatise")
Now we get a concordance the NLTK way.
word2concord = input("What word to concord?")
Hume1Text = nltk.Text(Hume1TokensLower2)
Hume1Text.concordance(word2concord, lines=10)
What word to concord?give Displaying 10 of 267 matches: he most momentous we are not able to give any certain decision disputes are mu so the only solid foundation we can give to this science itself must be laid r ignorance and perceive that we can give no reason for our most general and m ever appear in the contrary order to give a child an idea of scarlet or orange by any accident the faculties which give rise to any impressions are obstruct me ideas which perhaps in their turn give rise to other impressions and ideas ure and principles of the human mind give a particular account of ideas before mpossible it is sufficient if we can give any satisfactory account of them fro f this mutual complaisance i can not give a more evident instance than in the ly defect of our senses is that they give us disproportioned images of things
Hume1Text.similar(word2concord)
it relation pride pleasure matter objects idea passion belief love nature causes property reasoning power resemblance reason society ideas mind
Hume1Text.collocations(num=20, window_size=2)
human nature; continued existence; every thing; give rise; may observe; properly speaking; constant conjunction; difference betwixt; general rules; present impression; take place; external objects; every one; right line; shall find; first sight; double relation; like manner; great measure; infinite divisibility
Hume1Text.dispersion_plot(["true","false"])
stopwords = nltk.corpus.stopwords.words("english")
wordsToAppend = ["may","one","us","must","upon","every","without","though","therefore","first","two","would"]
stopwords2 = stopwords + wordsToAppend
print(stopwords2)
# I need to find a way to append a bunch of words
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn', 'haven', 'isn', 'ma', 'mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn', 'may', 'one', 'us', 'must', 'upon', 'every', 'without', 'though', 'therefore', 'first', 'two', 'would']
Hume1ContentWords = [word for word in Hume1TokensLower2 if word not in stopwords2]
Hume1ContentWordsDist = nltk.FreqDist(Hume1ContentWords)
Hume1ContentWordsDist.tabulate(10)
idea object objects ideas mind relation present passions reason nature 822 692 664 633 599 464 449 432 431 415
Hume1ContentWordsDist.plot(25, title="Top Frequency Content Words in Hume's Treatise")