#!/usr/bin/env python # coding: utf-8 # # Exploring a Text with NLTK # # This notebook shows how you can explore aspects of a text using the Natural Langauge Took Kit (NLTK). # # Some of the things you can do include: # # * [Tokenize a text](#Tokenization) # * [Generate a concordance for a word](#Concording) # * [Explore collocations (words that are located together)](#Collocations) # * [Counting words and frequencies](#Counting-Words-and-Frequencies) # * [Finding smiliar words and contexts](#Similar-Words) # # For more on NLTK see the online version of the book [Natural Language Processing with Python](http://www.nltk.org/book/). # ## Preparing for Exploration # # Before we can analyze a text we need to load it in and tokenize it. # ### Installing NTLK # # Before you can use NTLK you need to make sure it is installed. The [Anaconda Navigator](https://docs.continuum.io/anaconda/navigator) by default installs NLTK, but you can always test if it is installed by importing it with ```import nltk```. Try it. It will give you an error if you don't have it. # In[1]: import nltk # ### (more on) Installing NLTK # # If you don't have it there are different ways to install it. # # * The NLTK 3.0 documentation has a page on [Installing NLTK](http://www.nltk.org/install.html). # * You can have Anaconda install or update it for you. See [Using Anaconda Navigator](https://docs.continuum.io/anaconda/navigator-using#) and scroll down to the part about updating packages. Basically you click on the check to the left of the package and pull down to "Mark for upgrade". Then click the Apply button below. # ### Getting a Text # # Now we will get a text to process with NLTK. # First we see what text files we have. # In[2]: ls *.txt # We are going to use the "Hume Enquiry.txt" from the Gutenberg Project. You can use whatever text you want. We print the first 50 characters to check. # In[3]: theText2Use = "Hume Enquiry.txt" with open(theText2Use, "r") as fileToRead: theString = fileToRead.read() print("This string has", len(theString), "characters.") print(theString[:50]) # ### Tokenization # # Now we tokenize the text using NTLK's tokenizer producing a list called "listOfTokens" and check the first words. Note that the NTLK tokenizer doesn't eliminate punctuation and doesn't lower case the words. You can tokenize using another method if you want. Then we create a NLTK text object from the tokens. Note how the text object behaves like a list of tokens. # In[4]: listOfTokens = nltk.word_tokenize(theString) theText = nltk.Text(listOfTokens) print(listOfTokens[:50]) # ## Concording # # Now we get a concordance for a word in one line. Note that we can control the width of the concordances. Edit the word to explore. # In[5]: theText.concordance("the", width=100) # Note that ```concordance``` is not case sensitive. This will give you a concordance of both capitalized and lower case words. # # If you want more lines then you need to add a parameter. # In[6]: theText.concordance("the", lines=30) # One thing that is annoying is that you can't easily save a concordance to a file and that is because the NLTK text object concordance is printed to the screen for exploration. You will need to cut and paste to a word processor to save this. # ### Plot the Dispersion of Words # # We can easily plot the dispersion of words through the text. Note how it is case sensitive. # # The line ```%matplotlib inline``` makes sure that the plot is placed inline. # In[7]: get_ipython().run_line_magic('matplotlib', 'inline') theText.dispersion_plot(["Truth","truth"]) # ### Counting Words and Frequencies # # You can also count words. This is case sensitive if you use the text object. # In[8]: print(theText.count("Truth"), " ", theText.count("truth")) # To make it case insensitive we are going to use [list comprehension](http://python-3-patterns-idioms-test.readthedocs.io/en/latest/Comprehensions.html) to lowercase every token and get a new list of tokens. We are also going to get rid of punctuation using a parameter. Then we can count things in the list. # In[9]: theLowerTokens = [token.lower() for token in listOfTokens if token[0].isalpha()] print(theLowerTokens[:20]) # In[10]: theLowerTokens.count("truth") # With NLTK we can get word frequencies. These can be displayed as a table. We can then do other things with the frequency distribution object. # In[11]: theLowerFreqs = nltk.FreqDist(theLowerTokens) theLowerFreqs.tabulate(15) # In[12]: theLowerFreqs["truth"] # Rather than get the count we can get the relative frequency which is the count divided by the number of tokens. # In[13]: theLowerFreqs.freq("the") # ## Plot the Frequency of Words # We can also plot the high frequency words. # In[14]: get_ipython().run_line_magic('matplotlib', 'inline') theLowerFreqs.plot(30) # ### Plotting Content Words # # What if we want to see just the high frequency content words. Here we get the NLTK English stop-word list. # In[15]: stopwords = nltk.corpus.stopwords.words("english") print(stopwords[:20]) # We need to create a new list of tokens without the stopwords. # In[16]: theLowerContentWords = [token for token in theLowerTokens if token not in stopwords] theLowerContentWords[:10] # Now we can create a table of high frequency content words. # In[17]: theLowerContFreqs = nltk.FreqDist(theLowerContentWords) theLowerContFreqs.tabulate(10) # If you still see words you want to remove then you need to remove them too. Note that this next cell updates what is in the variables ```theLowerContentWords``` and ```theLowerContFreqs```. If you want to go recover the words you need to start **3.1** over. # In[18]: moreStopwords = ["may","one","must","us","never","every"] theLowerContentWords = [token for token in theLowerContentWords if token not in moreStopwords] theLowerContFreqs = nltk.FreqDist(theLowerContentWords) theLowerContFreqs.tabulate(10) # And now we get the Frequency Distribution and plot it. # In[19]: theLowerContFreqs.plot(30) # We might also want to check how these words are used by looking at their concordance. # In[20]: theText.concordance("power", width=80, lines=5) # ## Collocations, Similar Words, and Contexts # # ### Collocations # # NLTK will also let you explore co-locating words by which is meant sets of two or more words that appear frequently together. # In[21]: theText.collocations(10) # Note how we are getting a lot of bigrams with "Gutenberg". That's because NLTK looks for bigrams where the words appear together more often than alone. If you ask for more collocations you can see some that have to do with the text. # In[22]: theText.collocations(100) # ### Similar Words # # We can get words that are **similar** to target words. These are not synonyms but words being used in similar contexts. You can use this to expland on a word you are interested in. # In[23]: theText.similar("truth") # You can use this to get concordances of sets of similar words. # In[24]: listOfWords2Conc = ["reason","fact","knowledge","ideas"] for i in listOfWords2Conc: print(i.upper() + ": ") theText.concordance(i, width=80, lines=5) print("--------------------------------------------------\n") # ### Common Contexts # # NLTK can give us common contexts for words that share them. # In[25]: theText.common_contexts(["nature", "experience"],10) # ## Finding Patterns # # We can use regular expressions on tokens with the ```findall``` method of the Text object. Some guidelines: # # * You are matching to tokens, not the raw text. The < and > indicates the token. # * ```<.*>``` matches any token as ```.``` means any character and ```*``` means 0 or more of. ```?``` would mean # * The parantheses tell IPython what to show from the match. In the first example below you can see how to show all the words right before the word you want. # # Here are some examples. # In[26]: theText.findall("(<.*>)") # In[27]: theText.findall("<.*><.*>") # In[28]: theText.findall("(<.*><.*>)") # In[29]: theText.findall("<.*>?") # --- # [CC BY-SA](https://creativecommons.org/licenses/by-sa/4.0/) From [The Art of Literary Text Analysis](../ArtOfLiteraryTextAnalysis.ipynb) by [Stéfan Sinclair](http://stefansinclair.name) & [Geoffrey Rockwell](http://geoffreyrockwell.com). Edited and revised by [Melissa Mony](http://melissamony.com).
Created October 10th, 2016 (Jupyter 4.2.1) # In[ ]: