#!/usr/bin/env python # coding: utf-8 # # Python NLTK: Texts and Frequencies # **(C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/) <>** # **Version:** 1.1, January 2024 # **Download:** This and various other Jupyter notebooks are available from my [GitHub repo](https://github.com/dcavar/python-tutorial-notebooks). # **License:** [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/) ([CA BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/)) # **Prerequisites:** # In[ ]: get_ipython().system('pip install -U nltk') # This is a brief introduction to NLTK for simple frequency analysis of texts. I created this notebook for intro to corpus linguistics and natural language processing classes at Indiana University between 2017 and 2020. # For this to work, in the folder with the notebook we expect a subfolder data that contains a file HOPG.txt. This file contains the novel "A House of Pomegranates" by Oscar Wilde taken as raw text from [Project Gutenberg](https://www.gutenberg.org/). # ## Simple File Processing # Reading a text into memory in Python is faily simple. We open a file, read from it, and close the file again. The following code prints out the first 300 characters of the text in memory: # In[1]: ifile = open("data/HOPG.txt", mode='r', encoding='utf-8') text = ifile.read() ifile.close() print(text[:300], "...") # The optional parameters in the *open* function above define the **mode** of operations on the file and the **encoding** of the content. For example, setting the **mode** to **r** declares that *reading* from the file is the only permitted operation that we will perform in the following code. Setting the **encoding** to **utf-8** declares that all characters will be encoded using the [Unicode](https://en.wikipedia.org/wiki/Unicode) encoding schema [UTF-8](https://en.wikipedia.org/wiki/UTF-8) for the content of the file. # We can now import the [NLTK](https://www.nltk.org/) module in Python to work with frequency profiles and [n-grams](https://en.wikipedia.org/wiki/N-gram) using the tokens or words in the text. # In[2]: import nltk # We can now lower the text, which means normalizing it to all characters lower case: # In[3]: text = text.lower() print(text[:300], "...") # To generate a frequency profile from the text file, we can use the [NLTK](https://www.nltk.org/) function *FreqDist*: # In[4]: myFD = nltk.FreqDist(text) # In[5]: print(myFD) # We can remove certain characters from the distribution, or alternatively replace these characters in the text variable. The following loop removes them from the frequency profile in myFD, which is a dictionary data structure in Python. # In[6]: for x in ":,.-[];!'\"\t\n/ ?": del myFD[x] # We can print out the frequency profile by looping through the returned data structure: # In[7]: for x in myFD: print(x, myFD[x]) # To relativize the frequencies, we need to compute the total number of characters. This is assuming that we removed all punctuation symbols. The frequency distribution instance myFD provides a method to access the values associated with the individual characters. This will return a list of values, that is the frequencies associated with the characters. # In[8]: myFD.values() # The *sum* function can summarize these values in its list argument: # In[9]: sum(myFD.values()) # To avoid type problems when we compute the relative frequency of characters, we can convert the total number of characters into a *float*. This will guarantee that the division in the following relativisation step will be a *float* as well. # In[10]: float(sum(myFD.values())) # We store the resulting number of characters in the *total* variable: # In[11]: total = float(sum(myFD.values())) print(total) # We can now generate a probability distribution over characters. To convert the frequencies into relative frequencies we use list comprehension and divide every single list element by total. The resulting relative frequencies are stored in the variable *relfreq*: # In[12]: relfrq = [ x/total for x in myFD.values() ] print(relfrq) # Let us compute the [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) for the character distribution using the relative frequencies. We will need the [logarithm](https://en.wikipedia.org/wiki/Logarithm) function from the Python *math* module for that: # In[13]: from math import log # We can define the [Entropy](https://en.wikipedia.org/wiki/Entropy_(information_theory)) function according to the equation $I = - \sum P(x) log_2( P(x) )$ as: # In[14]: def entropy(p): return -sum( [ x * log(x, 2) for x in p ] ) # In[15]: entropy([1/8, 1/16, 1/4, 1/8, 1/16, 1/16, 1/4, 1/16]) # In[16]: entropy([1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8, 1/8]) # We can now compute the entropy of the character distribution: # In[17]: print(entropy([ 1/len(relfrq) ] * len(relfrq))) print(entropy(relfrq)) # We might be interested in the point-wise entropy of the characters in this distribution, thus needing the entropy of each single character. We can compute that in the following way: # In[18]: entdist = [ -x * log(x, 2) for x in relfrq ] print(entdist) # We could now compute the variance over this point-wise entropy distribution or other properties of the frequency distribution as for example median, mode, or standard deviation. # ## From Characters to Words/Tokens # We see that the frequency profile is for the characters in the text, not the words or tokens. In order to generate a frequency profile over words/tokens in the text, we need to utilize a **tokenizer**. [NLTK](https://www.nltk.org/) provides basic tokenization functions. We will use the *word_tokenize* function to generate a list of tokens: # In[19]: tokens = nltk.word_tokenize(text) # We can print out the first 20 tokens to verify our data structure is a list with lower-case strings: # In[20]: tokens[:20] # We can now generate a frequency profile from the token list, as we did with the characters above: # In[21]: myTokenFD = nltk.FreqDist(tokens) print(myTokenFD) # The frequency profile can be printed out in the same way as above by looping over the tokens and their frequencies. Note that we restrict the loop to the first 20 tokens here just to keep the notebook smaller. You can remove the [:20] selector in your own experiments. # In[22]: print(tuple(myTokenFD.items())[:10]) # In[23]: for token in list(myTokenFD.items()): print(token[0], token[1]) # In[24]: stopwords = nltk.corpus.stopwords.words('english') print(stopwords) # In[25]: for x in stopwords: del myTokenFD[x] print(tuple(myTokenFD)[:20]) # In[ ]: # ## Counting N-grams # [NLTK](https://www.nltk.org/) provides simple methods to generate [n-gram](https://en.wikipedia.org/wiki/N-gram) models or frequency profiles over [n-grams](https://en.wikipedia.org/wiki/N-gram) from any kind of list or sequence. We can for example generate a bi-gram model, that is an [n-grams](https://en.wikipedia.org/wiki/N-gram) model for n = 2, from the text tokens: # In[26]: myTokenBigrams = nltk.ngrams(tokens, 2) # To store the bigrams in a list that we want to process and analyze further, we convert the **Python generator object** myTokenBigrams to a list: # In[27]: bigrams = list(myTokenBigrams) # Let us verify that the resulting data structure is indeed a list of string tuples. We will print out the first 20 tuples from the bigram list: # In[28]: print(bigrams[:20]) # We can now verify the number of bigrams and check that there are exactly *number of tokens - 1 = number of bigrams* in the resulting list: # In[29]: print(len(bigrams)) print(len(tokens)) # The frequency profile from these bigrams is generated in exactly the same way as from the token list in the examples above: # In[30]: myBigramFD = nltk.FreqDist(bigrams) print(myBigramFD) # If we would want to know some more general properties of the frequency distribution, we can print out information about it. The print statement for this bigram frequency distribution tells us that we have 17,766 types and 38,126 tokens: # In[31]: print(myBigramFD) # The bigrams and their corresponding frequencies can be printed using a *for* loop. We restrict the number of printed items to 20, just to keep this list reasonably long. If you would like to see the full frequency profile, remove the [:20] restrictor. # In[32]: for bigram in list(myBigramFD.items())[:20]: print(bigram[0], bigram[1]) print("...") # Pretty printing the bigrams is possible as well: # In[33]: for ngram in list(myBigramFD.items()): print(" ".join(ngram[0]), ngram[1]) print("...") # You can remove the [:20] restrictor above and print out the entire frequency profile. If you select and copy the profile to your clipboard, you can paste it into your favorite spreadsheet software and sort, analyze, and study the distribution in many interesting ways. # Instead of running the frequency profile through a loop we can also use a list comprehension construction in Python to generate a list of tuples with the n-gram and its frequency: # In[34]: ngrams = [ (" ".join(ngram), myBigramFD[ngram]) for ngram in myBigramFD ] print(ngrams[:100]) # We can generate an increasing frequency profile using the sort function on the second element of the tuple list, that is on the frequency: # In[35]: sortedngrams = sorted(ngrams, key=lambda x: x[1]) print(sortedngrams[:20]) print("...") # We can increase the speed of this *sorted* call by using the *itemgetter()* function in the *operator* module. Let us import this function: # In[36]: from operator import itemgetter # We can now define the sort-key for *sorted* using the *itemgetter* function and selecting with 1 the second element in the tuple. Remember that the enumeration of elements in lists or tuples in Python starts at 0. # In[37]: sortedngrams = sorted(ngrams, key=itemgetter(1)) print(sortedngrams[:20]) print("...") # A decreasing frequency profile can be generated using another parameter to *sorted*: # In[38]: sortedngrams = sorted(ngrams, key=itemgetter(1), reverse=True) print(sortedngrams[:20]) print("...") # We can pretty-print the decreasing frequency profile: # In[39]: sortedngrams = sorted(ngrams, key=itemgetter(1), reverse=True) for t in sortedngrams[:20]: print(t[0], t[1]) print("...") # In[40]: total = float(sum(myBigramFD.values())) exceptions = ["]", "[", "--", ",", ".", "'s", "?", "!", "'", "'ye"] results = [] for x in myBigramFD: if x[0] in exceptions or x[1] in exceptions: continue if x[0] in stopwords or x[1] in stopwords: continue results.append( (x[0], x[1], myBigramFD[x]/total) ) sortedresults = sorted(results, key=itemgetter(2), reverse=True) for x in sortedresults[:20]: print(x[0], x[1], x[2]) # To be continued... # (C) 2017-2024 by [Damir Cavar](http://damir.cavar.me/) <>