#!/usr/bin/env python # coding: utf-8 # # Adding Context to Word Frequency Counts # While the raw data from word frequency counts is compelling, it does little but describe quantitative features of the corpus. In order to determine if the statistics are indicative of a trend in word usage we must add value to the word frequencies. In this exercise we will produce a ratio of the occurences of `privacy` to the number of words in the entire corpus. Then we will compare the occurences of `privacy` to the indivudal number of transcripts within the corpus. This data will allow us identify trends that are worthy of further investigation. # # Finally, we will determine the number of words in the corpus as a whole and investigate the 50 most common words by creating a frequency plot. The last statistic we will generate is the type/token ratio, which is a measure of the variability of the words used in the corpus. # ### Part 1: Determining a ratio # To add context to our word frequency counts, we can work with the corpus in a number of different ways. One of the easiest is to compare the number of words in the entire corpus to the frequency of the word we are investigating. # Let's begin by calling on all the functions we will need. Remember that the first few sentences are calling on pre-installed Python modules, and anything with a `def` at the beginning is a custom function built specifically for these exercises. The text in red describes the purpose of the function. # In[1]: # This is where the modules are imported import nltk from os import listdir from os.path import splitext from os.path import basename from tabulate import tabulate # These functions iterate through the directory and create a list of filenames def list_textfiles(directory): "Return a list of filenames ending in '.txt'" textfiles = [] for filename in listdir(directory): if filename.endswith(".txt"): textfiles.append(directory + "/" + filename) return textfiles def remove_ext(filename): "Removes the file extension, such as .txt" name, extension = splitext(filename) return name def remove_dir(filepath): "Removes the path from the file name" name = basename(filepath) return name def get_filename(filepath): "Removes the path and file extension from the file name" filename = remove_ext(filepath) name = remove_dir(filename) return name # These functions work on the content of the files def read_file(filename): "Read the contents of FILENAME and return as a string." infile = open(filename) contents = infile.read() infile.close() return contents def count_in_list(item_to_count, list_to_search): "Counts the number of a specified word within a list of words" number_of_hits = 0 for item in list_to_search: if item == item_to_count: number_of_hits += 1 return number_of_hits # In the next piece of code we will cycle through our directory again: first assigning readable names to our files and storing them as a list in the variable `filenames`; then we will remove the case and punctuation from the text, split the words into a list of tokens, and assign the words in each file to a list in the variable `corpus`. # In[2]: filenames = [] for files in list_textfiles('../Counting Word Frequencies/data'): files = get_filename(files) filenames.append(files) # In[3]: corpus = [] for filename in list_textfiles('../Counting Word Frequencies/data'): text = read_file(filename) words = text.split() clean = [w.lower() for w in words if w.isalpha()] corpus.append(clean) # Here we recreate our list from the last exercise, counting the instances of the word `privacy` in each file. # In[4]: for words, names in zip(corpus, filenames): print("Instances of the word \'privacy\' in", names, ":", count_in_list("privacy", words)) # Next we use the `len` function to count the total number of words in each file. # In[5]: for files, names in zip(corpus, filenames): print("There are", len(files), "words in", names) # Now we can calculate the ratio of the word `privacy` to the total number of words in the file. To accomplish this we simply divide the two numbers. # In[6]: print("Ratio of instances of privacy to total number of words in the corpus:") for words, names in zip(corpus, filenames): print('{:.6f}'.format(float(count_in_list("privacy", words))/(float(len(words)))),":",names) # Now our descriptive statistics concerning word frequencies have added value. We can see that there has indeed been a steady increase in the frequency of the use of the word `privacy` in our corpus. When we investigate the yearly usage, we can see that the frequency almost doubled between 2008 and 2009, as well as dramatic increase between 2012 and 2014. This is also apparent in the difference between the 39th and the 40th sittings of Parliament. # ------ # Let's package all of the data together so it can be displayed as a table or exported to a `CSV` file. First we will write our values to a list: `raw` contains the raw frequencies, and `ratio` contains the ratios. Then we will create a tuple that contains the `filename` variable and includes the corresponding `raw` and `ratio` variables. Here we'll generate the ratio as a percentage. # In[7]: raw = [] for i in range(len(corpus)): raw.append(count_in_list("privacy", corpus[i])) ratio = [] for i in range(len(corpus)): ratio.append('{:.3f}'.format((float(count_in_list("privacy", corpus[i]))/(float(len(corpus[i])))) * 100)) table = zip(filenames, raw, ratio) # Using the `tabulate` module, we will display our tuple as a table. # In[8]: print(tabulate(table, headers = ["Filename", "Raw", "Ratio %"], floatfmt=".3f", numalign="left")) # And finally, we will write the values to a `CSV` file called `privacyFreqTable`. # In[9]: import csv with open('privacyFreqTable.csv','wb') as f: w = csv.writer(f) w.writerows(table) # ----------- # ### Part 2: Counting the number of transcripts # Another way we can provide context is to process the corpus in a different way. Instead of splitting the data by word, we will split it in larger chunks pertaining to each individual transcript. Each transcript corresponds to a unique debate but starts with exactly the same formatting, making the files easy to split. The text below shows the beginning of a transcript. The first words are `OFFICIAL REPORT (HANSARD)`. # # Here we will pass the files to another variable, called `corpus_1`. Instead of removing capitalization and punctuation, all we will do is split the files at every occurence of `OFFICIAL REPORT (HANSARD)`. # In[10]: corpus_1 = [] for filename in list_textfiles('../Counting Word Frequencies/data'): text = read_file(filename) words = text.split(" OFFICIAL REPORT (HANSARD)") corpus_1.append(words) # Now, we can count the number of files in each dataset. This is also an important activity for error-checking. While it is easy to trust the numerical output of the code when it works sucessfully, we must always be sure to check that the code is actually performing in exactly the way we want it to. In this case, these numbers can be cross-referenced with the original XML data, where each transcript exists as its own file. A quick check of the directory shows that the numbers are correct. # In[11]: for files, names in zip(corpus_1, filenames): print("There are", len(files), "files in", names) # Here is a screenshot of some of the raw data. We can see that there are 97 files in 2006, 117 in 2007 and 93 in 2008. The rest of the data is also correct. # # Now we can compare the amount of occurences of `privacy` with the number of debates occuring in each dataset. # In[12]: for names, files, words in zip(filenames, corpus_1, corpus): print("In", names, "there were", len(files), "debates. The word privacy was said", \ count_in_list('privacy', words), "times.") # These numbers confirm our earlier results. There is a clear indication that the usage of the term `privacy` is increasing, with major changes occuring between the years 2008 and 2009, as well as between 2012 and 2014. This trend is also clearly obervable between the 39th and 40th sittings of Parliament. # ------ # ### Part 3: Looking at the corpus as a whole # While chunking the corpus into pieces can help us understand the distribution or dispersion of words throughout the corpus, it's valuable to look at the corpus as a whole. Here we will create a third corpus variable `corpus_3` that only contains the files named `39`, `40`, and `41`. Note the new directory named `data2`. We only need these files; if we used all of the files we would literally duplicate the results. # In[13]: corpus_3 = [] for filename in list_textfiles('../Counting Word Frequencies/data2'): text = read_file(filename) words = text.split() clean = [w.lower() for w in words if w.isalpha()] corpus_3.append(clean) # Now we will combine the three lists into one large list and assign it to the variable `large`. # In[14]: large = list(sum(corpus_3, [])) # We can use the same calculations to determine the total number of occurences of `privacy`, as well as the total number of words in the corpus. We can also calculate the total ratio of `privacy` to the total number of words. # In[15]: print("There are", count_in_list('privacy', large), "occurences of the word 'privacy' and a total of", \ len(large), "words.") print("The ratio of instances of privacy to total number of words in the corpus is:", \ '{:.6f}'.format(float(count_in_list("privacy", large))/(float(len(large)))), "or", \ '{:.3f}'.format((float(count_in_list("privacy", large))/(float(len(large)))) * 100),"%") # Another type of word frequency statistic we can generate is a type/token ratio. The types are the total number of unique words in the corpus, while the tokens are the total number of words. The type/token ratio is used to determine the variability of the language used in the text. The higher the ratio, the more complex the text will be. First we'll determine the total number of types, using Python's `set` function. # In[16]: print("There are", (len(set(large))), "unique words in the Hansard corpus.") # Now we can divide the types by the tokens to determine the ratio. # In[17]: print("The type/token ratio is:", ('{:.6f}'.format(len(set(large))/(float(len(large))))), "or",\ '{:.3f}'.format(len(set(large))/(float(len(large)))*100),"%") # Finally, we will use the `NLTK` module to create a graph that shows the top 50 most frequent words in the Hansard corpus. Although `privacy` will not appear in the graph, it's always interesting to see what types of words are most common, and what their distribution is. `NLTK` will be introduced with more detail in the next section featuring concordance outputs, but here all we need to know is that we assign our variable `large` to the `NLTK` function `Text` in order to work with the corpus data. From there we can determine the frequency distribution for the whole text. # In[18]: text = nltk.Text(large) fd = nltk.FreqDist(text) # Here we will assign the frequency distribution to the `plot` function to produce a graph. While it's a little hard to read, the most commonly used word in the Hansard corpus is `the`, with a frequency just over 400,000 occurences. The next most frequent word is `to`, which only has a frequency of about 225,000 occurences, almost half of the first most common word. The first 10 most frequent words appear with a much greater frequency than any of the other words in the corpus. # In[19]: get_ipython().run_line_magic('matplotlib', 'inline') fd.plot(50,cumulative=False) # Another feature of the `NLTK` frequency distribution function is the generation of a list of hapaxes. These are words that appear only once in the entire corpus. While not meaningful for this study, it's an interesting way to explore the data. # In[20]: fd.hapaxes() # The next section will use `NLTK` to create generate concordance outputs featuring the word `privacy`.