This lesson will introduce the Jupyter Notebook interface. We will use the interface to run and write, yes, write, some Python code for text data analysis.
By the end of this lesson, learners should be able to:
And just an aside, all the code for this fun stuff is availble on GitHub at https://github.com/jcoliver/dig-coll-borderlands.
If you are not familiar with text data mining, take a look at this nice StoryMap that introduces the idea of text data mining and what we can do with it.
Jupyter Notebooks are effectively made up of "cells". We can start by thinking of each cell being equivalent to a paragraph on a page. There is an order in which paragraphs and cells appear, and that order matters. In Jupyter Notebooks, the cells come in two flavors and a single notebook (like the one we are working in now) with have both types of cells.
So let's try this out. Click your cursor in the box below on the word "Data" and run the cell. You can run the cell by holding down the Control (Ctrl) key and press Enter. You can also click the button labeled "Run" at the top of the screen, too.
print('Collections as Data')
So what we are going to do today is work with some text files on some text data mining questions.
We will start with a single text file.
The code block below sets up the name of the file we want to use. There are a couple of important pieces we need to provide:
title = 'border-vidette'
year = '1919'
month = '01'
day = '04'
Now that we know which date and which title we are interested in, we need to tell Python where the text files are located.
# We also need to indicate where the data are stored (i.e. which folder
# are they in)
datapath = 'data/sample/'
We can define another variable, filename
that contains all the information we need to read the file. That is, filename
includes the folder location and the filename of the file we are interested in.
# We stitch all those pieces of information together, along with the folder
# information about where data for an entire day's paper is located
filename = datapath + title + '/volumes/' + year + month + day + '.txt'
When we run that code block, nothing will visibly happen. We haven't asked Python to print anything, and there were no errors (yay!). But we might want to check our work to make sure the file name was specified correctly. So we can use our print
command again:
print(filename)
Note that this time we did not enter a phrase enclosed with quotation marks, but instead provided the word filename
. But it didn't print "filename". Rather it printed the value stored in the variable called filename
. If you can think back to high school algebra, this is a similar sort of concept - we use a variable, in this case filename
to store information, much like we would use the variable "x" in a mathematical equation.
At this point, we are ready to read the file and do some work with it. Before we do so, we will need to tell Python about some additional programs to use. By default, Python does not come with text data mining tools, so those are installed separately and we make them available for use using the import
command. Run the code block below to load those packages.
# Load additional packages
# for data tables
import pandas
# for file navigation
import os
# for pattern matching in filenames
import re
# for text data mining
import nltk
# for stopword corpora for a variety of languages
from nltk.corpus import stopwords
# for splitting data into individual words
from nltk.tokenize import RegexpTokenizer
# for automated text cleaning
import digcol as dc
We also need to download the stopwords. There are a lot of recognized stopwords (i.e. "y", "a", "el", "la", "del", "que", etc.), so we don't want to enter them by hand.
# download the stopwords for several languages
nltk.download('stopwords')
Now we are ready to read in the data and start looking around. The code block below will read in all the text from the day's paper and clean it up. By "clean it up", the CleanText
does the following:
newdata = dc.CleanText(filename)
Again, nothing visibly happened, so we can check our work by looking at the first 20 words. Run the code block below (remember click the box and press Ctrl+Enter or Cmd+Enter).
print(newdata.clean_list[0:20])
We can use this list of words to calculate relative frequency of each word. Relative frequencies in this case are in regards to the length of the issue. We count the number of times a word occurs, and divide that by the total number of words in the issue.
# Create a table with all the words
word_table = pandas.Series(newdata.clean_list)
# Calculate relative frequency of each word
word_freqs = word_table.value_counts(normalize = True)
To check our work, we look at the first 10 rows of the word_freqs
table:
print(word_freqs.head(n = 10))
It should come as no big surprise that "Arizona" and "Nogales" are the most frequent words, given that the paper was printed in Nogales, Arizona.
Now we can broaden our focus to look at trends over time. We are going to look a multiple years of papers to track how the frequency of influenza coverages changes over time. We will stick with The Border Vidette but instead of looking at a single issue, we will look at all the issues 1917-1919.
Here's where it gets fun. We could try to do this file-by-file, but that would be extremely tedious. So we are going to give Python a little bit of information and let the computer look at every single file. But first we need to tell Python which files to use.
# Create a pattern that will match the dates of interest. In this case,
# papers from 1917, 1918, and 1919
date_pattern = re.compile(r'(1917|1918|1919)([0-9]{4})*')
Wait, what the hell does that even mean? What we have in the code block above is something called "regular expressions". Regular expressions is a very powerful pattern matching tool with a very terrible name. What we are saying above is that we want any files that:
[0-9]
matches any single digit between 0 and 9{4}
means that there are four consecutive digits. [0-9][0-9][0-9][0-9]
is equivalent to [0-9]{4}
, we just don't have to write as muchSo be sure you run the code block above (Ctrl-Enter or Cmd-Enter) before moving on. You will know that the code block has been run when you see a number show up in between the square brackets to the left of the code block (In [ ]:
).
We are now ready to start reading in the files. We need to start by listing all the Border Vidette issues, then filtering only those that are in the date range of interest.
# List all the Border Vidette files and store in bv_volumes variable
volume_path = datapath + title + '/volumes/'
bv_volumes = os.listdir(volume_path)
# Use date pattern from above to restrict to dates of interest
bv_volumes = list(filter(date_pattern.match, bv_volumes))
# Sort them for easier bookkeeping
bv_volumes.sort()
Here is another opportunity for a reality check, so we ask Python to print out the first five files that we will ask Python to read.
print(bv_volumes[0:5])
Now we know which files to look at, so we can instruct Python to do so. We are ultimately going to want to create a table that has two columns of data:
We will start by creating a table that will hold that information. We need to extract dates for each paper. While the filenames have that information, we need to convert it to an actual date, in the form of YYYY-MM-DD, so the date for 19170113.txt is 1917-01-13.
# Create a table that will hold the relative frequency for each date
dates = []
for one_file in bv_volumes:
one_date = str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
dates.append(one_date)
Wait. No. What?
So there's a bit going on there up above.
dates = []
creates an empty list of dates. There is nothing in there to start with.for one_file in bv_volumes:
says we are going to cycle through all of the files that are listed in the bv_volumes
variable; each cycle, the value of one_file
changes to the next value. If you look at the output we created above when running print(bv_volumes[0:5])
, the first time through the cycle, one_file
will have the value '19170113.txt'. The second time, one_file
will have the value '19170120.txt'.one_date = str(one_file[0:4]) + str(one_file[4:6]) + str(one_file[6:8])
is creating a - no, we just need to run some code to explain this one.Let us do a little test, seeing what this code does on an example of one_file
. We start by pulling out the very first value in bv_volumes
, as if it was the first cycle through the code above.
one_file = bv_volumes[0]
print(one_file)
Cool. Looking good. What we are doing with the one_date
line is pulling out parts of that filename using indexing. An index is basically an address for each letter. For the first That is, we pull out the 0th through 3rd part of the file name via one_file[0:4]
:
# Look at first four characters
print(one_file[0:4])
# look at characters 5 and 6
print(one_file[4:6])
Looking at the entireity of the filename, these are the indexes of each character:
Index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Characters: | 1 | 9 | 1 | 7 | 0 | 1 | 1 | 3 | . | t | x | t |
If we run the piece of code that stitches all the pieces together,
str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
we see the date formatted like we want (i.e. YYYY-MM-DD).
Looking back to the cycle:
for one_file in bv_volumes:
one_date = str(one_file[0:4]) + "-" + str(one_file[4:6]) + "-" + str(one_file[6:8])
dates.append(one_date)
the last line (dates.append(one_date)
) will add that one date to our list of dates. Now that we have all those dates, we can set up our table (finally!).
# Add those dates to a data frame
flu_table = pandas.DataFrame(dates, columns = ['Date'])
# Set all frequencies to zero
flu_table['Frequency'] = 0.0
And as a reality check, let's look at the first six rows.
flu_table.head(n = 6)
Now we can use Python to cycle over every file, calculate the relative frequency of flu and influenza, and store the result in the table.
flu_words = ['flu', 'influenza']
for issue in bv_volumes:
issue_text = dc.CleanText(volume_path + issue)
issue_text = issue_text.clean_list
# Create a table with words
word_table = pandas.Series(issue_text)
# Calculate relative frequencies of all words in the issue
word_freqs = word_table.value_counts(normalize = True)
# Pull out only values for flu or influenza
flu_freqs = word_freqs.filter(flu_words)
# Get the total frequency for flu and influenza
total_flu_freq = flu_freqs.sum()
# Format the date from the name of the file so we know where to put
# the data in our table
issue_date = str(issue[0:4]) + '-' + str(issue[4:6]) + '-' + str(issue[6:8])
# Add the date & relative frequency to our data table
flu_table.loc[flu_table['Date'] == issue_date, 'Frequency'] = total_flu_freq
Look again at the first six rows of our table.
flu_table.head(n = 6)
Hmm...they are all still zeros. But maybe that isn't surprising, since the influenza pandemic did not really get going until late 1918, and we are just looking at the early 1917 issues here. When doing this sort of quality assurance, we can pick a paper that we know will have at least some occurrences of influenza. The November 16 issue from 1918 had at least some mention of influenza, so we can look at the corresponding row for that date via:
# Look at a 1918 date where we know there should be non-zero values
flu_table.loc[flu_table['Date'] == '1918-11-16', ]
Alright! We have our data ready to go. All we need to do now is graph it.
# One more package is needed for plotting
import plotly.express as px
flu_figure = px.line(flu_table, x = 'Date', y = 'Frequency')
flu_figure.show()
The graph should show a peak in relative frequency of flu/influenza over the winter of 1918-1919.
The code block below includes all the code necessary to make a graph like the one above, but you get to determine what it shows. You'll need to provide:
Run the code block below to display a table with relevant information regarding available titles and dates.
# Run to display table with newspaper information
titles = pandas.read_csv('data/sample/sample-titles.csv')
display(titles)
Edit the first four variables in the code block below then run the code block. Your graph should appear below the block once it has finished running. You'll know it finished when the asterisk in the square brackets is replaced by a number (e.g. In [*]
-> In [31]
)
# Change this to directory of the title of interest, be sure to use lowercase
# and no spaces (it may be easiest to copy & paste from the "directory"
# column in the table above)
title = 'el-tucsonense'
# List the years of interest, each enclosed in quotation marks (') and separated
# by commas
year_list = ['1917', '1918', '1919']
# What words are you interested in? You can add as many as you like,
# just be sure to enclose each in quotation marks (') and separate with a comma
# Also, keep them lower case, even if they are proper nouns
my_words = ['alemania', 'alemana', 'alemán'] # germany, german (f.), german (m.)
# Specify the language of the title you are looking at (all lowercase)
# Possible values: english, spanish, arabic, turkish, etc.
language = 'spanish'
################################################################################
# No need to edit anything below here
################################################################################
# Creating the pattern of filenames based on years to match
years = '|'
years = years.join(year_list)
pattern = '(' + years + ')([0-9]{4})*'
date_pattern = re.compile(pattern)
# Location of files with text for a day's paper
volume_path = datapath + title + '/volumes/'
my_volumes = os.listdir(volume_path)
# Use date pattern from above to restrict to dates of interest
my_volumes = list(filter(date_pattern.match, my_volumes))
# Sort them for easier bookkeeping
my_volumes.sort()
# Create a table that will hold the relative frequency for each date
dates = []
for one_file in my_volumes:
one_date = str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
dates.append(one_date)
# Add those dates to a data frame
results_table = pandas.DataFrame(dates, columns = ['Date'])
# Set all frequencies to zero
results_table['Frequency'] = 0.0
# Cycle over all issues and do relative frequency calculations
for issue in my_volumes:
issue_text = dc.CleanText(filename = volume_path + issue, language = language)
issue_text = issue_text.clean_list
# Create a table with words
word_table = pandas.Series(issue_text)
# Calculate relative frequencies of all words in the issue
word_freqs = word_table.value_counts(normalize = True)
# Pull out only values that match words of interest
my_freqs = word_freqs.filter(my_words)
# Get the total frequency for words of interest
total_my_freq = my_freqs.sum()
# Format the date from the name of the file so we know where to put
# the data in our table
issue_date = str(issue[0:4]) + '-' + str(issue[4:6]) + '-' + str(issue[6:8])
# Add the date & relative frequency to our data table
results_table.loc[results_table['Date'] == issue_date, 'Frequency'] = total_my_freq
# Analyses are all done, plot the figure
my_figure = px.line(results_table, x = 'Date', y = 'Frequency')
my_figure.show()
If you want to plot more than one line on a plot, you can use the code below. The code below is set up for plotting two lines for one title of one language, but it could be extended to multiple lines, titles, and languages (but probably not today).
# Change this to directory of the title of interest, be sure to use lowercase
# and no spaces (it may be easiest to copy & paste from the "directory"
# column in the table above)
title = 'el-tucsonense'
# List the years of interest, each enclosed in quotation marks (') and separated
# by commas
year_list = ['1917', '1918', '1919']
# What words are you interested in? You can add as many as you like,
# just be sure to enclose each in quotation marks (') and separate with a comma
# Also, keep them lower case, even if they are proper nouns
words_1 = ['alemania', 'alemana', 'alemán'] # germany, german (f.), german (m.)
words_1_name = 'German'
words_2 = ['japona', 'japón']
words_2_name = 'Japan'
# Specify the language of the title you are looking at (all lowercase)
# Possible values: english, spanish, arabic, turkish, etc.
language = 'spanish'
################################################################################
# No need to edit anything below here
################################################################################
# Creating the pattern of filenames based on years to match
years = '|'
years = years.join(year_list)
pattern = '(' + years + ')([0-9]{4})*'
date_pattern = re.compile(pattern)
# Location of files with text for a day's paper
volume_path = datapath + title + '/volumes/'
my_volumes = os.listdir(volume_path)
# Use date pattern from above to restrict to dates of interest
my_volumes = list(filter(date_pattern.match, my_volumes))
# Sort them for easier bookkeeping
my_volumes.sort()
# Create a table that will hold the relative frequency for each date
dates = []
for one_file in my_volumes:
one_date = str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
dates.append(one_date)
# Add those dates to a data frame
results_table = pandas.DataFrame(dates, columns = ['Date'])
# Set all frequencies to zero
results_table[words_1_name] = 0.0
results_table[words_2_name] = 0.0
# Cycle over all issues and do relative frequency calculations
for issue in my_volumes:
issue_text = dc.CleanText(filename = volume_path + issue, language = language)
issue_text = issue_text.clean_list
# Create a table with words
word_table = pandas.Series(issue_text)
# Calculate relative frequencies of all words in the issue
word_freqs = word_table.value_counts(normalize = True)
# Pull out only values that match words of interest
words_1_freqs = word_freqs.filter(words_1)
words_2_freqs = word_freqs.filter(words_2)
# Get the total frequency for words of interest
total_words_1 = words_1_freqs.sum()
total_words_2 = words_2_freqs.sum()
# Format the date from the name of the file so we know where to put
# the data in our table
issue_date = str(issue[0:4]) + '-' + str(issue[4:6]) + '-' + str(issue[6:8])
# Add the date & relative frequency to our data table
results_table.loc[results_table['Date'] == issue_date, words_1_name] = total_words_1
results_table.loc[results_table['Date'] == issue_date, words_2_name] = total_words_2
# Analyses are all done, but we need to transform data to "long" format
results_melt = results_table.melt(id_vars = 'Date', value_vars = [words_1_name, words_2_name])
# By default, two columns created are called "value" and "variable", we want
# to rename them
results_melt.rename(columns = {'value':'Frequency', 'variable':'Words'}, inplace = True)
# plot the figure
my_figure = px.line(results_melt, x = 'Date' , y = 'Frequency' , color = 'Words')
my_figure.show()
We can also make comparisons between different titles. Here we are going to compare the Bisbee Daily Review and the Border Vidette to see if there is a difference in the coverage of the mine strike of 1917.
We start as we did before to filter papers by dates of interest. We are going to focus only on those issues published June through October of 1917.
# Create a pattern that will match papers June - October, 1917
date_pattern = re.compile(r'1917(06|07|08|09|10)([0-9]{2})*')
More regular expressions! We are only looking at papers published in 1917, in the months June (06) through October (10). Similar to the regular expression we saw earlier, this allows us to match those files that:
[0-9]
matches any single digit between 0 and 9{2}
means that there are two consecutive digits, so in this case we are using [0-9]{2}
as a shortcut for [0-9][0-9]
We match only those files, for each title, that were published during the time period of interest.
# To the data from each title separate, we will use a convention where
# the variables are named such that the prefix of the variable name
# indicates the title from which the information is associated with:
# bv = Border Vidette
# bdr = Bisbee Daily Review
# List all the Border Vidette files
bv_volumes = os.listdir('data/sample/border-vidette/volumes')
# Use date pattern from above to restrict to dates of interest
bv_volumes = list(filter(date_pattern.match, bv_volumes))
# Do a little reality check to make sure we only see files in
# desired date range.
print('Border Vidette: ' + str(len(bv_volumes)) + ' issues')
print(bv_volumes[0:5])
# Download and filter files for Bisbee Daily Review (like above)
bdr_volumes = os.listdir('data/sample/bisbee-daily-review/volumes')
bdr_volumes = list(filter(date_pattern.match, bdr_volumes))
# Another reality check, reporting total number of issues and the
# first five filenames
print('Bisbee Daily Review: ' + str(len(bdr_volumes)) + ' issues')
print(bdr_volumes[0:5])
As we did above with the search for influenza terms, we will look for words related to the strike in Bisbee, and calculate the relative frequency for each issue of each title. We start by defining the terms to look for.
# A list of the words of interest
strike_words = ['strike', 'strikes', 'striker', 'strikers']
We start with looking at issues of Border Vidette.
# For all Border Vidette volumes that matched our date criteria, calculate
# the relative frequency of 'strike' and related words
# This variable will hold relative frequency for each day's paper
bv_strike_freq = []
# The location where text for each issue is stored
bv_file_locations = datapath + 'border-vidette/volumes/'
# Loop over each issue, calculating relative frequency
for one_issue in bv_volumes:
# Read in the cleaned text (stopwords and punctuation removed)
issue_text = dc.CleanText(filename = bv_file_locations + one_issue)
issue_text = issue_text.clean_list
# Create a table with all the words in the issue
word_table = pandas.Series(issue_text)
# Calculate relative frequency of each word in the issue
word_freqs = word_table.value_counts(normalize = True)
# Pull out only values for strike related words
strike_freqs = word_freqs.filter(strike_words)
# Add those frequencies to our list of values for Border Vidette
bv_strike_freq.append(strike_freqs.sum())
# Do a reality check to look at first five values
print(bv_strike_freq[0:5])
Now repeat the process of calculating relative frequencies for Bisbee Daily Review.
# For all Bisbee Daily Review volumes that matched our date criteria,
# calculate the relative frequency of 'strike' and related words
# This variable will hold relative frequency for each day's paper
bdr_strike_freq = []
# The location where text for each issue is stored
bdr_file_locations = datapath + 'bisbee-daily-review/volumes/'
# Loop over each issue, calculating relative frequency
for one_issue in bdr_volumes:
# Read in the cleaned text (stopwords and punctuation removed)
issue_text = dc.CleanText(filename = bdr_file_locations + one_issue)
issue_text = issue_text.clean_list
# Create a table with all the words in the issue
word_table = pandas.Series(issue_text)
# Calculate relative frequency of each word in the issue
word_freqs = word_table.value_counts(normalize = True)
# Pull out only values for strike related words
strike_freqs = word_freqs.filter(strike_words)
# Add those frequencies to our list of values for Border Vidette
bdr_strike_freq.append(strike_freqs.sum())
# Do a reality check to look at first five values
print(bdr_strike_freq[0:5])
Now that we have the relative frequencies for each of the titles, we can calculate some summary statistics, including the average relative frequency in each issue for each title.
# Import the mean function from the statistics package
from statistics import mean
# Calculate average relative frequency of Border Vidette issues
bv_mean = mean(bv_strike_freq)
# Print the value, using format instead of str to avoid scientific notation
print(format(bv_mean, 'f') + ' Border Vidette')
# Calculate average relative frequency of Bisbee Daily Review issues and print
# to screen
bdr_mean = mean(bdr_strike_freq)
print(format(bdr_mean, 'f') + ' Bisbee Daily Review')
Comparing these means, we see that the relative frequency of strike-related words was higher in issues of the Bisbee Daily Review than in issues of the Border Vidette. In fact, it looks like the relative frequency of strike words in the Bisbee Daily Review was ten-fold higher than in the Border Vidette.
Finally, we need to run a statistical test to see if those means are significantly difference. For our purposes, we can use a two-sample t-test.
# The scipy package has a stats function that allows us to run a t-test
from scipy import stats
# Run the test, assuming unequal variances
compare_strike = stats.ttest_ind(bv_strike_freq, bdr_strike_freq, equal_var = False)
# Extract values of interest, Student's t and the p-value
t_value = compare_strike[0]
p_value = compare_strike[1]
# Print test statistics
print('t = ' + format(t_value, '.3f')) # normal formatting
print('p = ' + format(p_value, '.3e')) # scientific notation
In this case we can conclude the relative word frequency of strike-related words was significantly higher in the Bisbee Daily Review than in the Border Vidette.
Whew. We're done.
If you want more practice or want to do some analyses with a larger data set, head over to the lesson at https://mybinder.org/v2/gh/jcoliver/dig-coll-borderlands/main?filepath=Text-Mining-Template.ipynb.
If you want even more resources, check these out:
If you have any questions or comments on this lesson, look at the project's GitHub page and open a new issue if you don't find an answer there.
This lesson is licensed under a CC-BY-4.0 2020 to Jeffrey C. Oliver.