Notebook

Text mining template¶

This notebook includes Python code for downloading scanned text of borderlands newspapers and performing word frequency text analyses on the newspapers.

To start, follow directions in the Setup section, below. Once you know which data you would like to use, there are several options listed below for different text analyses.

The work is part two projects:

_Using Newspapers as Data for Collaborative Pedagogy: A Multidisciplinary

Interrogation of the Borderlands in Undergraduate Classrooms_, funded in part by the Mellon Foundation through the Collections as Data program. More information about the project is available found at https://libguides.library.arizona.edu/newspapers-as-data.

_Reporting on Race and Ethnicity in the Borderlands (1882-1924): A

Data-Driven Digital Storytelling Hub_, funded by the Mellon Foundation through the Digital Borderlands program.

If you are not familiar with text data mining, take a look at this nice StoryMap that introduces the idea of text data mining and what we can do with it.

This notebook and additional text mining lessons are available at https://github.com/jcoliver/dig-coll-borderlands.

Setup¶

The first decision to make is whether you want to use a small, sample data set or the larger set of data. The latter option requires the files to be downloaded and can take a few minutes.

If you do not want to use the larger set of scanned text, you can use the data that are distributed with this notebook. Running the code block below will show you the data that are available if you do not want to download the larger data set (you do not need to take any extra steps to use the data below, they come with this Jupyter Notebook).

In [ ]:

# Run to display table with newspaper information
import pandas
titles = pandas.read_csv('data/sample/sample-titles.csv')
datapath = 'data/sample/'
display(titles)

If you would like to use the entire suite of scanned borderlands newspapers, you will need to first download the files from the University of Arizona Data Repository. Executing the code block below will do this for you (if you just want to try things out with a smaller data set, do not run this block and just jump ahead). Note the data are contained in an archive around 1.5GB and include hundreds of thousands of files. Both the downloading and file extraction steps may take a little while (5 minutes? 10?), so now might be a good time to refill your beverage. When the download and extraction process is complete, a table showing the available data will be printed below the code block.

In [ ]:

# import the libraries necessary for download & extraction
from urllib.request import urlretrieve
import zipfile
import os
import pandas

# Location of the file on the UA Data Repository
url = 'https://arizona.figshare.com/ndownloader/files/31104157'

# Download the file & write it to disk
zip_filename = 'fulldata.zip'
download = urlretrieve(url, zip_filename)

# Set the destination for the data files
destination = 'data/complete/'

# Make sure the destination directory exists
if(not(os.path.isdir(destination))):
    os.makedirs(destination)

# Extract files to destination directory
with zipfile.ZipFile(zip_filename, 'r') as zipdata:
    zipdata.extractall(destination)
    
# No need for that zipfile, so we can remove it
os.remove(zip_filename)

# Finally, display the available titles for this full data set
full_titles = pandas.read_csv('data/complete/complete-titles.csv')
datapath = 'data/complete/'
display(full_titles.sort_values(by=['name']))

Word frequency analyses¶

For the newspaper(s) of choice, there are a variety of analyses that can be performed with code below.

Investigate word frequency over time for a single word (or set of words) in a single newspaper
Investigate word frequency over time in a pair of newspapers
Investigate frequencies of two words over time for a single newspaper

In all analyses, the code below has example values for newspapers, words, and dates. You can change these as necessary for your specific question.

Before you get started though, be sure to run the code block immediately below, which loads in all the libraries necessary for subsequent text data analyses.

In [ ]:

# No need to change anything, just run this block of code to load necessary libraries

# for data tables
import pandas

# for file navigation
import os

# for pattern matching in filenames
import re

# for text data mining
import nltk

# for stopword corpora for a variety of languages
from nltk.corpus import stopwords

# for splitting data into individual words
from nltk.tokenize import RegexpTokenizer

# for automated text cleaning
import digcol as dc

# download the stopwords for several languages
nltk.download('stopwords')

# for drawing the plot
import plotly.express as px

Single word in one newspaper¶

The code below is designed to analyze on set of words for an individual newspaper title. As written, the code block will look at the frequency of influenza-related words ("flu", "influenza") in The Bisbee Daily Review during the years 1917 and 1918.

You can edit the values for title, year_list, my_words, and language to fit your analysis of interest. For the value of title, be sure to use the value in the "directory" column in the table above that corresponds to the newspaper of interest. For example, if you wanted to look at El Tucsonense, change this:

title = 'bisbee-daily-review'

to this:

title = 'el-tucsonense'

For year_list, list all the years of interest, each enclosed in single quotes (') and values separated by a comma. If you are only interested in one year, no comma is necessary.

The words listed in my_words will effectively be "lumped together" - that is, for this example, the plot will show the frequency of 'flu' and 'influenza' combined. If you are interested in plotting separate word sets, see the section Investigate frequencies of two words over time for a single newspaper, below.

Finally, be sure the value of language corresponds to the language of the newspaper you are looking at (see the table at the top of the page for language information. Note the value should be all lower case; i.e. use 'spanish' not 'Spanish'.

In [ ]:

# Code for one set of words, one newspaper

# Include 'flu' in words to look for
title = 'bisbee-daily-review'     # Make sure this matches "directory" in table above
year_list = ['1917', '1918']      # Each item is separated by a comma
my_words = ['flu', 'influenza']   # Each item is separated by a comma
language = 'english'              # Can take values 'english', 'spanish' (all lowercase)

################################################################################
# No need to edit anything below here
################################################################################

# Creating the pattern of filenames based on years to match
years = '|'
years = years.join(year_list)
pattern = '(' + years + ')([0-9]{4})*'
date_pattern = re.compile(pattern)

# Location of files with text for a day's paper
volume_path = datapath + title + '/volumes/'
my_volumes = os.listdir(volume_path)

# Use date pattern from above to restrict to dates of interest
my_volumes = list(filter(date_pattern.match, my_volumes))

# Sort them for easier bookkeeping
my_volumes.sort()

# Create a table that will hold the relative frequency for each date
dates = []
for one_file in my_volumes:
    one_date = str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
    dates.append(one_date)

# Add those dates to a data frame
results_table = pandas.DataFrame(dates, columns = ['Date'])

# Set all frequencies to zero
results_table['Frequency'] = 0.0

# Cycle over all issues and do relative frequency calculations
for issue in my_volumes:
    issue_text = dc.CleanText(filename = volume_path + issue, language = language)
    issue_text = issue_text.clean_list
    
    # Create a table with words
    word_table = pandas.Series(issue_text)

    # Calculate relative frequencies of all words in the issue
    word_freqs = word_table.value_counts(normalize = True)
    
    # Pull out only values that match words of interest
    my_freqs = word_freqs.filter(my_words)
    
    # Get the total frequency for words of interest
    total_my_freq = my_freqs.sum()
    
    # Format the date from the name of the file so we know where to put
    # the data in our table
    issue_date = str(issue[0:4]) + '-' + str(issue[4:6]) + '-' + str(issue[6:8])
    
    # Add the date & relative frequency to our data table
    results_table.loc[results_table['Date'] == issue_date, 'Frequency'] = total_my_freq
    
# Analyses are all done, plot the figure
my_figure = px.line(results_table, x = 'Date', y = 'Frequency')
my_figure.show()

Single word in two newspapers¶

The code below is designed to analyze one set of words for a pair of individual newspaper titles. As written, the code block will look at the frequency of words related to Germany in El Tucsonense and The Bisbee Daily Review during the years 1917-1919. Note because these papers are different languages, we need to provide appropriate word sets for each of the papers.

Update the corresponding values for titles, words, and languages for the words and titles of interest. Note the longer the time stretch you are looking at, the longer the analysis may take. When the analysis has fininshed, the asterisk in the square brackets to the lower left will be replaced with a number (i.e. In [*] becomes something like In [6]) and the plot will be printed below the code block.

In [ ]:

# Code for one set of words, two newspapers

# Change these to directories of the titles of interest, be sure to use lowercase 
# and no spaces (it may be easiest to copy & paste from the "directory" 
# column in the table above)
title_1 = 'el-tucsonense'
title_2 = 'bisbee-daily-review' 

# The "human readable" names of the newspaper titles that will show up on the 
# plot
title_1_name = 'El Tucsonense'
title_2_name = 'Bisbee Daily Review'

# List the years of interest, each enclosed in quotation marks (') and separated
# by commas
year_list = ['1917', '1918', '1919']

# What words are you interested in? You can add as many as you like, 
# just be sure to enclose each in quotation marks (') and separate with a comma
# Also, keep them lower case, even if they are proper nouns
words_1 = ['alemania', 'alemana', 'alemán'] # germany, german (f.), german (m.)
words_2 = ['germany', 'german']

# Specify the language of the title you are looking at (all lowercase)
# Possible values: english, spanish, arabic, turkish, etc.
language_1 = 'spanish'
language_2 = 'english'

################################################################################
# No need to edit anything below here
################################################################################

# Creating the pattern of filenames based on years to match
years = '|'
years = years.join(year_list)
pattern = '(' + years + ')([0-9]{4})*'
date_pattern = re.compile(pattern)

# Create dictionary with information about each title, for easier
# iteration
title_data = {}
title_data[title_1] = {
    'directory' : title_1,
    'name' : title_1_name,
    'words' : words_1,
    'language' : language_1,
    'volume_path' : datapath + title_1 + '/volumes/'
}
title_data[title_1]['volumes'] = os.listdir(title_data[title_1]['volume_path'])
title_data[title_2] = {
    'directory' : title_2,
    'name' : title_2_name,
    'words' : words_2,
    'language' : language_2,
    'volume_path' : datapath + title_2 + '/volumes/'
}
title_data[title_2]['volumes'] = os.listdir(title_data[title_2]['volume_path'])

# Find out all the dates of papers we are looking at
dates = []
for one_file in (title_data[title_1]['volumes'] + title_data[title_2]['volumes']):
    one_date = str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
    # Only add unique values to avoid duplication
    if one_date not in dates:
        dates.append(one_date)
dates.sort()

# Add those dates to a data frame
results_table = pandas.DataFrame(dates, columns = ['Date'])

# Set all frequencies to None
results_table[title_1_name] = None
results_table[title_2_name] = None

# Cycle over each title
for title in [title_1, title_2]:
    title_directory = title_data[title]['directory'] # string
    title_name = title_data[title]['name']           # string
    words = title_data[title]['words']               # list
    language = title_data[title]['language']         # string
    volume_path = title_data[title]['volume_path']   # string
    volumes = title_data[title]['volumes']           # list

    # List of volumes
#     volume_path = title_data[title]['volume_path']   # string
#     volumes = os.listdir(volume_path)
    
    # Use date pattern from above to restrict to dates of interest
    volumes = list(filter(date_pattern.match, volumes))

    # Sort them for easier bookkeeping
    volumes.sort()
    
    # Cycle over all the issues of the current title
    printed = False
    for issue in volumes:
        issue_text = dc.CleanText(filename = volume_path + issue, language = language)
        
        # Clean the text (remove stop words, punctuation, etc.)
        issue_text = issue_text.clean_list
        
        # Create a table with all words from the issue
        word_table = pandas.Series(issue_text)
        
        # Calculate relative frequencies of all words in the issue
        word_freqs = word_table.value_counts(normalize = True)
        
        # Pull out only values that match words of interest
        words_freqs = word_freqs.filter(words)

        # Get the total frequency for words of interest
        total_word_freq = words_freqs.sum()

        # Format the date from the name of the file so we know where to put
        # the data in our table
        issue_date = str(issue[0:4]) + '-' + str(issue[4:6]) + '-' + str(issue[6:8])

        # Add the date & relative frequency to our data table
        results_table.loc[results_table['Date'] == issue_date, title_name] = total_word_freq

# Analyses are all done, but we need to transform data to "long" format
results_melt = results_table.melt(id_vars = 'Date', value_vars = [title_1_name, title_2_name])

# By default, two columns created are called "value" and "variable", we want 
# to rename them
results_melt.rename(columns = {'value':'Frequency', 'variable':'Title'}, inplace = True)

# Before plotting, remove rows with missing values
results_clean = results_melt.dropna()

# plot the figure
my_figure = px.line(results_clean, x = 'Date' , y = 'Frequency' , color = 'Title')
my_figure.show()

Two words in one newspaper¶

The code below is designed to analyze two sets of words for an individual newspaper title. As written, the code block will look at the frequency of words related to Germany and those related to Japan in El Tucsonense during the years 1917-1919.

Update the corresponding values for title, words, and language for the words and title of interest. Note the longer the time stretch you are looking at, the longer the analysis may take. When the analysis has fininshed, the asterisk in the square brackets to the lower left will be replaced with a number (i.e. In [*] becomes something like In [6]) and the plot will be printed below the code block.

In [ ]:

# Change this to directory of the title of interest, be sure to use lowercase 
# and no spaces (it may be easiest to copy & paste from the "directory" 
# column in the table above)
title = 'el-tucsonense' 

# List the years of interest, each enclosed in quotation marks (') and separated
# by commas
year_list = ['1917', '1918', '1919']

# What words are you interested in? You can add as many as you like, 
# just be sure to enclose each in quotation marks (') and separate with a comma
# Also, keep them lower case, even if they are proper nouns
words_1 = ['alemania', 'alemana', 'alemán'] # germany, german (f.), german (m.)
words_1_name = 'Germany'
words_2 = ['japona', 'japón']
words_2_name = 'Japan'

# Specify the language of the title you are looking at (all lowercase)
# Possible values: english, spanish, arabic, turkish, etc.
language = 'spanish'

################################################################################
# No need to edit anything below here
################################################################################

# Creating the pattern of filenames based on years to match
years = '|'
years = years.join(year_list)
pattern = '(' + years + ')([0-9]{4})*'
date_pattern = re.compile(pattern)


# Location of files with text for a day's paper
volume_path = datapath + title + '/volumes/'
my_volumes = os.listdir(volume_path)

# Use date pattern from above to restrict to dates of interest
my_volumes = list(filter(date_pattern.match, my_volumes))

# Sort them for easier bookkeeping
my_volumes.sort()

# Create a table that will hold the relative frequency for each date
dates = []
for one_file in my_volumes:
    one_date = str(one_file[0:4]) + '-' + str(one_file[4:6]) + '-' + str(one_file[6:8])
    dates.append(one_date)

# Add those dates to a data frame
results_table = pandas.DataFrame(dates, columns = ['Date'])

# Set all frequencies to zero
results_table[words_1_name] = 0.0
results_table[words_2_name] = 0.0

# Cycle over all issues and do relative frequency calculations
for issue in my_volumes:
    issue_text = dc.CleanText(filename = volume_path + issue, language = language)
    issue_text = issue_text.clean_list
    
    # Create a table with words
    word_table = pandas.Series(issue_text)

    # Calculate relative frequencies of all words in the issue
    word_freqs = word_table.value_counts(normalize = True)
    
    # Pull out only values that match words of interest
    words_1_freqs = word_freqs.filter(words_1)
    words_2_freqs = word_freqs.filter(words_2)

    # Get the total frequency for words of interest
    total_words_1 = words_1_freqs.sum()
    total_words_2 = words_2_freqs.sum()
    
    # Format the date from the name of the file so we know where to put
    # the data in our table
    issue_date = str(issue[0:4]) + "-" + str(issue[4:6]) + "-" + str(issue[6:8])
    
    # Add the date & relative frequency to our data table
    results_table.loc[results_table['Date'] == issue_date, words_1_name] = total_words_1
    results_table.loc[results_table['Date'] == issue_date, words_2_name] = total_words_2
    
# Analyses are all done, but we need to transform data to "long" format
results_melt = results_table.melt(id_vars = 'Date', value_vars = [words_1_name, words_2_name])

# By default, two columns created are called "value" and "variable", we want 
# to rename them
results_melt.rename(columns = {'value':'Frequency', 'variable':'Words'}, inplace = True)

# plot the figure
my_figure = px.line(results_melt, x = 'Date' , y = 'Frequency' , color = 'Words')
my_figure.show()

This lesson is licensed under a CC-BY-4.0 2020 to Jeffrey C. Oliver.