Session 3: Ensembling and Clustering¶

Getting started¶

First, we will load the packages we need for these exercises.

In [1]:

# This line is needed to force matplotlib to display inline in the notebook
%matplotlib inline

from collections import Counter            # Collections has a bunch of neat data structures
import os                                  # File manipulation
import re                                  # Built in regular expression functionality

import nltk                                # Commonly used library for NLP processing
import numpy as np                         # Simple mathematics function and linear algebra
import matplotlib.pyplot as plt            # Charting functions
plt.rcParams['figure.figsize'] = [12, 8]   # Make plots larger by default
import pandas as pd                        # Data frames
import spacy                               # A newer approach to NLP processing than NLTK -- more ML-driven
import textacy                             # Additional functionality for SpaCy

Basic text functions in Python¶

Simple, manual example¶

You can use single or double quotes to make strings in python.

In [2]:

print('This is a string')

This is a string

In [3]:

print("This is also a string")

This is also a string

Triple quoting allows for multi-line strings.

In [4]:

print("""This is a multi-line
string since it has triple quotes""")

This is a multi-line
string since it has triple quotes

You can also manually code the newlines with \n to keep the string on one line in the script.

In [5]:

'This is also two lines\nsince it has a newline'

Out[5]:

'This is also two lines\nsince it has a newline'

In fact, if you don't put print around the strings, you will see clearly that they are the same structure.

In [6]:

"""This is a multi-line
string since it has triple quotes"""

Out[6]:

'This is a multi-line\nstring since it has triple quotes'

As we can see below, using triple quotes or \n give identical strings, even on Windows where a newline is usually \r\n in other applications. Thus, the main benefit of triple quoting is readability of the script.

In [7]:

"""This is a multi-line
string""" == 'This is a multi-line\nstring'

Out[7]:

True

If you want to include a quote in a double quoted string or a double quote in a single quoted string, you can do so.

In [8]:

'"This is a quote"'

Out[8]:

'"This is a quote"'

In [9]:

"This string's quote is not escaped"

Out[9]:

"This string's quote is not escaped"

If, however, you want to include a single quote in a single quoted string or a double quote in a double quoted string, you need to escape the characters with a \, e.g., as \' or \".

In [10]:

'This string\'s quote is escaped'

Out[10]:

"This string's quote is escaped"

In [11]:

"\"This is a quote\""

Out[11]:

'"This is a quote"'

Useful functions¶

First, it isn't a function per se, but using square brackets, [ ], to extract text is extremely useful. Do remember that python is zero indexed, so the first character is character 0!

In [12]:

# WSJ "About Us" description from: https://www.wsj.com/about-us
text = "The Wall Street Journal was founded in July 1889. Ever since, the Journal has led the way in chronicling the rise of industries in America and around the world. In no other period of human history has the planet witnessed changes so dramatic or swift. The Journal has covered the births and deaths of tens of thousands of companies; the creation of new industries such as autos, aerospace, oil and entertainment; two world wars and numerous other conflicts; profound advances in science and technology; revolutionary social movements; the rise of consumer economies in the U.S. and abroad; and the fitful march of globalization."

In [13]:

print(text[50:60])

Ever since

Convert anything to a string with str( ).

In [14]:

x = 72
x_string = str(x)
x_string

Out[14]:

'72'

Combining text with +.

In [15]:

'Hello' + ' ' + 'world'

Out[15]:

'Hello world'

Casing text with .lower(), .upper(), and .title().

In [16]:

print('soon TO be UPPERCASE'.upper())
print('SOON tO be lowercase'.lower())
print('soon to be titlecase'.title())

SOON TO BE UPPERCASE
soon to be lowercase
Soon To Be Titlecase

Checking if text contains something particular.

In [17]:

x = 'What is in this string?'

[x.startswith('What'), x.startswith('this')]
[x.endswith('string?'), x.endswith('string')]
['this' in x, 'ing' in x, 'zzz' in x]

Out[17]:

[True, True, False]

Finding where the content is.

In [18]:

x = 'What is in this string?'
[x.find('this'), x.find('ing'), x.find('zzz')]

Out[18]:

[11, 19, -1]

In [19]:

for y in ['this', 'ing', 'zzz']:
    try:
        print(x.index(y))
    except:
        print('Error!')

11
19
Error!

Counting the number of occurrences of a word or phrase.

In [20]:

text.count('Journal')

Out[20]:

Splitting strings

In [21]:

x = '1,2,3,4,5'.split(',')
print(x)

['1', '2', '3', '4', '5']

Joining strings together

In [22]:

print(' & '.join(x))

1 & 2 & 3 & 4 & 5

Replacing string content

In [23]:

x = 'I like mee goreng with mutton and mee goreng with chicken'
print(x.replace('mee', 'nasi'))
print(x.replace('mee', 'nasi', 1))

I like nasi goreng with mutton and nasi goreng with chicken
I like nasi goreng with mutton and mee goreng with chicken

Removing blank content

In [24]:

x = '   this is awkwardly padded        '
print([x.strip(), x.lstrip(), x.rstrip()])

['this is awkwardly padded', 'this is awkwardly padded        ', '   this is awkwardly padded']

Padding strings

In [25]:

gvkey = 1024
gvkey = str(gvkey).zfill(6)
print(gvkey)

Checking if strings are a certain type. (Note that some types are counterintuitive.)

In [26]:

output = '\t'.join(['input', 'alnum', 'alpha', 'decimal', 'digit', 'numeric', 'ascii'])
for x in ['ABC123', 'AAABBB', '12345', '12345²', '12345½', '123.1', '£12.0']:
  output += '\n' + '\t'.join(map(str,[x, x.isalnum(), x.isalpha(), x.isdecimal(), x.isdigit(), x.isnumeric(), x.isascii()]))
print(output)

input	alnum	alpha	decimal	digit	numeric	ascii
ABC123	True	False	False	False	False	True
AAABBB	True	True	False	False	False	True
12345	True	False	True	True	True	True
12345²	True	False	False	True	True	False
12345½	True	False	False	False	True	False
123.1	False	False	False	False	False	True
£12.0	False	False	False	False	False	False

Importing text data¶

There are many ways to open a text file in python:

Store a context manager to the file using open(), then read from the file, and then call .close() on the manager when you are done.
The same as point 1, but reading the file inside a try...finally block, where the .close() is in the finally part.
[Preferred method] Use a with...as block to open the file and ensure proper closing.

In [27]:

with open('../../Data/S4_WSJ_2013.09.09.txt', 'rt') as f:
    text = f.read()

In [28]:

text[0:100]

Out[28]:

'Document 1 of 119\n\nBusiness and Finance\n\nAuthor: Anonymous\n\nAbstract: A U.S. appeals court will hear'

If we want to work through the text line-by-line, we have two options. We can either split out the text after reading, using .split(), or we can read in the file using .readlines() instead of .read().

The only difference between these methods is that .readlines() will keep the newline characters. However, these can easily be removed by calling .strip() over each element, e.g., as [i.strip() for i in text].

In [29]:

text.split('\n')[0:5]

Out[29]:

['Document 1 of 119', '', 'Business and Finance', '', 'Author: Anonymous']

In [30]:

with open('../../Data/S4_WSJ_2013.09.09.txt', 'rt') as f:
    text2 = f.readlines()

In [31]:

text2[0:5]

Out[31]:

['Document 1 of 119\n',
 '\n',
 'Business and Finance\n',
 '\n',
 'Author: Anonymous\n']

A lot of work with text analytics typically goes into cleaning up the document. In the case of the above, we probably want 119 articles separated out such that each article is an element of a list. To do this, we can use some basic looping and conditional statements. Note 2 key insights for this:

Each article starts with the phrase Full text:. This is unlikely to be used in the article text itself, so it can serve as a delimiter for the start of an article.
Articles end with one of the following tags: Company / organization:, Credit:, or Copyright:
One article is missing, just saying Full text: Not available

In [32]:

articles = []
article = ''
reading = False
for line in text2:
    if reading:  # check for the end of an article
        if 'Company / organization: ' in line or 'Credit: ' in line or 'Copyright: ' in line:
            # Done reading the article: output it
            reading = False
            articles.append(article)
            article = ''
        else:
            article += line
    else:  # check for the start of an article
        if 'Full text: ' in line and 'Full text: Not available' not in line:
            # Start reading the article in
            article = line[11:]
            reading = True
        else:
            pass  # not part of an article, nothing to do.
        

In [33]:

len(articles)

Out[33]:

In [34]:

print(articles[0])

Hedge funds are cutting their standard fees of 2% of assets under management and 20% of profits amid pressure from investors.
---
A team of Ares and CPP Investment Board are in the final stages of talks to buy luxury retailer Neiman Marcus for around $6 billion.
---
Federal regulators plan to reduce the maximum size of mortgages eligible for backing by Fannie and Freddie.
---
Time Warner plans to move its U.S. retirees from company-administered health plans to private exchanges.
---
A U.S. appeals court will hear oral arguments today in a suit by Verizon challenging FCC "net-neutrality" rules.
---
Japan's GDP grew at an annualized pace of 3.8% in the second quarter, much faster than initially estimated.
---
China said its exports rose 7.2% in August from a year earlier, the latest in a series of positive economic reports.
---
White House officials are considering Treasury Undersecretary Lael Brainard for a seat on the Fed's board.
---
SolarCity scrapped a "Happy Meals" deal, a share-lending technique that has been the target of criticism.
---
The CFTC could vote by the end of the month on a rule aimed at curbing speculation in the commodity markets.
---
Reserve Primary Fund's former managers have reached a preliminary settlement of a lawsuit brought by investors.

Given the above list, we are now in good shape to do whichever analysis it is that we need.

Pattern matching: Regular expressions¶

A common task in NLP is extracting text that matches a particular pattern. For instance, matching certain titles, extracting addresses, identifying numeric content, or extracting contact information can all be done by matching to a pattern. Regular expressions are built in in python as part of the re package. There are many different options that can be implemented in regular expressions, however, and they are not terribly easy to read.

Overall, they are useful and good to know. However, expect a bit of trial and error as you work them out.

As an example, we can use a regular expression to extract all quotes from the WSJ articles.

The below is a simple approach. Breaking it down:

(?m): Allow output to span multiple lines (i.e., go across any \n if needed)
\": A literal double quote
.: Matches any character (this is a special symbol for regular expressions)
+: Find as many as you can, but at least 1
?: Don't be greedy -- find only as much as needed to satisfy the pattern
\": Another literal quote

Note: Taken together, the .+? essentially is just matching the smallest block of text it can between two double quotes.

In [35]:

re.findall('(?m)\".+?\"', articles[1])

Out[35]:

['"Anna Karenina"',
 '"All happy families are alike."',
 '"Why two more now, and in the same year? I have no idea,"',
 '"The Decameron,"',
 '"The Iliad"',
 '"The Odyssey"',
 '"I thought I could do better,"',
 '"The Decameron"',
 '"naked from the waist down."',
 '"naked from the waist up."',
 '"I would be happy to address this question if you allow me to go over Wayne\'s edition and find some mistakes that he can address,"',
 '"Each new translation profits from those that went before,"',
 '"I am sure that Wayne took a look at our version, especially since we tried to take a nonarchaic, non-British approach to Boccaccio\'s great and very clear vernacular Italian."',
 '"Anna Karenina,"',
 '"perfectionist,"',
 '"spats,"',
 '"an example of where she was off the mark."',
 '"brogues,"',
 '"smart shoes with perforations."',
 '"light peasant moccasins."',
 '"moccasin"',
 '"It\'s a loaded word, particularly in the U.S.,"',
 '"Most disagreements over words ignore the context, which is all important,"',
 '"porshni,"',
 '"is obsolete in Russian,"',
 '"primitive peasant shoes made from raw leather."',
 '"rather close to the first meaning of brogues in the Oxford English Dictionary: \'rough shoes of untanned hide.\' "',
 '"will be a long time before I get those."',
 '"we want to have a 21st-century translation with a critically up-to-date introduction and notes."',
 '"The Death of Ivan Ilyich & Confession."',
 '"He consciously chose to spend the last year of his life translating this book,"',
 '"The Tale of Genji,"',
 '"There\'s always room for another excellent translation,"']

Looking at the output, we note one issue: for some reason, the text has book names in double quotes, as well as certain single words. As such, it may be necessary to add some additional filtering.

In [36]:

[i for i in re.findall('(?m)\".+?\"', articles[1]) if len(i.split()) >= 5]

Out[36]:

['"All happy families are alike."',
 '"Why two more now, and in the same year? I have no idea,"',
 '"I thought I could do better,"',
 '"naked from the waist down."',
 '"naked from the waist up."',
 '"I would be happy to address this question if you allow me to go over Wayne\'s edition and find some mistakes that he can address,"',
 '"Each new translation profits from those that went before,"',
 '"I am sure that Wayne took a look at our version, especially since we tried to take a nonarchaic, non-British approach to Boccaccio\'s great and very clear vernacular Italian."',
 '"an example of where she was off the mark."',
 '"It\'s a loaded word, particularly in the U.S.,"',
 '"Most disagreements over words ignore the context, which is all important,"',
 '"primitive peasant shoes made from raw leather."',
 '"rather close to the first meaning of brogues in the Oxford English Dictionary: \'rough shoes of untanned hide.\' "',
 '"will be a long time before I get those."',
 '"we want to have a 21st-century translation with a critically up-to-date introduction and notes."',
 '"The Death of Ivan Ilyich & Confession."',
 '"He consciously chose to spend the last year of his life translating this book,"',
 '"There\'s always room for another excellent translation,"']

If we want to scale this across all of our text, we can simply apply the procedure to each string. Also, to save a bit of computation time we can pre-compile our regular expression.

In [37]:

re_quotes = re.compile('(?m)\".+?\"')
quotes = []
for article in articles:
    quotes = quotes + [i for i in re.findall('(?m)\".+?\"', article) if len(i.split()) >= 5]

In [38]:

len(quotes)

Out[38]:

Basic sentiment measures: Dictionaries¶

First, let's import a couple dictionaries. Here, we will import two of the dictionaries from the Loughran and McDondald 2011 JF paper.

In [39]:

with open('../../Data/S4_LM_Neg.csv', 'rt') as f:
    LM_neg = [x.strip().lower() for x in f.readlines()]
print(LM_neg[0:5])

['abandon', 'abandoned', 'abandoning', 'abandonment', 'abandonments']

In [40]:

with open('../../Data/S4_LM_Pos.csv', 'rt') as f:
    LM_pos = [x.strip().lower() for x in f.readlines()]
print(LM_pos[0:5])

['able', 'abundance', 'abundant', 'acclaimed', 'accomplish']

Next, we need to convert our data from being full documents, to being in a "Bag of Words" (BoW) structure. This essentially means we need to convert from documents to word counts. There are a couple approaches to this: 1) use regular expressions to extract words, or 2) use an NLP parser like NLTK or SpaCy to parse the documents.

For this example, we will use NLTK. We will see SpaCy in Session 5.

In [41]:

# If you get an error that you are missing 'punkt', run: nltk.download('punkt')
article_tokens = [nltk.tokenize.word_tokenize(article) for article in articles]

In [42]:

print(article_tokens[0])

['Hedge', 'funds', 'are', 'cutting', 'their', 'standard', 'fees', 'of', '2', '%', 'of', 'assets', 'under', 'management', 'and', '20', '%', 'of', 'profits', 'amid', 'pressure', 'from', 'investors', '.', '--', '-', 'A', 'team', 'of', 'Ares', 'and', 'CPP', 'Investment', 'Board', 'are', 'in', 'the', 'final', 'stages', 'of', 'talks', 'to', 'buy', 'luxury', 'retailer', 'Neiman', 'Marcus', 'for', 'around', '$', '6', 'billion', '.', '--', '-', 'Federal', 'regulators', 'plan', 'to', 'reduce', 'the', 'maximum', 'size', 'of', 'mortgages', 'eligible', 'for', 'backing', 'by', 'Fannie', 'and', 'Freddie', '.', '--', '-', 'Time', 'Warner', 'plans', 'to', 'move', 'its', 'U.S.', 'retirees', 'from', 'company-administered', 'health', 'plans', 'to', 'private', 'exchanges', '.', '--', '-', 'A', 'U.S.', 'appeals', 'court', 'will', 'hear', 'oral', 'arguments', 'today', 'in', 'a', 'suit', 'by', 'Verizon', 'challenging', 'FCC', '``', 'net-neutrality', "''", 'rules', '.', '--', '-', 'Japan', "'s", 'GDP', 'grew', 'at', 'an', 'annualized', 'pace', 'of', '3.8', '%', 'in', 'the', 'second', 'quarter', ',', 'much', 'faster', 'than', 'initially', 'estimated', '.', '--', '-', 'China', 'said', 'its', 'exports', 'rose', '7.2', '%', 'in', 'August', 'from', 'a', 'year', 'earlier', ',', 'the', 'latest', 'in', 'a', 'series', 'of', 'positive', 'economic', 'reports', '.', '--', '-', 'White', 'House', 'officials', 'are', 'considering', 'Treasury', 'Undersecretary', 'Lael', 'Brainard', 'for', 'a', 'seat', 'on', 'the', 'Fed', "'s", 'board', '.', '--', '-', 'SolarCity', 'scrapped', 'a', '``', 'Happy', 'Meals', "''", 'deal', ',', 'a', 'share-lending', 'technique', 'that', 'has', 'been', 'the', 'target', 'of', 'criticism', '.', '--', '-', 'The', 'CFTC', 'could', 'vote', 'by', 'the', 'end', 'of', 'the', 'month', 'on', 'a', 'rule', 'aimed', 'at', 'curbing', 'speculation', 'in', 'the', 'commodity', 'markets', '.', '--', '-', 'Reserve', 'Primary', 'Fund', "'s", 'former', 'managers', 'have', 'reached', 'a', 'preliminary', 'settlement', 'of', 'a', 'lawsuit', 'brought', 'by', 'investors', '.']

As certain words tend to lack any inherent meaning (e.g., they are more for grammatical purposes rather than explication), we oftern remove such words. We call those words "stopwords." NLTK has builtin lists of these words for many languages.

We will slightly modify the default list though, as 'no' and 'not' can have useful meaning.

In [43]:

# If you get an error that you are missing 'stopwords', run: nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words("english"))
stop_words.remove('no')
stop_words.remove('not')
punct = {'.', ',', ';', '"', '\'', '-', '--', '---', '``', '\'\'', '%', '\'s'}
stop_words = stop_words | punct
print(stop_words)

{'at', 'down', 'about', 'myself', "'s", '.', 've', 'from', "shouldn't", 'each', 'do', 'both', 'just', "weren't", 'doing', 're', "hadn't", 'below', 'then', 'needn', 'his', 'if', 'hadn', 'by', "you've", 'this', 'these', 'haven', 'me', 'in', '%', "you'll", 'your', 'there', 'mustn', 'most', 'he', "should've", 'yourself', 'can', 'aren', 'once', 'd', "needn't", 'him', 'where', 'shan', 'but', 'because', 'above', "it's", 'to', 'some', 'ourselves', 'into', 'too', 'than', ';', 'few', ',', 'had', 'only', 'o', 'are', 'have', 'didn', 'any', 'more', 'all', 'after', 'now', 'while', 'on', 'am', 'doesn', 'itself', '"', 'a', 'those', "don't", 'hasn', 'is', 'themselves', 'does', 'theirs', 'wasn', 'it', 'was', "hasn't", 'why', 'i', 's', 'their', 'will', 'so', 'm', 'were', "mustn't", 'again', 'ours', 'yours', 'they', "didn't", '``', 'when', 'what', 'same', 'between', 'y', "isn't", 'isn', "'", 'her', 'up', 'before', 'ma', 'out', 'which', "you're", "couldn't", 'yourselves', 'has', 'won', 'through', 'be', 'them', "doesn't", 'its', "wasn't", '---', 'herself', 'ain', 'off', "haven't", "mightn't", 'weren', 'of', 'other', 'hers', 'did', 'll', 'you', 'being', 'until', 'should', 'own', 'over', 'our', 'couldn', '-', 'we', 'she', 'during', 'having', "wouldn't", "you'd", 'against', 'such', 'an', "aren't", 'whom', 'been', "she's", "won't", "shan't", 'for', 'shouldn', 't', 'very', 'and', 'mightn', 'here', 'the', 'how', 'my', 'wouldn', 'or', 'as', 'himself', 'under', 'with', "''", 'don', 'that', "that'll", 'who', 'nor', '--', 'further'}

We also will convert everything to lowercase, as casing is unlikely to matter in most sentiment contexts. However, note that this a choice that depends on the context of your problem, as does the choice of stopword lists.

In [44]:

filtered_tokens = []
for tokens in article_tokens:
    filtered_tokens.append([t.lower() for t in tokens if t.lower() not in stop_words])

In [45]:

print(filtered_tokens[0])

['hedge', 'funds', 'cutting', 'standard', 'fees', '2', 'assets', 'management', '20', 'profits', 'amid', 'pressure', 'investors', 'team', 'ares', 'cpp', 'investment', 'board', 'final', 'stages', 'talks', 'buy', 'luxury', 'retailer', 'neiman', 'marcus', 'around', '$', '6', 'billion', 'federal', 'regulators', 'plan', 'reduce', 'maximum', 'size', 'mortgages', 'eligible', 'backing', 'fannie', 'freddie', 'time', 'warner', 'plans', 'move', 'u.s.', 'retirees', 'company-administered', 'health', 'plans', 'private', 'exchanges', 'u.s.', 'appeals', 'court', 'hear', 'oral', 'arguments', 'today', 'suit', 'verizon', 'challenging', 'fcc', 'net-neutrality', 'rules', 'japan', 'gdp', 'grew', 'annualized', 'pace', '3.8', 'second', 'quarter', 'much', 'faster', 'initially', 'estimated', 'china', 'said', 'exports', 'rose', '7.2', 'august', 'year', 'earlier', 'latest', 'series', 'positive', 'economic', 'reports', 'white', 'house', 'officials', 'considering', 'treasury', 'undersecretary', 'lael', 'brainard', 'seat', 'fed', 'board', 'solarcity', 'scrapped', 'happy', 'meals', 'deal', 'share-lending', 'technique', 'target', 'criticism', 'cftc', 'could', 'vote', 'end', 'month', 'rule', 'aimed', 'curbing', 'speculation', 'commodity', 'markets', 'reserve', 'primary', 'fund', 'former', 'managers', 'reached', 'preliminary', 'settlement', 'lawsuit', 'brought', 'investors']

For the Negative Loughran McDonald dictionary, the measure is based purely on word counts. As such, to quickly calculate negative sentiment, we can first calculate word counts, and then cross reference those counts with the dictionary. For long documents, you will find that this method is significantly faster.

In [46]:

filtered_counts = [Counter(tokens) for tokens in filtered_tokens]

In [47]:

print(filtered_counts[0])

Counter({'investors': 2, 'board': 2, 'plans': 2, 'u.s.': 2, 'hedge': 1, 'funds': 1, 'cutting': 1, 'standard': 1, 'fees': 1, '2': 1, 'assets': 1, 'management': 1, '20': 1, 'profits': 1, 'amid': 1, 'pressure': 1, 'team': 1, 'ares': 1, 'cpp': 1, 'investment': 1, 'final': 1, 'stages': 1, 'talks': 1, 'buy': 1, 'luxury': 1, 'retailer': 1, 'neiman': 1, 'marcus': 1, 'around': 1, '$': 1, '6': 1, 'billion': 1, 'federal': 1, 'regulators': 1, 'plan': 1, 'reduce': 1, 'maximum': 1, 'size': 1, 'mortgages': 1, 'eligible': 1, 'backing': 1, 'fannie': 1, 'freddie': 1, 'time': 1, 'warner': 1, 'move': 1, 'retirees': 1, 'company-administered': 1, 'health': 1, 'private': 1, 'exchanges': 1, 'appeals': 1, 'court': 1, 'hear': 1, 'oral': 1, 'arguments': 1, 'today': 1, 'suit': 1, 'verizon': 1, 'challenging': 1, 'fcc': 1, 'net-neutrality': 1, 'rules': 1, 'japan': 1, 'gdp': 1, 'grew': 1, 'annualized': 1, 'pace': 1, '3.8': 1, 'second': 1, 'quarter': 1, 'much': 1, 'faster': 1, 'initially': 1, 'estimated': 1, 'china': 1, 'said': 1, 'exports': 1, 'rose': 1, '7.2': 1, 'august': 1, 'year': 1, 'earlier': 1, 'latest': 1, 'series': 1, 'positive': 1, 'economic': 1, 'reports': 1, 'white': 1, 'house': 1, 'officials': 1, 'considering': 1, 'treasury': 1, 'undersecretary': 1, 'lael': 1, 'brainard': 1, 'seat': 1, 'fed': 1, 'solarcity': 1, 'scrapped': 1, 'happy': 1, 'meals': 1, 'deal': 1, 'share-lending': 1, 'technique': 1, 'target': 1, 'criticism': 1, 'cftc': 1, 'could': 1, 'vote': 1, 'end': 1, 'month': 1, 'rule': 1, 'aimed': 1, 'curbing': 1, 'speculation': 1, 'commodity': 1, 'markets': 1, 'reserve': 1, 'primary': 1, 'fund': 1, 'former': 1, 'managers': 1, 'reached': 1, 'preliminary': 1, 'settlement': 1, 'lawsuit': 1, 'brought': 1})

Then we can cross reference the Counters with the dictionary to calculate the amount of negative words.

In [48]:

neg = []
for counts in filtered_counts:
    temp = 0
    for w in LM_neg: 
        temp += counts[w]
    neg.append(temp)

In [49]:

words = [sum(counts.values()) for counts in filtered_counts]

In [50]:

print(neg)

[3, 14, 137, 15, 5, 21, 35, 3, 7, 25, 53, 4, 17, 9, 9, 22, 29, 4, 16, 5, 2, 14, 41, 13, 9, 26, 12, 18, 9, 24, 21, 16, 16, 3, 1, 21, 3, 7, 52, 0, 33, 3, 55, 7, 12, 14, 20, 3, 12, 11, 4, 9, 16, 4, 5, 0, 11, 6, 2, 6, 2, 18, 12, 17, 7, 7, 20, 11, 2, 5, 4, 1, 32, 1, 1, 11, 20, 1, 10, 6, 5, 5, 4, 12, 6, 24, 7, 14, 31, 21, 7, 21, 11, 2, 9, 2, 11, 1, 20, 3, 10, 22, 3, 6, 4, 6, 3, 16, 4, 25, 1, 5, 2, 3, 6, 16, 3, 11]

For the Positive Loughran McDonald dictionary, the measures is based on word counts, but requires that there is no negating word before it. As such, we should keep the word order and check word-by-word.

In [51]:

pos = []
for tokens in filtered_tokens:
    prior_token = ''
    temp = 0
    for token in tokens:
        if token in LM_pos and prior_token != ['no', 'not']:
            temp += 1
        prior_token = token
    pos.append(temp)

In [52]:

print(pos)

[2, 8, 21, 10, 2, 9, 8, 0, 3, 4, 5, 3, 6, 2, 1, 8, 1, 9, 9, 9, 3, 9, 12, 0, 1, 10, 3, 8, 7, 14, 7, 4, 4, 1, 2, 1, 2, 5, 6, 1, 18, 0, 8, 5, 3, 21, 7, 1, 1, 12, 0, 5, 0, 3, 4, 4, 3, 1, 4, 4, 6, 13, 10, 7, 5, 8, 9, 4, 2, 6, 6, 1, 2, 3, 5, 2, 1, 2, 8, 6, 9, 7, 5, 8, 9, 10, 8, 11, 8, 10, 0, 2, 3, 6, 2, 1, 15, 5, 24, 2, 7, 42, 8, 2, 4, 4, 7, 19, 6, 15, 3, 4, 3, 6, 9, 5, 3, 9]

A common measure of sentiment is calculated as follows:

$$ Sentiment = \frac{\#Positive - \#Negative}{\# Words} $$

In [53]:

df = pd.DataFrame(zip(words, pos, neg), columns=['words', 'pos', 'neg'])

In [54]:

df['sentiment'] = (df.pos - df.neg) / df.words 

In [55]:

df

Out[55]:

	words	pos	neg	sentiment
0	132	2	3	-0.007576
1	608	8	14	-0.009868
2	1841	21	137	-0.063009
3	736	10	15	-0.006793
4	141	2	5	-0.021277
...	...	...	...	...
113	239	6	3	0.012552
114	493	9	6	0.006085
115	414	5	16	-0.026570
116	366	3	3	0.000000
117	571	9	11	-0.003503

118 rows × 4 columns

Lastly, we can take a look at the most negative and most positive articles from the issue.

In [56]:

print('Sentiment: ' + str(np.min(df.sentiment)) +
      ', Article #' + str(np.argmin(df.sentiment)) + 
      ', Text: ' + articles[np.argmin(df.sentiment)])

Sentiment: -0.11173184357541899, Article #35, Text: Jason Riley's "Jobless Blacks Should Cheer Background Checks" (op-ed, Aug. 23) suggests that "Ban the Box" initiatives to limit usage of criminal background checks in hiring practices are misguided in that they are a disservice to jobless blacks whom they seek to serve.
What this piece is missing is the concurrent issue of ensuring that people aren't denied a job based on a criminal history which bears no consequence on their potential as workers. In the same report cited by the editorial, "Perceived Criminality, Criminal Background Checks, and the Racial Hiring Practices of Employers" from the Journal of Law and Economics, about half (42.1%) of surveyed firms stated that they would "probably not" hire an applicant with any criminal record. Even worse, 19.5% of firms stated that they would "definitely not" hire an applicant with any criminal record.
The "National Longitudinal Survey of Youth" states that 30.2% of youth are arrested for a crime by age 23. In a criminal justice system that incarcerates more people than the Russian gulag did, forced criminal background checks present a serious threat to the productivity of millions of Americans, who, but for minor and mostly drug-related arrests, are perfectly capable workers. Given the current mindset of employers, which irrationally puts these applications in the "probably not" and "definitely not" bin, the legitimacy and need for criminal background checks needs to be seriously examined.
What we need to do is take a closer look at how we use the word "criminal" and what it really means. Otherwise millions of qualified Americans will be disqualified before they even apply.
Ajay Nadig
Cherry Hill, N.J.
(See related letter: "Letters to the Editor: Of Course There Is a Stigma for Criminals" -- WSJ September 23, 2013)

In [57]:

print('Sentiment: ' + str(np.max(df.sentiment)) +
      ', Article #' + str(np.argmax(df.sentiment)) + 
      ', Text: ' + articles[np.argmax(df.sentiment)])

Sentiment: 0.03164556962025317, Article #17, Text: BEIJING -- China's economy showed fresh signs of resilience in August, with trade data pointing to a sustained strengthening in global demand for goods from the country.
Exports continued to gather steam, rising 7.2% in August from a year earlier, according to data released on Sunday by the General Administration of Customs. This was up from a 5.1% rise in July and a contraction of 3.1% in June. Imports rose 7% from a year earlier in August, down from 10.9% in July.
The overall picture was of a Chinese economy benefiting from progressive strengthening of demand in the U.S. and other important export markets. China is also continuing to stock up on raw materials for its industrial sector.
"China's back," said Stephen Green of Standard Chartered Bank. "It won't be a strong recovery but it's increasingly clear we've bottomed."
Meanwhile, data early Monday showed inflation in August remained subdued, with the consumer-price index edging down to 2.6% year-on-year, from 2.7% in July.
August's trade numbers are the latest in a series of positive data releases, after overseas sales and factory output in July showed signs of improvement.
There are questions surrounding the upswing's sustainability. Rising wages and a stronger currency dent the competitiveness of China's exports. Beijing's recent moves to slow lending growth -- after years of credit-fueled economic expansion -- could curtail investment and imports.
Still, two months of stronger data has increased optimism that the government will be able to hit its full-year target for gross domestic product growth of 7.5%.

Supervised classification: Books per Hassan et al. (2019 QJE)¶

First, I have provided the main loop I use to clean and bi-gram documents. It is fairly flexible and efficient, and quite accurate in terms of its parsing.

In [58]:

def grammer(doc, n, processed_patterns, word_blacklist, gram_blacklist, lower=True, stopword=True):
    if not stopword:
        grams = textacy.extract.ngrams(doc, n=n, filter_stops=False, filter_nums=True)
    else:
        grams = textacy.extract.ngrams(doc, n=n, filter_stops=True, filter_nums=True)
    ngrams = Counter()
    for gram in grams:
        pos = '|'.join([word.tag_ for word in gram])
        if not lower:
            text = '|'.join([word.text for word in gram])
        else:
            text = '|'.join([word.text for word in gram]).lower()
        if pos not in processed_patterns:
            if not np.any([word.text in word_blacklist for word in gram]):
                if text not in gram_blacklist:
                    ngrams[text] += 1
    return ngrams

Next, we need to define the blacklists. I have provided the default blacklists from Hassan et al. (2019 QJE) below.

In [59]:

word_blacklist = "i i've you've we've i'm you're we're i'd you'd we'd that's".split(' ')
pattern_blacklist = ["PRP|PRP", "IN|IN", "RB|RB", "WRB|RB", "IN|RB", "RB|IN",
                     "IN|WRB", "WRB|IN", "DT|IN", "IN|DT", "RB|WRB", "RB|DT",
                     "DT|RB", "WRB|DT", "DT|WRB", "SYM|SYM"]
gram_blacklist = 'princeton|university'

In [67]:

# install the spacy language model with `python -m spacy download en_core_web_sm`
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
nlp.max_length = 10000000

documents = list(nlp.pipe(articles))

grams = [grammer(document, n=2, processed_patterns=pattern_blacklist,
                word_blacklist=word_blacklist,
                gram_blacklist=gram_blacklist) for document in documents]

# Intermediary measures
gram_counts = [sum(gram.values()) for gram in grams]
gram_sets = [set(gram) for gram in grams]

As the text data is somewhat difficult to share for the paper (as SMU lacks a license) we will use a hypothetical weighted dictionary. This is essentially a hypothetical plug-in for the $\mathbb{P}\backslash\mathbb{N}$ in their paper. Suppose we have a weight set of the following:

In [61]:

weights = {'earnings|foreign':0.5, 'currency|foreign':0.4, 'foreign|currencies':0.35, 'foreign|subsidiary':0.3,
           'foreign|currency':0.25, 'foreign|investment':0.2, 'foreign|holdings':0.2, 'foreign|borrowing':0.1,
           'overseas|sales':0.1, 'foreign|investors':0.05}
weight_set = set(weights)

Apply the hypothetical weighted dictionary to our data.

Subset to the overlap
Iterate over overlap and sum
- If doing a word radius around risk or the like, add it in here
Dot product weights by counts of overlapping data
Divide by gram counts

In [62]:

foreign_weight = []
for i in range(0, len(grams)):
    shared_keys = list(gram_sets[i] & weight_set)
    ns = len(shared_keys)
    v_weights = np.empty(ns)
    v_counts = np.empty(ns)
    c = 0
    for key in shared_keys:
        v_weights[c] = weights[key]
        v_counts[c] = grams[i][key]
        c += 1
    spec_weight = np.dot(v_weights, v_counts)
    measure = spec_weight / gram_counts[i] if gram_counts[i] > 0 else 0
    foreign_weight.append(measure)

Lastly, let's add the weight from our

In [63]:

df['foreign'] = foreign_weight

In [64]:

df

Out[64]:

	words	pos	neg	sentiment	foreign
0	132	2	3	-0.007576	0.0
1	608	8	14	-0.009868	0.0
2	1841	21	137	-0.063009	0.0
3	736	10	15	-0.006793	0.0
4	141	2	5	-0.021277	0.0
...	...	...	...	...	...
113	239	6	3	0.012552	0.0
114	493	9	6	0.006085	0.0
115	414	5	16	-0.026570	0.0
116	366	3	3	0.000000	0.0
117	571	9	11	-0.003503	0.0

118 rows × 5 columns

In [65]:

np.sum(df['foreign']>0)

Out[65]:

In [66]:

print(articles[np.argmax(df['foreign'])])

The divergent fortunes of global emerging markets can be told through Latin America's two biggest economies: Mexico and Brazil.
Think of it as a tortoise-and-hare story. For the past decade, Brazil has boomed by selling raw materials to China. Its expanding middle class gorged on a tide of cheap credit unleashed by central banks in advanced economies as they tried to energize their recoveries.
Brazil's economy averaged 3.6% annual growth over the past decade, peaking at a 7.5% pace in 2010. Its currency surged in value.
All the usual signs of excess were in evidence: Brazilian shoppers cramming stores in New York and Miami; news stories reporting $30 cheese pizzas and $35 martinis in Sao Paulo.
By comparison, Mexico has seen lackluster growth, partly because it has been tied to a struggling U.S. economy. It has also suffered from deep problems of its own: laws that banned foreign investment in energy, a dysfunctional tax code, a tattered education system and hidebound economy dominated by a handful of near-monopolies. And it suffered a surge in drug violence, deterring tourists and investors.
Mexico's economic growth averaged 2.6% per year over the past decade, while its currency has slipped slightly in value.
Now the shoe's on the other foot. Brazil is being punished by investors as the U.S. Federal Reserve signals a coming wind-down of its bond-buying program and as China's hunger fades for its raw materials. Brazil's currency and stocks have both sunk by more than 10% this year.
"Brazil has done very well over the past 10 years on the back of a commodities boom that's transferred massive wealth from China," said David Rees, emerging-markets economist at Capital Economics. "That's now coming to an end."
Brazil largely squandered the bonanza, investing little in roads and other areas that could foster its development. Its government has pursued a state-led economic model, rendering many of its businesses uncompetitive abroad. And businesses and households loaded up on debt, further constraining future growth. It has developed a significant gap that must be financed by foreign borrowing.
Meanwhile, Mexico used its lean years to overhaul its economy, revamping the country's labor laws, education system and its telecommunications system, financial and energy sectors -- including a plan to open up its oil and gas sector to private investment. If completed, economists expect the changes to lift the country's growth potential at a time when Mexico's biggest trading partner, the U.S., kicks into higher gear.
At the same time, Mexico has maintained a relatively small trade deficit that is easily financed by long-term foreign investment in companies and factories there. It isn't as dependent on fickle flows of short-term foreign cash and, as a result, has been less affected by the turmoil roiling Brazil and other emerging markets in recent weeks.
Mexico could still disappoint. Its economy shrank slightly in the second quarter, while Brazil has had a stronger few months than many analysts expected. Mexico's central bank on Friday cut interest rates by a quarter point to support the economy. But many economists expect Mexico to pick up speed in the months and years to come.
The story of Latin America's two largest economies helps illustrate why the fortunes of emerging markets are now diverging. For the past five years, developing economies such as Brazil, Russia, India, China and South Africa -- the so-called BRICS -- have been the engines of global growth as developed economies coped with the after-effects of the financial crisis.
To support their sluggish economies, central banks in the U.S., U.K., and Japan bought bonds to push their interest rates down to historic lows, sending a wave of cash into emerging markets in search of higher yields. With the Fed signaling that it will start to wind down its $85 billion-a-month bond-buying program this year, that tide is reversing and money is draining from developing nations.
The list of losers is already apparent. They're countries with large financing needs -- because they have large trade gaps or budget deficits or because they've borrowed heavily abroad. India, Turkey, Indonesia, South Africa and Brazil all have suffered big market selloffs in recent weeks -- a chief topic of discussion at last week's meeting of the Group of 20 nations in St. Petersburg, Russia.
Others, including Mexico, the Philippines, Poland and South Korea, have suffered smaller outflows of cash. In general, they tend to be countries with small trade gaps to finance and relatively little debt -- both public and private. They're also countries that export manufactured goods to a slowly recovering U.S. and Europe instead of raw materials to China.
Unlike the BRICS, they tended to grow more slowly over recent years and didn't build up large trade imbalances or big debts. They undertook difficult economic overhauls during the slow years. They didn't become dependent on China, and aren't as exposed to its slowdown. And they stand to benefit from trade links to the West.
Call it the revenge of the tortoises.