by Wian Stipp
#prevents printing the install messages
%%capture
!pip install sec-edgar-downloader
!pip install html2text
from sec_edgar_downloader import Downloader
import textwrap
import html2text
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
from tqdm.notebook import tqdm
We use a tool called SEC Edgar Downloader to scrape the html from the 10-k reports. For more details: https://pypi.org/project/sec-edgar-downloader/.
We will download the data directly into the default working directory that Google Colab uses. We also need to specify which companies we would like data for.
PATH = "/content"
dl = Downloader(PATH)
SYMBOLS = ["GOOGL", "MSFT", "AMZN", "IBM", "NVDA"]
# The ARGS variable holds some hardcoded information that we might need to reuse
ARGS = {"Type of Report": "10-K",
"Companies": SYMBOLS,
}
Then we can simply download the data by looping through each of the companies and downloading using the SEC Edgar tool. The data will download into the "content" directory as we specified above.
Throughout this notebook we also use the tqdm_notebook tool from tqdm. This is essentially an awesome progress bar that helps you see how far you are through a loop and the expected remaining time. https://github.com/tqdm/tqdm
for symbol in tqdm_notebook(SYMBOLS):
dl.get(ARGS["Type of Report"], symbol)
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:1: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0 Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook` """Entry point for launching an IPython kernel.
HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))
Now that we have downloaded the data, we need to extract the relevant text from the files. We also need to extract the year of the 10-k filing. This is easy since it is included in the document name.
To extract the text we just use a rough approximation to take a portion of text near the start of the report.
Html2text is another tool used here to convert from HTML, which all the files are in, to text.
def createDataframe(company_list):
df = pd.DataFrame(columns=["Company", "Year", "Report"])
start_index = {"AMZN": 49257}
end_index = {"AMZN": 185190}
for company in tqdm(company_list):
try:
reports = os.listdir(PATH + "/sec_edgar_filings/" + company + "/10-K")
for index, report in enumerate(reports):
opened_file = open(PATH + "/sec_edgar_filings/" + company + "/10-K/" + report, "r")
full_text = opened_file.read()
full_text_length = len(full_text)
opened_file.close()
try:
if company in start_index.keys():
start = start_index[company]
end = end_index[company]
else:
start = 44800
end = 200000
text = html2text.html2text(full_text[start:end])
t_len = len(text)
relevant_text = text[round(t_len*0.003):round(t_len*0.08)]
yr = int(report.split("-")[1])
if yr > 20:
yr = 1900 + yr
else:
yr = 2000 + yr
df = df.append({"Company": company, "Year": yr, "Report": relevant_text}, ignore_index=True)
except:
print(company, report, "Failed")
except: pass
return df
dataframe = createDataframe(ARGS["Companies"])
/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:7: TqdmDeprecationWarning: This function will be removed in tqdm==5.0.0 Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook` import sys
HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))
import re
def clean_dataset(text):
# Make Lowercase
text = text.lower()
# Remove some remaining html
text = re.sub(r"font", "", text)
text = re.sub(r"size", "", text)
text = re.sub(r"pt", "", text)
text = re.sub(r"px", "", text)
text = re.sub(r"padding", "", text)
text = re.sub(r"family", "", text)
text = re.sub(r"style", "", text)
# Remove HTML special entities (e.g. &)
text = re.sub(r"\&\w*;", "", text)
# Remove tickers
text = re.sub(r"\$\w*", "", text)
# Remove hyperlinks & URLs
text = re.sub(r"https?:\/\/.*\/\w*", "", text)
text = re.sub(r"http(\S)+", "", text)
text = re.sub(r"http ...", "", text)
# Remove whitespace (including new line characters)
text = re.sub(r"\s\s+", "", text)
text = re.sub(r"[ ]{2, }", " ", text)
# &, < and >
text = re.sub(r"&?", "and", text)
text = re.sub(r"<", "<", text)
text = re.sub(r">", ">", text)
# Insert space between words and punctuation marks
text = re.sub(r'\[\[(?:[^\]|]*\|)?([^\]|]*)\]\]', r'\1', text)
# Remove characters beyond Basic Multilingual Plane (BMP) of Unicode:
text = "".join(c for c in text if c <= "\uFFFF")
text = text.strip()
text = " ".join(text.split())
return text
dataframe["Report"] = dataframe["Report"].apply(clean_dataset)
Word clouds, as basic as they are, can be useful to visually represent the main themes in the report.
from wordcloud import WordCloud
def make_wordcloud(series):
all_text = ','.join(list(series.values))
wordcloud = WordCloud(background_color="white", max_words=5000, contour_width=3)
wordcloud.generate(all_text)
return wordcloud.to_image()
make_wordcloud(dataframe.Report)
Let's see if we can see any immediate differences between companies by just using the word cloud. Since we wrote the word cloud generation into a function called make_wordcloud
we can easily reuse it! Let's look at Alphabet vs Microsoft.
import numpy as np
GOOGL_condition = (dataframe.Company == "GOOGL")
make_wordcloud(dataframe[GOOGL_condition].Report)
MSFT_condition = (dataframe.Company == "MSFT")
make_wordcloud(dataframe[MSFT_condition].Report)
Just by visualising the data from Alphabet vs Microsoft, we can see that Microsoft seems to talk more about their services and products, while Alphabet seems to be more concerned about macroeconomic factors.
We will be using LDA for the Topic Model. LDA stands for Latent Dirichlet Allocation. If we break down that term a little, we notice the word "latent", which means unobserved; Dirichlet is named after the German mathematician and "allocation" because of the nature of the problem of allocating latent topics to chunks of text.
LDA is actually an unsupervised technique, meaning we do not need labelled data, which is a big benefit when you have many 10-k filings as we do. The mathematics behind it is very deep, because it uses Bayesian methods, and so we won't cover it here. If you would like to get a better idea of the mathematics, then I recommend you start here: https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation or view the original paper (with Andrew Ng) here: https://web.archive.org/web/20120501152722/http://jmlr.csail.mit.edu/papers/v3/blei03a.html
An intuitive way to understand how topic modelling works is that the model imagines each document contains a fixed number of topics. For each topic, there are certain words that are associated with that topic. Then a document can be modeled as some topics that are generating some words associated with the topics. For example, a document discussing Covid-19 and unemployment impact can be modelled as containing the topics: “Covid-19”, “economics”, “health” and “unemployment”. Each one of these topics has a specific vocabulary associated with it, which appears in the document. The model knows the document isn’t discussing the topic “trade” because words associated with “trade” do not appear in the document.
NTLK, Gensim and SpaCy are the primary packages we will be using to clean the text, preprocess it and then build the model. There libraries are very common in the NLP space nowadays and should become familar to you overtime.
We also use a special plotting tool called pyLDAvis. As the name suggests this enables you to visualise the Topic Modelling output by using a number of techniques, such as dimensionality reduction.
To prepare the text for the model we need to do a few things. The first is to remove stopwords, which are words that are not going to add much meaning to the text and hence just add noise into the model. The NTLK package has a lot of these words listed which we can make use of right away. Have a look at some of them below, stored in the list variable stop_words
.
Note: If you find any other words that could be considered meaningless which remain after filtering out the stopwords, then you can just append them to the list.
%%capture
import nltk; nltk.download('stopwords')
!python3 -m spacy download en
%%capture
from pprint import pprint
import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
# Gensim is a great package that supports topic modelling and other NLP tools
import gensim
import gensim.corpora as corpora
from gensim.models import CoherenceModel
from gensim.utils import simple_preprocess
# spacy for lemmatization
import spacy
# Plotting tools
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
def sent_to_words(sentences):
for sentence in sentences:
yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
def remove_stopwords(texts):
"""
This function simply removes all of the stopwords we have specified in the list stop_words.
"""
return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]
def bigrams(words, bi_min=3):
"""
https://radimrehurek.com/gensim/models/phrases.html
"""
bigram = gensim.models.Phrases(words, min_count = bi_min)
bigram_mod = gensim.models.phrases.Phraser(bigram)
return bigram_mod
def get_corpus(df):
words = list(sent_to_words(df.Report))
words = remove_stopwords(words)
bigram = bigrams(words)
bigram = [bigram[report] for report in words]
id2word = gensim.corpora.Dictionary(bigram)
id2word.filter_extremes(no_below=3, no_above=0.35)
id2word.compactify()
corpus = [id2word.doc2bow(text) for text in bigram]
return corpus, id2word, bigram
train_corpus, train_id2word, bigram_train = get_corpus(dataframe)
We need to choose the number of topics we are looking for. This is a hyperparameter and cannot be directly optimised.
NUM_TOPICS = 10
with warnings.catch_warnings():
warnings.simplefilter('ignore')
lda_train = gensim.models.ldamulticore.LdaMulticore(
corpus=train_corpus,
num_topics=NUM_TOPICS,
id2word=train_id2word,
chunksize=50,
workers=4,
passes=100,
eval_every = 1,
per_word_topics=True)
coherencemodel = CoherenceModel(lda_train, texts=bigram_train, dictionary=train_id2word)
print (coherencemodel.get_coherence())
0.527596795083962
# Visualize the topics using pyLDAvis
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_train, train_corpus, train_id2word)
vis
As the results show, the model is decent at finding topics, but we can do better. We will look at two ways to improve the model: finding the optimal number of topics, and using Mallet.
Mallet (Machine Learning for Language Toolkit), is a topic modelling package written in Java. Gensim has a wrapper to interact with the package, which we will take advantage of.
The difference between the LDA model we have been using and Mallet is that the original LDA using variational Bayes sampling, while Mallet uses collapsed Gibbs sampling. The latter is more precise, but is slower. In most cases Mallet performs much better than original LDA, so we will test it on our data. Also, as we will see, Mallet will dramatically increase our coherence score, demonstrating that it is better suited for this task as compared with the original LDA model.
We need to go through some additional steps to properly install Mallet and the wrapper from Gensim. Here is an excellent guide to using Mallet with Google Colab: https://github.com/polsci/colab-gensim-mallet
def install_java():
!apt-get install -y openjdk-8-jdk-headless -qq > /dev/null #install openjdk
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64" #set environment variable
!java -version #check java version
install_java()
openjdk version "11.0.7" 2020-04-14 OpenJDK Runtime Environment (build 11.0.7+10-post-Ubuntu-2ubuntu218.04) OpenJDK 64-Bit Server VM (build 11.0.7+10-post-Ubuntu-2ubuntu218.04, mixed mode, sharing)
%%capture
!wget http://mallet.cs.umass.edu/dist/mallet-2.0.8.zip
!unzip mallet-2.0.8.zip
os.environ['MALLET_HOME'] = '/content/mallet-2.0.8'
mallet_path = '/content/mallet-2.0.8/bin/mallet'
from gensim.models.wrappers import LdaMallet
with warnings.catch_warnings():
warnings.simplefilter('ignore')
lda_mallet = LdaMallet(
mallet_path,
corpus=train_corpus,
num_topics=NUM_TOPICS,
id2word=train_id2word,
)
gensimmodelMallet = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(lda_mallet)
coherencemodel = CoherenceModel(gensimmodelMallet, texts=bigram_train, dictionary=train_id2word)
print(coherencemodel.get_coherence())
0.7143588181690383
As you can see we get a huge boost in coherence!
# Visualize the topics using pyLDAvis
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(gensimmodelMallet, train_corpus, train_id2word)
vis
There is no easy way to obtain the optimal number of topics you should use. This is a hyperparameter and must be tuned manually. One way, albeit still not great, is to create a number of models, each with different topic numbers, and calculate the coherence scores for each model. Then we plot the coherence vs. number of topics, and find the elbow - the point at which the curve tapers off.
def plot_coherence(dictionary, corpus, texts, maximum=30, minimum=3, step=4):
coherence_values = []
model_list = []
for num_topics in tqdm(range(minimum, maximum, step)):
model = LdaMallet(
mallet_path,
corpus=train_corpus,
num_topics=num_topics,
id2word=train_id2word,
)
gensimMallet = gensim.models.wrappers.ldamallet.malletmodel2ldamodel(model)
model_list.append(gensimMallet)
coherencemodel = CoherenceModel(model=gensimMallet, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_values.append(coherencemodel.get_coherence())
return model_list, coherence_values
models, coherences = plot_coherence(train_id2word, train_corpus, bigram_train)
HBox(children=(FloatProgress(value=0.0, max=7.0), HTML(value='')))
/usr/local/lib/python3.6/dist-packages/smart_open/smart_open_lib.py:253: UserWarning: This function is deprecated, use smart_open.open instead. See the migration notes for details: https://github.com/RaRe-Technologies/smart_open/blob/master/README.rst#migrating-to-the-new-open-function 'See the migration notes for details: %s' % _MIGRATION_NOTES_URL
x = range(3, 30, 4)
plt.figure(figsize=(15, 10))
plt.plot(x, coherences)
plt.xlabel("Num Topics")
plt.ylabel("Coherence score")
# plt.legend(("coherence_values"), loc='best')
plt.show()
Unfortunately it looks like this is quite messy, but the scale on the y-axis is quite narrow, so it doesn't make too much difference which value we choose. Also, if you run this again, the line can look much different. Therefore, we should choose a number of topics that makes sense in this content. The fewer topics we choose, the more general and broad the topics would be and vice versa.
Using the knowledge from the optimal topics we can select a "best model" to analyse the data with.
bestModelMallet = models[1]
# Visualize the topics using pyLDAvis
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(bestModelMallet, train_corpus, train_id2word)
vis
One great thing we can now do is look at how certain mentions of certain topics change overtime within each company. Inspecting topic 3 reveals the terms: “improved_advertising”, “cloud_platforms”, “phones_intelligent”, “information_investors”, “global_compute” and so on. This topic can be interpreted as integrating technology with current business practices. Let’s simply call with topic “Tech” for short.
def get_df_topics(row, topic_id):
words = list(sent_to_words([row]))
words = remove_stopwords(words)
bigram = bigrams(words)
bigram = [bigram[report] for report in words]
tokens = train_id2word.doc2bow(bigram[0])
topics = bestModelMallet.get_document_topics(tokens)
if topics is not None:
for index, score in topics:
if index == topic_id:
return score
MSFT_df = dataframe[dataframe.Company == "MSFT"]
MSFT_df = MSFT_df.sort_values("Year")
MSFT_df.head()
Company | Year | Report | |
---|---|---|---|
30 | MSFT | 1994 | anufacturing and distribution operation consis... |
8 | MSFT | 1995 | mpete with unix-based operating systems from a... |
13 | MSFT | 1996 | operating systems from a wide range of compani... |
18 | MSFT | 1997 | distribute microsoft's copyrighted software pr... |
24 | MSFT | 1998 | providing these technical services. notes to f... |
score_list = []
year_list = []
for index, row in MSFT_df.iterrows():
score_list.append(get_df_topics(row.Report, 3))
year_list.append(row.Year)
plt.plot(year_list, score_list)
plt.title("MSFT Tech Mentions")
plt.show()
As we can see, mentions of topic 3, which seems to represent technology has been fluctuating since 1995, and more recently is at a low-point for Microsoft. Let's compare with IBM.
IBM_df = dataframe[dataframe.Company == "IBM"]
IBM_df = IBM_df.sort_values("Year")
IBM_df.head()
Company | Year | Report | |
---|---|---|---|
59 | IBM | 1994 | tatement no. 33-6889 on form s-3 filed on july... |
61 | IBM | 1995 | erland)................. switzerland 100 ibm u... |
68 | IBM | 1996 | world trade corporation.......................... |
78 | IBM | 1997 | with any and all amendments and subsequent res... |
71 | IBM | 1998 | the ibm supplemental executive retirement plan... |
score_list = []
year_list = []
for index, row in IBM_df.iterrows():
score_list.append(get_df_topics(row.Report, 3))
year_list.append(row.Year)
plt.plot(year_list, score_list)
plt.title("IBM Tech Mentions")
plt.show()
Interesting! IBM seems to have been on an uptrend since 1995, but in the last two years has dramatically fallen to 2004 levels. Now what about Amazon?
AMZN_df = dataframe[dataframe.Company == "AMZN"]
AMZN_df = AMZN_df.sort_values("Year")
AMZN_df.head()
Company | Year | Report | |
---|---|---|---|
51 | AMZN | 1998 | prospects, financial condition and results of ... |
43 | AMZN | 1999 | ecutive officers and directors of the company ... |
50 | AMZN | 2000 | losses from a major interruion. computer virus... |
44 | AMZN | 2001 | iminish the value of our trademarks and other ... |
47 | AMZN | 2002 | ional retail seasonality. traditional retail s... |
score_list = []
year_list = []
for index, row in AMZN_df.iterrows():
score_list.append(get_df_topics(row.Report, 3))
year_list.append(row.Year)
plt.plot(year_list, score_list)
plt.title("Amazon Tech Mentions")
plt.show()
We have learned about topic modelling, and more specifically LDA and Mallet. Topic modelling is a great unsupervised tool for extracting topics from documents, and Mallet is a particular model for performing topic modelling, which improves on the weaknesses of the original LDA model - precision.
We learned how to preprocess and clean text for building an LDA/Mallet model. First, cleaning the data with regular expressions, then removing stopwords, and creating a Bag of Words model. We discussed the assumptions of the BoW model, but recognised that it can be a useful simplification in many cases.
Then we built the Mallet and LDA models, chose the optimal number of topics, and visualised the topics in pyLDAvis. The visualisation methods of pyLDAvis were covered in detail.
After finding the topic representation for each document, we saw how further work can be done, such as looking at how certain topics trend overtime.
If you would like to go further, I would recommend looking into Deep LDA or Neural Topic Modelling with Reinforcement Learning.