Notebook

Naive Bayes Model for Newsgroups Data¶

For an explanation of the Naive Bayes model, see our course notes.

This notebook uses code from http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html.

Instructions¶

If you haven't already, follow the setup instructions here to get all necessary software installed.
Read through the code in the following sections:

Newgroups Data
Model Training
Prediction

Complete at least one of the following exercises:

Exercise Option #1 - Standard Difficulty
Exercise Option #2 - Advanced Difficulty

In [1]:

from sklearn.datasets import fetch_20newsgroups # the 20 newgroups set is included in scikit-learn
from sklearn.naive_bayes import MultinomialNB # we need this for our Naive Bayes model

# These next two are about processing the data. We'll look into this more later in the semester.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Newgroups Data¶

Back in the day, Usenet was a popular discussion system where people could discuss topics in relevant newsgroups (think Slack channel or subreddit). At some point, someone pulled together messages sent to 20 different newsgroups, to use as a dataset for doing text processing.

We are going to pull out messages from just a few different groups to try out a Naive Bayes model.

Examine the newsgroups dictionary, to make sure you understand the dataset.

Note: If you get an error about SSL certificates, you can fix this with the following:

In Finder, click on Applications in the list on the left panel
Double click to go into the Python folder (it will be called something like Python 3.7)
Double click on the Install Certificates command in that folder

In [2]:

# which newsgroups we want to download
newsgroup_names = ['comp.graphics', 'rec.sport.hockey', 'sci.electronics', 'sci.space']

# get the newsgroup data (organized much like the iris data)
newsgroups = fetch_20newsgroups(categories=newsgroup_names, shuffle=True, random_state=265)
newsgroups.keys()

Out[2]:

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR', 'description'])

This next part does some processing of the data, because the scikit-learn Naive Bayes module is expecting numerical data rather than text data. We will talk more about what this code is doing later in the semester. For now, you can ignore it.

In [3]:

# Convert the text into numbers that represent each word (bag of words method)
word_vector = CountVectorizer()
word_vector_counts = word_vector.fit_transform(newsgroups.data)

# Account for the length of the documents:
#   get the frequency with which the word occurs instead of the raw number of times
term_freq_transformer = TfidfTransformer()
term_freq = term_freq_transformer.fit_transform(word_vector_counts)

Model Training¶

Now we fit the Naive Bayes model to the subset of the 20 newsgroups data that we've pulled out.

In [4]:

# Train the Naive Bayes model
model = MultinomialNB().fit(term_freq, newsgroups.target)

Prediction¶

Let's see how the model does on some (very short) documents that we made up to fit into the specific categories our model is trained on.

In [5]:

# Predict some new fake documents
fake_docs = [
    'That GPU has amazing performance with a lot of shaders',
    'The player had a wicked slap shot',
    'I spent all day yesterday soldering banks of capacitors',
    'Today I have to solder a bank of capacitors',
    'NASA has rovers on Mars']
fake_counts = word_vector.transform(fake_docs)
fake_term_freq = term_freq_transformer.transform(fake_counts)

predicted = model.predict(fake_term_freq)
print('Predictions:')
for doc, group in zip(fake_docs, predicted):
    print('\t{0} => {1}'.format(doc, newsgroups.target_names[group]))

probabilities = model.predict_proba(fake_term_freq)
print('Probabilities:')
print(''.join(['{:17}'.format(name) for name in newsgroups.target_names]))
for probs in probabilities:
    print(''.join(['{:<17.8}'.format(prob) for prob in probs]))

Predictions:
	That GPU has amazing performance with a lot of shaders => comp.graphics
	The player had a wicked slap shot => rec.sport.hockey
	I spent all day yesterday soldering banks of capacitors => sci.space
	Today I have to solder a bank of capacitors => sci.electronics
	NASA has rovers on Mars => sci.space
Probabilities:
comp.graphics    rec.sport.hockey sci.electronics  sci.space        
0.29466149       0.22895149       0.24926344       0.22712357       
0.12948055       0.51155698       0.18248712       0.17647535       
0.18604814       0.24117771       0.27540452       0.29736963       
0.21285086       0.21081302       0.3486507        0.22768541       
0.079185633      0.066225915      0.10236622       0.75222223

Exercise Option #1 - Standard Difficulty¶

Modify the fake documents and add some new documents of your own.

What words in your documents have particularly large effects on the model probabilities? Note that we're not looking for documents that consist of a single word, but for words that, when included or excluded from a document, tend to change the model's output.

Exercise Option #2 - Advanced Difficulty¶

Write some code to count up how often the words you found in the exercise above appear in each category in the training dataset. Does this match up with your intuition?

In [ ]: