The Naive Bayes classifier is a statsitcal programing tool with diverse uses in text analysis. It is famously used for determining if an email is "spam" or "ham"(the opposite of spam). It also has uses in sentiment analysis, for example, determining whether a movie review was good or bad, determining the authorship of an article, or categorizing documents into topics. Today we will be using a naive bayes classifier however, not to classify, but to analyze competing sides of political debates on the floor of the US Congress.
%pylab inline from __future__ import division from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.naive_bayes import GaussianNB from sklearn.naive_bayes import BernoulliNB from sklearn.cross_validation import cross_val_score from sklearn.cross_validation import train_test_split from speech import Speech
Twiddle with the parameters and see how many speeches you can retireve for the subset of the congressional record you are interested in analyzing. Good machine leanring generally requires large samplesizes so try to go for a few thousand speeches. Pick a topic that will classify well (i.e. that will contrast starkly between democrats and republicans).
phrase = "abortion" num_speeches = Speech.get(0, 0, phrase=phrase, congress="", start_date="1995-05-04", speaker_party="*")['count'] print "Downloading %i speeches" % num_speeches speeches = Speech.get(start=0, rows=num_speeches, phrase=phrase, speaker_party="*")['speeches'] print len(speeches), "speeches downloaded"
naive_bayes = MultinomialNB(alpha=1.0, fit_prior=True) vectorizer = TfidfVectorizer(min_df=.1, max_df=.6, stop_words='english' )
# Make an array of text objects. Each chunk of text is just the text of a congressional speech. data = [" ".join(speech['speaking']) for speech in speeches] # Transform speeches into vectors data = vectorizer.fit_transform(data) # Make an array of 0s and 1s that determine if each speech is democrat or republican. This is called the target vector. target = [speech['speaker_party'] for speech in speeches] target = [ 0 if x == "D" else 1 for x in target ]
What the data vector looks like
What the target vector looks like
Classifiers generally are split into two parts. The "training set" which the algorithm uses to learn to recognize a pattern, and the "testing set" which we give the algorithm as an unknown. The naive bayes algorithm uses what it learns from the training set to attempt to classify items in the testing set. In this case, even though we know the correct 'class' or categorization of each speech, we split our data 80/20 and assume for the 20% test set that we don't know which class it belongs in.
X_train, X_test , Y_train, Y_test = train_test_split(data, target, test_size=0.2) print X_train.shape, X_test.shape, len(Y_train), len(Y_test)
A naive bayes classifier works by adding up the conidtional probabilities of a word being in a particular class.
P(party|word) = P(word|party) P(party) P(Democrat | 'choice') = P('choice' | Democrat) P (Democrat) / P('choice') # Probably pretty high P(Republican | 'choice') = P('choice' | Republican) * P (Republican) / P('choice') # Probably fairly low
When you add up the conditional probabilities P(party|word) for every word in a document, you get the probability that the document in question comes from a member of party P. In short
P( party | word1 & word2 & word3 ... & word-n ) is the same as P( party | document )
In our case, we knew the correct 'class' of 100% of our data, but we assumed that 20% was unknown and asked the classifier to figure it out using the patterns it learned. Now we can see how accurately the classifier classified the data into our two classes (Democrat and Republican). Since we're only classifying into two classes, random guesses would land us at 50%. In order to continue we must ensure that the classifier is doing a fair bit better than chance.
Cross validation is a more robust method of checking if a classifier is correct or not. It runs the above simulations several times on many different subsets of the data.
cross_val_score(naive_bayes, data, target, scoring='accuracy', verbose=1, cv=5)
terms = vectorizer.get_feature_names() t1 = [(naive_bayes.feature_log_prob_[i] * (naive_bayes.class_count_ / naive_bayes.class_count_.sum())) for i in range(len(terms))] t2 = [(naive_bayes.feature_log_prob_[i] * (naive_bayes.class_count_ / naive_bayes.class_count_.sum())) for i in range(len(terms))]
[(terms[i],t1[i]) for i in np.array(t1).argsort()] # Top Terms for Republicans
[(terms[i],t2[i]) for i in np.array(t2).argsort()] # Top Terms for Democrats
What other contexts might you be able to apply this methodology to? How can a classifier be useful in your work? How about the particular way in which we were able to find the words most indicative of a certain class?