This notebook is my follow up on the Kaggle tutorial about Bag of Words model for sentiment analysis. You can see the detail instruction by clicking the link below. I will not explain in details here, because everything I did here is in the tutorial.
Tutorial link: https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words
import sys
sys.path.append('/home/hoanvu/anaconda2/envs/ds/lib/python2.7/site-packages/')
import re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
reviews = pd.read_csv('../data/labeledTrainData.tsv', header=0, delimiter="\t", quoting=3)
reviews.head()
id | sentiment | review | |
---|---|---|---|
0 | "5814_8" | 1 | "With all this stuff going down at the moment ... |
1 | "2381_9" | 1 | "\"The Classic War of the Worlds\" by Timothy ... |
2 | "7759_3" | 0 | "The film starts with a manager (Nicholas Bell... |
3 | "3630_4" | 0 | "It must be assumed that those who praised thi... |
4 | "9495_8" | 1 | "Superbly trashy and wondrously unpretentious ... |
Before cleaning every review, let's see what we can do to clean each review:
# Take a look at a sample review
sample_review = reviews.review[2]
sample_review
'"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like \xc2\xa8Jurassik Park\xc2\xa8, and some scientists resurrect one of nature\'s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against one nature\'s most fearsome predators. Furthermore a third Sabretooth more dangerous and slow stalks its victims.<br /><br />The movie delivers the goods with lots of blood and gore as beheading, hair-raising chills,full of scares when the Sabretooths appear with mediocre special effects.The story provides exciting and stirring entertainment but it results to be quite boring .The giant animals are majority made by computer generator and seem totally lousy .Middling performances though the players reacting appropriately to becoming food.Actors give vigorously physical performances dodging the beasts ,running,bound and leaps or dangling over walls . And it packs a ridiculous final deadly scene. No for small kids by realistic,gory and violent attack scenes . Other films about Sabretooths or Smilodon are the following : \xc2\xa8Sabretooth(2002)\xc2\xa8by James R Hickox with Vanessa Angel, David Keith and John Rhys Davies and the much better \xc2\xa810.000 BC(2006)\xc2\xa8 by Roland Emmerich with with Steven Strait, Cliff Curtis and Camilla Belle. This motion picture filled with bloody moments is badly directed by George Miller and with no originality because takes too many elements from previous films. Miller is an Australian director usually working for television (Tidal wave, Journey to the center of the earth, and many others) and occasionally for cinema ( The man from Snowy river, Zeus and Roxanne,Robinson Crusoe ). Rating : Below average, bottom of barrel."'
# Clean all HTML tags
sample_review = BeautifulSoup(sample_review, 'html.parser').get_text()
sample_review
u'"The film starts with a manager (Nicholas Bell) giving welcome investors (Robert Carradine) to Primal Park . A secret project mutating a primal animal using fossilized DNA, like \xa8Jurassik Park\xa8, and some scientists resurrect one of nature\'s most fearsome predators, the Sabretooth tiger or Smilodon . Scientific ambition turns deadly, however, and when the high voltage fence is opened the creature escape and begins savagely stalking its prey - the human visitors , tourists and scientific.Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre-historical animals which are deadlier and bigger . In addition , a security agent (Stacy Haiduk) and her mate (Brian Wimmer) fight hardly against the carnivorous Smilodons. The Sabretooths, themselves , of course, are the real star stars and they are astounding terrifyingly though not convincing. The giant animals savagely are stalking its prey and the group run afoul and fight against one nature\'s most fearsome predators. Furthermore a third Sabretooth more dangerous and slow stalks its victims.The movie delivers the goods with lots of blood and gore as beheading, hair-raising chills,full of scares when the Sabretooths appear with mediocre special effects.The story provides exciting and stirring entertainment but it results to be quite boring .The giant animals are majority made by computer generator and seem totally lousy .Middling performances though the players reacting appropriately to becoming food.Actors give vigorously physical performances dodging the beasts ,running,bound and leaps or dangling over walls . And it packs a ridiculous final deadly scene. No for small kids by realistic,gory and violent attack scenes . Other films about Sabretooths or Smilodon are the following : \xa8Sabretooth(2002)\xa8by James R Hickox with Vanessa Angel, David Keith and John Rhys Davies and the much better \xa810.000 BC(2006)\xa8 by Roland Emmerich with with Steven Strait, Cliff Curtis and Camilla Belle. This motion picture filled with bloody moments is badly directed by George Miller and with no originality because takes too many elements from previous films. Miller is an Australian director usually working for television (Tidal wave, Journey to the center of the earth, and many others) and occasionally for cinema ( The man from Snowy river, Zeus and Roxanne,Robinson Crusoe ). Rating : Below average, bottom of barrel."'
# Remove all characters which are not letters
sample_review = re.sub(r'[^a-zA-Z]', ' ', sample_review)
sample_review
u' The film starts with a manager Nicholas Bell giving welcome investors Robert Carradine to Primal Park A secret project mutating a primal animal using fossilized DNA like Jurassik Park and some scientists resurrect one of nature s most fearsome predators the Sabretooth tiger or Smilodon Scientific ambition turns deadly however and when the high voltage fence is opened the creature escape and begins savagely stalking its prey the human visitors tourists and scientific Meanwhile some youngsters enter in the restricted area of the security center and are attacked by a pack of large pre historical animals which are deadlier and bigger In addition a security agent Stacy Haiduk and her mate Brian Wimmer fight hardly against the carnivorous Smilodons The Sabretooths themselves of course are the real star stars and they are astounding terrifyingly though not convincing The giant animals savagely are stalking its prey and the group run afoul and fight against one nature s most fearsome predators Furthermore a third Sabretooth more dangerous and slow stalks its victims The movie delivers the goods with lots of blood and gore as beheading hair raising chills full of scares when the Sabretooths appear with mediocre special effects The story provides exciting and stirring entertainment but it results to be quite boring The giant animals are majority made by computer generator and seem totally lousy Middling performances though the players reacting appropriately to becoming food Actors give vigorously physical performances dodging the beasts running bound and leaps or dangling over walls And it packs a ridiculous final deadly scene No for small kids by realistic gory and violent attack scenes Other films about Sabretooths or Smilodon are the following Sabretooth by James R Hickox with Vanessa Angel David Keith and John Rhys Davies and the much better BC by Roland Emmerich with with Steven Strait Cliff Curtis and Camilla Belle This motion picture filled with bloody moments is badly directed by George Miller and with no originality because takes too many elements from previous films Miller is an Australian director usually working for television Tidal wave Journey to the center of the earth and many others and occasionally for cinema The man from Snowy river Zeus and Roxanne Robinson Crusoe Rating Below average bottom of barrel '
# Remove all stopwords inside a review
sample_review_words = [word for word in sample_review.lower().split(' ')
if word and not word in stopwords.words('english')]
# Let's see our sample review after cleaning
' '.join(sample_review_words)
u'film starts manager nicholas bell giving welcome investors robert carradine primal park secret project mutating primal animal using fossilized dna like jurassik park scientists resurrect one nature fearsome predators sabretooth tiger smilodon scientific ambition turns deadly however high voltage fence opened creature escape begins savagely stalking prey human visitors tourists scientific meanwhile youngsters enter restricted area security center attacked pack large pre historical animals deadlier bigger addition security agent stacy haiduk mate brian wimmer fight hardly carnivorous smilodons sabretooths course real star stars astounding terrifyingly though convincing giant animals savagely stalking prey group run afoul fight one nature fearsome predators furthermore third sabretooth dangerous slow stalks victims movie delivers goods lots blood gore beheading hair raising chills full scares sabretooths appear mediocre special effects story provides exciting stirring entertainment results quite boring giant animals majority made computer generator seem totally lousy middling performances though players reacting appropriately becoming food actors give vigorously physical performances dodging beasts running bound leaps dangling walls packs ridiculous final deadly scene small kids realistic gory violent attack scenes films sabretooths smilodon following sabretooth james r hickox vanessa angel david keith john rhys davies much better bc roland emmerich steven strait cliff curtis camilla belle motion picture filled bloody moments badly directed george miller originality takes many elements previous films miller australian director usually working television tidal wave journey center earth many others occasionally cinema man snowy river zeus roxanne robinson crusoe rating average bottom barrel'
It would be much more useful if we combine all the above steps to form a method so that all available reviews can be cleaned by calling it
def clean_review(raw_review):
# Remove all HTML tags present in the review
review_without_html = BeautifulSoup(raw_review, 'html.parser').get_text()
# Remove all characters which are not letters
review_with_only_letter = re.sub('[^a-zA-Z]', ' ', review_without_html)
# Convert stopword list into a set, better for checking membership
all_stopwords = set(stopwords.words('english'))
# Remove all stopwords present inside a review
review = [word for word in review_with_only_letter.lower().split(' ') if word and not word in all_stopwords]
# Previous line return a list, now join them back together and return the cleaned review
return ' '.join(review)
Now let's test our method, it should return exactly what we have by above steps (before defining the method)
clean_review(reviews.review[2])
u'film starts manager nicholas bell giving welcome investors robert carradine primal park secret project mutating primal animal using fossilized dna like jurassik park scientists resurrect one nature fearsome predators sabretooth tiger smilodon scientific ambition turns deadly however high voltage fence opened creature escape begins savagely stalking prey human visitors tourists scientific meanwhile youngsters enter restricted area security center attacked pack large pre historical animals deadlier bigger addition security agent stacy haiduk mate brian wimmer fight hardly carnivorous smilodons sabretooths course real star stars astounding terrifyingly though convincing giant animals savagely stalking prey group run afoul fight one nature fearsome predators furthermore third sabretooth dangerous slow stalks victims movie delivers goods lots blood gore beheading hair raising chills full scares sabretooths appear mediocre special effects story provides exciting stirring entertainment results quite boring giant animals majority made computer generator seem totally lousy middling performances though players reacting appropriately becoming food actors give vigorously physical performances dodging beasts running bound leaps dangling walls packs ridiculous final deadly scene small kids realistic gory violent attack scenes films sabretooths smilodon following sabretooth james r hickox vanessa angel david keith john rhys davies much better bc roland emmerich steven strait cliff curtis camilla belle motion picture filled bloody moments badly directed george miller originality takes many elements previous films miller australian director usually working television tidal wave journey center earth many others occasionally cinema man snowy river zeus roxanne robinson crusoe rating average bottom barrel'
Now, let's start cleaning every review in the training data set:
all_clean_reviews = []
for index, review in enumerate(reviews.review):
all_clean_reviews.append(clean_review(review))
len(all_clean_reviews)
25000
Using CountVectorizer
from scikit-learn to extract features for our training data. CountVectorizer
is the Bag-of-Words model.
In below code, max_features=5000
means that we create a dictionary of 5000 most frequent words from the training data. It also means that each review will be converted to a vector of numbers. Each vector is a 1D list with 5000 columns. Here is an example when max_features=10
:
[3 1 0 0 2 3 1 1 1 0]
vectorizer = CountVectorizer(analyzer='word', max_features=5000)
fit_transform()
does two functions:
The input to fit_transform()
should be a list of strings.
train_data_features = vectorizer.fit_transform(all_clean_reviews)
# Convert train_data_features into numpy array so that it's easier to work with in prediction
train_data_features = train_data_features.toarray()
Let's take a look at our output, it should be a sparse matrix
train_data_features[:10]
array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]])
train_data_features.shape
(25000, 5000)
After calling fit_transform()
for our training data, what's the vocabulary set that we get?
vocab = vectorizer.get_feature_names()
# Print first 20 words in vocabulary
vocab[:20]
[u'abandoned', u'abc', u'abilities', u'ability', u'able', u'abraham', u'absence', u'absent', u'absolute', u'absolutely', u'absurd', u'abuse', u'abusive', u'abysmal', u'academy', u'accent', u'accents', u'accept', u'acceptable', u'accepted']
Print the count for first 20 words in our vocabulary:
# Sum up the counts of each vocabulary word
dist = np.sum(train_data_features, axis=0)
# For each, print the vocabulary word and the number of times it
# appears in the training set
for tag, count in zip(vocab, dist)[:20]:
print count, tag
187 abandoned 125 abc 108 abilities 454 ability 1259 able 85 abraham 116 absence 83 absent 352 absolute 1485 absolutely 306 absurd 192 abuse 91 abusive 98 abysmal 297 academy 485 accent 203 accents 300 accept 130 acceptable 144 accepted
forest = RandomForestClassifier(n_estimators=100)
forest.fit(train_data_features, reviews.sentiment)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
test_data = pd.read_csv('../data/testData.tsv', header=0, quoting=3, delimiter='\t')
clean_test_review = []
for review in test_data.review:
clean_test_review.append(clean_review(review))
With test data, we dont fit them into model, only transform. If we fit the data into model, overfitting will occur
test_data_features = vectorizer.transform(clean_test_review)
test_data_features = test_data_features.toarray()
test_data_features.shape
(25000, 5000)
result = forest.predict(test_data_features)
output = pd.DataFrame({'id': test_data['id'], 'sentiment': result})
output.to_csv('bag_of_words.csv', index=False, quoting=3)
This submission has the accuracy about 88%.