The goal of this project was to predict the sentiment of an IMDB movie review using a binary classification system. The dataset was part of the Bag of Words Meets Bag of Popcorn Competition.
Model Accuracy: 0.89532
A Bag of Words (BoW) model is a simple algorithm used in Natural Language Processing. It simply counts the number of times a word appears in a document.
TF-IDF (or Term Frequency-Inverse Document Frequency) on the other hand reflects how important a word is to a document, or corpus. With TF-IDF, words are given weight, measured by relevance, rather than frequency.
It is the product of two statistics:
In the project, I used two separate TF-IDF vectorizers and merged them into a single bag of words.
Lastly, we used a Logistic Regression to predict the sentiment attached to each review. The hyperparameters of the model were tuned using a validation dataset prior to training the model.
First, we need to load the required libraries and read the data into Python.
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.sparse import hstack
from time import time
train = pd.read_csv( "labeledTrainData.tsv", header=0, delimiter="\t")
test = pd.read_csv("testData.tsv", header=0, delimiter="\t")
train_text = train['review']
test_text = test['review']
y = train['sentiment']
all_text = pd.concat([train_text, test_text])
First, we convert the reviews into a Bag of Words using the TF-IDF vectorizer for words and for character trigrams.
word_vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', sublinear_tf=True, strip_accents='unicode',
stop_words='english', ngram_range=(1, 1), max_features=10000)
word_vectorizer.fit(train_text)
train_word_features = word_vectorizer.transform(train_text)
char_vectorizer = TfidfVectorizer(analyzer='char', sublinear_tf=True, strip_accents='unicode',
stop_words='english', ngram_range=(1, 3), max_features=50000)
char_vectorizer.fit(train_text)
train_char_features = char_vectorizer.transform(train_text)
train_features = hstack([train_word_features, train_char_features])
Since there are multiple hyperparameters to tune in the XGBoost model, we will use the GridSearchCV function of Sklearn to determine the optimal hyperparameter values. Next, I used the train_test_split function to generate a validation set and find the best parameters.
X_train, X_test, y_train, y_test = train_test_split(train_features, y,test_size=0.3 ,random_state=1234)
lr_model = LogisticRegression(random_state=1234)
param_dict = {'C': [0.001, 0.01, 0.1, 1, 10],
'solver': ['sag', 'lbfgs', 'saga']}
start = time()
grid_search = GridSearchCV(lr_model, param_dict)
grid_search.fit(X_train, y_train)
print("GridSearch took %.2f seconds to complete." % (time()-start))
display(grid_search.best_params_)
print("Cross-Validated Score of the Best Estimator: %.3f" % grid_search.best_score_)
GridSearch took 350.08 seconds to complete.
{'C': 1, 'solver': 'saga'}
Cross-Validated Score of the Best Estimator: 0.888
Let's see how well our model does on the validation dataset and where any misclassifications occur.
We have several metrics available for classification accuracy, including a confusion matrix and a classification report.
lr=LogisticRegression(C=1, solver ='saga')
lr.fit(X_train, y_train)
lr_preds=lr.predict(X_test)
print(confusion_matrix(y_test, lr_preds))
print(classification_report(y_test, lr_preds))
print("Accuracy Score: %.3f" % accuracy_score(y_test, lr_preds))
[[3366 399] [ 366 3369]] precision recall f1-score support 0 0.90 0.89 0.90 3765 1 0.89 0.90 0.90 3735 avg / total 0.90 0.90 0.90 7500 Accuracy Score: 0.898
The number of false positives (FP = 366) is similar to the number of false negatives (FN = 399), suggesting that our model is not biased towards either specificity nor sensitivity.
We will redo the steps taken above, this time we both the train and test dataset.
word_vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', sublinear_tf=True, strip_accents='unicode',
stop_words='english', ngram_range=(1, 1), max_features=10000)
word_vectorizer.fit(all_text)
train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)
char_vectorizer = TfidfVectorizer(analyzer='char', sublinear_tf=True, strip_accents='unicode',
stop_words='english', ngram_range=(1, 3), max_features=50000)
char_vectorizer.fit(all_text)
train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
lr=LogisticRegression(C=1,solver='saga')
lr.fit(train_features,y)
final_preds=lr.predict(test_features)
The predictions are then formatted in an appropriate layout for submission to Kaggle.
test['sentiment'] = final_preds
test = test[['id', 'sentiment']]
test.to_csv('Submission.csv',index=False)