TF-IDF and N-Grams

The goal of this project was to predict the sentiment of an IMDB movie review using a binary classification system. The dataset was part of the Bag of Words Meets Bag of Popcorn Competition.

Model Accuracy: 0.89532

Bag of Words & TF-IDF

A Bag of Words (BoW) model is a simple algorithm used in Natural Language Processing. It simply counts the number of times a word appears in a document.

TF-IDF (or Term Frequency-Inverse Document Frequency) on the other hand reflects how important a word is to a document, or corpus. With TF-IDF, words are given weight, measured by relevance, rather than frequency.

It is the product of two statistics:

  1. Term Frequency (TF): The number of times a word appears in a given document.
  2. Inverse Document Frequency (IDF): The more documents a word appears in, the less valuable that word is as a signal. Very common words, such as “a” or “the”, thereby receive heavily discounted tf-idf scores, in contrast to words that are very specific to the document in question.

In the project, I used two separate TF-IDF vectorizers and merged them into a single bag of words.

  • The first vectorizer (word_vectorizer) analyzed complete words.
  • The second vectorizer (char_vectorizer) analyzed the frequency of character n-grams. An n-gram is a continous sequence of n items from a document. Using Trigrams (N-gram size = 3) yielded a high predictive score.

Lastly, we used a Logistic Regression to predict the sentiment attached to each review. The hyperparameters of the model were tuned using a validation dataset prior to training the model.

Interestingly, our model performed worse if we cleaned the text data in the usual methods. This includes removing html, removing unwanted punctuation, removing stopwords, stemming, or tokenizing.

Loading Required Libraries and Reading the Data into Python

First, we need to load the required libraries and read the data into Python.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from scipy.sparse import hstack
from time import time
In [2]:
train = pd.read_csv( "labeledTrainData.tsv", header=0, delimiter="\t")
test = pd.read_csv("testData.tsv", header=0, delimiter="\t")

train_text = train['review']
test_text = test['review']
y = train['sentiment']

all_text = pd.concat([train_text, test_text])

TF-IDF Vectorizers

First, we convert the reviews into a Bag of Words using the TF-IDF vectorizer for words and for character trigrams.

In [3]:
word_vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', sublinear_tf=True, strip_accents='unicode', 
                                  stop_words='english', ngram_range=(1, 1), max_features=10000)
word_vectorizer.fit(train_text)

train_word_features = word_vectorizer.transform(train_text)
In [4]:
char_vectorizer = TfidfVectorizer(analyzer='char', sublinear_tf=True, strip_accents='unicode', 
                                  stop_words='english', ngram_range=(1, 3), max_features=50000)
char_vectorizer.fit(train_text)

train_char_features = char_vectorizer.transform(train_text)
In [5]:
train_features = hstack([train_word_features, train_char_features])

Hyperparameter Tuning of Logistic Regression

Since there are multiple hyperparameters to tune in the XGBoost model, we will use the GridSearchCV function of Sklearn to determine the optimal hyperparameter values. Next, I used the train_test_split function to generate a validation set and find the best parameters.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(train_features, y,test_size=0.3 ,random_state=1234)

lr_model = LogisticRegression(random_state=1234)
param_dict = {'C': [0.001, 0.01, 0.1, 1, 10],
             'solver': ['sag', 'lbfgs', 'saga']}

start = time()
grid_search = GridSearchCV(lr_model, param_dict)
grid_search.fit(X_train, y_train)
print("GridSearch took %.2f seconds to complete." % (time()-start))
display(grid_search.best_params_)
print("Cross-Validated Score of the Best Estimator: %.3f" % grid_search.best_score_)
GridSearch took 350.08 seconds to complete.
{'C': 1, 'solver': 'saga'}
Cross-Validated Score of the Best Estimator: 0.888

Let's see how well our model does on the validation dataset and where any misclassifications occur.

We have several metrics available for classification accuracy, including a confusion matrix and a classification report.

In [7]:
lr=LogisticRegression(C=1, solver ='saga')
lr.fit(X_train, y_train)
lr_preds=lr.predict(X_test)

print(confusion_matrix(y_test, lr_preds))
print(classification_report(y_test, lr_preds))
print("Accuracy Score: %.3f" % accuracy_score(y_test, lr_preds))
[[3366  399]
 [ 366 3369]]
             precision    recall  f1-score   support

          0       0.90      0.89      0.90      3765
          1       0.89      0.90      0.90      3735

avg / total       0.90      0.90      0.90      7500

Accuracy Score: 0.898

The number of false positives (FP = 366) is similar to the number of false negatives (FN = 399), suggesting that our model is not biased towards either specificity nor sensitivity.

Modelling Sentiment from Reviews

We will redo the steps taken above, this time we both the train and test dataset.

  1. Create a TF-IDF BoW for both words and trigrams.
  2. Train the Logistic Regression model using the tuned hyperparameters.
  3. Format predictions for submission to Kaggle Competition.
In [8]:
word_vectorizer = TfidfVectorizer(analyzer='word', token_pattern=r'\w{1,}', sublinear_tf=True, strip_accents='unicode', 
                                  stop_words='english', ngram_range=(1, 1), max_features=10000)
word_vectorizer.fit(all_text)

train_word_features = word_vectorizer.transform(train_text)
test_word_features = word_vectorizer.transform(test_text)
In [9]:
char_vectorizer = TfidfVectorizer(analyzer='char', sublinear_tf=True, strip_accents='unicode', 
                                  stop_words='english', ngram_range=(1, 3), max_features=50000)
char_vectorizer.fit(all_text)

train_char_features = char_vectorizer.transform(train_text)
test_char_features = char_vectorizer.transform(test_text)
In [10]:
train_features = hstack([train_char_features, train_word_features])
test_features = hstack([test_char_features, test_word_features])
In [11]:
lr=LogisticRegression(C=1,solver='saga')
lr.fit(train_features,y)
final_preds=lr.predict(test_features)

The predictions are then formatted in an appropriate layout for submission to Kaggle.

In [12]:
test['sentiment'] = final_preds
test = test[['id', 'sentiment']]
test.to_csv('Submission.csv',index=False)

Logistic Regression Sentiment Accuracy = 0.89532