This exercise uses a small subset of the data from Kaggle's Yelp Business Rating Prediction competition.
Description of the data:
yelp.csv
contains the dataset. It is stored in the repository (in the data
directory), so there is no need to download anything from the Kaggle website.Goal: Predict the star rating of a review using only the review text.
Tip: After each task, I recommend that you check the shape and the contents of your objects, to confirm that they match your expectations.
Read yelp.csv
into a pandas DataFrame and examine it.
Create a new DataFrame that only contains the 5-star and 1-star reviews.
Define X and y from the new DataFrame, and then split X and y into training and testing sets, using the review text as the only feature and the star rating as the response.
Use CountVectorizer to create document-term matrices from X_train and X_test.
Use multinomial Naive Bayes to predict the star rating for the reviews in the testing set, and then calculate the accuracy and print the confusion matrix.
Calculate the null accuracy, which is the classification accuracy that could be achieved by always predicting the most frequent class.
Browse through the review text of some of the false positives and false negatives. Based on your knowledge of how Naive Bayes works, do you have any ideas about why the model is incorrectly classifying these reviews?
Calculate which 10 tokens are the most predictive of 5-star reviews, and which 10 tokens are the most predictive of 1-star reviews.
feature_count_
and class_count_
attributes of the Naive Bayes model object.Up to this point, we have framed this as a binary classification problem by only considering the 5-star and 1-star reviews. Now, let's repeat the model building process using all reviews, which makes this a 5-class classification problem.
Here are the steps: