Siraj's 03 Challenge

This is a response to the Coding Challenge in: https://youtu.be/si8zZHkufRY

The challenge for this video is to train a model on this dataset of video game reviews from IGN.com. Then, given some new video game title it should be able to classify it. You can use pandas to parse this dataset. Right now each review has a label that's either Amazing, Great, Good, Mediocre, Painful, or Awful. These are the emotions. Using the existing labels is extra credit. The baseline is that you can just convert the labels so that there are only 2 emotions (positive or negative). Ideally you can use an RNN via TFLearn like the one in this example, but I'll accept other types of ML models as well. You'll learn how to parse data, select appropriate features, and use a neural net on an IRL problem.

Sentiment Labels to be Predicted

  • Great
  • Good
  • Okay
  • Mediocre
  • Amazing
  • Bad
  • Awful
  • Painful
  • Unbearable
  • Masterpiece

Accuracy Results

  • Dummy Classifier (i.e. select most frequent class): 0.25631 (25.6%)
  • Multinomial Naive Bayes: 0.32355 (32.4%)
  • RNN (using tflearn): 0.41546 (41.5%)
In [1]:
import sys
import tensorflow as tf
from termcolor import colored
print(colored('Python Version: %s' % sys.version.split()[0], 'blue'))
print(colored('TensorFlow Ver: %s' % tf.__version__, 'magenta'))
Python Version: 3.5.2
TensorFlow Ver: 0.11.0
In [2]:
n_epoch = int(input('Enter no. of epochs for RNN training: '))
Enter no. of epochs for RNN training: 100
In [3]:
print(colored('No. of epochs: %d' % n_epoch, 'red'))
No. of epochs: 100
In [4]:
import pandas as pd
pd.set_option('display.max_colwidth', 1000)

Load IGN Dataset as original_ign

In [5]:
original_ign = pd.read_csv('ign.csv')
original_ign.head(10)
Out[5]:
Unnamed: 0 score_phrase title url platform score genre editors_choice release_year release_month release_day
0 0 Amazing LittleBigPlanet PS Vita /games/littlebigplanet-vita/vita-98907 PlayStation Vita 9.0 Platformer Y 2012 9 12
1 1 Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition /games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059 PlayStation Vita 9.0 Platformer Y 2012 9 12
2 2 Great Splice: Tree of Life /games/splice/ipad-141070 iPad 8.5 Puzzle N 2012 9 12
3 3 Great NHL 13 /games/nhl-13/xbox-360-128182 Xbox 360 8.5 Sports N 2012 9 11
4 4 Great NHL 13 /games/nhl-13/ps3-128181 PlayStation 3 8.5 Sports N 2012 9 11
5 5 Good Total War Battles: Shogun /games/total-war-battles-shogun/mac-142565 Macintosh 7.0 Strategy N 2012 9 11
6 6 Awful Double Dragon: Neon /games/double-dragon-neon/xbox-360-131320 Xbox 360 3.0 Fighting N 2012 9 11
7 7 Amazing Guild Wars 2 /games/guild-wars-2/pc-896298 PC 9.0 RPG Y 2012 9 11
8 8 Awful Double Dragon: Neon /games/double-dragon-neon/ps3-131321 PlayStation 3 3.0 Fighting N 2012 9 11
9 9 Good Total War Battles: Shogun /games/total-war-battles-shogun/pc-142564 PC 7.0 Strategy N 2012 9 11

Check out the shape of the IGN Dataset

In [6]:
print('original_ign.shape:', original_ign.shape)
original_ign.shape: (18625, 11)

Check out all the unique score_phrase as well as their counts

In [7]:
original_ign.score_phrase.value_counts()
Out[7]:
Great          4773
Good           4741
Okay           2945
Mediocre       1959
Amazing        1804
Bad            1269
Awful           664
Painful         340
Unbearable       72
Masterpiece      55
Disaster          3
Name: score_phrase, dtype: int64

Data Preprocessing

As always, we gotta perform preprocessing on our Dataset before training our model(s).

Convert score_phrase to binary sentiments and add a new column called sentiment

In [8]:
bad_phrases = ['Bad', 'Awful', 'Painful', 'Unbearable', 'Disaster']
original_ign['sentiment'] = original_ign.score_phrase.isin(bad_phrases).map({True: 'Negative', False: 'Positive'})
In [9]:
# Remove "Disaster"
original_ign = original_ign[original_ign['score_phrase'] != 'Disaster']
In [10]:
original_ign.head()
Out[10]:
Unnamed: 0 score_phrase title url platform score genre editors_choice release_year release_month release_day sentiment
0 0 Amazing LittleBigPlanet PS Vita /games/littlebigplanet-vita/vita-98907 PlayStation Vita 9.0 Platformer Y 2012 9 12 Positive
1 1 Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition /games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059 PlayStation Vita 9.0 Platformer Y 2012 9 12 Positive
2 2 Great Splice: Tree of Life /games/splice/ipad-141070 iPad 8.5 Puzzle N 2012 9 12 Positive
3 3 Great NHL 13 /games/nhl-13/xbox-360-128182 Xbox 360 8.5 Sports N 2012 9 11 Positive
4 4 Great NHL 13 /games/nhl-13/ps3-128181 PlayStation 3 8.5 Sports N 2012 9 11 Positive

No. of Positive Sentiments VS No. of Negative Seniments

In [11]:
original_ign.sentiment.value_counts(normalize=True)
Out[11]:
Positive    0.874074
Negative    0.125926
Name: sentiment, dtype: float64

Check for null elements

In [12]:
original_ign.isnull().sum()
Out[12]:
Unnamed: 0         0
score_phrase       0
title              0
url                0
platform           0
score              0
genre             36
editors_choice     0
release_year       0
release_month      0
release_day        0
sentiment          0
dtype: int64

Fill all null elements with an empty string

In [13]:
original_ign.fillna(value='', inplace=True)
In [14]:
# original_ign[ original_ign['genre'] == '' ].shape

Create a new DataFrame called ign

In [15]:
ign = original_ign[ ['sentiment', 'score_phrase', 'title', 'platform', 'genre', 'editors_choice'] ].copy()
ign.head(10)
Out[15]:
sentiment score_phrase title platform genre editors_choice
0 Positive Amazing LittleBigPlanet PS Vita PlayStation Vita Platformer Y
1 Positive Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer Y
2 Positive Great Splice: Tree of Life iPad Puzzle N
3 Positive Great NHL 13 Xbox 360 Sports N
4 Positive Great NHL 13 PlayStation 3 Sports N
5 Positive Good Total War Battles: Shogun Macintosh Strategy N
6 Negative Awful Double Dragon: Neon Xbox 360 Fighting N
7 Positive Amazing Guild Wars 2 PC RPG Y
8 Negative Awful Double Dragon: Neon PlayStation 3 Fighting N
9 Positive Good Total War Battles: Shogun PC Strategy N

Create a new column called is_editors_choice

In [16]:
ign['is_editors_choice'] = ign['editors_choice'].map({'Y': 'editors_choice', 'N': ''})
ign.head()
Out[16]:
sentiment score_phrase title platform genre editors_choice is_editors_choice
0 Positive Amazing LittleBigPlanet PS Vita PlayStation Vita Platformer Y editors_choice
1 Positive Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer Y editors_choice
2 Positive Great Splice: Tree of Life iPad Puzzle N
3 Positive Great NHL 13 Xbox 360 Sports N
4 Positive Great NHL 13 PlayStation 3 Sports N

Create a new column called text which contains contents of several columns

In [17]:
ign['text'] = ign['title'].str.cat(ign['platform'], sep=' ').str.cat(ign['genre'], sep=' ').str.cat(ign['is_editors_choice'], sep=' ')
In [18]:
print('Shape of \"ign\" DataFrame:', ign.shape)
Shape of "ign" DataFrame: (18622, 8)
In [19]:
ign.head(10)
Out[19]:
sentiment score_phrase title platform genre editors_choice is_editors_choice text
0 Positive Amazing LittleBigPlanet PS Vita PlayStation Vita Platformer Y editors_choice LittleBigPlanet PS Vita PlayStation Vita Platformer editors_choice
1 Positive Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer Y editors_choice LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer editors_choice
2 Positive Great Splice: Tree of Life iPad Puzzle N Splice: Tree of Life iPad Puzzle
3 Positive Great NHL 13 Xbox 360 Sports N NHL 13 Xbox 360 Sports
4 Positive Great NHL 13 PlayStation 3 Sports N NHL 13 PlayStation 3 Sports
5 Positive Good Total War Battles: Shogun Macintosh Strategy N Total War Battles: Shogun Macintosh Strategy
6 Negative Awful Double Dragon: Neon Xbox 360 Fighting N Double Dragon: Neon Xbox 360 Fighting
7 Positive Amazing Guild Wars 2 PC RPG Y editors_choice Guild Wars 2 PC RPG editors_choice
8 Negative Awful Double Dragon: Neon PlayStation 3 Fighting N Double Dragon: Neon PlayStation 3 Fighting
9 Positive Good Total War Battles: Shogun PC Strategy N Total War Battles: Shogun PC Strategy

http://www.westernbands.net/userdata/news_picupload/pic_sid1070-0-norm.jpg

Here, I'll treat this as a multiclass problem where I attempt to predict the labels (i.e. the score_phrases)

Examples of score_phrases:

  • Great
  • Good
  • Okay
  • Mediocre
  • Amazing
  • Bad
  • Awful
  • Painful
  • Unbearable
  • Masterpiece
In [20]:
X = ign.text
y = ign.score_phrase

Top 10 rows for X

In [21]:
X.head(10)
Out[21]:
0                                 LittleBigPlanet PS Vita PlayStation Vita Platformer editors_choice
1    LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer editors_choice
2                                                                  Splice: Tree of Life iPad Puzzle 
3                                                                            NHL 13 Xbox 360 Sports 
4                                                                       NHL 13 PlayStation 3 Sports 
5                                                      Total War Battles: Shogun Macintosh Strategy 
6                                                             Double Dragon: Neon Xbox 360 Fighting 
7                                                                 Guild Wars 2 PC RPG editors_choice
8                                                        Double Dragon: Neon PlayStation 3 Fighting 
9                                                             Total War Battles: Shogun PC Strategy 
Name: text, dtype: object

Top 10 rows for y

In [22]:
y.head(10)
Out[22]:
0    Amazing
1    Amazing
2      Great
3      Great
4      Great
5       Good
6      Awful
7    Amazing
8      Awful
9       Good
Name: score_phrase, dtype: object

Model #0: The DUMMY Classifier (Always Choose the Most Frequent Class)

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
In [24]:
vect = TfidfVectorizer(stop_words='english', token_pattern=r'\b\w{2,}\b')
dummy = DummyClassifier(strategy='most_frequent', random_state=0)

dummy_pipeline = make_pipeline(vect, dummy)
In [25]:
dummy_pipeline.named_steps
Out[25]:
{'dummyclassifier': DummyClassifier(constant=None, random_state=0, strategy='most_frequent'),
 'tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=False,
         token_pattern='\\b\\w{2,}\\b', tokenizer=None, use_idf=True,
         vocabulary=None)}
In [26]:
# Cross Validation
cv = cross_val_score(dummy_pipeline, X, y, scoring='accuracy', cv=10, n_jobs=-1)
print(colored('\nDummy Classifier\'s Accuracy: %0.5f\n' % cv.mean(), 'yellow'))

Dummy Classifier's Accuracy: 0.25631

Model #1: MultinomialNB Classifier

In [27]:
from sklearn.naive_bayes import MultinomialNB
In [28]:
vect = TfidfVectorizer(stop_words='english', 
                       token_pattern=r'\b\w{2,}\b',
                       min_df=1, max_df=0.1,
                       ngram_range=(1,2))
mnb = MultinomialNB(alpha=2)

mnb_pipeline = make_pipeline(vect, mnb)
In [29]:
mnb_pipeline.named_steps
Out[29]:
{'multinomialnb': MultinomialNB(alpha=2, class_prior=None, fit_prior=True),
 'tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=0.1, max_features=None, min_df=1,
         ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=False,
         token_pattern='\\b\\w{2,}\\b', tokenizer=None, use_idf=True,
         vocabulary=None)}
In [30]:
# Cross Validation
cv = cross_val_score(mnb_pipeline, X, y, scoring='accuracy', cv=10, n_jobs=-1)
print(colored('\nMultinomialNB Classifier\'s Accuracy: %0.5f\n' % cv.mean(), 'green'))

MultinomialNB Classifier's Accuracy: 0.32355

Model #2: RNN Classifier using TFLearn

In [31]:
import tflearn
from tflearn.data_utils import to_categorical, pad_sequences
from tflearn.datasets import imdb

Train-Test-Split

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

Create the vocab (so that we can create X_word_ids from X)

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1,1), token_pattern=r'\b\w{1,}\b')
In [34]:
vect.fit(X_train)
vocab = vect.vocabulary_
In [35]:
def convert_X_to_X_word_ids(X):
    return X.apply( lambda x: [vocab[w] for w in [w.lower().strip() for w in x.split()] if w in vocab] )
In [36]:
X_train_word_ids = convert_X_to_X_word_ids(X_train)
X_test_word_ids  = convert_X_to_X_word_ids(X_test)

Difference between X(_train/_test) and X(_train_word_ids/test_word_ids)

In [37]:
X_train.head()
Out[37]:
16138               Castlevania: Harmony of Despair PlayStation 3 Action, Adventure 
5945     Kim Possible 2: Drakken's Demise Game Boy Advance Platformer editors_choice
11360                                            Madden NFL 09 PlayStation 2 Sports 
18270                                                   WWE 2K16 Xbox One Wrestling 
12533                                        The Last Ninja Commodore 64/128 Action 
Name: text, dtype: object
In [38]:
X_train_word_ids.head()
Out[38]:
16138                 [3134, 4717, 1911, 5074, 149, 269]
5945     [3730, 5126, 1888, 2799, 1025, 266, 5062, 2227]
11360                   [4037, 4585, 13, 5074, 77, 6257]
18270                      [7447, 136, 7458, 4751, 7439]
12533                      [6674, 3843, 4617, 1533, 251]
Name: text, dtype: object
In [39]:
print('X_train_word_ids.shape:', X_train_word_ids.shape)
print('X_test_word_ids.shape:', X_test_word_ids.shape)
X_train_word_ids.shape: (16759,)
X_test_word_ids.shape: (1863,)

Sequence Padding

In [40]:
X_train_padded_seqs = pad_sequences(X_train_word_ids, maxlen=20, value=0)
X_test_padded_seqs  = pad_sequences(X_test_word_ids , maxlen=20, value=0)
In [41]:
print('X_train_padded_seqs.shape:', X_train_padded_seqs.shape)
print('X_test_padded_seqs.shape:', X_test_padded_seqs.shape)
X_train_padded_seqs.shape: (16759, 20)
X_test_padded_seqs.shape: (1863, 20)
In [42]:
pd.DataFrame(X_train_padded_seqs).head()
Out[42]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3134 4717 1911 5074 149 269
1 0 0 0 0 0 0 0 0 0 0 0 0 3730 5126 1888 2799 1025 266 5062 2227
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4037 4585 13 5074 77 6257
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7447 136 7458 4751 7439
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6674 3843 4617 1533 251
In [43]:
pd.DataFrame(X_test_padded_seqs).head()
Out[43]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7380 5261
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2154 5451 5074 149 5062
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4535 3944 12 5074 149 6257
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6674 7234 162 1826 4931 2329
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 251 935 3500 251

Convert (y) labels to vectors

In [44]:
unique_y_labels = list(y_train.value_counts().index)
unique_y_labels
Out[44]:
['Great',
 'Good',
 'Okay',
 'Mediocre',
 'Amazing',
 'Bad',
 'Awful',
 'Painful',
 'Unbearable',
 'Masterpiece']
In [45]:
len(unique_y_labels)
Out[45]:
10
In [46]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(unique_y_labels)
Out[46]:
LabelEncoder()
In [47]:
# print('')
# print(unique_y_labels)
# print(le.transform(unique_y_labels))
# print('')
In [48]:
print('')
for label_id, label_name in zip(le.transform(unique_y_labels), unique_y_labels):
    print('%d: %s' % (label_id, label_name))
print('')
4: Great
3: Good
7: Okay
6: Mediocre
0: Amazing
2: Bad
1: Awful
8: Painful
9: Unbearable
5: Masterpiece

In [49]:
y_train = to_categorical(y_train.map(lambda x: le.transform([x])[0]), nb_classes=len(unique_y_labels))
y_test  = to_categorical(y_test.map(lambda x:  le.transform([x])[0]), nb_classes=len(unique_y_labels))
In [50]:
y_train[0:3]
Out[50]:
array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])
In [51]:
print('y_train.shape:', y_train.shape)
print('y_test.shape:', y_test.shape)
y_train.shape: (16759, 10)
y_test.shape: (1863, 10)

Network Building

In [52]:
size_of_each_vector = X_train_padded_seqs.shape[1]
vocab_size = len(vocab)
no_of_unique_y_labels = len(unique_y_labels)
In [53]:
print('size_of_each_vector:', size_of_each_vector)
print('vocab_size:', vocab_size)
print('no_of_unique_y_labels:', no_of_unique_y_labels)
size_of_each_vector: 20
vocab_size: 7596
no_of_unique_y_labels: 10
In [54]:
#sgd = tflearn.SGD(learning_rate=1e-4, lr_decay=0.96, decay_step=1000)

net = tflearn.input_data([None, size_of_each_vector]) # The first element is the "batch size" which we set to "None"
net = tflearn.embedding(net, input_dim=vocab_size, output_dim=128) # input_dim: vocabulary size
net = tflearn.lstm(net, 128, dropout=0.6) # Set the dropout to 0.6
net = tflearn.fully_connected(net, no_of_unique_y_labels, activation='softmax') # relu or softmax
net = tflearn.regression(net, 
                         optimizer='adam',  # adam or ada or adagrad # sgd
                         learning_rate=1e-4,
                         loss='categorical_crossentropy')

Intialize the Model

In [55]:
#model = tflearn.DNN(net, tensorboard_verbose=0, checkpoint_path='SavedModels/model.tfl.ckpt')
model = tflearn.DNN(net, tensorboard_verbose=0)

Train the Model

In [56]:
# model.fit(X_train_padded_seqs, y_train, 
#           validation_set=(X_test_padded_seqs, y_test), 
#           n_epoch=n_epoch,
#           show_metric=True, 
#           batch_size=100)

Manually Save the Model

In [57]:
# model.save('SavedModels/model.tfl')
# print(colored('Model Saved!', 'red'))

Manually Load the Model

In [58]:
model.load('SavedModels/model.tfl')
print(colored('Model Loaded!', 'red'))
Model Loaded!

RNN's Accuracy

In [59]:
import numpy as np
from sklearn import metrics
In [60]:
pred_classes = [np.argmax(i) for i in model.predict(X_test_padded_seqs)]
true_classes = [np.argmax(i) for i in y_test]

print(colored('\nRNN Classifier\'s Accuracy: %0.5f\n' % metrics.accuracy_score(true_classes, pred_classes), 'cyan'))

RNN Classifier's Accuracy: 0.41546


Show some predicted samples

In [61]:
ids_of_titles = range(0,21) # range(X_test.shape[0]) 

for i in ids_of_titles:
    pred_class = np.argmax(model.predict([X_test_padded_seqs[i]]))
    true_class = np.argmax(y_test[i])
    
    print(X_test.values[i])
    print('pred_class:', le.inverse_transform(pred_class))
    print('true_class:', le.inverse_transform(true_class))
    print('')
Amy's Jigsaw Scrapbook Wireless Puzzle 
pred_class: Good
true_class: Good

DuckTales Remastered PlayStation 3 Platformer 
pred_class: Okay
true_class: Good

NBA Live 08 PlayStation 3 Sports 
pred_class: Good
true_class: Okay

The Walking Dead: 400 Days PC Adventure, Episodic 
pred_class: Great
true_class: Great

Action Blox iPhone Action 
pred_class: Great
true_class: Okay

Kane & Lynch: Dead Men Xbox 360 Action 
pred_class: Good
true_class: Good

Life is Strange -- Episode 4: Dark Room Xbox One Adventure 
pred_class: Great
true_class: Okay

Genma Onimusha Xbox Action, Adventure 
pred_class: Good
true_class: Great

Klonoa 2: Dream Champ Tournament Game Boy Advance Platformer 
pred_class: Great
true_class: Great

The Walking Dead: A Telltale Game Series -- Season Two PlayStation Vita Adventure 
pred_class: Great
true_class: Great

Dead Star PlayStation 4 Strategy 
pred_class: Good
true_class: Good

Gods vs. Humans! Wii Strategy 
pred_class: Mediocre
true_class: Bad

Worms Revolution PlayStation 3 Strategy 
pred_class: Great
true_class: Great

I of the Dragon PC Action, RPG 
pred_class: Good
true_class: Bad

Family Guy: Air Griffin Wireless Action 
pred_class: Okay
true_class: Bad

Tetris iPod Puzzle 
pred_class: Okay
true_class: Good

The Simpsons Wrestling PlayStation Action 
pred_class: Good
true_class: Unbearable

Godzilla PlayStation 4 Action 
pred_class: Bad
true_class: Bad

New World Order PC Shooter 
pred_class: Good
true_class: Bad

Blacklight: Retribution PC Shooter editors_choice
pred_class: Great
true_class: Great

Sakura Wars: Hanagumi Taisen Columns 2 Dreamcast Puzzle 
pred_class: Great
true_class: Amazing


By Jovian Lin (http://jovianlin.com)