# Siraj's 03 Challenge¶

### This is a response to the Coding Challenge in: https://youtu.be/si8zZHkufRY¶

The challenge for this video is to train a model on this dataset of video game reviews from IGN.com. Then, given some new video game title it should be able to classify it. You can use pandas to parse this dataset. Right now each review has a label that's either Amazing, Great, Good, Mediocre, Painful, or Awful. These are the emotions. Using the existing labels is extra credit. The baseline is that you can just convert the labels so that there are only 2 emotions (positive or negative). Ideally you can use an RNN via TFLearn like the one in this example, but I'll accept other types of ML models as well. You'll learn how to parse data, select appropriate features, and use a neural net on an IRL problem.

• Great
• Good
• Okay
• Mediocre
• Amazing
• Awful
• Painful
• Unbearable
• Masterpiece

# Accuracy Results¶

• Dummy Classifier (i.e. select most frequent class): 0.25631 (25.6%)
• Multinomial Naive Bayes: 0.32355 (32.4%)
• RNN (using tflearn): 0.41546 (41.5%)
In [1]:
import sys
import tensorflow as tf
from termcolor import colored
print(colored('Python Version: %s' % sys.version.split()[0], 'blue'))
print(colored('TensorFlow Ver: %s' % tf.__version__, 'magenta'))

Python Version: 3.5.2
TensorFlow Ver: 0.11.0

In [2]:
n_epoch = int(input('Enter no. of epochs for RNN training: '))

Enter no. of epochs for RNN training: 100

In [3]:
print(colored('No. of epochs: %d' % n_epoch, 'red'))

No. of epochs: 100

In [4]:
import pandas as pd
pd.set_option('display.max_colwidth', 1000)


# Load IGN Dataset as original_ign¶

In [5]:
original_ign = pd.read_csv('ign.csv')

Out[5]:
Unnamed: 0 score_phrase title url platform score genre editors_choice release_year release_month release_day
0 0 Amazing LittleBigPlanet PS Vita /games/littlebigplanet-vita/vita-98907 PlayStation Vita 9.0 Platformer Y 2012 9 12
1 1 Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition /games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059 PlayStation Vita 9.0 Platformer Y 2012 9 12
2 2 Great Splice: Tree of Life /games/splice/ipad-141070 iPad 8.5 Puzzle N 2012 9 12
3 3 Great NHL 13 /games/nhl-13/xbox-360-128182 Xbox 360 8.5 Sports N 2012 9 11
4 4 Great NHL 13 /games/nhl-13/ps3-128181 PlayStation 3 8.5 Sports N 2012 9 11
5 5 Good Total War Battles: Shogun /games/total-war-battles-shogun/mac-142565 Macintosh 7.0 Strategy N 2012 9 11
6 6 Awful Double Dragon: Neon /games/double-dragon-neon/xbox-360-131320 Xbox 360 3.0 Fighting N 2012 9 11
7 7 Amazing Guild Wars 2 /games/guild-wars-2/pc-896298 PC 9.0 RPG Y 2012 9 11
8 8 Awful Double Dragon: Neon /games/double-dragon-neon/ps3-131321 PlayStation 3 3.0 Fighting N 2012 9 11
9 9 Good Total War Battles: Shogun /games/total-war-battles-shogun/pc-142564 PC 7.0 Strategy N 2012 9 11

### Check out the shape of the IGN Dataset¶

In [6]:
print('original_ign.shape:', original_ign.shape)

original_ign.shape: (18625, 11)


### Check out all the unique score_phrase as well as their counts¶

In [7]:
original_ign.score_phrase.value_counts()

Out[7]:
Great          4773
Good           4741
Okay           2945
Mediocre       1959
Amazing        1804
Awful           664
Painful         340
Unbearable       72
Masterpiece      55
Disaster          3
Name: score_phrase, dtype: int64

# Data Preprocessing¶

As always, we gotta perform preprocessing on our Dataset before training our model(s).

### Convert score_phrase to binary sentiments and add a new column called sentiment¶

In [8]:
bad_phrases = ['Bad', 'Awful', 'Painful', 'Unbearable', 'Disaster']
original_ign['sentiment'] = original_ign.score_phrase.isin(bad_phrases).map({True: 'Negative', False: 'Positive'})

In [9]:
# Remove "Disaster"
original_ign = original_ign[original_ign['score_phrase'] != 'Disaster']

In [10]:
original_ign.head()

Out[10]:
Unnamed: 0 score_phrase title url platform score genre editors_choice release_year release_month release_day sentiment
0 0 Amazing LittleBigPlanet PS Vita /games/littlebigplanet-vita/vita-98907 PlayStation Vita 9.0 Platformer Y 2012 9 12 Positive
1 1 Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition /games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059 PlayStation Vita 9.0 Platformer Y 2012 9 12 Positive
2 2 Great Splice: Tree of Life /games/splice/ipad-141070 iPad 8.5 Puzzle N 2012 9 12 Positive
3 3 Great NHL 13 /games/nhl-13/xbox-360-128182 Xbox 360 8.5 Sports N 2012 9 11 Positive
4 4 Great NHL 13 /games/nhl-13/ps3-128181 PlayStation 3 8.5 Sports N 2012 9 11 Positive

### No. of Positive Sentiments VS No. of Negative Seniments¶

In [11]:
original_ign.sentiment.value_counts(normalize=True)

Out[11]:
Positive    0.874074
Negative    0.125926
Name: sentiment, dtype: float64

### Check for null elements¶

In [12]:
original_ign.isnull().sum()

Out[12]:
Unnamed: 0         0
score_phrase       0
title              0
url                0
platform           0
score              0
genre             36
editors_choice     0
release_year       0
release_month      0
release_day        0
sentiment          0
dtype: int64

### Fill all null elements with an empty string¶

In [13]:
original_ign.fillna(value='', inplace=True)

In [14]:
# original_ign[ original_ign['genre'] == '' ].shape


# Create a new DataFrame called ign¶

In [15]:
ign = original_ign[ ['sentiment', 'score_phrase', 'title', 'platform', 'genre', 'editors_choice'] ].copy()

Out[15]:
sentiment score_phrase title platform genre editors_choice
0 Positive Amazing LittleBigPlanet PS Vita PlayStation Vita Platformer Y
1 Positive Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer Y
2 Positive Great Splice: Tree of Life iPad Puzzle N
3 Positive Great NHL 13 Xbox 360 Sports N
4 Positive Great NHL 13 PlayStation 3 Sports N
5 Positive Good Total War Battles: Shogun Macintosh Strategy N
6 Negative Awful Double Dragon: Neon Xbox 360 Fighting N
7 Positive Amazing Guild Wars 2 PC RPG Y
8 Negative Awful Double Dragon: Neon PlayStation 3 Fighting N
9 Positive Good Total War Battles: Shogun PC Strategy N

### Create a new column called is_editors_choice¶

In [16]:
ign['is_editors_choice'] = ign['editors_choice'].map({'Y': 'editors_choice', 'N': ''})

Out[16]:
sentiment score_phrase title platform genre editors_choice is_editors_choice
0 Positive Amazing LittleBigPlanet PS Vita PlayStation Vita Platformer Y editors_choice
1 Positive Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer Y editors_choice
2 Positive Great Splice: Tree of Life iPad Puzzle N
3 Positive Great NHL 13 Xbox 360 Sports N
4 Positive Great NHL 13 PlayStation 3 Sports N

### Create a new column called text which contains contents of several columns¶

In [17]:
ign['text'] = ign['title'].str.cat(ign['platform'], sep=' ').str.cat(ign['genre'], sep=' ').str.cat(ign['is_editors_choice'], sep=' ')

In [18]:
print('Shape of \"ign\" DataFrame:', ign.shape)

Shape of "ign" DataFrame: (18622, 8)

In [19]:
ign.head(10)

Out[19]:
sentiment score_phrase title platform genre editors_choice is_editors_choice text
0 Positive Amazing LittleBigPlanet PS Vita PlayStation Vita Platformer Y editors_choice LittleBigPlanet PS Vita PlayStation Vita Platformer editors_choice
1 Positive Amazing LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer Y editors_choice LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer editors_choice
2 Positive Great Splice: Tree of Life iPad Puzzle N Splice: Tree of Life iPad Puzzle
3 Positive Great NHL 13 Xbox 360 Sports N NHL 13 Xbox 360 Sports
4 Positive Great NHL 13 PlayStation 3 Sports N NHL 13 PlayStation 3 Sports
5 Positive Good Total War Battles: Shogun Macintosh Strategy N Total War Battles: Shogun Macintosh Strategy
6 Negative Awful Double Dragon: Neon Xbox 360 Fighting N Double Dragon: Neon Xbox 360 Fighting
7 Positive Amazing Guild Wars 2 PC RPG Y editors_choice Guild Wars 2 PC RPG editors_choice
8 Negative Awful Double Dragon: Neon PlayStation 3 Fighting N Double Dragon: Neon PlayStation 3 Fighting
9 Positive Good Total War Battles: Shogun PC Strategy N Total War Battles: Shogun PC Strategy

# Here, I'll treat this as a multiclass problem where I attempt to predict the labels (i.e. the score_phrases)¶

Examples of score_phrases:

• Great
• Good
• Okay
• Mediocre
• Amazing
• Awful
• Painful
• Unbearable
• Masterpiece
In [20]:
X = ign.text
y = ign.score_phrase


### Top 10 rows for X¶

In [21]:
X.head(10)

Out[21]:
0                                 LittleBigPlanet PS Vita PlayStation Vita Platformer editors_choice
1    LittleBigPlanet PS Vita -- Marvel Super Hero Edition PlayStation Vita Platformer editors_choice
2                                                                  Splice: Tree of Life iPad Puzzle
3                                                                            NHL 13 Xbox 360 Sports
4                                                                       NHL 13 PlayStation 3 Sports
5                                                      Total War Battles: Shogun Macintosh Strategy
6                                                             Double Dragon: Neon Xbox 360 Fighting
7                                                                 Guild Wars 2 PC RPG editors_choice
8                                                        Double Dragon: Neon PlayStation 3 Fighting
9                                                             Total War Battles: Shogun PC Strategy
Name: text, dtype: object

### Top 10 rows for y¶

In [22]:
y.head(10)

Out[22]:
0    Amazing
1    Amazing
2      Great
3      Great
4      Great
5       Good
6      Awful
7    Amazing
8      Awful
9       Good
Name: score_phrase, dtype: object

# Model #0: The DUMMY Classifier (Always Choose the Most Frequent Class)¶

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.dummy import DummyClassifier
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

In [24]:
vect = TfidfVectorizer(stop_words='english', token_pattern=r'\b\w{2,}\b')
dummy = DummyClassifier(strategy='most_frequent', random_state=0)

dummy_pipeline = make_pipeline(vect, dummy)

In [25]:
dummy_pipeline.named_steps

Out[25]:
{'dummyclassifier': DummyClassifier(constant=None, random_state=0, strategy='most_frequent'),
'tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
stop_words='english', strip_accents=None, sublinear_tf=False,
token_pattern='\\b\\w{2,}\\b', tokenizer=None, use_idf=True,
vocabulary=None)}
In [26]:
# Cross Validation
cv = cross_val_score(dummy_pipeline, X, y, scoring='accuracy', cv=10, n_jobs=-1)
print(colored('\nDummy Classifier\'s Accuracy: %0.5f\n' % cv.mean(), 'yellow'))


Dummy Classifier's Accuracy: 0.25631



# Model #1: MultinomialNB Classifier¶

In [27]:
from sklearn.naive_bayes import MultinomialNB

In [28]:
vect = TfidfVectorizer(stop_words='english',
token_pattern=r'\b\w{2,}\b',
min_df=1, max_df=0.1,
ngram_range=(1,2))
mnb = MultinomialNB(alpha=2)

mnb_pipeline = make_pipeline(vect, mnb)

In [29]:
mnb_pipeline.named_steps

Out[29]:
{'multinomialnb': MultinomialNB(alpha=2, class_prior=None, fit_prior=True),
'tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=0.1, max_features=None, min_df=1,
ngram_range=(1, 2), norm='l2', preprocessor=None, smooth_idf=True,
stop_words='english', strip_accents=None, sublinear_tf=False,
token_pattern='\\b\\w{2,}\\b', tokenizer=None, use_idf=True,
vocabulary=None)}
In [30]:
# Cross Validation
cv = cross_val_score(mnb_pipeline, X, y, scoring='accuracy', cv=10, n_jobs=-1)
print(colored('\nMultinomialNB Classifier\'s Accuracy: %0.5f\n' % cv.mean(), 'green'))


MultinomialNB Classifier's Accuracy: 0.32355



# Model #2: RNN Classifier using TFLearn¶

In [31]:
import tflearn
from tflearn.datasets import imdb


### Train-Test-Split¶

In [32]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)


### Create the vocab (so that we can create X_word_ids from X)¶

In [33]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(ngram_range=(1,1), token_pattern=r'\b\w{1,}\b')

In [34]:
vect.fit(X_train)
vocab = vect.vocabulary_

In [35]:
def convert_X_to_X_word_ids(X):
return X.apply( lambda x: [vocab[w] for w in [w.lower().strip() for w in x.split()] if w in vocab] )

In [36]:
X_train_word_ids = convert_X_to_X_word_ids(X_train)
X_test_word_ids  = convert_X_to_X_word_ids(X_test)


### Difference between X(_train/_test) and X(_train_word_ids/test_word_ids)¶

In [37]:
X_train.head()

Out[37]:
16138               Castlevania: Harmony of Despair PlayStation 3 Action, Adventure
5945     Kim Possible 2: Drakken's Demise Game Boy Advance Platformer editors_choice
11360                                            Madden NFL 09 PlayStation 2 Sports
18270                                                   WWE 2K16 Xbox One Wrestling
12533                                        The Last Ninja Commodore 64/128 Action
Name: text, dtype: object
In [38]:
X_train_word_ids.head()

Out[38]:
16138                 [3134, 4717, 1911, 5074, 149, 269]
5945     [3730, 5126, 1888, 2799, 1025, 266, 5062, 2227]
11360                   [4037, 4585, 13, 5074, 77, 6257]
18270                      [7447, 136, 7458, 4751, 7439]
12533                      [6674, 3843, 4617, 1533, 251]
Name: text, dtype: object
In [39]:
print('X_train_word_ids.shape:', X_train_word_ids.shape)
print('X_test_word_ids.shape:', X_test_word_ids.shape)

X_train_word_ids.shape: (16759,)
X_test_word_ids.shape: (1863,)


In [40]:
X_train_padded_seqs = pad_sequences(X_train_word_ids, maxlen=20, value=0)

In [41]:
print('X_train_padded_seqs.shape:', X_train_padded_seqs.shape)

X_train_padded_seqs.shape: (16759, 20)

In [42]:
pd.DataFrame(X_train_padded_seqs).head()

Out[42]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3134 4717 1911 5074 149 269
1 0 0 0 0 0 0 0 0 0 0 0 0 3730 5126 1888 2799 1025 266 5062 2227
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4037 4585 13 5074 77 6257
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7447 136 7458 4751 7439
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6674 3843 4617 1533 251
In [43]:
pd.DataFrame(X_test_padded_seqs).head()

Out[43]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7380 5261
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2154 5451 5074 149 5062
2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4535 3944 12 5074 149 6257
3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6674 7234 162 1826 4931 2329
4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 251 935 3500 251

### Convert (y) labels to vectors¶

In [44]:
unique_y_labels = list(y_train.value_counts().index)
unique_y_labels

Out[44]:
['Great',
'Good',
'Okay',
'Mediocre',
'Amazing',
'Awful',
'Painful',
'Unbearable',
'Masterpiece']
In [45]:
len(unique_y_labels)

Out[45]:
10
In [46]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(unique_y_labels)

Out[46]:
LabelEncoder()
In [47]:
# print('')
# print(unique_y_labels)
# print(le.transform(unique_y_labels))
# print('')

In [48]:
print('')
for label_id, label_name in zip(le.transform(unique_y_labels), unique_y_labels):
print('%d: %s' % (label_id, label_name))
print('')

4: Great
3: Good
7: Okay
6: Mediocre
0: Amazing
1: Awful
8: Painful
9: Unbearable
5: Masterpiece


In [49]:
y_train = to_categorical(y_train.map(lambda x: le.transform([x])[0]), nb_classes=len(unique_y_labels))
y_test  = to_categorical(y_test.map(lambda x:  le.transform([x])[0]), nb_classes=len(unique_y_labels))

In [50]:
y_train[0:3]

Out[50]:
array([[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.],
[ 0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.],
[ 0.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.]])
In [51]:
print('y_train.shape:', y_train.shape)
print('y_test.shape:', y_test.shape)

y_train.shape: (16759, 10)
y_test.shape: (1863, 10)


### Network Building¶

In [52]:
size_of_each_vector = X_train_padded_seqs.shape[1]
vocab_size = len(vocab)
no_of_unique_y_labels = len(unique_y_labels)

In [53]:
print('size_of_each_vector:', size_of_each_vector)
print('vocab_size:', vocab_size)
print('no_of_unique_y_labels:', no_of_unique_y_labels)

size_of_each_vector: 20
vocab_size: 7596
no_of_unique_y_labels: 10

In [54]:
#sgd = tflearn.SGD(learning_rate=1e-4, lr_decay=0.96, decay_step=1000)

net = tflearn.input_data([None, size_of_each_vector]) # The first element is the "batch size" which we set to "None"
net = tflearn.embedding(net, input_dim=vocab_size, output_dim=128) # input_dim: vocabulary size
net = tflearn.lstm(net, 128, dropout=0.6) # Set the dropout to 0.6
net = tflearn.fully_connected(net, no_of_unique_y_labels, activation='softmax') # relu or softmax
net = tflearn.regression(net,
learning_rate=1e-4,
loss='categorical_crossentropy')


### Intialize the Model¶

In [55]:
#model = tflearn.DNN(net, tensorboard_verbose=0, checkpoint_path='SavedModels/model.tfl.ckpt')
model = tflearn.DNN(net, tensorboard_verbose=0)


### Train the Model¶

In [56]:
# model.fit(X_train_padded_seqs, y_train,
#           n_epoch=n_epoch,
#           show_metric=True,
#           batch_size=100)


### Manually Save the Model¶

In [57]:
# model.save('SavedModels/model.tfl')
# print(colored('Model Saved!', 'red'))


In [58]:
model.load('SavedModels/model.tfl')

Model Loaded!


### RNN's Accuracy¶

In [59]:
import numpy as np
from sklearn import metrics

In [60]:
pred_classes = [np.argmax(i) for i in model.predict(X_test_padded_seqs)]
true_classes = [np.argmax(i) for i in y_test]

print(colored('\nRNN Classifier\'s Accuracy: %0.5f\n' % metrics.accuracy_score(true_classes, pred_classes), 'cyan'))


RNN Classifier's Accuracy: 0.41546



### Show some predicted samples¶

In [61]:
ids_of_titles = range(0,21) # range(X_test.shape[0])

for i in ids_of_titles:
true_class = np.argmax(y_test[i])

print(X_test.values[i])
print('pred_class:', le.inverse_transform(pred_class))
print('true_class:', le.inverse_transform(true_class))
print('')

Amy's Jigsaw Scrapbook Wireless Puzzle
pred_class: Good
true_class: Good

DuckTales Remastered PlayStation 3 Platformer
pred_class: Okay
true_class: Good

NBA Live 08 PlayStation 3 Sports
pred_class: Good
true_class: Okay

pred_class: Great
true_class: Great

Action Blox iPhone Action
pred_class: Great
true_class: Okay

Kane & Lynch: Dead Men Xbox 360 Action
pred_class: Good
true_class: Good

Life is Strange -- Episode 4: Dark Room Xbox One Adventure
pred_class: Great
true_class: Okay

pred_class: Good
true_class: Great

Klonoa 2: Dream Champ Tournament Game Boy Advance Platformer
pred_class: Great
true_class: Great

The Walking Dead: A Telltale Game Series -- Season Two PlayStation Vita Adventure
pred_class: Great
true_class: Great

pred_class: Good
true_class: Good

Gods vs. Humans! Wii Strategy
pred_class: Mediocre

Worms Revolution PlayStation 3 Strategy
pred_class: Great
true_class: Great

I of the Dragon PC Action, RPG
pred_class: Good

Family Guy: Air Griffin Wireless Action
pred_class: Okay

Tetris iPod Puzzle
pred_class: Okay
true_class: Good

The Simpsons Wrestling PlayStation Action
pred_class: Good
true_class: Unbearable

Godzilla PlayStation 4 Action

New World Order PC Shooter
pred_class: Good