DEEP BEERS

PLAYING WITH DEEP RECOMMENDATION ENGINES WITH KERAS

A few years ago, I scrapped with my friend @alexvanacker a beer rating website. I wanted at the time to test different recommendation algorithms. Full disclaimer, I am a bit of a data science beer geek.

More recently, I was advised to follow this excellent class by Charles Ollion and Olivier Grisel, to learn more about some specific aspects of deep learning. When I came across the second lab on factorization machine and deep recommendations, I thought again about the old beer dataset and decided to give it a try.

In the following blog post, I discus the different experiments I was able to run using keras. My code is more than heavily inspired by the class, so don't get alarmed if you detect obvious copy paste.

This blog post is organised as follow: We start by introducing the data before explaining the separation between explicit and implicit recommendations. We start with the explicit one and describe the basic architectures before having some fun in grid searching it. Then we introduce more complex architectures to incoporate item metadata. Finally, we will describe implicit recommender system engine using the triplet loss.

The Data

Ratings

Let's start by importing the scrapped data

In [1]:
%pylab inline
import warnings
warnings.filterwarnings("ignore")
Populating the interactive namespace from numpy and matplotlib
In [2]:
import pandas as pd
beer_reviews = pd.read_csv('/data/pgutierrez/beer/beers_reviews_csv.csv.gz', sep=',')
In [3]:
''' no need to show '''

beer_reviews['user_id'] = beer_reviews['user_url'].map(lambda x : x.split(".")[1].replace('/',''))
beer_reviews['beer_id'] = beer_reviews['beer_url'].map(lambda x : x.split("/")[-2])
In [4]:
print beer_reviews.shape
(4563152, 9)
In [5]:
beer_reviews.head()
Out[5]:
beer_url score user_url date review rdev scrap_time user_id beer_id
0 http://www.beeradvocate.com//beer/profile/68/2... 4.00 /community/members/capnamerica.719306/ 2014-09-07 01:16:00 NaN -0.7 2014-09-13 04:37:56.471514 719306 24071
1 http://www.beeradvocate.com//beer/profile/68/2... 3.75 /community/members/supermanny.817238/ 2014-09-05 00:00:00 NaN -6.9 2014-09-13 04:37:56.472023 817238 24071
2 http://www.beeradvocate.com//beer/profile/68/2... 3.75 /community/members/mcnazz.719186/ 2014-09-04 00:00:00 NaN -6.9 2014-09-13 04:37:56.472453 719186 24071
3 http://www.beeradvocate.com//beer/profile/68/2... 3.75 /community/members/zenbiscuits.699682/ 2014-09-03 00:00:00 NaN -6.9 2014-09-13 04:37:56.472876 699682 24071
4 http://www.beeradvocate.com//beer/profile/68/2... 4.00 /community/members/jwal.799985/ 2014-09-02 00:00:00 NaN -0.7 2014-09-13 04:37:56.473296 799985 24071

The dataset is composed 4.5 million reviews. It is composed of a beer id, a user id as well as a score between 1 and 5. For some entries, there is also a complete typed review, which I won't use in this blog post though it would be interesting to integrate it.

Let's get an idea of what the rating distribution looks like.

In [6]:
beer_reviews["score"].describe()
Out[6]:
count    4.563152e+06
mean     3.852421e+00
std      6.533483e-01
min      1.000000e+00
25%      3.500000e+00
50%      4.000000e+00
75%      4.250000e+00
max      5.000000e+00
Name: score, dtype: float64
In [18]:
beer_reviews["score"].hist(bins=10)
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x610d6950>

The median is 4. This is very important because our data is skewed towards high ratings. This is a common bias in internet ratings, people tend to rate items or moovies that they liked, and rarely spend time to comment something they dislike or are indiferent to. This will have a huge impact on the way we model the recommendation problem.

For the algorithm in keras to work, we need to remap all beers and users id to an interger between 0 and the total number of users / beers.

In [19]:
users = beer_reviews.user_id.unique()
user_map = {i:val for i,val in enumerate(users)}
inverse_user_map = {val:i for i,val in enumerate(users)}
beers = beer_reviews.beer_id.unique()
beer_map = {i:val for i,val in enumerate(beers)}
inverse_beer_map = {val:i for i,val in enumerate(beers)}

beer_reviews["user_id"] = beer_reviews["user_id"].map(inverse_user_map)
beer_reviews["old_id"] = beer_reviews["beer_id"] # copying for join with metadata
beer_reviews["beer_id"] = beer_reviews["beer_id"].map(inverse_beer_map)

print "We have %d users"%users.shape[0]
print "We have %d beers"%beers.shape[0] 
We have 91645 users
We have 78518 beers

Note the important number of users and reviews. This makes use ask the following questions: How many ratings do we have per beer ? Per user ? What is the corresponding distributions ?

For the users we have:

In [20]:
users_nb = beer_reviews['user_id'].value_counts().reset_index()
users_nb.columns= ['user_id','nb_lines']
users_nb['nb_lines'].describe()
Out[20]:
count    91645.000000
mean        49.791609
std        173.584107
min          1.000000
25%          1.000000
50%          4.000000
75%         23.000000
max       6311.000000
Name: nb_lines, dtype: float64
In [21]:
import seaborn
users_nb['nb_lines'].hist()
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x65fb7250>

Off course the distribution is very skewed. With 50 % of people having done no more than 4 reviews... Whereas one got crazy with more than 6000 ratings! This has some implications: it means that for most people, we have few beers to characterize them, whereas for at least 25% we have more than 23 which is probably enough information to start generating good recommendations.

Now let's have a look at the items:

In [22]:
beers_nb = beer_reviews['old_id'].value_counts().reset_index()
beers_nb.columns= ['old_id','nb_lines']
beers_nb['nb_lines'].describe()
Out[22]:
count    78518.000000
mean        58.115999
std        313.841229
min          1.000000
25%          2.000000
50%          5.000000
75%         18.000000
max      12450.000000
Name: nb_lines, dtype: float64
In [23]:
beers_nb['nb_lines'].hist()
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x65fc6390>

Again, the distribution is very skewed with 50 % of the beers having 5 ratings or less. Even worse, 75% of beers have less than 18 ratings. Though it is normal for a user, we would assume that many users would generate more well spread beers. Let's have a look at the most rated beers. To do so, we need to load some other files containing the metadata.

In [ ]:
a = beer_reviews.dropna()[['beer_id','user_id','score','date','review']]
In [29]:
pd.set_option('display.max_colwidth', 100)
a.head(10)
Out[29]:
beer_id user_id score date review
15 0 15 4.10 2014-08-24 00:00:00 I like this brewer and their Oyster Stout is a great session stout. Perfect last beer of the ev...
24 0 24 4.00 2014-08-17 00:00:00 Pours a rich and dark. If it's not completely black it may as well be because there are no hues ...
67 0 67 3.89 2014-07-11 00:00:00 Appearance - Extremely dark, practically black . Okay head that lasted for a while and would slo...
69 0 69 4.21 2014-07-10 00:00:00 Vintage 2013 Appearance: It has a nice jet black color to it. It has a nice thick creamy tan hea...
72 0 72 3.63 2014-07-09 00:00:00 Appearance- poured from a glass into a guiness style glass. this shit is DARK. not much head, la...
76 0 76 4.13 2014-07-04 00:00:00 A - poured a one finger thick coffee-colored head into snifter that left a thin ring throughout....
96 0 96 3.51 2014-06-22 00:00:00 A: Dark brown to black. Forms a small head that dissappears to nothing. Swirling the beer in my ...
155 0 155 3.76 2014-04-30 00:00:00 2013 Vintage poured into DFH snifter. A - inky black with mocha head S - lots of dark roasted m...
167 0 167 4.43 2014-04-05 00:00:00 A - 4 - Nice dark brown/black beer with a big mocha head on it, leaves nice lacing. S - 4.25 - ...
179 0 179 4.15 2014-03-25 00:00:00 Vintage from 2010. 12oz poured into an oversized tulip. A - Black but not opaque. Small cap of t...
In [30]:
a.shape[0]/float(beer_reviews.shape[0])
Out[30]:
0.2690559069695684

Metadata

In [32]:
# beer metadata
all_info = pd.read_csv('/data/pgutierrez/beer/all_beer_info_complete.csv', sep=',') 
all_info['style'] = all_info['style'].fillna('no_data')
all_info['brewery_country'] = all_info['brewery_country'].fillna('no_data')
all_info['brewery'] = all_info['brewery'].fillna('no_data')
all_info['abv'] = pd.to_numeric(all_info['abv'],errors="coerce")
# often outliers in abv. check tactical nuclear pinguin or sink the bismark for examples. 
all_info['abv'] = all_info['abv'].fillna(all_info['abv'].median()) 
all_info['beer_id'] = all_info['id'].astype(str).map(inverse_beer_map) # remap

# adding the count 
beers_nb["old_id"] = beers_nb["old_id"].astype(int).values
all_info =  pd.merge(all_info,beers_nb,how='left',left_on='id',right_on='old_id')

# user metadata
users_infos = pd.read_csv('/data/pgutierrez/beer/users.csv.gz', sep=',') 
users_infos = users_infos.fillna('no_data')
In [43]:
#all_info['nb_lines']=all_info['nb_lines'].astype(int)
all_info.dropna().sample(10)[['name','style','brewery','brewery_country','nb_lines']]
Out[43]:
name style brewery brewery_country nb_lines
49120 Old Guardian - Temecula Red Wine Barrel-Aged American Barleywine Stone Brewing Co. United States 8.0
17330 Hop Common California Common / Steam Beer Peekskill Brewery United States 83.0
65111 A Token Of Our Extreme American Wild Ale Jackie O's Pub & Brewery United States 37.0
24899 Great Lakes Deliverance - Pinot Noir Barrel Aged Saison / Farmhouse Ale Great Lakes Brewery Canada 6.0
44402 Mack-Daddy Chocolate Stout American Stout Outer Banks Brewing Station United States 8.0
9204 Gahan House Harvest Gold Pale Ale American Pale Ale (APA) The Gahan House Canada 7.0
76094 Blond Belgian Pale Ale Brouwerij West United States 25.0
15277 Chili Lager Chile Beer Tommyknocker Brewery United States 8.0
29238 120 Shilling Ale Scotch Ale / Wee Heavy Rock Art Brewery United States 2.0
71928 Wing-Nut Brown Ale American Brown Ale Kodiak Island Brewing Co. United States 2.0
In [50]:
users_infos[['user_id','join_date','occupation','location','birth_year']].head()
Out[50]:
user_id join_date occupation location birth_year
0 724374 Mar 19, 2013 no_data Maryland no_data
1 72431 Mar 31, 2006 no_data Maryland no_data
2 697544 Oct 5, 2012 Engineer Illinois no_data
3 55882 Dec 20, 2005 no_data Massachusetts no_data
4 120748 Feb 4, 2007 no_data Maryland no_data
In [49]:
users_infos.columns
Out[49]:
Index([u'user_id', u'user_name', u'join_date', u'occupation', u'location',
       u'gender', u'birth_year', u'content', u'home_page'],
      dtype='object')

Here are the most rated beers:

In [52]:
all_info[['name','brewery_country','style','nb_lines']].sort('nb_lines',ascending=False).head(10)
Out[52]:
name brewery_country style nb_lines
67170 90 Minute IPA United States American Double / Imperial IPA 12450.0
59285 Founders Breakfast Stout United States American Double / Imperial Stout 11694.0
52528 Two Hearted Ale United States American IPA 10698.0
67171 Pliny The Elder United States American Double / Imperial IPA 10467.0
67172 Hopslam Ale United States American Double / Imperial IPA 9825.0
42703 Old Rasputin Russian Imperial Stout United States Russian Imperial Stout 9620.0
67173 Stone Ruination IPA United States American Double / Imperial IPA 9400.0
52530 Sculpin IPA United States American IPA 9200.0
52529 60 Minute IPA United States American IPA 9017.0
7265 Sierra Nevada Pale Ale United States American Pale Ale (APA) 8650.0

Though I know most of these beers, most of them do not seem very common to me. This is because of two bias:

  • Most of the people in the dataset come from the USA. Which explain that all these beers come from there.
  • Most of the people rating beers on this website have "beer geek" profile. Hence they will rate mostly beers they liked so you can expect the most rated beer to be quality beers easily findable. That's 90 minute IPA or Siera Nevada.

Let's have a look at the other metadata.

In [20]:
pd.set_option('display.max_colwidth', 30)
all_info.head()
Out[20]:
style name notes brewery_location availability image_url added_by abv beer_url_x id ... brewery_postal_code brewery_image_url brewery_address Macro Style Macro Style 2 beer_url_y avg_rating beer_id old_id nb_lines
0 Baltic Porter Gonzo Imperial Porter ABV varies. Maryland , United States Rotating http://cdn.beeradvocate.co... BeerAdvocate 9.20 http://www.beeradvocate.co... 24071 ... 21703 http://cdn.beeradvocate.co... 4607 Wedgewood Blvd. English Ales Ale http://www.beeradvocate.co... 4.026906 0.0 24071.0 2479.0
1 Baltic Porter 3Beans A Sixpoint & Friends jam s... New York , United States Limited (brewed once) http://cdn.beeradvocate.co... BeerAdvocate 10.00 http://www.beeradvocate.co... 88889 ... 11231 http://cdn.beeradvocate.co... 40 Van Dyke St English Ales Ale http://www.beeradvocate.co... 4.029175 1.0 88889.0 1151.0
2 Baltic Porter Smuttynose Baltic Porter (... 35 IBU 2007 - 8.1% 2009 -... New Hampshire , United States Winter http://cdn.beeradvocate.co... GratefulBeerGuy 9.24 http://www.beeradvocate.co... 40674 ... 03842 http://cdn.beeradvocate.co... 105 Towle Farm Rd English Ales Ale http://www.beeradvocate.co... 4.267594 2.0 40674.0 1014.0
3 Baltic Porter Baltic Thunder No notes at this time. Pennsylvania , United States Year-round http://cdn.beeradvocate.co... akorsak 8.50 http://www.beeradvocate.co... 40443 ... 19335 http://cdn.beeradvocate.co... 420 Acorn Lane English Ales Ale http://www.beeradvocate.co... 3.866393 3.0 40443.0 890.0
4 Baltic Porter Dark Depths Combination between a Balt... Massachusetts , United States Limited (brewed once) http://cdn.beeradvocate.co... Bierman9 7.60 http://www.beeradvocate.co... 77234 ... 02130-2315 http://cdn.beeradvocate.co... 30 Germania St English Ales Ale http://www.beeradvocate.co... 3.698539 4.0 77234.0 876.0

5 rows × 29 columns

In the beer metadata we have it's name, the brewery, where the beer comes from, it's tyle and abv. I also computed a mean rating to get the top and worst beers. To give stable results, let's keep only beers rated more than 500 times.

In [54]:
all_info[all_info['nb_lines']>=500][["name",'style','avg_rating']].sort('avg_rating').head(5)
Out[54]:
name style avg_rating
491 Natural Light Light Lager 1.673766
74533 Keystone Ice American Adjunct Lager 1.699634
74522 Natural Ice American Adjunct Lager 1.736900
504 Milwaukee's Best Light Light Lager 1.785730
502 Miller 64 Light Lager 1.788587

I have to say, I never tried any of these beers. You can't really find light beers in France. Let's have a look at the top ones:

In [56]:
all_info[all_info['nb_lines']>=500][["name",'style','avg_rating']].sort('avg_rating',ascending=False).head(5)
Out[56]:
name style avg_rating
59364 Hunahpu's Imperial Stout - Double Barrel Aged American Double / Imperial Stout 4.780888
67174 Heady Topper American Double / Imperial IPA 4.731133
42741 Bourbon Barrel Aged Vanilla Bean Dark Lord Russian Imperial Stout 4.726511
59375 Barrel-Aged Abraxas American Double / Imperial Stout 4.716880
67345 King Sue American Double / Imperial IPA 4.704760

I have to say, I have no idea what are these beers. This may be because the best beers are probably craft beers and hence less well spread. Let's go for beers rated more than 10000 times to see what we get.

In [61]:
all_info[all_info['nb_lines']>=5000][["name",'style','avg_rating']].sort('avg_rating',ascending=False).head(5)
Out[61]:
name style avg_rating
67174 Heady Topper American Double / Imperial IPA 4.731133
67171 Pliny The Elder American Double / Imperial IPA 4.650610
59286 Founders KBS (Kentucky Breakfast Stout) American Double / Imperial Stout 4.625228
7266 Zombie Dust American Pale Ale (APA) 4.616031
59287 Bourbon County Brand Stout American Double / Imperial Stout 4.583207

Now, I do remember filling a luggage with these 90 Minutes IPA. This is good stuff!

We now have a good overview of the data. Let's start recommending some beers!

Explicit feedback Recommender System

You can learn more about the different type of neural recommender systems in Ollion and Grisel slides.

Basically, explicit feeback is when your users give you volontarily the information. For example, we have explicit beer ratings here. However, in many cases, we don't have this information. We can then rely on implicit feedback, that we can find in user behaviour. For example, when you type a google query, you do not notify google of the result pertinence. However you do click on some links and spend time on thoose pages. That's implicit feedback. In our case that would be the list of beer people drank (1 for drank else 0).

In the following we will start with the explicit case. This boils down to a regression problem where we try to predict the ratings. This means that to a user we will recommand a beer that if he is likely to rate well if he drinks it.

To evaluate the model, we will randomly separate the data into a training and test set. Note that we could do things more properly by splitting the user ratings based on increasing timestamps. This would allow us to answer the questions: what will the next drank beers be ? We leave this for further analysis.

In [24]:
from sklearn.cross_validation import train_test_split

ratings_train, ratings_test = train_test_split(
    beer_reviews, test_size=0.2, random_state=0)
/data/pgutierrez/dataiku-dss-4.0.6/python.packages/sklearn/cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)
In [62]:
""" NO NEED TO SHOW """
import tensorflow as tf
from keras.backend.tensorflow_backend import set_session
config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction 
config.gpu_options.visible_device_list = "0" # "0,1"
set_session(tf.Session(config=config))
Using TensorFlow backend.

Matrix Factorization approach

Let's do some imports from keras

In [63]:
from keras.models import Model
from keras.layers import Input
from keras.layers.core import Reshape
from keras.layers.merge import Multiply
from keras.layers.merge import Dot
from keras.layers.embeddings import Embedding
from keras import optimizers

The simpler model will be based on a matrix factorization approach. We create an embedding for the users and one for the items. The dot product between an item and a product will give us the rating.

In keras, we can define our model this way:

In [64]:
user_id_input = Input(shape=[1], name='user')
item_id_input = Input(shape=[1], name='item')

embedding_size = 30
user_embedding = Embedding(output_dim=embedding_size, input_dim=users.shape[0],
                           input_length=1, name='user_embedding')(user_id_input)
item_embedding = Embedding(output_dim=embedding_size, input_dim=beers.shape[0],
                           input_length=1, name='item_embedding')(item_id_input)

user_vecs = Reshape([embedding_size])(user_embedding)
item_vecs = Reshape([embedding_size])(item_embedding)

y = Dot(1, normalize=False)([user_vecs, item_vecs])

model = Model(inputs=[user_id_input, item_id_input], outputs=y)

model.compile(loss='mse',
              optimizer="adam"
             )

Keras provide the following nice graph rendering

In [65]:
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
import pydot
SVG(model_to_dot(model).create(prog='dot', format='svg'))
Couldn't import dot_parser, loading of dot files will not be possible.
Out[65]:
G 1910100432 user: InputLayer 65899792 user_embedding: Embedding 1910100432->65899792 1761273936 item: InputLayer 1761272720 item_embedding: Embedding 1761273936->1761272720 1761273488 reshape_1: Reshape 65899792->1761273488 1761272464 reshape_2: Reshape 1761272720->1761272464 1761273616 dot_1: Dot 1761273488->1761273616 1761272464->1761273616

To save the different models, I used keras ModelCheckpoint callback.

In [114]:
import time
from keras.callbacks import ModelCheckpoint
mainpath = '/data.nfs/pgutierrez/beer_reco'
save_path = mainpath + "/models"
mytime = time.strftime("%Y_%m_%d_%H_%M")
modname = 'matrix_facto_50_' + mytime 
thename = save_path + '/' + modname + '.h5'
mcheck = ModelCheckpoint(thename  , monitor='val_loss', save_best_only=True)

Now we can train the model.

In [ ]:
history = model.fit([ratings_train["user_id"], ratings_train["beer_id"]]
                    , ratings_train["score"]
                    , batch_size=64, epochs=10
                    , validation_split=0.1
                    , callbacks=[mcheck]
                    , shuffle=True)

import pickle
with open(mainpath + '/histories/' + modname + '.pkl' , 'wb') as file_pi:
    pickle.dump(history.history, file_pi)

And look at the corresponding history

In [26]:
histories = ['matrix_facto_302017_10_09_20_05.pkl']

import pickle
mainpath = '/data.nfs/pgutierrez/beer_reco'
for val in histories:
    with open(mainpath + '/histories/' + val , 'rb') as file_pi:
        thepickle = pickle.load(file_pi)
        plot(thepickle["loss"],label ='loss_' + val,linestyle='--')
        plot(thepickle["val_loss"],label='val_loss' + val)
plt.legend()
plt.ylim(0, 1)

pd.DataFrame(thepickle,columns =['loss','val_loss']).head(20).transpose()
Out[26]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
loss 3.282131 0.697876 0.467870 0.368187 0.309391 0.270531 0.242532 0.220680 0.203654 0.190286 0.180526 0.173244 0.168040 0.163784 0.160461 0.157619 0.155266 0.152941 0.151294 0.149745
val_loss 1.010002 0.689172 0.583937 0.532542 0.505037 0.490072 0.479538 0.472309 0.467850 0.466473 0.465188 0.466323 0.466727 0.469903 0.471110 0.472099 0.473838 0.477046 0.477645 0.478870

The training loss stabilizes around 0.15. After 10 epochs, the model start overfitting, giving us a best mse validation loss around 0.465. A quick grid search on the embedding sizes gives us:

In [24]:
histories = ['matrix_facto_10_2017_10_10_12_12.pkl','matrix_facto_302017_10_09_20_05.pkl','matrix_facto_50_2017_10_10_14_55.pkl']

import pickle
mainpath = '/data.nfs/pgutierrez/beer_reco'
for i,val in enumerate(histories):
    with open(mainpath + '/histories/' + val , 'rb') as file_pi:
        thepickle = pickle.load(file_pi)
        plot(thepickle["loss"][:20],label ='loss_' + val,linestyle="--")
        plot(thepickle["val_loss"][:20],label='val_loss' + val)
plt.legend()
a= plt.ylim(0, 1)

Which shows that choosing large values of embedding sizes actually leads to overfitting. Hence, for most of the experimentations we will keep this embedding size of 10 (giving us around 0.42 validation mse)

You may have noticed that we are using the internal keras validation instead of our test to evaluate our models. This is because we are going to grid search many parameters and architecture. Since we are exploring and going to follow the most promising leads, we are prone to overfit manually. The test set will be kept to verify the quality of recommendations at the end of this part.

Now, let's go deeper.

Going deeper

The architecture above is trying to predict a rating by performing a dot product. We can relax the dot assumption and instead use a concatenate layer followed by a dense layer. This means that instead of relying on a simple dot product, the netwrok can find itself the way it wants to combine the information.

With a two layer deep neural network, this gives us using keras:

In [68]:
from keras.models import Model
from keras.layers import Input
from keras.layers.core import Reshape, Dropout, Dense
from keras.layers.merge import Multiply, Dot
from keras.layers.embeddings import Embedding
from keras.layers.merge import Concatenate
from keras import optimizers
In [69]:
user_id_input = Input(shape=[1], name='user')
item_id_input = Input(shape=[1], name='item')

embedding_size = 10 # 5
user_embedding = Embedding(output_dim=embedding_size, input_dim=users.shape[0],
                           input_length=1, name='user_embedding')(user_id_input)
item_embedding = Embedding(output_dim=embedding_size, input_dim=beers.shape[0],
                           input_length=1, name='item_embedding')(item_id_input)

user_vecs = Reshape([embedding_size])(user_embedding)
item_vecs = Reshape([embedding_size])(item_embedding)

input_vecs = Concatenate()([user_vecs, item_vecs])

x = Dense(128, activation='relu')(input_vecs)
# x = Dense(128, activation='relu')(x)

y = Dense(1)(x)


model = Model(inputs=[user_id_input, item_id_input], outputs=y)
model.compile(optimizer='adam', loss='mse')  
In [70]:
SVG(model_to_dot(model).create(prog='dot', format='svg'))
Out[70]:
G 1919391440 user: InputLayer 1910100112 user_embedding: Embedding 1919391440->1910100112 1919390864 item: InputLayer 1917578960 item_embedding: Embedding 1919390864->1917578960 1919391504 reshape_5: Reshape 1910100112->1919391504 1919389776 reshape_6: Reshape 1917578960->1919389776 1919360784 concatenate_1: Concatenate 1919391504->1919360784 1919389776->1919360784 1919360016 dense_1: Dense 1919360784->1919360016 1916980304 dense_2: Dense 1919360016->1916980304

Running the model we obtain the following curve:

In [27]:
histories = ['dense_1_128_10_2017_10_12_11_29.pkl']

import pickle
mainpath = '/data.nfs/pgutierrez/beer_reco'
for val in histories:
    with open(mainpath + '/histories/' + val , 'rb') as file_pi:
        thepickle = pickle.load(file_pi)
        plot(thepickle["loss"][:20],label ='loss_' + val,linestyle='--')
        plot(thepickle["val_loss"][:20],label='val_loss' + val)
plt.legend()
#plt.ylim(0, 1)

pd.DataFrame(thepickle,columns =['loss','val_loss']).transpose()
Out[27]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
loss 0.269783 0.210583 0.204253 0.200291 0.197833 0.196008 0.194539 0.19332 0.192284 0.191359 0.190491 0.189853 0.189228 0.188460 0.188130 0.187473 0.186994 0.186573 0.186222 0.185809
val_loss 0.211035 0.207652 0.205352 0.205727 0.209358 0.206124 0.205076 0.20627 0.205641 0.211541 0.220987 0.210864 0.211884 0.213006 0.206713 0.210470 0.214230 0.214933 0.213901 0.210559

And when comparing with the previous model, this gives us:

In [29]:
histories = ['dense_1_128_10_2017_10_12_11_29.pkl','matrix_facto_10_2017_10_10_12_12.pkl']

import pickle
mainpath = '/data.nfs/pgutierrez/beer_reco'
for val in histories:
    with open(mainpath + '/histories/' + val , 'rb') as file_pi:
        thepickle = pickle.load(file_pi)
        plot(thepickle["loss"][:20],label ='loss_' + val,linestyle='--')
        plot(thepickle["val_loss"][:20],label='val_loss' + val)
plt.legend()
lim = plt.ylim(0, 1)

Obviously, the performance way got better! From 0.42 to 0.205 validation loss. Wa can also notice the following points:

  • we converge really fast to the best model. After one or two epoch, the model starts overfitting or at least the validation loss does not seem to stabily go down anymore.
  • when comparing to the previous model, we almost manage to match the training error with our validation error! This may mean that we are close to reaching the best possible validation error.

We can also grid search around this architecture. What happens if we add another layer on top of the first one ? What happens if we decrease the embedding size ? (modifications commented in the code above)

In [169]:
histories = ['dense_1_128_10_2017_10_12_11_29.pkl'
             ,'dense_2_128_10_2017_10_12_13_46.pkl','dense_1_128_05_2017_10_12_14_40.pkl']

plt.figure(figsize=(30,8))
import pickle
mainpath = '/data.nfs/pgutierrez/beer_reco'
for val in histories:
    with open(mainpath + '/histories/' + val , 'rb') as file_pi:
        thepickle = pickle.load(file_pi)
        plt.subplot(121)
        plot(thepickle["loss"][:10],label ='loss_' + val)
        plt.legend(fontsize=20)
        plt.subplot(122)
        plot(thepickle["val_loss"][:10],label='val_loss' + val)
        plt.legend(fontsize=20)
plt.subplot(121)
a = plt.ylim(0.15, 0.3)
plt.subplot(122)
a = plt.ylim(0.18, 0.25)

We can see that unexpetedly, adding a layer does not help much (val_lossdense_2). In the opposite direction, simplifying the model by reducing embedding size does not help either.

Now before going further in the architecture grid search, let's get a grasp of what the model does by looking at our generated embeddings.

Visualizing embeddings

Having a look at similar beers

The first thing we can do is to vizualize closer beers to a given list to see if it matches our expectations. Let's follow this list:

In [90]:
data=np.array([
    ['Coors light',837,"Oh my god why."],
    ['Heineken',246,"Euro basic blond beer. To be drank in the sun, or if in the north/east, add Picon."],
    ['Leffe Blonde', 2137,"My personally most hated beer. Entry belgium beer way too much wildspread in France."],
    ["Lindemans Kriek",600,"Beer with fruits. Sugary."],
    ["Chimay Bleue",2512,"Probably the most well known Trappist beer. Entry point for most beer lovers in France."],
    ["Lagunitas IPA",916,"Well known USA IPA. Difficult to find in Europe"],
    ["Firestone double Jack",50697,"Double IPA from Firestone. Very bitter. Awesome beer. Probably for conoisseurs."],
    ["Tsarina Esra",40959,"Imperial stout. You can't go more way more beer geek than this."]])
mybeers = pd.DataFrame(data=data,columns = ['name','id','Description'])
mybeers['id'] = mybeers['id'].map(inverse_beer_map).astype(int)

pd.set_option('display.max_colwidth', -1)
mybeers
Out[90]:
name id Description
0 Coors light 475 Oh my god why.
1 Heineken 16887 Euro basic blond beer. To be drank in the sun, or if in the north/east, add Picon.
2 Leffe Blonde 73044 My personally most hated beer. Entry belgium beer way too much wildspread in France.
3 Lindemans Kriek 39850 Beer with fruits. Sugary.
4 Chimay Bleue 15782 Probably the most well known Trappist beer. Entry point for most beer lovers in France.
5 Lagunitas IPA 50447 Well known USA IPA. Difficult to find in Europe
6 Firestone double Jack 64614 Double IPA from Firestone. Very bitter. Awesome beer. Probably for conoisseurs.
7 Tsarina Esra 50 Imperial stout. You can't go more way more beer geek than this.

And let's define a function to get the closest beers in the embedding from them.

In [91]:
"""NO SHOW THAT
I SHOULD HAVE THIS MORE PROPER SOMEWHERE"""
# getting the mapping
beer_infos = pd.read_csv('/data/pgutierrez/beer/all_beer_info.csv.gz', sep=',')
beer_infos['beer_id'] = beer_infos['beer_url'].map(lambda x : x.split("/")[-2])
beer_infos["map_id"]=beer_infos["beer_id"].map(inverse_beer_map)
namesdic = {row[1]['map_id']:row[1]['name']  for row in beer_infos.iterrows()}
In [92]:
EPSILON = 1e-07

def cosine(x, y):
    dot_pdt = np.dot(x, y.T)
    norms = np.linalg.norm(x) * np.linalg.norm(y)
    return dot_pdt / (norms + EPSILON)


def cosine_similarities(x,embeddings):
    dot_pdt = np.dot(embeddings, x)
    norms = np.linalg.norm(x) * np.linalg.norm(embeddings,axis = 1)
    return dot_pdt / (norms + EPSILON)

# Computes euclidean distances between x and all item embeddings
def euclidean_distances(x,embeddings):  
    return np.linalg.norm(embeddings - x,axis=1)

# Computes top_n most similar items to an idx
def most_similar(idx, embeddings,top_n=10,euclidian= False):
    if euclidian:
        # eucliedian distance between idx and the rest
        distance = euclidean_distances(embeddings[idx],embeddings)
        order = (distance).argsort()
        order= [x for x in order if x <> idx]
        order= order[:top_n]
        return list(zip([namesdic[x] for x in order], distance[order]))
    else: 
        # cosine similarity between idx and the rest
        distance = cosine_similarities(embeddings[idx],embeddings)
        order = (-distance).argsort()
        order= [x for x in order if x <> idx]
        order= order[:top_n]
        return list(zip([namesdic[x] for x in order], distance[order]))

    

Let's get closets beer for the matrix factorisation model obtained with the dot layer:

In [93]:
# the embeddings are the first layer weights
from keras.models import load_model
load_path = "/data.nfs/pgutierrez/beer_reco/models/"
model = load_model(load_path+'matrix_facto_10_2017_10_10_12_12.h5')
weights = model.get_weights()
user_embeddings = weights[0]
item_embeddings = weights[1]
print "weights shapes",[w.shape for w in weights]
weights shapes [(91645, 10), (78518, 10)]
In [94]:
from IPython.core import display as ICD
dataframes = []
for i,row in enumerate(mybeers.iterrows()):
    row = row[1]
    similars = pd.DataFrame(most_similar(row["id"],item_embeddings,top_n=10,euclidian= False))
    similars.columns = [row["name"]+' Closest',row["name"]+' Score' ]
    dataframes.append(similars)
    if i % 2 ==1 :
        final = pd.concat(dataframes,axis=1)
        ICD.display(final)
        dataframes=[]
Coors light Closest Coors light Score Heineken Closest Heineken Score
0 Bud Light 0.978795 Superior 0.959585
1 Miller Lite 0.975721 Beck's 0.958840
2 Michelob Ultra 0.970546 Chang Beer (Export) 0.953909
3 Budweiser Select 0.969926 Brown Fox Session Ale 0.952309
4 Bud Light Platinum 0.963584 Amstel Lager 0.949430
5 Budweiser Select 55 0.957238 Maibock 0.944654
6 Michelob Light 0.953016 Birra Peroni 0.942317
7 Tecate Light 0.951688 Rolling Rock Extra Pale 0.942249
8 Kirin Light Beer 0.951604 Sol 0.941787
9 Busch Light 0.946506 Asahi Super Dry 0.939921
Leffe Blonde Closest Leffe Blonde Score Lindemans Kriek Closest Lindemans Kriek Score
0 Korova Milk Porter 0.988748 River Horse Belgian Freeze Belgian Style Winter Ale 0.988397
1 Hoppopotamus I.P.A. 0.985438 Stoudt's Triple (Belgian Abbey-Style Ale) 0.987153
2 Big Eddy Russian Imperial Stout 0.983687 Fullsuit Belgian-Style Brown Ale 0.986378
3 Nine Voices 0.981028 Big Eddy Wee Heavy Scotch Ale 0.984225
4 Immort Ale 0.980868 Winter Warmer 0.983860
5 Raison D'Être 0.977779 Crush Beer 0.980558
6 Samuel Adams Chocolate Bock 0.975969 Petrus Blond 0.978730
7 Trikini 0.972123 Old Double Bagger (ODB) 0.978404
8 Innis And Gunn Oak Aged Beer 0.971600 Irish Setter Red 0.977576
9 CBC Classic Lager 0.967656 Leffe De Noël 0.977054
Chimay Bleue Closest Chimay Bleue Score Lagunitas IPA Closest Lagunitas IPA Score
0 Chimay Tripel (White) 0.993961 Devil Dog Imperial IPA 0.983049
1 Ayinger Celebrator Doppelbock 0.992905 Ranch Double IPA 0.982516
2 Trappistes Rochefort 6 0.991883 A Little Sumpin' Sumpin' Ale 0.981847
3 Schneider Weisse Tap 6 Unser Aventinus 0.991332 DirtWolf 0.979956
4 Chimay Première (Red) 0.991170 Bell's Oberon Ale 0.978370
5 La Chouffe 0.990845 Great Lakes Christmas Ale 0.978363
6 Rare Vos (Amber Ale) 0.990526 Maierfest Lager 0.977886
7 Pranqster 0.990228 New Holland Imperial Hatter 0.977791
8 St. Bernardus Pater 6 0.989876 Sculpin IPA 0.977255
9 Westmalle Trappist Dubbel 0.989362 Hopslam Ale 0.976874
Firestone double Jack Closest Firestone double Jack Score Tsarina Esra Closest Tsarina Esra Score
0 Myrcenary Double IPA 0.995100 Santa's Little Helper 2012 0.992227
1 Narwhal Imperial Stout 0.994915 Klokke Roeland 0.991870
2 Apex 0.994892 Flower Child IPA 0.991121
3 Espresso Imperial Russian Stout 0.994810 Panil Barriquée (Sour Version) 0.990704
4 Modus Hoperandi 0.994788 De Viento 0.990359
5 D.O.R.I.S. The Destroyer Double Imperial Stout 0.994764 Fusion 18 0.989791
6 Green Flash Imperial India Pale Ale 0.994161 Le Capitaine 0.989658
7 Café Racer 15 0.993976 Great Lakes RoboHop 0.989278
8 The Devil Made Me Do It! Coffee Imperial Oatmeal Porter 0.993843 I See A Darkness 0.989091
9 NightTime 0.993701 Brown Barleywine TBC Collaboration 0.989078

Let's have a look at the results:

  • For Coors light, we do pick up similar light beers, as well as the budweiser assortment.
  • For Heineken, the results seem to match our expectations. We find mostly blond lagers from all over the world: Birra Peroni from Italy, Asahi from Japan, Beck's from Germany, Sol and Superior from Mexico or Amstel from Netherland (and actually Amstel belongs to Heineken). It's interesting to see that most of these beers are not from the USA whereas equivalent exists. This is because our users are mostly from the USA so Heineken is thought as a foreign beer, leading to a high similarity to other foreign beers.

Indeed if we have a look at the country distribution:

In [95]:
users_infos = pd.read_csv('/data/pgutierrez/beer/users.csv.gz', sep=',')
users_infos = users_infos.fillna('no_data')
country_count = users_infos.location.value_counts().reset_index()
country_count.columns = ["location","nb_users"]
country_count['perc_users'] = country_count['nb_users'].astype(float)/country_count['nb_users'].sum()
country_count['cum_perc_users'] = country_count['perc_users'].cumsum()
country_count.head(15)
Out[95]:
location nb_users perc_users cum_perc_users
0 no_data 16269 0.186059 0.186059
1 California 7016 0.080238 0.266297
2 Pennsylvania 5488 0.062763 0.329060
3 New York 4247 0.048570 0.377630
4 Illinois 4220 0.048262 0.425892
5 Massachusetts 3854 0.044076 0.469968
6 Texas 3342 0.038220 0.508188
7 Ohio 2894 0.033097 0.541285
8 Florida 2652 0.030329 0.571615
9 Michigan 2394 0.027379 0.598994
10 New Jersey 2262 0.025869 0.624863
11 Virginia 2125 0.024302 0.649165
12 North Carolina 2031 0.023227 0.672392
13 Minnesota 1792 0.020494 0.692887
14 Indiana 1575 0.018012 0.710899

So we have around 20% of unknown locations and more than 50% of American users.

In fact if we look at the non US entries in the list we get:

  • The state of Ontario(Canada) as the first entry with 718 users
  • United Kingdom and Australia as the two first countries with respectively 348 and 306 users
  • France arrives at the 69th position with 62 users.

Back to our results on closest beers:

  • for Leffe, the results are less interpretable. It seems to be drank along other style of beers like IPA or Stout while it's a classical Belgium beer. If all users were French we would probably find Grimbergen or Affligem instead.
  • For Lindermans, the results are not that good either. We find that it's associated with some winter or belgium beers (Petrus for example) but we totally miss the fruit idea.
  • Chimay is different. The matches make a lot of sense. First because we find the two other Chimay (blue and white). Then because we find other Trappist beers (Rochefort, Westmale) or Belgium beers (Chouffe, St Bernardus). Form my perspective, recommending these beers to a Chimay drinker would indeed be smart... unless it's way to obvious.
  • I let the American beer fan comment on the Lagunitas. I know the Sculpin which seems a good match but my American IPA culture is not strong enough. We do see mostly American IPA and Ales though which seems good.
  • Finally, for the Firestone double IPA and the Tsarina, we find mostly beergeek beers: double ipa, imperial stouts and unusual beers.

Conclusion: overall we do have a good match and closest beer seem to make sense. It is possible to check the euclidian distance instead of the cosine similarity but the results are very similar. The unexpected behaviour (like the one of Leffe) seems mostly due to the bias of the dataset since most users are American. An other bias that we should be aware of is the fact that most users are likely to be beergeeks and thus the Coors Light type of beers will always be rated low.

What happens if we now check the embeddings retrieved by a deeper model ?

In [96]:
# the embeddings are the first layer weights
from keras.models import load_model
load_path = "/data.nfs/pgutierrez/beer_reco/models/"
model = load_model(load_path+'dense_1_128_10_2017_10_12_11_29.h5')
weights = model.get_weights()
user_embeddings = weights[0]
item_embeddings = weights[1]
print "weights shapes",[w.shape for w in weights]
weights shapes [(91645, 10), (78518, 10), (20, 128), (128,), (128, 1), (1,)]
In [97]:
mybeers
Out[97]:
name id Description
0 Coors light 475 Oh my god why.
1 Heineken 16887 Euro basic blond beer. To be drank in the sun, or if in the north/east, add Picon.
2 Leffe Blonde 73044 My personally most hated beer. Entry belgium beer way too much wildspread in France.
3 Lindemans Kriek 39850 Beer with fruits. Sugary.
4 Chimay Bleue 15782 Probably the most well known Trappist beer. Entry point for most beer lovers in France.
5 Lagunitas IPA 50447 Well known USA IPA. Difficult to find in Europe
6 Firestone double Jack 64614 Double IPA from Firestone. Very bitter. Awesome beer. Probably for conoisseurs.
7 Tsarina Esra 50 Imperial stout. You can't go more way more beer geek than this.
In [98]:
mybeers2 = mybeers[mybeers['id'].isin([475,16887,15782,50447])]
dataframes = []
for i,row in enumerate(mybeers2.iterrows()):
    row = row[1]
    similars = pd.DataFrame(most_similar(row["id"],item_embeddings,top_n=10,euclidian= False))
    similars.columns = [row["name"]+' Closest',row["name"]+' Score' ]
    dataframes.append(similars)
    if i % 2 ==1 :
        final = pd.concat(dataframes,axis=1)
        ICD.display(final)
        dataframes=[]
Coors light Closest Coors light Score Heineken Closest Heineken Score
0 Milwaukee's Best Ice 0.996696 Bavaria Beer / Pilsener 0.991667
1 Budweiser Select 0.995665 Mr. Punch 0.991157
2 Natural Light 0.995015 Budweiser 0.991082
3 Miller 64 0.994731 Foster's Export 0.990888
4 Michelob Ultra 0.994584 Boddingtons Pub Ale 0.990460
5 Miller Chill 0.994191 Žaibo 0.990229
6 Bud Light 0.994100 Key West Sunset Ale 0.989983
7 Icehouse 0.994055 San Miguel Especial 0.989482
8 Milwaukee's Best Premium 0.991707 Windy Gap Wheat 0.988779
9 Busch Beer 0.991293 Dabob Bay India Pale Ale 0.986460
Chimay Bleue Closest Chimay Bleue Score Lagunitas IPA Closest Lagunitas IPA Score
0 Trois Pistoles 0.998028 Tang And Biscuits 0.995972
1 La Fin Du Monde 0.990438 Sucre - Tawny Port Barrel Aged 0.995160
2 Burton Baton 0.989614 Flying Monkeys Ryezin’ Up The Rye Of The Tiger 0.994247
3 Allagash Odyssey 0.988587 Iris Pale Ale 0.994117
4 Green Flash West Coast IPA 0.987935 Zywiec Porter 0.993770
5 Green Flash Palate Wrecker 0.987310 Darwin 0.992140
6 Scottish Ale 0.987281 Draft Bear 0.991880
7 Dark Island 0.986381 Irish Oatmeal Stout 0.991712
8 90 Minute IPA 0.985973 Double Red IPA 0.991538
9 Westmalle Trappist Tripel 0.985170 The Scarlet Letter 0.991528

When restricting to the beers that were working the best previously, we see that the ratings make less sense. While it is still good for Coors Light and correct for Heineken, the results worsen (at the interpretation level). For chimay, we lost the Chimay beers, one trappist and got IPAs instead. For Lagunitas, we recommend less IPAs.

It is unclear why the embeddings show less inutitive results. Let's see what it gives us using the tsne dimension reduction technique.

tsne of the embeddings

Let's randomly select 10000 beers and apply a tsne transformation on their embeddings.

In [99]:
# the embeddings are the first layer weights
from keras.models import load_model
load_path = "/data.nfs/pgutierrez/beer_reco/models/"
model = load_model(load_path+'matrix_facto_10_2017_10_10_12_12.h5')
weights = model.get_weights()
user_embeddings = weights[0]
item_embeddings = weights[1]
#print "weights shapes",[w.shape for w in weights]
In [100]:
import random
random.seed(0)
smallbeers = [inverse_beer_map[x] for x in random.sample(beers,10000)]
smallembedding = item_embeddings[smallbeers]

smallusers = [inverse_user_map[x] for x in random.sample(users,10000)]
smallembedding2 = user_embeddings[smallusers]

mostratedbeers = all_info[['beer_id','nb_lines']].sort('nb_lines',ascending=False).head(10000)
mostratedbeers = [int(x) for x in mostratedbeers["beer_id"].values]
mostratedembeddings = item_embeddings[mostratedbeers]

leastratedbeers = all_info[['beer_id','nb_lines']].sort('nb_lines').head(10000)
leastratedbeers = [int(x) for x in leastratedbeers["beer_id"].values]
leastratedembeddings = item_embeddings[mostratedbeers]
In [84]:
from sklearn.manifold import TSNE
%time item_tsne = TSNE(perplexity=30).fit_transform(smallembedding)
a = pd.DataFrame(item_tsne)
a.columns = ["x",'y']
a["beer_id"] = smallbeers
a['old_id'] = [beer_map[x] for x in smallbeers]
a.to_csv("/data.nfs/pgutierrez/beer_reco/new_data/tsne_dot.csv")


from sklearn.manifold import TSNE
%time item_tsne = TSNE(perplexity=30).fit_transform(smallembedding2)
a = pd.DataFrame(item_tsne)
a.columns = ["x",'y']
a["user_id"] = smallusers
a['old_id'] = [user_map[x] for x in smallusers]
a.to_csv("/data.nfs/pgutierrez/beer_reco/new_data/tsne_dot_user.csv")

We get

In [129]:
a = pd.read_csv("/data.nfs/pgutierrez/beer_reco/new_data/tsne_dot.csv")
b = pd.merge(a,all_info, how='left',left_on ='old_id', right_on = 'old_id') 

plt.figure(figsize=(10,10))
sc = plt.scatter(b['x'], b['y'],s=10)
plt.xlim(-10, 10)
plt.ylim(-10, 10)
Out[129]:
(-10, 10)

At first, you may think that we have here very structured information with well separated clusters. It turns out, it is not possible to correlate any of these shapes with a style or origin of the beer (trust me I tried). I manually checked some areas and was not able to make sense out of it.

In fact, the structure that we have is mostly driven by two axes: the average rating of the beer and the number of times it has been rated. Below, the first graph shows the same data colored by average rating. We see on the left side a cluster of poorely rated beers (mostly American laggers) and some red of yellow clusters (top, center, bottom). The second charts shows the tsne representation colored by the log of the number of times the beer was rated. We can see that at the center of the map, lies the most popular beers while the round cluster on the bottom consists of beer that were rated only once (and thus cannot be linked to any other beer).

In [86]:
cm = plt.cm.get_cmap('RdYlBu_r')
plt.figure(figsize=(20,7))

plt.subplot(121)
sc = plt.scatter(b['x'], b['y'],s=10,c= b['avg_rating'],cmap=cm);
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.colorbar(sc)

plt.subplot(122)
sc = plt.scatter(b['x'], b['y'],s=10,c= np.log(b['nb_lines']),cmap=cm);
plt.xlim(-10, 10)
plt.ylim(-10, 10)
plt.colorbar(sc)

plt.show()