Mercari Price Estimates/Suggestions¶

This is a nice dataset that combines textual descriptions and pricing. Almost like a hedonic pricing model, but not quite. Actually it's almost entirely different. But it's similar in that we are converting text into quantifiable features, which is kind of like hedonic pricing again.

I don't use Mercari, but the data is useful for prototyping the sort of model that could be used on Amazon, Best Buy, and other marketplaces and retailers.

In [98]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.layers import (Input, Dropout, Dense, concatenate, GRU, Embedding, Flatten,
                          Activation, SpatialDropout1D, GlobalMaxPooling1D)
from keras.optimizers import Adam
from keras.models import Model
from keras import backend as K
from nltk.corpus import stopwords

%matplotlib inline

In [99]:

# root mean squared error, assuming all y values are already log transformed
def rmse (y_true, y_pred):
    return np.sqrt(np.mean((y_pred-y_true)**2))

In [100]:

train = pd.read_table('d:/data/price/train.tsv')
test = pd.read_table('d:/data/price/test.tsv')

In [101]:

train.columns

Out[101]:

Index(['train_id', 'name', 'item_condition_id', 'category_name', 'brand_name',
       'price', 'shipping', 'item_description'],
      dtype='object')

In [102]:

print (len(train),len(test))

1482535 693359

In [103]:

cols = ['name', 'item_condition_id', 'category_name', 'brand_name','shipping', 'item_description']

Missing values in train and test sets.

In [104]:

for c in cols:
    print(c, ': ', train[c].isnull().sum(), test[c].isnull().sum())

name :  0 0
item_condition_id :  0 0
category_name :  6327 3058
brand_name :  632682 295525
shipping :  0 0
item_description :  4 0

A lot of items without brand names, which in itself is very informative. The lack of category names for some items could be a hassle, but they represent less than 1% of all observations.

There are 874 0 prices and nothing $\in$ (0,3), so we should remove these since they are incorrect (Mercari has a $3 lower limit).

In [105]:

len(train[train['price']==0])

Out[105]:

In [106]:

train[(train['price']<3) & (train['price']>0)]

Out[106]:

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description

In [107]:

train.drop(train[train['price']<3.0].index, inplace=True)

In [108]:

train.shape

Out[108]:

(1481661, 8)

In [109]:

train['log_price'] = np.log(train['price'])

Feature quantification/engineering¶

Some will need vectorization, some new features will be created.

In [110]:

def char_count(text):
    try:
        # not a real description
        if text == 'No description yet':
            return 0
        else:
            chars = text.lower().replace(' ', '')
            return len(chars)
    except:
        return 0

def word_count(text):
    try:
        if text == 'No description yet':
            return 0
        else:
            words = [w for w in text.lower().split()]
            return len(words)
    except:
        return 0

In [111]:

train['desc_words'] = train['item_description'].apply(lambda s: word_count(s))
train['desc_chars'] = train['item_description'].apply(lambda s: char_count(s))
test['desc_words'] = test['item_description'].apply(lambda s: word_count(s))
test['desc_chars'] = test['item_description'].apply(lambda s: char_count(s))

train['name_words'] = train['name'].apply(lambda s: word_count(s))
train['name_chars'] = train['name'].apply(lambda s: char_count(s))
test['name_words'] = test['name'].apply(lambda s: word_count(s))
test['name_chars'] = test['name'].apply(lambda s: char_count(s))

train.loc[train['item_description']=='No description yet', 'item_description'] = 'missing'
test.loc[test['item_description']=='No description yet', 'item_description'] = 'missing'

In [112]:

train.head()

Out[112]:

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description	log_price	desc_words	desc_chars	name_words	name_chars
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	NaN	10.0	1	missing	2.302585	0	0	7	29
1	1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.0	0	This keyboard is in great condition and works ...	3.951244	36	153	4	29
2	2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.0	1	Adorable top with a hint of lace and a key hol...	2.302585	29	96	2	13
3	3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	NaN	35.0	1	New with tags. Leather horses. Retail for [rm]...	3.555348	32	142	3	19
4	4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	NaN	44.0	0	Complete with certificate of authenticity	3.784190	5	37	4	17

Checking for missing brand names. I have mentioned above that the lack of a brand is itself important information, but let's make sure that we don't have missing brands.

First get all the unique brand names, ignoring "None".

In [113]:

len(train.brand_name.unique()) + len(test.brand_name.unique())

Out[113]:

In [114]:

train['brand_name'].fillna('missing',inplace=True), test['brand_name'].fillna('missing',inplace=True)
train['category_name'].fillna('missing',inplace=True), test['category_name'].fillna('missing',inplace=True)
train['item_description'].fillna('missing',inplace=True), test['item_description'].fillna('missing',inplace=True)

Out[114]:

(None, None)

In [115]:

all_brands = set(list(train.brand_name.unique()) + list(test['brand_name'].unique()))
all_brands = [b for b in all_brands if b is not 'missing']
# I could use pop... but list comprehensions are more fun, no?

In [116]:

len(all_brands)

Out[116]:

So we're going to check the names and descriptions for brand name information. 632336 "None" brands, let's see what we end up with.

In [117]:

len(train[train['brand_name']=='missing'])

Out[117]:

In [118]:

def assign_brand(line):
    name_words = line[0].split()
    brand = line[1]
    
    # these are okay
    if brand != 'missing':
        return brand
    
    # let's see if we can find the brand name for currently unlabelled items
    # If a word is in all_brands, return just the word rather than the full name, or we're just creating new brands...
    else:
        for word in name_words:
            if word in all_brands:
                return word
            else:
                return 'missing'

In [119]:

train['new_brand_name'] = train[['name','brand_name']].apply(lambda l:assign_brand(l), axis=1)
test['new_brand_name'] = test[['name','brand_name']].apply(lambda l:assign_brand(l), axis=1)

This isn't perfect, but we've assigned over 70,000 new brands. We can also see that brand_name was used kind of loosely in the first place, and sometimes it was more of an extra description rather than a trademarked brand.

In [120]:

train[(train['brand_name'] == 'missing') & (train['new_brand_name'] != 'missing')].head()

Out[120]:

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description	log_price	desc_words	desc_chars	name_words	name_chars	new_brand_name
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	missing	10.0	1	missing	2.302585	0	0	7	29	MLB
49	49	Younique 3d fiber lash mascara	1	Beauty/Makeup/Eyes	missing	9.0	1	Younique 3d fiber lash mascara will quickly be...	2.197225	32	166	5	26	Younique
55	55	Vintage wood jewelry lot	3	Vintage & Collectibles/Jewelry/Brooch	missing	5.0	1	All are made out of wood. Necklace, earrings b...	1.609438	11	60	4	21	Vintage
66	66	Silver choker Italy 925	3	Women/Jewelry/Necklaces	missing	15.0	1	Signed Italy and 925 Necklace Vintage, lobster...	2.708050	12	63	4	20	Silver
71	71	Partners In Crime Necklace ShipfromChina	1	Women/Jewelry/Necklaces	missing	4.0	1	"Fine or Fashion: Fashion Item Type: Necklace ...	1.386294	22	118	5	36	Partners

In [121]:

train[train['brand_name'] == 'Silver']

Out[121]:

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description	log_price	desc_words	desc_chars	name_words	name_chars	new_brand_name
174314	174314	Silver jeans size 18 reg	3	Women/Jeans/Straight Leg	Silver	20.0	1	Absolutely love these jeans! Smoke free, pet f...	2.995732	9	47	5	20	Silver
227909	227909	6 Total NICE Silver Kennedy Half Dollars	2	Vintage & Collectibles/Collectibles/Other	Silver	20.0	1	The first pic you see are some good (not scrap...	2.995732	55	225	7	34	Silver
1397420	1397420	Sterling Silver bracelet accessories	1	Men/Other/Other	Silver	56.0	0	missing	4.025352	0	0	4	33	Silver

We can also see that category_name can be made more granular, as there are up to five actual categories (but mostly three) for each category_name.

But do we create 5 categories or just 3? The latter will avoid increasing sparsity, while using the former will give us extra information for only 7 out of over a million observations. This is probably not worth the extra computational cost. And if we look at the 2 observations with 5 categories, we can be fairly confident that the item name and the first three categories can give us enough information, unless there exists some secret iPad that's not a tablet and can't read eBooks.

In [122]:

cat_len = []
for cat in train['category_name'].unique():
    cat_len.append(len(cat.split('/')))
    if len(cat.split('/')) > 3:
        print(cat.split('/'))
print ('Maximum categories: ', np.max(cat_len))
print ('Minimum categories: ', np.min(cat_len))

['Electronics', 'Computers & Tablets', 'iPad', 'Tablet', 'eBook Readers']
['Sports & Outdoors', 'Exercise', 'Dance', 'Ballet']
['Electronics', 'Computers & Tablets', 'iPad', 'Tablet', 'eBook Access']
['Sports & Outdoors', 'Outdoors', 'Indoor', 'Outdoor Games']
['Men', 'Coats & Jackets', 'Varsity', 'Baseball']
['Men', 'Coats & Jackets', 'Flight', 'Bomber']
['Handmade', 'Housewares', 'Entertaining', 'Serving']
Maximum categories:  5
Minimum categories:  1

In [123]:

def granular_cat (line):
    splits = line.split('/')
    cats = len(splits)
    if cats == 1:
        return (splits[0],'missing','missing')
    elif cats == 2:
        return (splits[0],splits[1],'missing')
    elif cats >= 3:
        return (splits[0],splits[1],splits[2])
    else:
        return ('missing', 'missing','missing')

In [124]:

train['cat_1'],train['cat_2'],train['cat_3'] = zip(*train['category_name'].apply(lambda l:granular_cat(l)))
test['cat_1'],test['cat_2'],test['cat_3'] = zip(*test['category_name'].apply(lambda l:granular_cat(l)))

In [125]:

train.head()

Out[125]:

	train_id	name	item_condition_id	category_name	brand_name	price	shipping	item_description	log_price	desc_words	desc_chars	name_words	name_chars	new_brand_name	cat_1	cat_2	cat_3
0	0	MLB Cincinnati Reds T Shirt Size XL	3	Men/Tops/T-shirts	missing	10.0	1	missing	2.302585	0	0	7	29	MLB	Men	Tops	T-shirts
1	1	Razer BlackWidow Chroma Keyboard	3	Electronics/Computers & Tablets/Components & P...	Razer	52.0	0	This keyboard is in great condition and works ...	3.951244	36	153	4	29	Razer	Electronics	Computers & Tablets	Components & Parts
2	2	AVA-VIV Blouse	1	Women/Tops & Blouses/Blouse	Target	10.0	1	Adorable top with a hint of lace and a key hol...	2.302585	29	96	2	13	Target	Women	Tops & Blouses	Blouse
3	3	Leather Horse Statues	1	Home/Home Décor/Home Décor Accents	missing	35.0	1	New with tags. Leather horses. Retail for [rm]...	3.555348	32	142	3	19	missing	Home	Home Décor	Home Décor Accents
4	4	24K GOLD plated rose	1	Women/Jewelry/Necklaces	missing	44.0	0	Complete with certificate of authenticity	3.784190	5	37	4	17	missing	Women	Jewelry	Necklaces

In [126]:

test.head()

Out[126]:

	test_id	name	item_condition_id	category_name	brand_name	shipping	item_description	desc_words	desc_chars	name_words	name_chars	new_brand_name	cat_1	cat_2	cat_3
0	0	Breast cancer "I fight like a girl" ring	1	Women/Jewelry/Rings	missing	1	Size 7	2	5	8	33	missing	Women	Jewelry	Rings
1	1	25 pcs NEW 7.5"x12" Kraft Bubble Mailers	1	Other/Office supplies/Shipping Supplies	missing	1	25 pcs NEW 7.5"x12" Kraft Bubble Mailers Lined...	38	214	7	34	missing	Other	Office supplies	Shipping Supplies
2	2	Coach bag	1	Vintage & Collectibles/Bags and Purses/Handbag	Coach	1	Brand new coach bag. Bought for [rm] at a Coac...	11	45	2	8	Coach	Vintage & Collectibles	Bags and Purses	Handbag
3	3	Floral Kimono	2	Women/Sweaters/Cardigan	missing	0	-floral kimono -never worn -lightweight and pe...	10	58	2	12	missing	Women	Sweaters	Cardigan
4	4	Life after Death	3	Other/Books/Religion & Spirituality	missing	1	Rediscovering life after the loss of a loved o...	29	139	3	14	missing	Other	Books	Religion & Spirituality

Model creation and validation¶

Description analysis is a little arcane for now, but we can still vectorize them and we do have their lengths in words and characters, which might give us some information.

In [127]:

combined = pd.concat((train,test),axis=0)

In [128]:

lab = LabelEncoder()
# it would be nice if this had fit_transform
lab.fit(combined['category_name'])
combined['category_name_final'] = lab.transform(combined['category_name'])

lab.fit(combined['new_brand_name'])
combined['new_brand_name_final'] = lab.transform(combined['new_brand_name'])

lab.fit(combined['cat_1'])
combined['cat_1_final'] = lab.transform(combined['cat_1'])

lab.fit(combined['cat_2'])
combined['cat_2_final'] = lab.transform(combined['cat_2'])

lab.fit(combined['cat_3'])
combined['cat_3_final'] = lab.transform(combined['cat_3'])

Tokenize the descriptions.

In [129]:

tok = Tokenizer()
all_text = np.hstack((combined['item_description'].str.lower(), combined['name'].str.lower()))

tok.fit_on_texts(all_text)

combined['item_description_seq'] = tok.texts_to_sequences(combined['item_description'].str.lower())
combined['name_seq'] = tok.texts_to_sequences(combined['name'].str.lower())

Features:

Vectorized name
# of words, # of characters in name
Brand (categorical)
Category, and 3 separated categories (categorical)
Vectorized description
# of words, # of characters in description
Item condition (can be categorical, but order is informational)
Shipping (categorical)

In [130]:

def data_prep(df):
    X = {
        'name_final': pad_sequences(df['name_seq'], maxlen=10),
        'name_words' : np.array(df[['name_words']]),
        #'name_chars' : np.array(df[['name_chars']]),
        'new_brand_name_final': np.array(df['new_brand_name_final']),
        'category_name_final': np.array(df['category_name_final']),
        'cat_1_final': np.array(df['cat_1_final']),
        'cat_2_final': np.array(df['cat_2_final']),
        'cat_3_final': np.array(df['cat_3_final']),
        'item_description_final': pad_sequences(df['item_description_seq'], maxlen=75),
        'desc_words': np.array(df[['desc_words']]),
        #'desc_chars': np.array(df[['desc_chars']]),
        'item_condition': np.array(df['item_condition_id']),
        'shipping': np.array(df[["shipping"]]),
    }
    return X

In [131]:

def rnn_model (lr=0.005, decay=0.0):
    # Inputs
    name_final = Input(shape=[X_train["name_final"].shape[1]], name="name_final")
    name_words = Input(shape=[1], name="name_words")
    #name_chars = Input(shape=[1], name='name_chars')
    new_brand_name_final = Input(shape=[1], name='new_brand_name_final')
    category_name_final = Input(shape=[1], name='category_name_final')
    cat_1_final = Input(shape=[1], name="cat_1_final")
    cat_2_final = Input(shape=[1], name="cat_2_final")
    cat_3_final = Input(shape=[1], name="cat_3_final")
    item_description_final = Input(shape=[X_train['item_description_final'].shape[1]], name="item_description_final")
    desc_words = Input(shape=[1], name='desc_words')
    #desc_chars = Input(shape=[1], name='desc_chars')
    item_condition = Input(shape=[1], name="item_condition")
    shipping = Input(shape=[X_train['shipping'].shape[1]], name="shipping")

    # input dimensions are always slightly larger than the maximum values in vectorized features or
    # maximum number of words/characters
    emb_name = Embedding(350000, 20)(name_final)
    emb_name_words = Embedding(18, 5)(name_words)
    #emb_name_chars = Embedding(41, 5)(name_chars)
    emb_brand_name = Embedding(5288, 10)(new_brand_name_final)
    emb_category = Embedding( 1311, 10)(category_name_final)
    emb_cat_1 = Embedding(11, 10)(cat_1_final)
    emb_cat_2 = Embedding(114, 10)(cat_2_final)
    emb_cat_3 = Embedding(883, 10)(cat_3_final)
    emb_item_desc = Embedding(350000, 60)(item_description_final)
    emb_desc_words = Embedding(250, 5)(desc_words)
    #emb_desc_chars = Embedding(900, 5)(desc_chars)
    emb_item_condition = Embedding(6, 5)(item_condition)

    rnn_layer1 = GRU(16) (emb_item_desc)
    rnn_layer2 = GRU(8) (emb_name)

    layer = concatenate([
        Flatten()(emb_name_words),
        #Flatten()(emb_name_chars),
        Flatten()(emb_brand_name),
        Flatten()(emb_category),
        Flatten()(emb_cat_1),
        Flatten()(emb_cat_2),
        Flatten()(emb_cat_3),
        Flatten()(emb_desc_words),
        #Flatten()(emb_desc_chars),
        Flatten()(emb_item_condition),
        rnn_layer1,
        rnn_layer2,
        shipping                         # only 2 possible values, so it's ok
    ])
    
    layer = Dropout(0.25)(Dense(512,kernel_initializer='normal',activation='relu') (layer))
    layer = Dropout(0.20)(Dense(256,kernel_initializer='normal',activation='relu') (layer))
    layer = Dropout(0.15)(Dense(128,kernel_initializer='normal',activation='relu') (layer))
    layer = Dropout(0.10)(Dense(64,kernel_initializer='normal',activation='relu') (layer))

    # scalar output for each set of features, linear model
    output = Dense(1, activation="linear") (layer)
    
    model = Model([name_final, name_words, new_brand_name_final,
                   category_name_final,
                   cat_1_final, cat_2_final, cat_3_final,
                   item_description_final, desc_words,
                   item_condition, shipping], output)

    optimizer = Adam(lr=lr, decay=decay)
    model.compile(loss = 'mse', optimizer = optimizer)

    return model

It turns out that using characters instead of words, or combining characters and words for description and name length decreases predictive power. This makes sense, as words actually give information about a product, while characters are informational only insofar as they form words. For examplek, "cool" and "awesome" in item descriptions probably give the same effect, so the 3 character difference doesn't really mean much.

In [132]:

train = combined[:len(train)]
test = combined[len(train):]

In [133]:

# clear up some RAM...
del lab, combined, tok

In [134]:

X = train
y = train['log_price'].values.reshape(-1, 1)

In [135]:

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.1, random_state=10101)

In [136]:

X_train = data_prep(X_train)
X_val = data_prep(X_val)

In [ ]:

X_test = data_prep(test)

In [ ]:

rnn = rnn_model(lr=0.005, decay=1e-6)
rnn.fit(X_train, y_train, epochs=2, batch_size=512,validation_data=(X_val, y_val), verbose=1)

Train on 1333494 samples, validate on 148167 samples
Epoch 1/2
1333494/1333494 [==============================] - 457s 342us/step - loss: 0.3008 - val_loss: 0.2219
Epoch 2/2
1214464/1333494 [==========================>...] - ETA: 38s - loss: 0.2085

In [ ]:

pred = rnn.predict(X_val, batch_size=512)
print("RMSLE:", rmse(y_val, pred))

In [ ]:

final_pred = rnn.predict(X_test, batch_size=512, verbose=1)
final_pred = np.exp(final_pred)

In [ ]:

submission = pd.DataFrame({"test_id": test['test_id'], "price": final_pred.reshape(-1)})
submission.to_csv("sub.csv", index=False)

In [ ]: