Load data

This dataset contains transactions of an online retail store in the UK. The files takes a bit to load. I downloaded this file as an xls from the UCI machine learning dataset library. I used excel to save it as a csv.

I will largely follow ths process described in a Feature Tools tutorial. Note that the tutorial changed the column names from the original file (which I have not done).

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

create_data = False

Field info (from UCI machine learning dataset website):

  • InvoiceNo: Invoice number. Nominal, a 6-digit integral number uniquely assigned to each transaction. If this code starts with letter 'c', it indicates a cancellation.
  • StockCode: Product (item) code. Nominal, a 5-digit integral number uniquely assigned to each distinct product.
  • Description: Product (item) name. Nominal.
  • Quantity: The quantities of each product (item) per transaction. Numeric.
  • InvoiceDate: Invice Date and time. Numeric, the day and time when each transaction was generated.
  • UnitPrice: Unit price. Numeric, Product price per unit in sterling.
  • CustomerID: Customer number. Nominal, a 5-digit integral number uniquely assigned to each customer.
  • Country: Country name. Nominal, the name of the country where each customer resides.
In [2]:
df = pd.read_csv('./data/raw/OnlineRetail.csv', parse_dates=["InvoiceDate"])
In [3]:
# Restrict data to 2011 - from tutorial. not sure why they did this.
df = df[df['InvoiceDate'].dt.year == 2011]
In [4]:
df.columns
Out[4]:
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')
In [5]:
len(df)
Out[5]:
499428
In [6]:
# label cancellation orders
df['Cancelled'] = df['InvoiceNo'].str.startswith('C')
In [7]:
# drop the duplicates
df = df.drop_duplicates()
In [8]:
# drop rows with null customer id
df = df.dropna(axis=0)
In [9]:
# Convert to dollars and create a field representing the total spent
df['UnitPrice'] = df['UnitPrice'] * 1.65
df['Total'] = df['UnitPrice'] * df['Quantity']
In [10]:
df.describe()
Out[10]:
Quantity UnitPrice CustomerID Total
count 375250.000000 375250.000000 375250.000000 375250.000000
mean 12.252296 5.762031 15265.136168 33.972397
std 258.180877 119.054102 1710.927381 733.803756
min -80995.000000 0.000000 12346.000000 -277974.840000
25% 2.000000 2.062500 13901.000000 7.012500
50% 5.000000 3.217500 15108.000000 19.305000
75% 12.000000 6.187500 16767.000000 32.670000
max 80995.000000 64300.500000 18287.000000 277974.840000
In [11]:
df.head()
Out[11]:
InvoiceNo StockCode Description Quantity InvoiceDate UnitPrice CustomerID Country Cancelled Total
42481 539993 22386 JUMBO BAG PINK POLKADOT 10 2011-01-04 10:00:00 3.2175 13313.0 United Kingdom False 32.175
42482 539993 21499 BLUE POLKADOT WRAP 25 2011-01-04 10:00:00 0.6930 13313.0 United Kingdom False 17.325
42483 539993 21498 RED RETROSPOT WRAP 25 2011-01-04 10:00:00 0.6930 13313.0 United Kingdom False 17.325
42484 539993 22379 RECYCLING BAG RETROSPOT 5 2011-01-04 10:00:00 3.4650 13313.0 United Kingdom False 17.325
42485 539993 20718 RED RETROSPOT SHOPPER BAG 10 2011-01-04 10:00:00 2.0625 13313.0 United Kingdom False 20.625
In [12]:
df.dtypes
Out[12]:
InvoiceNo              object
StockCode              object
Description            object
Quantity                int64
InvoiceDate    datetime64[ns]
UnitPrice             float64
CustomerID            float64
Country                object
Cancelled                bool
Total                 float64
dtype: object

Create response variable

The following code is a little complex. I compute how much a customer spent each month (including months with no transactions).

In [13]:
def monthly_spend(df):
    """Identify how much each customer spent each month."""
    df.index = df['InvoiceDate']
    monthly_sums = df.groupby('CustomerID').resample('MS')['Total'].sum()
    return monthly_sums.reset_index()


def customer_by_month(monthly_data, df):
    """Create an index for each customer for each month of the data set."""
    labels = monthly_data.set_index(['InvoiceDate', 'CustomerID'])
    midx = pd.MultiIndex.from_product(
        [pd.date_range('2011-01-01', '2012-01-01', freq='MS'), df['CustomerID'].unique()],
        names=labels.index.names)
    return labels.reindex(midx, fill_value=0).reset_index()


def monthly_min_date(monthly_data):
    """Create a table which has all months since a customer's first transaction"""
    min_dates = (monthly_data
                 .groupby('CustomerID')['InvoiceDate']
                 .min()
                 .apply(lambda x: pd.date_range(x, '2012-01-01', freq='MS').tolist())
                 .reset_index()
                )
    return pd.DataFrame([(x, i) for x, y in zip(min_dates['CustomerID'], min_dates['InvoiceDate']) for i in y],
                        columns=['CustomerID', 'InvoiceDate'])


# join the table where each customer has a record each month with the table where each customer has a 
# record since account creation. This way customers do not have records before their first transaction.
#labels = labels.merge(relevant_months, on=['CustomerID', 'InvoiceDate'], how='inner')
monthly_df = monthly_spend(df)
labels = customer_by_month(monthly_df, df).merge(monthly_min_date(monthly_df),
                                                   on=['CustomerID', 'InvoiceDate'],
                                                   how='inner')
In [14]:
# tutorial starts with march data.
labels = labels[labels['InvoiceDate'] >= '2011-03-01']
In [15]:
labels.head()
Out[15]:
InvoiceDate CustomerID Total
2065 2011-03-01 13313.0 0.000
2066 2011-03-01 18097.0 0.000
2067 2011-03-01 16656.0 589.248
2068 2011-03-01 16875.0 0.000
2069 2011-03-01 13094.0 115.434

Notice that I have more samples than the feature tools tutorial because they accidently removed all transactions on the first of the month...

In [16]:
labels.describe()
Out[16]:
CustomerID Total
count 35339.000000 35339.000000
mean 15276.044399 318.284416
std 1720.194897 1698.736413
min 12346.000000 -7074.589500
25% 13791.000000 0.000000
50% 15247.000000 0.000000
75% 16757.000000 260.460750
max 18287.000000 115906.725000
In [17]:
labels.loc[labels['CustomerID'] == 12347]
Out[17]:
InvoiceDate CustomerID Total
2731 2011-03-01 12347.0 0.0000
4543 2011-04-01 12347.0 1049.8125
6715 2011-05-01 12347.0 0.0000
9204 2011-06-01 12347.0 631.1580
11956 2011-07-01 12347.0 0.0000
14926 2011-08-01 12347.0 965.1015
18080 2011-09-01 12347.0 0.0000
21547 2011-10-01 12347.0 2135.6280
25385 2011-11-01 12347.0 0.0000
29582 2011-12-01 12347.0 370.9530
33826 2012-01-01 12347.0 0.0000
In [18]:
df[(df['CustomerID'] == 12347) & (df['InvoiceDate'] >= '2011-12-01')]['Total'].sum()
Out[18]:
370.953

Feature Automation

This next series of commands will load data into feature tools and explain the dataset to feature tools.

In [19]:
import featuretools as ft
In [20]:
es = ft.EntitySet(id="Online Retail Logs")

# Add the entire data table as an entity
es.entity_from_dataframe("purchases", # name of entity set
                         dataframe=df, # data
                         index="purchases_index", # name of new index
                         time_index = 'InvoiceDate', # time associated with each row
                         variable_types = {'Description': ft.variable_types.Text})  # specifiy variable type

es['purchases']
2018-09-22 09:06:26,324 featuretools.entityset - WARNING    index purchases_index not found in dataframe, creating new integer column
Out[20]:
Entity: purchases
  Variables:
    purchases_index (dtype: index)
    InvoiceNo (dtype: categorical)
    StockCode (dtype: categorical)
    Quantity (dtype: numeric)
    InvoiceDate (dtype: datetime_time_index)
    UnitPrice (dtype: numeric)
    CustomerID (dtype: numeric)
    Country (dtype: categorical)
    Cancelled (dtype: boolean)
    Total (dtype: numeric)
    Description (dtype: text)
  Shape:
    (Rows: 375250, Columns: 11)
In [21]:
es.normalize_entity(new_entity_id="products", # new entity for products. 
                    base_entity_id="purchases", # built from purchases entity
                    index="StockCode",  # Index with StockCode column from purchases entity. (unique key)
                    additional_variables=["Description"]) # bring in this variable

es['products'].df.head()
Out[21]:
StockCode Description first_purchases_time
20682 20682 RED RETROSPOT CHILDRENS UMBRELLA 2011-01-04 10:00:00
20718 20718 RED RETROSPOT SHOPPER BAG 2011-01-04 10:00:00
21498 21498 RED RETROSPOT WRAP 2011-01-04 10:00:00
21499 21499 BLUE POLKADOT WRAP 2011-01-04 10:00:00
22302 22302 COFFEE MUG PEARS DESIGN 2011-01-04 10:00:00
In [22]:
es.normalize_entity(new_entity_id="customers", # customer entity
                    base_entity_id="purchases", # from purchases
                    index="CustomerID")  # unique key is CustomerID

es.normalize_entity(new_entity_id="orders", # order entity
                    base_entity_id="purchases", # from purchases
                    index="InvoiceNo", # unique key is InvoiceNo
                    additional_variables=["Country", 'Cancelled']) # Include these variables.

es
Out[22]:
Entityset: Online Retail Logs
  Entities:
    purchases [Rows: 375250, Columns: 8]
    products [Rows: 3612, Columns: 3]
    customers [Rows: 4244, Columns: 2]
    orders [Rows: 20482, Columns: 4]
  Relationships:
    purchases.StockCode -> products.StockCode
    purchases.CustomerID -> customers.CustomerID
    purchases.InvoiceNo -> orders.InvoiceNo
In [23]:
es['customers'].df.head()
Out[23]:
CustomerID first_purchases_time
13313.0 13313.0 2011-01-04 10:00:00
18097.0 18097.0 2011-01-04 10:22:00
16656.0 16656.0 2011-01-04 10:23:00
13094.0 13094.0 2011-01-04 10:37:00
16875.0 16875.0 2011-01-04 10:37:00
In [24]:
es['orders'].df.head()
Out[24]:
InvoiceNo Country Cancelled first_purchases_time
539993 539993 United Kingdom False 2011-01-04 10:00:00
540001 540001 United Kingdom False 2011-01-04 10:22:00
540002 540002 United Kingdom False 2011-01-04 10:23:00
540003 540003 United Kingdom False 2011-01-04 10:37:00
540004 540004 United Kingdom False 2011-01-04 10:37:00

Deep Feature Synthesis

Here, we use featuretools to create new features.

In [25]:
if create_data:
    feature_matrix,_ = ft.dfs(entityset=es, # entity
                              target_entity='customers', # what we're trying to predict
                              agg_primitives=['max', 'min', 'mode', 'mean', 'avg_time_between'], # requested aggregations
                              trans_primitives=['day', 'month', 'hour', 'weekend'], # requested transformations
                              cutoff_time=labels, # define time period of predictions
                              verbose=1, # how much stdout
                              cutoff_time_in_index=True, # specify that we've given cutoff times
                              chunk_size=50, # how much data to give each worker
                              n_jobs=-1, # how many threads to create
                              max_depth=1) # how many aggregations

    feature_matrix = feature_matrix.reset_index()
    feature_matrix.to_csv('./data/processed/dfs_depth1.csv')
    feature_matrix.head()
else:
    feature_matrix = pd.read_csv('./data/processed/dfs_depth1.csv')
    feature_matrix['time'] = pd.to_datetime(feature_matrix['time'])
    feature_matrix = feature_matrix.drop('Unnamed: 0', axis=1)
In [26]:
feature_matrix.columns
Out[26]:
Index(['CustomerID', 'time', 'MAX(purchases.Quantity)',
       'MAX(purchases.UnitPrice)', 'MAX(purchases.Total)',
       'MIN(purchases.Quantity)', 'MIN(purchases.UnitPrice)',
       'MIN(purchases.Total)', 'MODE(purchases.InvoiceNo)',
       'MODE(purchases.StockCode)', 'MEAN(purchases.Quantity)',
       'MEAN(purchases.UnitPrice)', 'MEAN(purchases.Total)',
       'AVG_TIME_BETWEEN(purchases.InvoiceDate)', 'DAY(first_purchases_time)',
       'MONTH(first_purchases_time)', 'HOUR(first_purchases_time)',
       'WEEKEND(first_purchases_time)', 'Total'],
      dtype='object')
In [27]:
# look at a single output
feature_matrix.iloc[0, :3]
Out[27]:
CustomerID                               12346
time                       2011-03-01 00:00:00
MAX(purchases.Quantity)                  74215
Name: 0, dtype: object
In [28]:
# demonstrate we understand the output
df[(df['CustomerID'] == 12346) & (df['InvoiceDate'] < '2011-03-01')]['Quantity'].max()
Out[28]:
74215
In [29]:
feature_matrix.shape
Out[29]:
(35339, 19)
In [30]:
# create categorical response variable
feature_matrix['Total'] = feature_matrix['Total'].apply(lambda x: 1 if x > 500 else 0) 
In [31]:
import numpy as np

def create_train_test(month, feature_matrix, drop_cols=['CustomerID', 'time', 'month', 'Total']):
    """Basic cleaning and return train/test data."""
    
    # remove columns we know will not contribute
    feature_matrix = feature_matrix.drop(columns= ['MODE(purchases.InvoiceNo)', 'MODE(purchases.StockCode)'])
    # dummy code strings
    feature_matrix = pd.get_dummies(feature_matrix)
    # fill nans
    feature_matrix = feature_matrix.fillna(0)
    
    # Labels
    feature_matrix['month'] = feature_matrix['time'].dt.month
    train_labels = feature_matrix.loc[feature_matrix['month'] < month, 'Total']
    test_labels = feature_matrix.loc[feature_matrix['month'] >= month, 'Total']
    y_train = np.array(train_labels).reshape((-1, ))
    y_test = np.array(test_labels).reshape((-1, ))

    # Features
    X_train = feature_matrix[feature_matrix['time'].dt.month < month].drop(columns=drop_cols)
    X_test = feature_matrix[feature_matrix['time'].dt.month >= month].drop(columns=drop_cols)
    
    return (X_train, X_test, y_train, y_test)
In [32]:
X_train, X_test, y_train, y_test = create_train_test(11, feature_matrix)
In [33]:
print(np.mean(y_train))
print(np.mean(y_test))
0.183619599970258
0.16076294277929154
In [34]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(random_state=0, class_weight='balanced')
model.fit(X_train, y_train)
Out[34]:
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
In [35]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

def print_performance(X, y, model):
    """Print model performance metrics."""
    predictions = model.predict(X)
    probs = model.predict_proba(X)[:, 1]

    # Calculate metrics
    print('Precision: {}'.format(round(precision_score(y, predictions), 5)))
    print('Recall: {}'.format(round(recall_score(y, predictions), 5)))
    print('F1 Score: {}'.format(round(f1_score(y, predictions), 5)))
    print('ROC AUC: {}'.format(round(roc_auc_score(y, probs), 5)))
In [36]:
print_performance(X_train, y_train, model)
Precision: 0.44233
Recall: 0.43794
F1 Score: 0.44013
ROC AUC: 0.73158
In [37]:
print_performance(X_test, y_test, model)
Precision: 0.40087
Recall: 0.33972
F1 Score: 0.36777
ROC AUC: 0.69795

Deeper Features

Beware that this takes forever!!!

In [38]:
if create_data:
    feature_matrix,_ = ft.dfs(entityset=es,
                              target_entity='customers', # what we're trying to predict
                              agg_primitives=['max', 'min', 'mode', 'mean', 'avg_time_between'], # requested aggs
                              trans_primitives=['day', 'month', 'hour', 'weekend'], # requested transformations
                              n_jobs=-1,
                              chunk_size=50,
                              max_depth=2, # how many aggregations to do
                              cutoff_time=labels,
                              cutoff_time_in_index=True,
                              verbose=1)
    
    feature_matrix = feature_matrix.reset_index()
    feature_matrix.to_csv('./data/processed/dfs_depth2.csv')
    feature_matrix.head()
else:
    feature_matrix = pd.read_csv('./data/processed/dfs_depth2.csv')
    feature_matrix['time'] = pd.to_datetime(feature_matrix['time']) 
    feature_matrix = feature_matrix.drop('Unnamed: 0', axis=1)
In [39]:
feature_matrix.shape
Out[39]:
(35339, 23)
In [40]:
feature_matrix.columns
Out[40]:
Index(['CustomerID', 'time', 'MAX(purchases.Quantity)',
       'MAX(purchases.UnitPrice)', 'MAX(purchases.Total)',
       'MIN(purchases.Quantity)', 'MIN(purchases.UnitPrice)',
       'MIN(purchases.Total)', 'MODE(purchases.InvoiceNo)',
       'MODE(purchases.StockCode)', 'MEAN(purchases.Quantity)',
       'MEAN(purchases.UnitPrice)', 'MEAN(purchases.Total)',
       'AVG_TIME_BETWEEN(purchases.InvoiceDate)', 'DAY(first_purchases_time)',
       'MONTH(first_purchases_time)', 'HOUR(first_purchases_time)',
       'WEEKEND(first_purchases_time)', 'MODE(purchases.DAY(InvoiceDate))',
       'MODE(purchases.MONTH(InvoiceDate))',
       'MODE(purchases.HOUR(InvoiceDate))', 'MODE(purchases.orders.Country)',
       'Total'],
      dtype='object')
In [41]:
# look at a single output
feature_matrix.iloc[7000, [0, 1, 19]]
Out[41]:
CustomerID                                          13481
time                                  2011-06-01 00:00:00
MODE(purchases.MONTH(InvoiceDate))                      3
Name: 7000, dtype: object
In [42]:
# demonstrate we understand the output
df['month'] = df['InvoiceDate'].dt.month
df[(df['CustomerID'] == 13481) & (df['InvoiceDate'] < '2011-06-01')].groupby('month')['Total'].count()
Out[42]:
month
3    43
4     1
5    17
Name: Total, dtype: int64
In [43]:
# create categorical response variable
feature_matrix['Total'] = feature_matrix['Total'].apply(lambda x: 1 if x > 500 else 0) 
In [44]:
X_train, X_test, y_train, y_test = create_train_test(11, feature_matrix)
In [45]:
model = LogisticRegression(random_state=0, class_weight='balanced')
model.fit(X_train, y_train)
Out[45]:
LogisticRegression(C=1.0, class_weight='balanced', dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=0,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)
In [46]:
print_performance(X_train, y_train, model)
Precision: 0.4444
Recall: 0.44503
F1 Score: 0.44471
ROC AUC: 0.72874
In [47]:
print_performance(X_test, y_test, model)
Precision: 0.41033
Recall: 0.32203
F1 Score: 0.36086
ROC AUC: 0.68288