Kaggle's Predicting Red Hat Business Value¶

This is a first quick & dirty attempt at Kaggle's Predicting Red Hat Business Value competition.

Loading in the data¶

In [1]:

import pandas as pd

people = pd.read_csv('people.csv.zip')
people.head(3)

Out[1]:

	people_id	char_1	group_1	char_2	date	char_3	char_4	char_5	char_6	char_7	...	char_29	char_30	char_31	char_32	char_33	char_34	char_35	char_36	char_37	char_38
0	ppl_100	type 2	group 17304	type 2	2021-06-29	type 5	type 5	type 5	type 3	type 11	...	False	True	True	False	False	True	True	True	False	36
1	ppl_100002	type 2	group 8688	type 3	2021-01-06	type 28	type 9	type 5	type 3	type 11	...	False	True	True	True	True	True	True	True	False	76
2	ppl_100003	type 2	group 33592	type 3	2022-06-10	type 4	type 8	type 5	type 2	type 5	...	False	False	True	True	True	True	False	True	True	99

3 rows × 41 columns

In [2]:

actions = pd.read_csv('act_train.csv.zip')
actions.head(3)

Out[2]:

	people_id	activity_id	date	activity_category	char_1	char_2	char_3	char_4	char_5	char_6	char_7	char_8	char_9	char_10
0	ppl_100	act2_1734928	2023-08-26	type 4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	type 76
1	ppl_100	act2_2434093	2022-09-27	type 2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	type 1
2	ppl_100	act2_3404049	2022-09-27	type 2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	type 1

Joining together to get dataset¶

In [3]:

training_data_full = pd.merge(actions, people, how='inner', on='people_id', suffixes=['_action', '_person'], sort=False)
training_data_full.head(5)

Out[3]:

	people_id	activity_id	date_action	activity_category	char_1_action	char_2_action	char_3_action	char_4_action	char_5_action	char_6_action	...	char_29	char_30	char_31	char_32	char_33	char_34	char_35	char_36	char_37	char_38
0	ppl_100	act2_1734928	2023-08-26	type 4	NaN	NaN	NaN	NaN	NaN	NaN	...	False	True	True	False	False	True	True	True	False	36
1	ppl_100	act2_2434093	2022-09-27	type 2	NaN	NaN	NaN	NaN	NaN	NaN	...	False	True	True	False	False	True	True	True	False	36
2	ppl_100	act2_3404049	2022-09-27	type 2	NaN	NaN	NaN	NaN	NaN	NaN	...	False	True	True	False	False	True	True	True	False	36
3	ppl_100	act2_3651215	2023-08-04	type 2	NaN	NaN	NaN	NaN	NaN	NaN	...	False	True	True	False	False	True	True	True	False	36
4	ppl_100	act2_4109017	2023-08-26	type 2	NaN	NaN	NaN	NaN	NaN	NaN	...	False	True	True	False	False	True	True	True	False	36

5 rows × 55 columns

In [4]:

(actions.shape, people.shape, training_data_full.shape)

Out[4]:

((2197291, 15), (189118, 41), (2197291, 55))

Building a preprocessing pipeline¶

In [5]:

# %load "preprocessing_transforms.py"
from sklearn.base import TransformerMixin, BaseEstimator
import pandas as pd


class BaseTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, **transform_params):
        return self


class ColumnSelector(BaseTransformer):
    """Selects columns from Pandas Dataframe"""

    def __init__(self, columns, c_type=None):
        self.columns = columns
        self.c_type = c_type

    def transform(self, X, **transform_params):
        cs = X[self.columns]
        if self.c_type is None:
            return cs
        else:
            return cs.astype(self.c_type)


class SpreadBinary(BaseTransformer):

    def transform(self, X, **transform_params):
        return X.applymap(lambda x: 1 if x == 1 else -1)


class DfTransformerAdapter(BaseTransformer):
    """Adapts a scikit-learn Transformer to return a pandas DataFrame"""

    def __init__(self, transformer):
        self.transformer = transformer

    def fit(self, X, y=None, **fit_params):
        self.transformer.fit(X, y=y, **fit_params)
        return self

    def transform(self, X, **transform_params):
        raw_result = self.transformer.transform(X, **transform_params)
        return pd.DataFrame(raw_result, columns=X.columns, index=X.index)


class DfOneHot(BaseTransformer):
    """
    Wraps helper method `get_dummies` making sure all columns get one-hot encoded.
    """
    def __init__(self):
        self.dummy_columns = []

    def fit(self, X, y=None, **fit_params):
        self.dummy_columns = pd.get_dummies(
            X,
            prefix=[c for c in X.columns],
            columns=X.columns).columns
        return self

    def transform(self, X, **transform_params):
        return pd.get_dummies(
            X,
            prefix=[c for c in X.columns],
            columns=X.columns).reindex(columns=self.dummy_columns, fill_value=0)


class DfFeatureUnion(BaseTransformer):
    """A dataframe friendly implementation of `FeatureUnion`"""

    def __init__(self, transformers):
        self.transformers = transformers

    def fit(self, X, y=None, **fit_params):
        for l, t in self.transformers:
            t.fit(X, y=y, **fit_params)
        return self

    def transform(self, X, **transform_params):
        transform_results = [t.transform(X, **transform_params) for l, t in self.transformers]
        return pd.concat(transform_results, axis=1)

In [6]:

training_data_full.columns

Out[6]:

Index(['people_id', 'activity_id', 'date_action', 'activity_category',
       'char_1_action', 'char_2_action', 'char_3_action', 'char_4_action',
       'char_5_action', 'char_6_action', 'char_7_action', 'char_8_action',
       'char_9_action', 'char_10_action', 'outcome', 'char_1_person',
       'group_1', 'char_2_person', 'date_person', 'char_3_person',
       'char_4_person', 'char_5_person', 'char_6_person', 'char_7_person',
       'char_8_person', 'char_9_person', 'char_10_person', 'char_11',
       'char_12', 'char_13', 'char_14', 'char_15', 'char_16', 'char_17',
       'char_18', 'char_19', 'char_20', 'char_21', 'char_22', 'char_23',
       'char_24', 'char_25', 'char_26', 'char_27', 'char_28', 'char_29',
       'char_30', 'char_31', 'char_32', 'char_33', 'char_34', 'char_35',
       'char_36', 'char_37', 'char_38'],
      dtype='object')

In [7]:

for col in training_data_full.columns:
    print("in {} there are {} unique values".format(col, len(training_data_full[col].unique())))
None

in people_id there are 151295 unique values
in activity_id there are 2197291 unique values
in date_action there are 411 unique values
in activity_category there are 7 unique values
in char_1_action there are 52 unique values
in char_2_action there are 33 unique values
in char_3_action there are 12 unique values
in char_4_action there are 8 unique values
in char_5_action there are 8 unique values
in char_6_action there are 6 unique values
in char_7_action there are 9 unique values
in char_8_action there are 19 unique values
in char_9_action there are 20 unique values
in char_10_action there are 6516 unique values
in outcome there are 2 unique values
in char_1_person there are 2 unique values
in group_1 there are 29899 unique values
in char_2_person there are 3 unique values
in date_person there are 1196 unique values
in char_3_person there are 43 unique values
in char_4_person there are 25 unique values
in char_5_person there are 9 unique values
in char_6_person there are 7 unique values
in char_7_person there are 25 unique values
in char_8_person there are 8 unique values
in char_9_person there are 9 unique values
in char_10_person there are 2 unique values
in char_11 there are 2 unique values
in char_12 there are 2 unique values
in char_13 there are 2 unique values
in char_14 there are 2 unique values
in char_15 there are 2 unique values
in char_16 there are 2 unique values
in char_17 there are 2 unique values
in char_18 there are 2 unique values
in char_19 there are 2 unique values
in char_20 there are 2 unique values
in char_21 there are 2 unique values
in char_22 there are 2 unique values
in char_23 there are 2 unique values
in char_24 there are 2 unique values
in char_25 there are 2 unique values
in char_26 there are 2 unique values
in char_27 there are 2 unique values
in char_28 there are 2 unique values
in char_29 there are 2 unique values
in char_30 there are 2 unique values
in char_31 there are 2 unique values
in char_32 there are 2 unique values
in char_33 there are 2 unique values
in char_34 there are 2 unique values
in char_35 there are 2 unique values
in char_36 there are 2 unique values
in char_37 there are 2 unique values
in char_38 there are 101 unique values

Potential trouble with high dimensionality¶

Notice that char_10_action, group_1 and others have a ton of unique values; one-hot encoding will result in a dataframe with thousands of columns.

Being lazy and getting as fast as possible to a first attempt, let's skip those and only consider categorical variable with ~20 or less unique values. We'll get smarter about dealing with these variables to reinclude them in our model on a subsequent attempt

In [8]:

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import Imputer, StandardScaler

cat_columns = ['activity_category',
       'char_1_action', 'char_2_action', 'char_3_action', 'char_4_action',
       'char_5_action', 'char_6_action', 'char_7_action', 'char_8_action',
       'char_9_action', 'char_1_person',
       'char_2_person', 'char_3_person',
       'char_4_person', 'char_5_person', 'char_6_person', 'char_7_person',
       'char_8_person', 'char_9_person', 'char_10_person', 'char_11',
       'char_12', 'char_13', 'char_14', 'char_15', 'char_16', 'char_17',
       'char_18', 'char_19', 'char_20', 'char_21', 'char_22', 'char_23',
       'char_24', 'char_25', 'char_26', 'char_27', 'char_28', 'char_29',
       'char_30', 'char_31', 'char_32', 'char_33', 'char_34', 'char_35',
       'char_36', 'char_37']

q_columns = ['char_38']

preprocessor = Pipeline([
    ('features', DfFeatureUnion([
        ('quantitative', Pipeline([
            ('select-quantitative', ColumnSelector(q_columns, c_type='float')),
            ('impute-missing', DfTransformerAdapter(Imputer(strategy='median'))),
            ('scale', DfTransformerAdapter(StandardScaler()))
        ])),
        ('categorical', Pipeline([
            ('select-categorical', ColumnSelector(cat_columns)),
            ('apply-onehot', DfOneHot()),
            ('spread-binary', SpreadBinary())
        ])),
    ]))
])

Sampling to reduce runtime in training large dataset¶

If we train models based on the entire test dataset provided it exhausts the memory on my laptop. Again, in the spirit of getting something quick and dirty working, we'll sample the dataset and train on that. We'll then evaluate our model by testing the accuracy on a larger sample.

In [19]:

from sklearn.cross_validation import train_test_split

training_frac = 0.05
test_frac = 0.8

training_data, the_rest = train_test_split(training_data_full, train_size=training_frac, random_state=0)
test_data = the_rest.sample(frac=test_frac)

In [20]:

training_data.shape

Out[20]:

(109864, 55)

In [21]:

test_data.shape

Out[21]:

(1669942, 55)

In [22]:

wrangled = preprocessor.fit_transform(training_data)

In [23]:

wrangled.head()

Out[23]:

	char_38	activity_category_type 1	activity_category_type 2	activity_category_type 3	activity_category_type 4	activity_category_type 5	activity_category_type 6	activity_category_type 7	char_1_action_type 1	char_1_action_type 10	...	char_33_False	char_33_True	char_34_False	char_34_True	char_35_False	char_35_True	char_36_False	char_36_True	char_37_False	char_37_True
963496	-1.380347	-1	-1	-1	-1	1	-1	-1	-1	-1	...	1	-1	1	-1	1	-1	1	-1	1	-1
874945	-0.910167	-1	1	-1	-1	-1	-1	-1	-1	-1	...	1	-1	1	-1	1	-1	1	-1	1	-1
424945	-1.380347	-1	-1	-1	-1	1	-1	-1	-1	-1	...	1	-1	1	-1	1	-1	1	-1	1	-1
1478640	1.357758	-1	1	-1	-1	-1	-1	-1	-1	-1	...	-1	1	-1	1	-1	1	-1	1	-1	1
723674	0.859921	-1	-1	-1	-1	1	-1	-1	-1	-1	...	-1	1	-1	1	-1	1	-1	1	-1	1

5 rows × 336 columns

Putting together classifiers¶

In [24]:

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

pipe_lr = Pipeline([
        ('wrangle', preprocessor),
        ('lr', LogisticRegression(C=100.0, random_state=0))
    ])

pipe_rf = Pipeline([
        ('wrangle', preprocessor),
        ('rf', RandomForestClassifier(criterion='entropy', n_estimators=10, random_state=0))
    ])

In [25]:

feature_columns = cat_columns + q_columns 

In [26]:

def extract_X_y(df):
    return df[feature_columns], df['outcome']

X_train, y_train = extract_X_y(training_data)
X_test, y_test = extract_X_y(test_data)

Reporting utilities¶

Some utilities to make reporting progress easier

In [48]:

import time
import subprocess

class time_and_log():
    
    def __init__(self, label, *, prefix='', say=False):
        self.label = label
        self.prefix = prefix
        self.say = say
    
    def __enter__(self):
        msg = 'Starting {}'.format(self.label)
        print('{}{}'.format(self.prefix, msg))
        if self.say:
            cmd_say(msg)
        self.start = time.process_time()
        return self

    def __exit__(self, *exc):
        self.interval = time.process_time() - self.start
        msg = 'Finished {} in {:.2f} seconds'.format(self.label, self.interval)
        print('{}{}'.format(self.prefix, msg))
        if self.say:
            cmd_say(msg)
        return False
    
def cmd_say(msg):
    subprocess.call("say '{}'".format(msg), shell=True)

Cross validation and full test set accuracy¶

We'll cross validate within the training set, and then train on the full training set and see how well it performs on the full test set.

In [50]:

from sklearn.metrics import accuracy_score
from sklearn.cross_validation import cross_val_score
import numpy as np

models = [
    ('logistic regression', pipe_lr), 
    ('random forest', pipe_rf), 
]

for label, model in models:
    print('Evaluating {}'.format(label))
    say('Evaluating {}'.format(label))
#     with time_and_log('cross validating', say=True, prefix=" _"):
#         scores = cross_val_score(estimator=model,
#                              X=X_train,
#                              y=y_train,
#                              cv=5,
#                              n_jobs=1)
#         print('  CV accuracy: {:.3f} +/- {:.3f}'.format(np.mean(scores), np.std(scores)))
    with time_and_log('fitting full training set', say=True, prefix=" _"):
        model.fit(X_train, y_train)  
    with time_and_log('evaluating on full test set', say=True, prefix=" _"):
        print("  Full test accuracy ({:.2f} of dataset): {:.3f}".format(
                test_frac, 
                accuracy_score(y_test, model.predict(X_test)))) 

Evaluating logistic regression
 _Starting fitting full training set
 _Finished fitting full training set in 121.04 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.80 of dataset): 0.861
 _Finished evaluating on full test set in 288.49 seconds
Evaluating random forest
 _Starting fitting full training set
 _Finished fitting full training set in 21.32 seconds
 _Starting evaluating on full test set
  Full test accuracy (0.80 of dataset): 0.923
 _Finished evaluating on full test set in 292.85 seconds

Preparing the submission¶

Random forest beat logistic regression, let's start with a submission using that.

But first, let's see what the submission is supposed to look like:

In [56]:

pd.read_csv('sample_submission.csv.zip').head(5)

Out[56]:

	activity_id	outcome
0	act1_1	0
1	act1_100006	0
2	act1_100050	0
3	act1_100065	0
4	act1_100068	0

And now let's prepare the submission by fitting on the full provided training set and using it to predict on the provided test set.

In [57]:

kaggle_test_df = pd.merge(
    pd.read_csv('act_test.csv.zip'), 
    people, 
    how='inner', on='people_id', suffixes=['_action', '_person'], sort=False)
kaggle_test_df.head(2)

Out[57]:

	people_id	activity_id	date_action	activity_category	char_1_action	char_2_action	char_3_action	char_4_action	char_5_action	char_6_action	...	char_29	char_30	char_31	char_32	char_33	char_34	char_35	char_36	char_37	char_38
0	ppl_100004	act1_249281	2022-07-20	type 1	type 5	type 10	type 5	type 1	type 6	type 1	...	True	True	True	True	True	True	True	True	True	76
1	ppl_100004	act2_230855	2022-07-20	type 5	NaN	NaN	NaN	NaN	NaN	NaN	...	True	True	True	True	True	True	True	True	True	76

2 rows × 54 columns

In [55]:

kaggle_test_df.shape

Out[55]:

(498687, 54)

In [58]:

X_kaggle_train, y_kaggle_train = extract_X_y(training_data_full)

In [59]:

with time_and_log('fitting rf on full kaggle training set', say=True): 
    pipe_rf.fit(X_kaggle_train, y_kaggle_train)

Starting fitting rf on full kaggle training set
Finished fitting rf on full kaggle training set in 548.33 seconds

In [60]:

with time_and_log('preparing kaggle submission', say=True):
    submission_df = kaggle_test_df[['activity_id']].copy()
    submission_df['outcome'] = pipe_rf.predict(kaggle_test_df)
    submission_df.to_csv("predicting-red-hat-business-value_1_rf.csv", index=False)

Starting preparing kaggle submission
Finished preparing kaggle submission in 84.99 seconds

This got me to 85% accuracy on the submission, placing 1099 out of 1250 teams. There are 190 people with 99% or greater accuracy and 837 with 95%, so this definitely qualifies as merely a quick and dirty submission :)

It's also worth noting that apparently people figured out how to get 98% only looking at the date and group columns—two of the columns I ditched to make things easier to get started.