KTS logo

Modelling Guide¶

In [1]:

import numpy as np
np.random.seed(0)

import kts
from kts import *

DASHBOARD

features

simple_feature

FEATURE CONSTRUCTOR

name

simple_feature

source

@feature
def simple_feature(df):
    res = stl.empty_like(df)
    res['is_male'] = (df.Sex == 'male') + 0
    return res

interactions

GENERIC FEATURE

name

interactions

source

@feature
@generic(left="Pclass", right="SibSp")
def interactions(df):
    res = stl.empty_like(df)
    res[f"{left}_add_{right}"] = df[left] + df[right]
    res[f"{left}_sub_{right}"] = df[left] - df[right]
    res[f"{left}_mul_{right}"] = df[left] * df[right]
    return res

num_aggs

GENERIC FEATURE

name

num_aggs

description

Descriptions are also supported.

source

@feature
@generic(col="Parch")
def num_aggs(df):
    """Descriptions are also supported."""
    res = pd.DataFrame(index=df.index)
    mean = df[col].mean()
    std = df[col].std()
    res[f"{col}_div_mean"] = df[col] / mean
    res[f"{col}_sub_div_mean"] = (df[col] - mean) / mean
    res[f"{col}_div_std"] = df[col] / std
    return res

tfidf

GENERIC FEATURE

name

tfidf

source

@feature
@generic(col='Name')
def tfidf(df):
    if df.train:
        enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)
        res = enc.fit_transform(df[col])
        df.state['enc'] = enc
    else:
        enc = df.state['enc']
        res = enc.transform(df[col])
    return res.todense()

requirements

sklearn==0.20.2

helpers

You've got no helpers so far.

Feature constructors and helpers defined earlier are automatically loaded:

In [2]:

simple_feature

Out[2]:

FEATURE CONSTRUCTOR

name

simple_feature

source

@feature
def simple_feature(df):
    res = stl.empty_like(df)
    res['is_male'] = (df.Sex == 'male') + 0
    return res

Use kts.ls to list objects saved in your user cache and kts.rm to remove them:

In [3]:

print(kts.ls())
kts.rm('external')
print(kts.ls())

['train', 'test', 'external']
['train', 'test']

In [4]:

train = kts.load('train')
test = kts.load('test')

Models¶

kts.models.{binary, multiclass, regression} contain most popular models for each task type.
In particular, all of the corresponding sklearn models are present, as well as CatBoost, LGBM and XGB if already installed. We'll also add neural nets there soon.

In [5]:

from kts.models import binary, multiclass, regression

Init signatures are preserved:

In [6]:

cb = binary.CatBoostClassifier(iterations=100, rsm=.15, custom_metric='AUC')
cb

Out[6]:

MODEL

name

CatBoostClassifierGVB

model

CatBoostClassifier

params

custom_metric = 'AUC'
loss_function = 'Logloss'
          rsm = 0.15
   iterations = 100

source

CatBoostClassifier(custom_metric='AUC', loss_function='Logloss', rsm=0.15, iterations=100)

In [7]:

lr = binary.LogisticRegression(C=.5, solver='lbfgs', max_iter=1000)
lr

Out[7]:

MODEL

name

LogisticRegressionGNM

model

LogisticRegression

params

                C = 0.5
     class_weight = None
             dual = False
    fit_intercept = True
intercept_scaling = 1
         max_iter = 1000
      multi_class = 'warn'
          penalty = 'l2'
     random_state = None
           solver = 'lbfgs'
              tol = 0.0001
       warm_start = False

source

LogisticRegression(C=0.5, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, max_iter=1000, multi_class='warn', penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, warm_start=False)

In [8]:

from category_encoders import TargetEncoder, WOEEncoder

fs = FeatureSet([simple_feature, interactions('Pclass', 'Age'), num_aggs('Fare'), tfidf('Name'), stl.one_hot_encode('Embarked')], 
                [stl.category_encode(TargetEncoder(), 'Embarked', 'Survived'), 
                 stl.category_encode(WOEEncoder(), 'Embarked', 'Survived')],
                train_frame=train,
                targets='Survived')

Validation¶

To define a validation scheme, you'll use kts.Validator(splitter, metric). Splitter is used to split the training set, and metric is for evaluating trained models.

In [9]:

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score

skf = StratifiedKFold(5, True, 42)
val = Validator(skf, roc_auc_score)

Running validation is as easy as val.score(model, feature_set):

In [10]:

val.score(cb, fs)

COMPUTING FEATURES

feature

progress

num_aggs__Fare

0s

simple_feature

0s

interactions__Pclass_Age

0s

tfidf__Name

0s

FITTING

progress

train

0.224

valid

0.456

metric

0.862

took

3s

eta

0s

0.218

0.516

0.821

2s

0s

0.229

0.487

0.85

3s

0s

0.240

0.415

0.892

3s

0s

0.236

0.585

0.800

2s

0s

Out[10]:

{'score': 0.8450484134619144, 'id': 'KPBVAI'}

In [11]:

val.score(lr, fs)

FITTING

progress

train

valid

metric

0.863

took

0s

eta

0s

0.773

1s

0s

0.812

0s

0.829

0s

0.796

1s

0s

Out[11]:

{'score': 0.8145216602070352, 'id': 'FYCMDA'}

Leaderboard¶

Right after validation your experiments are placed in the leaderboard:

In [12]:

lb

Multiple Leaderboards¶

You can also keep multiple leaderboards by passing leaderboard parameter to val.score(). Default leaderboard is main.

In [13]:

some_model = binary.KNeighborsClassifier()

val.score(some_model, fs, leaderboard='other')

FITTING

progress

train

valid

metric

0.819

took

1s

eta

0s

0.788

1s

0s

0.794

1s

0s

0.813

0s

0.76

0s

Out[13]:

{'score': 0.7948359406496035, 'id': 'GWZCPQ'}

Use kts.leaderboard_list or kts.lbs to access leaderboards other than main:

In [14]:

lbs.main is lb
lbs.other is lbs['other']
leaderboard_list is lbs

Out[14]:

True

Out[14]:

True

Out[14]:

True

Note that the new experiment appeared only in the new leaderboard:

In [15]:

lb
lbs.other

Experiments¶

Experiments are accessible by their identifiers:

In [16]:

lb['KPBVAI'] is lb.KPBVAI

Out[16]:

True

In [17]:

lb.KPBVAI

Inference¶

Inference is as easy as experiment.predict(frame). Features are computed automatically.

In [18]:

lb.KPBVAI.predict(test.head(5))

COMPUTING FEATURES

feature

progress

simple_feature

0s

interactions__Pclass_Age

0s

num_aggs__Fare

0s

tfidf__Name

0s

INFERENCE

id

KPBVAI

progress

took

1s

eta

Out[18]:

array([0.17121523, 0.57615125, 0.0999157 , 0.23952768, 0.78401192])

Feature Importances¶

In [19]:

lb.KPBVAI.feature_importances

Out[19]:

EXPERIMENT.FEATURE_IMPORTANCES DOCS

signature

Experiment.feature_importances(self, plot, estimator, sort_by, n_best, verbose)

description

Computes feature importance

params

plot

if true, then returns a graph, otherwise returns a dataframe

estimator

importance estimator instance used to compute feature importances

sort_by

fold-wise statistic used to sort features. One of min, mean, and max

n_best

number of best features to show

verbose

whether to produce reports during computing, such as progress bar
and interim feature importances. Useful for long-running estimators

returns

A feature importances graph if plot=True, dataframe with importances otherwise

examples

>>> from kts.feature_selection import Permutation
>>> lb.ABCDEF.feature_importances(plot=False)  # -> pd.DataFrame
>>> lb.ABCDEF.feature_importances(estimator=Permutation(train_frame, n_iters=3), sort_by='max')

In [20]:

lb.KPBVAI.feature_importances(sort_by='mean', n_best=7)

Out[20]:

FEATURE IMPORTANCES

feature

mean

importance

is_male

37.519

Pclass_mul_Age

8.839

Pclass_sub_Age

6.410

Fare_div_mean

6.262

Fare_sub_div_mean

5.858

tfidf__Name_4

4.894

tfidf__Name_2

4.600

Use plot=False to get feature importances by fold:

In [21]:

lb.KPBVAI.feature_importances(plot=False)

Out[21]:

	tfidf__Name_0	Embarked_ce_OneHotEncoder_2	Pclass_add_Age	Fare_sub_div_mean	Embarked_ce_Survived_WOEEncoder	tfidf__Name_3	Embarked_ce_OneHotEncoder_1	Embarked_ce_Survived_TargetEncoder	Embarked_ce_OneHotEncoder_0	tfidf__Name_1	tfidf__Name_2	tfidf__Name_4	Fare_div_std	Fare_div_mean	Pclass_sub_Age	Pclass_mul_Age	is_male
0	2.94677	1.83414	1.96158	7.29455	1.27512	5.17552	2.0481	1.1201	2.87565	3.16809	5.48229	5.95125	3.48855	4.00294	6.07287	5.79585	39.5066
1	5.99983	1.11964	2.7143	12.0095	1.27202	3.33875	2.55421	1.37644	3.57352	3.79159	3.8706	4.67407	2.16574	1.38701	5.58991	10.2945	34.2684
2	2.48002	0.846825	5.26318	5.46335	2.98447	3.8642	1.56882	1.76954	2.11486	4.45559	5.13815	4.27618	5.2687	6.55907	6.86313	8.55423	32.5297
3	4.31537	1.30387	2.74786	3.78222	1.87028	4.34941	1.48347	1.03283	0.910795	6.2502	5.76845	5.14516	6.06301	6.2413	11.0083	7.67064	30.0568
4	1.74347	0.30458	0.848658	0.741891	0.262854	0.492911	1.88876	2.91524	0	4.45662	2.74165	4.42125	0.428414	13.1209	2.5172	11.8818	51.2338

Specify an importance estimator to compute permutation importance:

In [22]:

lb.KPBVAI.feature_importances(sort_by='mean', estimator=Permutation(train, n_iters=10))

COMPUTING IMPORTANCES

progress

Computing Embarked_ce_OneHotEncoder_3

took

5s

eta

0s

FEATURE IMPORTANCES

feature

mean

importance

is_male

0.208

Pclass_sub_Age

0.018

Pclass_mul_Age

0.017

Fare_div_mean

0.011

Fare_sub_div_mean

7.27e-03

tfidf__Name_0

6.43e-03

Fare_div_std

4.32e-03

tfidf__Name_4

3.71e-03

Pclass_add_Age

2.57e-03

tfidf__Name_1

2.54e-03

Embarked_ce_OneHotEncoder_1

2.46e-03

Embarked_ce_Survived_WOEEncoder

1.83e-03

tfidf__Name_3

1.78e-03

Embarked_ce_OneHotEncoder_2

1.33e-03

tfidf__Name_2

8.88e-04

Embarked_ce_OneHotEncoder_0

7.91e-04

Embarked_ce_Survived_TargetEncoder

5.84e-04

Embarked_ce_OneHotEncoder_3

0.e+00

Custom Models¶

Suppose you want to use some model which is not in kts.models, like Regularized Greedy Forest.

In [23]:

!pip3 install rgf_python > /dev/null

In [24]:

from rgf.sklearn import RGFClassifier

To use it, you simply need to create a class derived from both your classifier and kts.CustomModel. It may optionally include preprocess method or inherit it from some mixin, like kts.NormalizeFillNAMixin:

class KTSWrapper(kts.CustomModel, somelib.SomeClassifier):
    ignored_params = [...]

    def preprocess(X, y=None):
        if y is None:
            print('if y is None then .predict is called')
        else:
            print('otherwise .fit')
        return X, y

An alternative approach is using kts.custom_model(ModelClass, ignored_params, normalize_fillna=True/False) function:

RGF = custom_model(RGFClassifier, ignored_params=['memory_policy', 'n_jobs', 'verbose'], normalize_fillna=True)

However, subclassing gives more freedom in defining custom preprocessing.

In this example the classifier can't deal with NaN values, and we use kts.NormalizeFillNAMixin to add preprocessing method:

In [25]:

class RGF(NormalizeFillNAMixin, CustomModel, RGFClassifier):
    ignored_params = ['memory_policy', 'n_jobs', 'verbose']

In [26]:

RGF()

/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rgf/utils.py:225: UserWarning: Cannot find FastRGF executable files. FastRGF estimators will be unavailable for usage.
  warnings.warn("Cannot find FastRGF executable files. "

Out[26]:

MODEL

name

RGFCVOCZ

model

RGF

params

       algorithm = 'RGF'
       calc_prob = 'sigmoid'
      init_model = None
              l2 = 0.1
   learning_rate = 0.5
            loss = 'Log'
        max_leaf = 1000
min_samples_leaf = 10
          n_iter = None
   n_tree_search = 1
       normalize = False
    opt_interval = 100
       reg_depth = 1.0
             sl2 = None
   test_interval = 100

source

RGF(algorithm='RGF', calc_prob='sigmoid', init_model=None, l2=0.1, learning_rate=0.5, loss='Log', max_leaf=1000, min_samples_leaf=10, n_iter=None, n_tree_search=1, normalize=False, opt_interval=100, reg_depth=1.0, sl2=None, test_interval=100)

custom model class source

class RGF(NormalizeFillNAMixin, CustomModel, RGFClassifier):
    ignored_params = ['memory_policy', 'n_jobs', 'verbose']

In [27]:

val.score(RGF(), fs)

FITTING

progress

train

valid

metric

0.754

took

1s

eta

0s

0.733

2s

0s

0.795

1s

0s

0.817

1s

0s

0.754

1s

0s

Out[27]:

{'score': 0.7704334423329472, 'id': 'BFELLE'}

Custom Validators¶

TODO

In [28]: