KTS logo

Feature Engineering Guide¶

In [1]:

import pandas as pd
import numpy as np
np.random.seed(0)

import kts
from kts import *

DASHBOARD

features

simple_feature

FEATURE CONSTRUCTOR

name

simple_feature

source

@feature
def simple_feature(df):
    res = stl.empty_like(df)
    res['is_male'] = (df.Sex == 'male') + 0
    return res

interactions

GENERIC FEATURE

name

interactions

source

@feature
@generic(left="Pclass", right="SibSp")
def interactions(df):
    res = stl.empty_like(df)
    res[f"{left}_add_{right}"] = df[left] + df[right]
    res[f"{left}_sub_{right}"] = df[left] - df[right]
    res[f"{left}_mul_{right}"] = df[left] * df[right]
    return res

num_aggs

GENERIC FEATURE

name

num_aggs

description

Descriptions are also supported.

source

@feature
@generic(col="Parch")
def num_aggs(df):
    """Descriptions are also supported."""
    res = pd.DataFrame(index=df.index)
    mean = df[col].mean()
    std = df[col].std()
    res[f"{col}_div_mean"] = df[col] / mean
    res[f"{col}_sub_div_mean"] = (df[col] - mean) / mean
    res[f"{col}_div_std"] = df[col] / std
    return res

tfidf

GENERIC FEATURE

name

tfidf

source

@feature
@generic(col='Name')
def tfidf(df):
    if df.train:
        enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)
        res = enc.fit_transform(df[col])
        df.state['enc'] = enc
    else:
        enc = df.state['enc']
        res = enc.transform(df[col])
    return res.todense()

requirements

sklearn==0.20.2

helpers

You've got no helpers so far.

In [2]:

train = pd.read_csv('../input/train.csv', index_col='PassengerId')
test = pd.read_csv('../input/test.csv', index_col='PassengerId')

In [3]:

train.head()

Out[3]:

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

Use kts.save to put objects or dataframes to user cache:

In [4]:

kts.save(train, 'train')
kts.save(test, 'test')

Modular Feature Engineering in 30 seconds¶

Instead of sequentially adding new columns to one dataframe, you define functions called feature blocks, which take a raw dataframe as input and produce a new dataframe containing only new columns. Then these blocks are collected into feature sets. Such encapsulation enables your features to be computed in parallel, cached, and automatically applied during inference stage, making your experiments executable end-to-end out of the box.

Feature block is defined as a function taking one dataframe as an argument and returning a dataframe, too. Indices of input and output should be identical:

In [5]:

def dummy_feature_a(df):
    res = pd.DataFrame(index=df.index)
    res['a'] = 'a'
    return res

dummy_feature_a(train[:2])
dummy_feature_a(train[2:5])

Out[5]:

	a
PassengerId
1	a
2	a

Out[5]:

	a
PassengerId
3	a
4	a
5	a

@preview(frame, size_1, size_2, ...) does almost the same thing as above: it runs your feature constructor on frame.head(size_1), frame.head(size_2), ....

In addition, you can test out parallel execution. By default all of your features will be parallel, but if you want to change this behavior, use parallel=False.

In [6]:

@preview(train, 2, 5, parallel=True)
def dummy_feature_a(df):
    res = stl.empty_like(df)  # kts.stl is a standard library of feature constructors. Now you need to know
    res['a'] = 'a'            # only that stl.empty_like(df) is identical to pd.DataFrame(index=df.index)
    return res

COMPUTING FEATURES

feature

progress

dummy_feature_a

0s

	a
PassengerId
1	a
2	a

	a
PassengerId
1	a
2	a
3	a
4	a
5	a

Feature blocks usually consist of more than one feature:

In [7]:

@preview(train, 3, 6)
def dummy_feature_age_mean(df):
    res = stl.empty_like(df)
    res['Age'] = df['Age']
    res['mean'] = df['Age'].mean()
    return res

COMPUTING FEATURES

feature

progress

dummy_feature_age_mean

0s

	Age	mean
PassengerId
1	22.0	28.666667
2	38.0	28.666667
3	26.0	28.666667

	Age	mean
PassengerId
1	22.0	31.2
2	38.0	31.2
3	26.0	31.2
4	35.0	31.2
5	35.0	31.2
6	NaN	31.2

Functions are registered and converted into feature constructors using @feature decorator:

In [8]:

@feature
def dummy_feature_a(df):
    res = stl.empty_like(df)
    res['a'] = 'a'
    return res

@feature
def dummy_feature_bcd(df):
    res = stl.empty_like(df)
    res['b'] = 'b'
    res['c'] = 'c'
    res['d'] = 'd'
    return res

@feature
def dummy_feature_age_mean(df):
    res = stl.empty_like(df)
    res['mean'] = df['Age'].mean()
    return res

Then a feature set is defined by a list of feature constructors. Use slicing syntax to preview it:

In [9]:

dummy_fs = FeatureSet([dummy_feature_a, dummy_feature_bcd, dummy_feature_age_mean], train_frame=train)
dummy_fs[30:35]

COMPUTING FEATURES

feature

progress

dummy_feature_a

0s

dummy_feature_bcd

0s

dummy_feature_age_mean

0s

Out[9]:

	a	b	c	d	mean
PassengerId
31	a	b	c	d	44.666667
32	a	b	c	d	44.666667
33	a	b	c	d	44.666667
34	a	b	c	d	44.666667
35	a	b	c	d	44.666667

Let's clean up our namespace a bit:

In [10]:

delete(dummy_feature_a, force=True)
delete(dummy_feature_bcd, force=True)
delete(dummy_feature_age_mean, force=True)

Now let's get to the real things.

Decorators¶

Almost all of the functions that you'll use have rich docstrings with examples. Although it is not necessary, I'll demonstrate them throughout this tutorial.

Let's first take a closer look at the decorators that you have already seen. Don't be confused if you can't understand something, as it will be better explained in the Feature Types section.

@preview¶

In [11]:

preview

Out[11]:

PREVIEW DOCS

signature

preview(frame, sizes, parallel, train)

description

Runs a feature constructor several times to let you make sure it works correctly

Sequentially passes frame.head(size) to your feature constructor
for each provided size.
Generic features can also be previewed, in this case they'll be
initialized using their default arguments.

params

frame

a dataframe to be used for testing your feature

*sizes

one or more ints, sizes of input dataframes

parallel

whether to preview as a parallel feature constructor

train

df.train flag value to be passed to the feature constructor

examples

>>> @preview(train, 2, 3, parallel=False)
... def some_feature(df):
...     res = stl.empty_like(df)
...     res['col'] = ...
...     return res

>>> @preview(train, 200)
... def some_feature(df):
...     return stl.mean_encode(['Parch', 'Embarked'], 'Survived')(df)

>>> @preview(train, 100)
... @generic(left="Age", right="SibSp")
... def numeric_interactions(df):
...     res = stl.empty_like(df)
...     res[f"{left}_add_{right}"] = df[left] + df[right]
...     res[f"{left}_sub_{right}"] = df[left] - df[right]
...     res[f"{left}_mul_{right}"] = df[left] * df[right]
...     return res

@feature¶

In [12]:

feature

Out[12]:

FEATURE DOCS

signature

feature(args, cache, parallel, verbose)

description

Registers a function as a feature constructor and saves it

Can be used both with and without flags.
Note that generic feature constructors should be
additionally registered using this decorator.

params

cache

whether to cache calls and avoid recomputing

parallel

whether to run in parallel with other parallel FCs

verbose

whether to print logs and show progress

returns

A feature constructor.

examples

>>> @feature(parallel=False, verbose=False)
... def some_feature(df):
...     ...

>>> @feature
... def some_feature(df):
...     ...

>>> @feature
... @generic(param='default')
... def generic_feature(df):
...     ...

@generic¶

In [13]:

generic

Out[13]:

GENERIC DOCS

signature

generic(kwargs)

description

Creates a generic feature constructor

Generic features are parametrized feature constructors.

Note that this decorator does not register your function
and you should add @feature to save it.

params

**kwargs

arguments and their default values

returns

A generic feature constructor.

examples

>>> @feature
... @generic(left="Age", right="SibSp")
... def numeric_interactions(df):
...     res = stl.empty_like(df)
...     res[f"{left}_add_{right}"] = df[left] + df[right]
...     res[f"{left}_sub_{right}"] = df[left] - df[right]
...     res[f"{left}_mul_{right}"] = df[left] * df[right]
...     return res

>>> from itertools import combinations
>>> fs = FeatureSet([
...     numeric_interactions(left, right)
...     for left, right in combinations(['Parch', 'SibSp', 'Age'], r=2)
... ], ...)

delete¶

In [14]:

delete

Out[14]:

DELETE DOCS

signature

delete(feature_or_helper, force)

description

Deletes given feature or helper from lists and clears cache

Feature constructors are deleted along with their cache.
Generic feature constructors are also fully deleted.
As some STL features produce cache, you can also remove it
by passing an STL feature as an argument. The STL feature itself won't be removed.

params

feature_or_helper

an instance to be removed

force

force deletion without any warnings and confirmations

examples

>>> delete(incorrect_feature)
>>> delete(old_helper)
>>> delete(stl.mean_encode('Embarked', 'Survived'))
>>> delete(generic_feature)

Feature Types¶

Regular Features¶

This type of FCs should already look quite familiar:

In [15]:

@preview(train, 5)
def simple_feature(df):
    res = stl.empty_like(df)
    res['is_male'] = (df.Sex == 'male') + 0
    return res

COMPUTING FEATURES

feature

progress

simple_feature

0s

	is_male
PassengerId
1	1
2	0
3	0
4	0
5	1

In [16]:

@feature
def simple_feature(df):
    res = stl.empty_like(df)
    res['is_male'] = (df.Sex == 'male') + 0
    return res

Feature constructors can print anything to stdout and it will be shown in your report in real time, even if your features are computed in separate processes:

In [17]:

@preview(train, 2)
def feature_with_stdout(df):
    res = stl.empty_like(df)
    res['a'] = 'a'
    print('some logs')
    return res

COMPUTING FEATURES

feature

progress

feature_with_stdout

[17:34:13.126] some logs

0s

	a
PassengerId
1	a
2	a

Use kts.pbar to track progress of long-running features:

In [18]:

import time

@preview(train, 2)
def feature_with_pbar(df):
    res = stl.empty_like(df)
    res['a'] = 'a'
    for i in pbar(['a', 'b', 'c']):
        time.sleep(1)
    return res

COMPUTING FEATURES

feature

progress

feature_with_pbar

3s

0s

	a
PassengerId
1	a
2	a

They can also be nested and titled:

In [19]:

@preview(train, 2)
def feature_with_nested_pbar(df):
    res = stl.empty_like(df)
    res['a'] = 'a'
    for i in pbar(['a', 'b', 'c']):
        for j in pbar(range(6), title=i):
            time.sleep(0.5)
    return res

COMPUTING FEATURES

feature

progress

feature_with_nested_pbar

9s

0s

feature_with_nested_pbar - a

3s

0s

feature_with_nested_pbar - b

3s

0s

feature_with_nested_pbar - c

3s

0s

	a
PassengerId
1	a
2	a

Features Using External Frames¶

Sometimes datasets consist of more than one dataframe. To get an external dataframe into you feature constructor's scope, you need to save it with kts.save() and then use the following syntax:

In [20]:

external = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
kts.save(external, 'external')

@preview(train, 7)
def feature_using_external(df, somename='external'):
    """
    To get an external dataframe, you should set its name in user cache as a default value.
    Inside it will look like a usual dataframe.
    """
    print(somename.__class__.__name__)
    time.sleep(1)  # a short delay to receive stdout
    res = stl.empty_like(df)
    res['Pclass'] = df['Pclass']
    res['somefeat'] = somename.set_index('a').loc[df['Pclass']]['b'].values
    return res

COMPUTING FEATURES

feature

progress

feature_using_external

[17:34:25.902] DataFrame

1s

	Pclass	somefeat
PassengerId
1	3	6
2	1	4
3	3	6
4	1	4
5	3	6
6	3	6
7	1	4

Stateful Features¶

Some features may need their state to be saved between training and inference stages. In this case you can use df.train or df._train to identify which stage it is and df.state or df._state as a dictionary to write and read the state:

Unfortunately, so far you can preview only training stage using @preview. Later we'll add @preview_train_test to emulate both stages.

In [21]:

@preview(train, 5)
def stateful_feature(df):
    """A simple standardizer"""
    res = stl.empty_like(df)
    if df.train:
        print('this is a training stage')
        df.state['mean'] = df['Age'].mean()
        df.state['std'] = df['Age'].std()
    mean = df.state['mean']
    std = df.state['mean']
    res['Age'] = df['Age']
    res['age_std'] = (df['Age'] - mean) / std
    return res

COMPUTING FEATURES

feature

progress

stateful_feature

[17:34:27.039] this is a training stage

0s

	Age	age_std
PassengerId
1	22.0	-0.294872
2	38.0	0.217949
3	26.0	-0.166667
4	35.0	0.121795
5	35.0	0.121795

Generic Features¶

You can also create reusable functions with @generic(arg1=default, arg2=default, ...). For preview, default arguments are used.

In [22]:

@preview(train, 5)
@generic(left="Pclass", right="SibSp")
def interactions(df):
    res = stl.empty_like(df)
    res[f"{left}_add_{right}"] = df[left] + df[right]
    res[f"{left}_sub_{right}"] = df[left] - df[right]
    res[f"{left}_mul_{right}"] = df[left] * df[right]
    return res

COMPUTING FEATURES

feature

progress

interactions__Pclass_SibSp

0s

	Pclass_add_SibSp	Pclass_sub_SibSp	Pclass_mul_SibSp
PassengerId
1	4	2	3
2	2	0	1
3	3	3	0
4	2	0	1
5	3	3	0

Let's register a couple of generic features:

In [23]:

@feature
@generic(left="Pclass", right="SibSp")
def interactions(df):
    res = stl.empty_like(df)
    res[f"{left}_add_{right}"] = df[left] + df[right]
    res[f"{left}_sub_{right}"] = df[left] - df[right]
    res[f"{left}_mul_{right}"] = df[left] * df[right]
    return res

In [24]:

@feature
@generic(col="Parch")
def num_aggs(df):
    """Descriptions are also supported."""
    res = pd.DataFrame(index=df.index)
    mean = df[col].mean()
    std = df[col].std()
    res[f"{col}_div_mean"] = df[col] / mean
    res[f"{col}_sub_div_mean"] = (df[col] - mean) / mean
    res[f"{col}_div_std"] = df[col] / std
    return res

A combination of generic and stateful feature. It also returns a numpy array instead of dataframe. In this case, KTS will attach input index to result dataframe automatically.

In [25]:

from sklearn.feature_extraction.text import TfidfVectorizer

@preview(train, 10)
@generic(col='Name')
def tfidf(df):
    if df.train:
        enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)
        res = enc.fit_transform(df[col])
        df.state['enc'] = enc
    else:
        enc = df.state['enc']
        res = enc.transform(df[col])
    return res.todense()

COMPUTING FEATURES

feature

progress

tfidf__Name

0s

	tfidf__Name_0	tfidf__Name_1	tfidf__Name_2	tfidf__Name_3	tfidf__Name_4
PassengerId
1	0.508281	0.338854	0.185575	0.742300	0.203426
2	0.593616	0.197872	0.433463	0.541828	0.356369
3	0.464173	0.464173	0.508413	0.000000	0.557318
4	0.603771	0.301886	0.661317	0.220439	0.241644
5	0.631088	0.420725	0.460825	0.460825	0.000000
6	0.508984	0.508984	0.278748	0.557496	0.305561
7	0.779844	0.259948	0.000000	0.569447	0.000000
8	0.395067	0.526756	0.288481	0.288481	0.632461
9	0.605911	0.302956	0.442440	0.331830	0.485000
10	0.449865	0.449865	0.492741	0.246371	0.540139

Don't forget to change @preview to @feature to register generics:

In [26]:

@feature
@generic(col='Name')
def tfidf(df):
    if df.train:
        enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)
        res = enc.fit_transform(df[col])
        df.state['enc'] = enc
    else:
        enc = df.state['enc']
        res = enc.transform(df[col])
    return res.todense()

In [27]:

tfidf

Out[27]:

GENERIC FEATURE

name

tfidf

source

@feature
@generic(col='Name')
def tfidf(df):
    if df.train:
        enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)
        res = enc.fit_transform(df[col])
        df.state['enc'] = enc
    else:
        enc = df.state['enc']
        res = enc.transform(df[col])
    return res.todense()

requirements

sklearn==0.20.2

Note that KTS added sklearn to dependencies. Right now it is not very useful, but later it may be used to dockerize experiments automatically.

Standard Library¶

KTS provides the most essential feature constructors as a standard library, i.e. kts.stl submodule. All of the STL features have rich docstrings.

stl.empty_like¶

In [28]:

stl.empty_like

Out[28]:

EMPTY_LIKE DOCS

description

Returns an empty dataframe, preserving only index

examples

>>> @feature
... def some_feature(df):
...     res = stl.empty_like(df)
...     res['col'] = ...
...     return res

In [29]:

@preview(train, 5)
def preview_stl(df):
    return stl.empty_like(df)

COMPUTING FEATURES

feature

progress

preview_stl

0s


PassengerId
1
2
3
4
5

stl.identity¶

In [30]:

stl.identity

Out[30]:

IDENTITY DOCS

description

Returns its input

examples

>>> fs = FeatureSet([stl.identity, one_feature, another_feature], ...)
>>> assert all((stl.identity & ['a', 'b'])(df) == stl.select(['a', 'b'])(df))
>>> assert all((stl.identity - ['a', 'b'])(df) == stl.drop(['a', 'b'])(df))

In [31]:

@preview(train, 5)
def preview_stl(df):
    return stl.identity(df)

COMPUTING FEATURES

feature

progress

preview_stl

0s

	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

stl.select¶

In [32]:

stl.select

Out[32]:

SELECT DOCS

signature

select(columns)

description

Selects columns from a dataframe. Identical to df[columns]

params

columns

columns to select

returns

A feature constructor selecting given columns from input dataframe.

examples

>>> assert all(stl.select(['a', 'b'])(df) == df[['a', 'b']])

In [33]:

@preview(train, 5)
def preview_stl(df):
    return stl.select(['Name', 'Sex'])(df)

COMPUTING FEATURES

feature

progress

preview_stl

0s

	Name	Sex
PassengerId
1	Braund, Mr. Owen Harris	male
2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female
3	Heikkinen, Miss. Laina	female
4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female
5	Allen, Mr. William Henry	male

stl.drop¶

In [34]:

stl.drop

Out[34]:

DROP DOCS

signature

drop(columns)

description

Drops columns from a dataframe. Identical to df.drop(columns, axis=1)

params

columns

columns to drop

returns

A feature constructor dropping given columns from input dataframe.

examples

>>> assert all(stl.drop(['a', 'b'])(df) == df.drop(['a', 'b'], axis=1))

In [35]:

@preview(train, 5)
def preview_stl(df):
    return stl.drop(['Survived'])(df)

COMPUTING FEATURES

feature

progress

preview_stl

0s

	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
PassengerId
1	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
2	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
5	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

stl.concat¶

In [36]:

stl.concat

Out[36]:

CONCAT DOCS

signature

concat(feature_constructors)

description

Concatenates feature constructors

params

feature_constructors

list of feature constructors

returns

A single feature constructor whose output contains columns from each of the given features.

examples

>>> from category_encoders import WOEEncoder, CatBoostEncoder
>>> stl.concat([
...     stl.select('Age']),
...     stl.category_encode(WOEEncoder(), ['Sex', 'Embarked'], 'Survived'),
...     stl.category_encode(CatBoostEncoder(), ['Sex', 'Embarked'], 'Survived'),
... ])

In [37]:

@preview(train, 5)
def preview_stl(df):
    res = stl.concat([
        stl.select(['Sex', 'Name']),
        simple_feature,
        tfidf('Name')
    ])(df)
    return res

COMPUTING FEATURES

feature

progress

preview_stl

2s

simple_feature

0s

tfidf__Name

0s

	Name	Sex	is_male	tfidf__Name_0	tfidf__Name_1	tfidf__Name_2	tfidf__Name_3	tfidf__Name_4
PassengerId
1	Braund, Mr. Owen Harris	male	1	0.497477	0.331651	0.165826	0.000000	0.784236
2	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	0	0.610662	0.203554	0.407108	0.240666	0.601665
3	Heikkinen, Miss. Laina	female	0	0.546402	0.546402	0.546402	0.323011	0.000000
4	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	0	0.544245	0.272122	0.544245	0.536227	0.214491
5	Allen, Mr. William Henry	male	1	0.447424	0.298283	0.298283	0.705332	0.352666

stl.apply¶

In [38]:

stl.apply

Out[38]:

APPLY DOCS

signature

apply(df, func, parts, optimize, verbose)

description

Applies a function row-wise in parallel. Identical to df.apply(func, axis=1)

params

df

input dataframe

func

function taking a pd.Series as input and returning a single value

parts

number of parts to split the dataframe into. May be greater than the number of cores

optimize

if set to True, then the dataframe won't be partitioned if its size is less than 100

verbose

whether to show a progress bar for each process

returns

A dataframe whose only column contains the result of calling func for each row.

examples

>>> def func(row):
...     if row.Embarked == 'S':
...         return row.SibSp
...     return row.Age
>>> stl.apply(df, func, parts=7, verbose=True)

In [39]:

@preview(train, 700, parallel=True)
def preview_stl(df):
    def func(row):
        """A regular row-wise function with any logic."""
        time.sleep(0.1)
        if row.Embarked == 'S':
            return row.SibSp
        return row.Age
    res = stl.empty_like(df)
    res['col'] = stl.apply(df, func, parts=7, verbose=True)
    return res

COMPUTING FEATURES

feature

progress

preview_stl

19s

stl_apply_0_100

10s

0s

stl_apply_100_200

10s

0s

stl_apply_200_300

10s

0s

stl_apply_300_400

10s

0s

stl_apply_400_500

10s

0s

stl_apply_500_600

10s

0s

stl_apply_600_700

10s

0s

	col
PassengerId
1	1.0
2	38.0
3	0.0
4	1.0
5	0.0
...	...
696	0.0
697	0.0
698	NaN
699	49.0
700	0.0

700 rows × 1 columns

stl.category_encode¶

In [40]:

stl.category_encode

Out[40]:

CATEGORY_ENCODE DOCS

signature

category_encode(encoder, columns, targets)

description

Encodes categorical features in parallel

Performs both simple category encoding, such as one-hot, and various target encoding techniques.
In case if target columns are provided, each pair (encoded column, target column) from cartesian product of
both lists is encoded using encoder.

Runs encoders returning one column (e.g. TargetEncoder, WOEEncoder)
or fixed number of columns (HashingEncoder, BaseNEncoder) in parallel,
whereas encoders whose number of output columns depends on count of unique values (HelmertEncoder, OneHotEncoder)
are run in the main process to avoid result serialization overhead.

params

encoder

an instance of encoder from category_encoders package with predefined parameters

columns

list of encoded columns. Treats string as a list of length 1

targets

list of target columns. Should be provided if encoder uses target. Treats string as a list of length 1

returns

A feature constructor returning a concatenation of resulting columns.

examples

>>> from category_encoders import WOEEncoder, TargetEncoder
>>> stl.category_encode(WOEEncoder(), ['Sex', 'Embarked'], 'Survived')
>>> stl.category_encode(TargetEncoder(smoothing=3), ['Sex', 'Embarked'], ['Survived', 'Age'])
>>> stl.category_encode(WOEEncoder(sigma=0.1, regularization=0.5), 'Sex', 'Survived')

In [41]:

from category_encoders import CatBoostEncoder, WOEEncoder, TargetEncoder

@preview(train, 100)
def preview_stl(df):
    encoder = CatBoostEncoder(sigma=3, random_state=0)
    return stl.category_encode(encoder, columns=['Cabin', 'Embarked'], targets='Survived')(df)

COMPUTING FEATURES

feature

progress

preview_stl

0s

	Cabin_ce_Survived_CatBoostEncoder_random_state_0_sigma_3	Embarked_ce_Survived_CatBoostEncoder_random_state_0_sigma_3
PassengerId
1	2.579784	2.579784
2	0.902193	0.902193
3	0.806924	0.806924
4	3.166299	3.629659
5	3.103257	3.978111
...	...	...
96	1.096301	1.090039
97	0.422915	0.485321
98	2.606621	2.848815
99	0.479063	0.475339
100	0.783394	0.780401

100 rows × 2 columns

In [42]:

@preview(train, 100)
def preview_stl(df):
    return stl.concat([
        stl.select(['Cabin', 'Survived']),
        stl.category_encode(CatBoostEncoder(random_state=0), columns='Cabin', targets='Survived'),
        stl.category_encode(WOEEncoder(), columns='Cabin', targets='Survived'),
        stl.category_encode(TargetEncoder(), columns='Cabin', targets='Survived'),
    ])(df)

COMPUTING FEATURES

feature

progress

preview_stl

0s

	Survived	Cabin	Cabin_ce_Survived_CatBoostEncoder_random_state_0	Cabin_ce_Survived_WOEEncoder	Cabin_ce_Survived_TargetEncoder
PassengerId
1	0	NaN	0.410000	-0.253322	0.35
2	1	C85	0.410000	0.000000	0.41
3	1	NaN	0.205000	-0.253322	0.35
4	1	C123	0.410000	0.000000	0.41
5	0	NaN	0.470000	-0.253322	0.35
...	...	...	...	...	...
96	0	NaN	0.351410	-0.253322	0.35
97	0	A5	0.410000	0.000000	0.41
98	1	D10 D12	0.410000	0.000000	0.41
99	1	NaN	0.346962	-0.253322	0.35
100	0	NaN	0.355125	-0.253322	0.35

100 rows × 5 columns

stl.mean_encode¶

In [43]:

stl.mean_encode

Out[43]:

MEAN_ENCODE DOCS

signature

mean_encode(columns, targets, smoothing, min_samples_leaf)

description

Performs mean target encoding in parallel

An alias to stl.category_encode(TargetEncoder(smoothing, min_samples_leaf), columns, targets).

params

columns

list of encoded columns. Treats string as a list of length 1

targets

list of target columns. Should be provided if encoder uses target. Treats string as a list of length 1

smoothing

smoothing effect to balance categorical average vs prior.
Higher value means stronger regularization.
The value must be strictly bigger than 0.

min_samples_leaf

minimum samples to take category average into account.

returns

A feature constructor performing mean encoding for each pair (column, target) and returning the concatenation.

examples

>>> stl.mean_encoding(['Sex', 'Embarked'], ['Survived', 'Age'])
>>> stl.mean_encoding(['Sex', 'Embarked'], 'Survived', smoothing=1.5, min_samples_leaf=5)

In [44]:

@preview(train, 100)
def preview_stl(df):
    """An alias for stl.category_encode(TargetEncoder())"""
    return stl.mean_encode('Cabin', 'Survived', smoothing=3)(df)

COMPUTING FEATURES

feature

progress

preview_stl

0s

	Cabin_ce_Survived_TargetEncoder_smoothing_3.0
PassengerId
1	0.35
2	0.41
3	0.35
4	0.41
5	0.35
...	...
96	0.35
97	0.41
98	0.41
99	0.35
100	0.35

100 rows × 1 columns

stl.one_hot_encode¶

In [45]:

stl.one_hot_encode

Out[45]:

ONE_HOT_ENCODE DOCS

signature

one_hot_encode(columns)

description

Performs simple one-hot encoding

An alias to stl.category_encode(OneHotEncoder(), columns).

params

columns

list of columns to be encoded. Treats string as a list of length 1

returns

A feature constructor returning a concatenation of one-hot encoding of each column.

examples

>>> stl.one_hot_encode(['Sex', 'Embarked'])
>>> stl.one_hot_encode('Embarked')

In [46]:

@preview(train, 100, parallel=False)  # One hot encoder produces a lot of columns, but is computationally cheap, that's why we don't compute it in parallel
def preview_stl(df):
    """An alias for stl.category_encode(OneHotEncoder())"""
    return stl.one_hot_encode('Embarked')(df)

COMPUTING FEATURES

feature

progress

preview_stl

0s

	Embarked_ce_OneHotEncoder_0	Embarked_ce_OneHotEncoder_1	Embarked_ce_OneHotEncoder_2	Embarked_ce_OneHotEncoder_3
PassengerId
1	1	0	0	0
2	0	1	0	0
3	1	0	0	0
4	1	0	0	0
5	1	0	0	0
...	...	...	...	...
96	1	0	0	0
97	0	1	0	0
98	0	1	0	0
99	1	0	0	0
100	1	0	0	0

100 rows × 4 columns

Feature Set¶

In [47]:

FeatureSet

Out[47]:

FEATURESET DOCS

signature

FeatureSet(before_split, after_split, train_frame, test_frame, targets, auxiliary, description)

description

Collects and computes feature constructors

params

before_split

list of regular features

after_split

list of stateful features which may leak target if computed before split.
They are run in Single Validation mode, i.e. for each fold they are fit using training objects
and then applied to validation objects in inference mode.

train_frame

a dataframe to perform training on. Should contain unique indices for each object.

targets

list of target columns in case of a multilabel task, or a single string otherwise.
Target columns may be computed. In this case the corresponding feature constructors
should be passed to before_split list.

auxiliary

list of auxiliary columns, such as datetime, groups or whatever else can be used
for setting up your validation. These columns can be utilized by overriding Validator.
As well as targets, auxiliary columns may be computed.

description

any notes about this feature set.

examples

>>> fs = FeatureSet([feature_1, feature_2], [single_validation_feature],
...                  train_frame=train, targets='Survived')

>>> fs = FeatureSet([feature_1, feature_2], [single_validation_feature],
...                  train_frame=train,
...                  targets=['Target1', 'Target2'], auxiliary=['date', 'metric_group'])

>>> fs = FeatureSet([stl.select(['Age', 'Fare'])], [stl.mean_encode(['Embarked', 'Parch'], 'Survived')],
...                  train_frame=train, targets='Survived')

In [48]:

fs = FeatureSet([simple_feature, interactions('Pclass', 'Age'), num_aggs('Fare'), tfidf('Name')], 
                [stl.category_encode(TargetEncoder(), 'Embarked', 'Survived'), 
                 stl.category_encode(WOEEncoder(), 'Embarked', 'Survived')],
                train_frame=train,
                targets='Survived')

Each feature set is given a unique identifier. It also contains source code of all the features right in its repr:

In [49]:

fs

Out[49]:

FEATURE SET

name

FSBWBXEK

features

simple_feature

FEATURE CONSTRUCTOR

name

simple_feature

source

@feature
def simple_feature(df):
    res = stl.empty_like(df)
    res['is_male'] = (df.Sex == 'male') + 0
    return res

interactions

FEATURE CONSTRUCTOR

name

interactions('Pclass', 'Age')

description

An instance of generic feature constructor interactions

source

interactions('Pclass', 'Age')

additional source

@feature
@generic(left="Pclass", right="SibSp")
def interactions(df):
    res = stl.empty_like(df)
    res[f"{left}_add_{right}"] = df[left] + df[right]
    res[f"{left}_sub_{right}"] = df[left] - df[right]
    res[f"{left}_mul_{right}"] = df[left] * df[right]
    return res

num_aggs

FEATURE CONSTRUCTOR

name

num_aggs('Fare')

description

An instance of generic feature constructor num_aggs

source

num_aggs('Fare')

additional source

@feature
@generic(col="Parch")
def num_aggs(df):
    """Descriptions are also supported."""
    res = pd.DataFrame(index=df.index)
    mean = df[col].mean()
    std = df[col].std()
    res[f"{col}_div_mean"] = df[col] / mean
    res[f"{col}_sub_div_mean"] = (df[col] - mean) / mean
    res[f"{col}_div_std"] = df[col] / std
    return res

tfidf

FEATURE CONSTRUCTOR

name

tfidf('Name')

description

An instance of generic feature constructor tfidf

source

tfidf('Name')

additional source

@feature
@generic(col='Name')
def tfidf(df):
    if df.train:
        enc = TfidfVectorizer(analyzer='char', ngram_range=(1, 3), max_features=5)
        res = enc.fit_transform(df[col])
        df.state['enc'] = enc
    else:
        enc = df.state['enc']
        res = enc.transform(df[col])
    return res.todense()

requirements

sklearn==0.20.2

stl.category_encode

FEATURE CONSTRUCTOR

name

stl.category_encode(TargetEncoder(), ['Embarked'], ['Survived'])

source

stl.category_encode(TargetEncoder(), ['Embarked'], ['Survived'])

stl.category_encode

FEATURE CONSTRUCTOR

name

stl.category_encode(WOEEncoder(), ['Embarked'], ['Survived'])

source

stl.category_encode(WOEEncoder(), ['Embarked'], ['Survived'])

source

FeatureSet([simple_feature,
            interactions('Pclass', 'Age'),
            num_aggs('Fare'),
            tfidf('Name')],
           [stl.category_encode(TargetEncoder(), ['Embarked'], ['Survived']),
            stl.category_encode(WOEEncoder(), ['Embarked'], ['Survived'])],
           targets=['Survived'],
           auxiliary=[])

requirements

sklearn==0.20.2

Use slicing to preview your feature sets. Slicing calls are not cached and do not leak dataframes to IPython namespace, so you can run them as many times as you need. For stateful features, slicing calls always trigger a training stage.

In [50]:

fs[:10]

COMPUTING FEATURES

feature

progress

simple_feature

0s

interactions__Pclass_Age

0s

num_aggs__Fare

0s

tfidf__Name

0s

Out[50]:

	is_male	Pclass_add_Age	Pclass_sub_Age	Pclass_mul_Age	Fare_div_mean	Fare_sub_div_mean	Fare_div_std	tfidf__Name_0	tfidf__Name_1	tfidf__Name_2	tfidf__Name_3	tfidf__Name_4	Embarked_ce_Survived_TargetEncoder	Embarked_ce_Survived_WOEEncoder
PassengerId
1	1	25.0	-19.0	66.0	0.268312	-0.731688	0.307178	0.508281	0.338854	0.185575	0.742300	0.203426	0.428748	-0.223144
2	0	39.0	-37.0	38.0	2.638088	1.638088	3.020231	0.593616	0.197872	0.433463	0.541828	0.356369	0.865529	1.098612
3	0	29.0	-23.0	78.0	0.293292	-0.706708	0.335778	0.464173	0.464173	0.508413	0.000000	0.557318	0.428748	-0.223144
4	0	36.0	-34.0	35.0	1.965151	0.965151	2.249815	0.603771	0.301886	0.661317	0.220439	0.241644	0.428748	-0.223144
5	1	38.0	-32.0	105.0	0.297918	-0.702082	0.341074	0.631088	0.420725	0.460825	0.460825	0.000000	0.428748	-0.223144
6	1	NaN	NaN	NaN	0.313029	-0.686971	0.358373	0.508984	0.508984	0.278748	0.557496	0.305561	0.500000	0.000000
7	1	55.0	-53.0	54.0	1.919353	0.919353	2.197383	0.779844	0.259948	0.000000	0.569447	0.000000	0.428748	-0.223144
8	1	5.0	1.0	6.0	0.779954	-0.220046	0.892935	0.395067	0.526756	0.288481	0.288481	0.632461	0.428748	-0.223144
9	0	30.0	-24.0	81.0	0.412027	-0.587973	0.471711	0.605911	0.302956	0.442440	0.331830	0.485000	0.428748	-0.223144
10	0	16.0	-12.0	28.0	1.112875	0.112875	1.274082	0.449865	0.449865	0.492741	0.246371	0.540139	0.865529	1.098612

In [51]: