Notebook

## Открытый курс по машинному обучению Автор материала: Плаксина Елена Константиновна, Levka.

Обзор библиотеки для генерации временных признаков tsfresh¶

Time Series FeatuRe Extraction based on Scalable Hypothesis tests¶

Библиотека используется для извлечения признаков из временных рядов. Практически все признаки, которые могут прийти вам в голову, уже внесены в расчёт этой библиотеки и нет никакого смысла создавать их самому, когда это можно сделать парой строчек кода из библиотеки.

Извлечённые признаки могут быть использованы для описания или кластеризации временных рядов. Также их можно использовать для задач классификации/регрессии на временных рядах.

Процесс расчёта признаков состоит из двух этапов:¶

Расчёт всех возможных признаков

from tsfresh import extract_features
extracted_features = extract_features(timeseries, column_id="id", column_sort="time")

Отбор релевантных признаков и удаление константных/нулевых признаков

from tsfresh import select_features
from tsfresh.utilities.dataframe_functions import impute

impute(extracted_features)  # удаление константных признаков
features_filtered = select_features(extracted_features, y)  # отбор признаков

Процедура отбора признаков¶

Стадия 1¶

Расчёт признаков

Стадия 2¶

Проверка на значимость каждого признака, расчёт p-value

Стадия 3¶

Поправка на множественную проверку гипотез Бенджамини-Иекутиели

Приведём пример генерации признаков на основе датасета Human Activity Recognition¶

In [13]:

import matplotlib.pylab as plt
%matplotlib inline
from tsfresh.examples.har_dataset import download_har_dataset, load_har_dataset, load_har_classes
import seaborn as sns
from tsfresh import extract_features, extract_relevant_features, select_features
from tsfresh.utilities.dataframe_functions import impute
from tsfresh.feature_extraction import ComprehensiveFCParameters
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
import numpy as np

Загрузка и отрисовка данных

In [4]:

download_har_dataset()

In [5]:

df = load_har_dataset()

In [12]:

plt.title('accelerometer reading')
plt.plot(df.iloc[0,:])
plt.show()

Извлечение признаков

In [14]:

# расчёт только определённого набора параметров, заданного в ComprehensiveFCParameters
extraction_settings = ComprehensiveFCParameters()

In [15]:

# переформируем данные 500 первых показаний сенсоров column-wise, как этого требует формат библиотеки
N = 500
master_df = pd.DataFrame({0: df[:N].values.flatten(),
                          1: np.arange(N).repeat(df.shape[1])})
master_df.head()

Out[15]:

	0	1
0	0.000181	0
1	0.010139	0
2	0.009276	0
3	0.005066	0
4	0.010810	0

In [16]:

X = extract_features(master_df, column_id=1, impute_function=impute, default_fc_parameters=extraction_settings)

Feature Extraction: 100%|██████████████████████████████████████████| 20/20 [00:34<00:00,  1.74s/it]
WARNING:tsfresh.utilities.dataframe_functions:The columns ['0__fft_coefficient__coeff_65__attr_"abs"'
 '0__fft_coefficient__coeff_65__attr_"angle"'
 '0__fft_coefficient__coeff_65__attr_"imag"'
 '0__fft_coefficient__coeff_65__attr_"real"'
 '0__fft_coefficient__coeff_66__attr_"abs"'
 '0__fft_coefficient__coeff_66__attr_"angle"'
 '0__fft_coefficient__coeff_66__attr_"imag"'
 '0__fft_coefficient__coeff_66__attr_"real"'
 '0__fft_coefficient__coeff_67__attr_"abs"'
 '0__fft_coefficient__coeff_67__attr_"angle"'
 '0__fft_coefficient__coeff_67__attr_"imag"'
 '0__fft_coefficient__coeff_67__attr_"real"'
 '0__fft_coefficient__coeff_68__attr_"abs"'
 '0__fft_coefficient__coeff_68__attr_"angle"'
 '0__fft_coefficient__coeff_68__attr_"imag"'
 '0__fft_coefficient__coeff_68__attr_"real"'
 '0__fft_coefficient__coeff_69__attr_"abs"'
 '0__fft_coefficient__coeff_69__attr_"angle"'
 '0__fft_coefficient__coeff_69__attr_"imag"'
 '0__fft_coefficient__coeff_69__attr_"real"'
 '0__fft_coefficient__coeff_70__attr_"abs"'
 '0__fft_coefficient__coeff_70__attr_"angle"'
 '0__fft_coefficient__coeff_70__attr_"imag"'
 '0__fft_coefficient__coeff_70__attr_"real"'
 '0__fft_coefficient__coeff_71__attr_"abs"'
 '0__fft_coefficient__coeff_71__attr_"angle"'
 '0__fft_coefficient__coeff_71__attr_"imag"'
 '0__fft_coefficient__coeff_71__attr_"real"'
 '0__fft_coefficient__coeff_72__attr_"abs"'
 '0__fft_coefficient__coeff_72__attr_"angle"'
 '0__fft_coefficient__coeff_72__attr_"imag"'
 '0__fft_coefficient__coeff_72__attr_"real"'
 '0__fft_coefficient__coeff_73__attr_"abs"'
 '0__fft_coefficient__coeff_73__attr_"angle"'
 '0__fft_coefficient__coeff_73__attr_"imag"'
 '0__fft_coefficient__coeff_73__attr_"real"'
 '0__fft_coefficient__coeff_74__attr_"abs"'
 '0__fft_coefficient__coeff_74__attr_"angle"'
 '0__fft_coefficient__coeff_74__attr_"imag"'
 '0__fft_coefficient__coeff_74__attr_"real"'
 '0__fft_coefficient__coeff_75__attr_"abs"'
 '0__fft_coefficient__coeff_75__attr_"angle"'
 '0__fft_coefficient__coeff_75__attr_"imag"'
 '0__fft_coefficient__coeff_75__attr_"real"'
 '0__fft_coefficient__coeff_76__attr_"abs"'
 '0__fft_coefficient__coeff_76__attr_"angle"'
 '0__fft_coefficient__coeff_76__attr_"imag"'
 '0__fft_coefficient__coeff_76__attr_"real"'
 '0__fft_coefficient__coeff_77__attr_"abs"'
 '0__fft_coefficient__coeff_77__attr_"angle"'
 '0__fft_coefficient__coeff_77__attr_"imag"'
 '0__fft_coefficient__coeff_77__attr_"real"'
 '0__fft_coefficient__coeff_78__attr_"abs"'
 '0__fft_coefficient__coeff_78__attr_"angle"'
 '0__fft_coefficient__coeff_78__attr_"imag"'
 '0__fft_coefficient__coeff_78__attr_"real"'
 '0__fft_coefficient__coeff_79__attr_"abs"'
 '0__fft_coefficient__coeff_79__attr_"angle"'
 '0__fft_coefficient__coeff_79__attr_"imag"'
 '0__fft_coefficient__coeff_79__attr_"real"'
 '0__fft_coefficient__coeff_80__attr_"abs"'
 '0__fft_coefficient__coeff_80__attr_"angle"'
 '0__fft_coefficient__coeff_80__attr_"imag"'
 '0__fft_coefficient__coeff_80__attr_"real"'
 '0__fft_coefficient__coeff_81__attr_"abs"'
 '0__fft_coefficient__coeff_81__attr_"angle"'
 '0__fft_coefficient__coeff_81__attr_"imag"'
 '0__fft_coefficient__coeff_81__attr_"real"'
 '0__fft_coefficient__coeff_82__attr_"abs"'
 '0__fft_coefficient__coeff_82__attr_"angle"'
 '0__fft_coefficient__coeff_82__attr_"imag"'
 '0__fft_coefficient__coeff_82__attr_"real"'
 '0__fft_coefficient__coeff_83__attr_"abs"'
 '0__fft_coefficient__coeff_83__attr_"angle"'
 '0__fft_coefficient__coeff_83__attr_"imag"'
 '0__fft_coefficient__coeff_83__attr_"real"'
 '0__fft_coefficient__coeff_84__attr_"abs"'
 '0__fft_coefficient__coeff_84__attr_"angle"'
 '0__fft_coefficient__coeff_84__attr_"imag"'
 '0__fft_coefficient__coeff_84__attr_"real"'
 '0__fft_coefficient__coeff_85__attr_"abs"'
 '0__fft_coefficient__coeff_85__attr_"angle"'
 '0__fft_coefficient__coeff_85__attr_"imag"'
 '0__fft_coefficient__coeff_85__attr_"real"'
 '0__fft_coefficient__coeff_86__attr_"abs"'
 '0__fft_coefficient__coeff_86__attr_"angle"'
 '0__fft_coefficient__coeff_86__attr_"imag"'
 '0__fft_coefficient__coeff_86__attr_"real"'
 '0__fft_coefficient__coeff_87__attr_"abs"'
 '0__fft_coefficient__coeff_87__attr_"angle"'
 '0__fft_coefficient__coeff_87__attr_"imag"'
 '0__fft_coefficient__coeff_87__attr_"real"'
 '0__fft_coefficient__coeff_88__attr_"abs"'
 '0__fft_coefficient__coeff_88__attr_"angle"'
 '0__fft_coefficient__coeff_88__attr_"imag"'
 '0__fft_coefficient__coeff_88__attr_"real"'
 '0__fft_coefficient__coeff_89__attr_"abs"'
 '0__fft_coefficient__coeff_89__attr_"angle"'
 '0__fft_coefficient__coeff_89__attr_"imag"'
 '0__fft_coefficient__coeff_89__attr_"real"'
 '0__fft_coefficient__coeff_90__attr_"abs"'
 '0__fft_coefficient__coeff_90__attr_"angle"'
 '0__fft_coefficient__coeff_90__attr_"imag"'
 '0__fft_coefficient__coeff_90__attr_"real"'
 '0__fft_coefficient__coeff_91__attr_"abs"'
 '0__fft_coefficient__coeff_91__attr_"angle"'
 '0__fft_coefficient__coeff_91__attr_"imag"'
 '0__fft_coefficient__coeff_91__attr_"real"'
 '0__fft_coefficient__coeff_92__attr_"abs"'
 '0__fft_coefficient__coeff_92__attr_"angle"'
 '0__fft_coefficient__coeff_92__attr_"imag"'
 '0__fft_coefficient__coeff_92__attr_"real"'
 '0__fft_coefficient__coeff_93__attr_"abs"'
 '0__fft_coefficient__coeff_93__attr_"angle"'
 '0__fft_coefficient__coeff_93__attr_"imag"'
 '0__fft_coefficient__coeff_93__attr_"real"'
 '0__fft_coefficient__coeff_94__attr_"abs"'
 '0__fft_coefficient__coeff_94__attr_"angle"'
 '0__fft_coefficient__coeff_94__attr_"imag"'
 '0__fft_coefficient__coeff_94__attr_"real"'
 '0__fft_coefficient__coeff_95__attr_"abs"'
 '0__fft_coefficient__coeff_95__attr_"angle"'
 '0__fft_coefficient__coeff_95__attr_"imag"'
 '0__fft_coefficient__coeff_95__attr_"real"'
 '0__fft_coefficient__coeff_96__attr_"abs"'
 '0__fft_coefficient__coeff_96__attr_"angle"'
 '0__fft_coefficient__coeff_96__attr_"imag"'
 '0__fft_coefficient__coeff_96__attr_"real"'
 '0__fft_coefficient__coeff_97__attr_"abs"'
 '0__fft_coefficient__coeff_97__attr_"angle"'
 '0__fft_coefficient__coeff_97__attr_"imag"'
 '0__fft_coefficient__coeff_97__attr_"real"'
 '0__fft_coefficient__coeff_98__attr_"abs"'
 '0__fft_coefficient__coeff_98__attr_"angle"'
 '0__fft_coefficient__coeff_98__attr_"imag"'
 '0__fft_coefficient__coeff_98__attr_"real"'
 '0__fft_coefficient__coeff_99__attr_"abs"'
 '0__fft_coefficient__coeff_99__attr_"angle"'
 '0__fft_coefficient__coeff_99__attr_"imag"'
 '0__fft_coefficient__coeff_99__attr_"real"'] did not have any finite values. Filling with zeros.

In [18]:

"Число рассчитанных признаков: {}.".format(X.shape[1])

Out[18]:

'Число рассчитанных признаков: 794.'

Обучение классификатора

In [19]:

y = load_har_classes()[:N]
y.shape

Out[19]:

(500,)

In [20]:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2)

In [21]:

cl = DecisionTreeClassifier()
cl.fit(X_train, y_train)
print(classification_report(y_test, cl.predict(X_test)))

             precision    recall  f1-score   support

          1       1.00      1.00      1.00        29
          2       1.00      1.00      1.00         9
          3       1.00      1.00      1.00        14
          4       0.36      0.36      0.36        14
          5       0.26      0.36      0.30        14
          6       0.60      0.45      0.51        20

avg / total       0.73      0.71      0.72       100

Отберём признаки для каждого класса отдельно и решим задачу бинарной классификации

In [22]:

relevant_features = set()

for label in y.unique():
    y_train_binary = y_train == label
    X_train_filtered = select_features(X_train, y_train_binary)
    print("Number of relevant features for class {}: {}/{}".format(label, X_train_filtered.shape[1], X_train.shape[1]))
    relevant_features = relevant_features.union(set(X_train_filtered.columns))

WARNING:tsfresh.feature_selection.relevance:Infered classification as machine learning task

Number of relevant features for class 5: 216/794

WARNING:tsfresh.feature_selection.relevance:Infered classification as machine learning task

Number of relevant features for class 4: 202/794

WARNING:tsfresh.feature_selection.relevance:Infered classification as machine learning task

Number of relevant features for class 6: 188/794

WARNING:tsfresh.feature_selection.relevance:Infered classification as machine learning task

Number of relevant features for class 1: 216/794

WARNING:tsfresh.feature_selection.relevance:Infered classification as machine learning task

Number of relevant features for class 3: 220/794

WARNING:tsfresh.feature_selection.relevance:Infered classification as machine learning task

Number of relevant features for class 2: 166/794

In [23]:

len(relevant_features)

Out[23]:

Мы уменьшили количество признаков с 794 до 264.

In [24]:

X_train_filtered = X_train[list(relevant_features)]
X_test_filtered = X_test[list(relevant_features)]

In [25]:

X_train_filtered.shape, X_test_filtered.shape

Out[25]:

((400, 264), (100, 264))

In [26]:

cl = DecisionTreeClassifier()
cl.fit(X_train_filtered, y_train)
print(classification_report(y_test, cl.predict(X_test_filtered)))

             precision    recall  f1-score   support

          1       1.00      1.00      1.00        29
          2       1.00      1.00      1.00         9
          3       1.00      1.00      1.00        14
          4       0.27      0.29      0.28        14
          5       0.29      0.36      0.32        14
          6       0.62      0.50      0.56        20

avg / total       0.72      0.71      0.71       100

Качество модели практически не изменилось, однако модель стала намного проще.

Сравнение с классификатором на стандартных признаках

In [29]:

X_1 = df.iloc[:N,:]
X_1.shape

Out[29]:

(500, 128)

In [30]:

X_train, X_test, y_train, y_test = train_test_split(X_1, y, test_size=.2)

In [31]:

cl = DecisionTreeClassifier()
cl.fit(X_train, y_train)
print(classification_report(y_test, cl.predict(X_test)))

             precision    recall  f1-score   support

          1       0.55      0.58      0.56        19
          2       0.69      0.52      0.59        21
          3       0.75      0.46      0.57        13
          4       0.42      0.57      0.48        14
          5       0.65      0.50      0.56        22
          6       0.20      0.36      0.26        11

avg / total       0.57      0.51      0.53       100

Как видимо, качество модели значительно улучшилось по сравнению с наивным классификатором.