## mlcourse.ai – Open Machine Learning Course¶

Authors: Olga Daykhovskaya, Yury Kashnitskiy. This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

# Assignment #7 (demo). Solution

## Unupervised learning

Same assignment as a Kaggle Kernel + solution.

In this task, we will look at how data dimensionality reduction and clustering methods work. At the same time, we'll practice solving classification task again.

We will work with the Samsung Human Activity Recognition dataset. Download the data here. The data comes from accelerometers and gyros of Samsung Galaxy S3 mobile phones ( you can find more info about the features using the link above), the type of activity of a person with a phone in his/her pocket is also known – whether he/she walked, stood, lay, sat or walked up or down the stairs.

First, we pretend that the type of activity is unknown to us, and we will try to cluster people purely on the basis of available features. Then we solve the problem of determining the type of physical activity as a classification problem.

Fill the code where needed ("Your code is here") and answer the questions in the web form.

In [2]:
import os
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm_notebook

%matplotlib inline
from matplotlib import pyplot as plt
plt.style.use(['seaborn-darkgrid'])
plt.rcParams['figure.figsize'] = (12, 9)
plt.rcParams['font.family'] = 'DejaVu Sans'

from sklearn import metrics
from sklearn.cluster import KMeans, AgglomerativeClustering, SpectralClustering
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

RANDOM_STATE = 17
In [1]:
PATH_TO_SAMSUNG_DATA = "../../data/samsung_HAR"
In [4]:
"samsung_train_labels.txt")).astype(int)

"samsung_test_labels.txt")).astype(int)
In [5]:
# Checking dimensions
assert(X_train.shape == (7352, 561) and y_train.shape == (7352,))
assert(X_test.shape == (2947, 561) and y_test.shape == (2947,))

For clustering, we do not need a target vector, so we'll work with the combination of training and test samples. Merge X_train with X_test, and y_train with y_test.

In [4]:
X = np.vstack([X_train, X_test])
y = np.hstack([y_train, y_test])

Define the number of unique values of the labels of the target class.

In [5]:
np.unique(y)
Out[5]:
array([1, 2, 3, 4, 5, 6])
In [6]:
n_classes = np.unique(y).size

These labels correspond to:

• 1 – walking
• 2 – walking upstairs
• 3 – walking downstairs
• 4 – sitting
• 5 – standing
• 6 – laying down

Scale the sample using StandardScaler with default parameters.

In [7]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Reduce the number of dimensions using PCA, leaving as many components as necessary to explain at least 90% of the variance of the original (scaled) data. Use the scaled dataset and fix random_state (RANDOM_STATE constant).

In [8]:
pca = PCA(n_components=0.9, random_state=RANDOM_STATE).fit(X_scaled)
X_pca = pca.transform(X_scaled)

Question 1:
What is the minimum number of principal components required to cover the 90% of the variance of the original (scaled) data?

In [ ]:
X_pca.shape

• 56
• 65 [+]
• 66
• 193

Вопрос 2:
What percentage of the variance is covered by the first principal component? Round to the nearest percent.

• 45
• 51 [+]
• 56
• 61
In [10]:
round(float(pca.explained_variance_ratio_[0] * 100))
Out[10]:
51

Visualize data in projection on the first two principal components.

In [11]:
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, s=20, cmap='viridis');

Question 3:
If everything worked out correctly, you will see a number of clusters, almost perfectly separated from each other. What types of activity are included in these clusters?

• 1 cluster: all 6 activities
• 2 clusters: (walking, walking upstairs, walking downstairs ) and (sitting, standing, laying) [+]
• 3 clusters: (walking), (walking upstairs, walking downstairs) and (sitting, standing, laying)
• 6 clusters

Perform clustering with the KMeans method, training the model on data with reduced dimensionality (by PCA). In this case, we will give a clue to look for exactly 6 clusters, but in general case we will not know how many clusters we should be looking for.

Options:

• n_clusters = n_classes (number of unique labels of the target class)
• n_init = 100
• random_state = RANDOM_STATE (for reproducibility of the result)

Other parameters should have default values.

In [12]:
kmeans = KMeans(n_clusters=n_classes, n_init=100,
random_state=RANDOM_STATE, n_jobs=1)
kmeans.fit(X_pca)
cluster_labels = kmeans.labels_

Visualize data in projection on the first two principal components. Color the dots according to the clusters obtained.

In [13]:
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, s=20, cmap='viridis');

Look at the correspondence between the cluster marks and the original class labels and what kinds of activities the KMeans algorithm is confused at.

In [ ]:
tab = pd.crosstab(y, cluster_labels, margins=True)
tab.index = ['walking', 'going up the stairs',
'going down the stairs', 'sitting', 'standing', 'laying', 'all']
tab.columns = ['cluster' + str(i + 1) for i in range(6)] + ['all']
tab

We see that for each class (i.e., each activity) there are several clusters. Let's look at the maximum percentage of objects in a class that are assigned to a single cluster. This will be a simple metric that characterizes how easily the class is separated from others when clustering.

Example: if for class "walking downstairs" (with 1406 instances belonging to it), the distribution of clusters is:

• cluster 1 - 900
• cluster 3 - 500
• cluster 6 - 6,

then such a share will be 900/1406 $\approx$ 0.64.

Question 4:
Which activity is separated from the rest better than others based on the simple metric described above?

• walking
• standing
• walking downstairs
• all three options are incorrect [+]
In [ ]:
pd.Series(tab.iloc[:-1,:-1].max(axis=1).values /
tab.iloc[:-1,-1].values, index=tab.index[:-1])

It can be seen that kMeans does not distinguish activities very well. Use the elbow method to select the optimal number of clusters. Parameters of the algorithm and the data we use are the same as before, we change only n_clusters.

In [16]:
inertia = []
for k in tqdm_notebook(range(1, n_classes + 1)):
kmeans = KMeans(n_clusters=k, n_init=100,
random_state=RANDOM_STATE, n_jobs=1).fit(X_pca)
inertia.append(np.sqrt(kmeans.inertia_))

In [17]:
plt.plot(range(1, 7), inertia, marker='s');

We calculate $D(k)$, as described in this article in the section "Choosing the number of clusters for K-means".

In [19]:
d = {}
for k in range(2, 6):
i = k - 1
d[k] = (inertia[i] - inertia[i + 1]) / (inertia[i - 1] - inertia[i])
In [20]:
d
Out[20]:
{2: 0.17344753560094164,
3: 0.416886495398681,
4: 0.93321540944782688,
5: 0.62970401137157561}

Question 5:
How many clusters can we choose according to the elbow method?

• 1
• 2 [+]
• 3
• 4

Let's try another clustering algorithm, described in the article – agglomerative clustering.

In [21]:
ag = AgglomerativeClustering(n_clusters=n_classes,

Calculate the Adjusted Rand Index (sklearn.metrics) for the resulting clustering and for KMeans with the parameters from the 4th question.

In [22]:
print('Agglomerative CLustering: ARI =',
KMeans: ARI = 0.41980700126
Agglomerative CLustering: ARI = 0.49362763373

Question 6:
Select all the correct statements.

• According to ARI, KMeans handled clustering worse than Agglomerative Clustering [+]
• For ARI, it does not matter which tags are assigned to the cluster, only the partitioning of instances into clusters matters [+]
• In case of random partitioning into clusters, ARI will be close to zero [+]

Comment:

1. Yes, the higher ARI, the better
2. Yes, if you renumber clusters differently, ARI will not change
3. True

You can notice that the task is not very well solved when we try to detect several clusters (> 2). Now, let's solve the classification problem, given that the data is labeled.

For classification, use the support vector machine – class sklearn.svm.LinearSVC. In this course, we did study this algorithm separately, but it is well-known and you can read about it, for example here.

Choose the C hyperparameter forLinearSVC using GridSearchCV.

• Train the new StandardScaler on the training set (with all original features), apply scaling to the test set
• In GridSearchCV, specify cv = 3.
In [23]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In [24]:
svc = LinearSVC(random_state=RANDOM_STATE)
svc_params = {'C': [0.001, 0.01, 0.1, 1, 10]}
In [25]:
%%time
best_svc = GridSearchCV(svc, svc_params, n_jobs=1, cv=3, verbose=1)
best_svc.fit(X_train_scaled, y_train);
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   40.5s finished
CPU times: user 44.5 s, sys: 372 ms, total: 44.9 s
Wall time: 44.7 s
In [26]:
best_svc.best_params_, best_svc.best_score_
Out[26]:
({'C': 0.1}, 0.93824809575625678)

Question 7
Which value of the hyperparameter C was chosen the best on the basis of cross-validation?

• 0.001
• 0.01
• 0.1 [+]
• 1
• 10
In [27]:
y_predicted = best_svc.predict(X_test_scaled)
In [ ]:
tab = pd.crosstab(y_test, y_predicted, margins=True)
tab.index = ['walking', 'climbing up the stairs',
'going down the stairs', 'sitting', 'standing', 'laying', 'all']
tab.columns = ['walking', 'climbing up the stairs',
'going down the stairs', 'sitting', 'standing', 'laying', 'all']
tab

As you can see, the classification problem is solved quite well.

Question 8:
Which activity type is worst detected by SVM in terms of precision? Recall?

• precision – going up the stairs, recall – laying
• precision – laying, recall – sitting
• precision – walking, recall – walking
• precision – standing, recall – sitting [+]

Comment: The classifier solved the problem well, but not ideally.

Finally, do the same thing as in Question 7, but add PCA.

• Use X_train_scaled andX_test_scaled
• Train the same PCA as before, on the scaled training set, apply scaling to the test set
• Choose the hyperparameter C via cross-validation on the training set with PCA-transformation. You will notice how much faster it works now.

Question 9:
What is the difference between the best quality (accuracy) for cross-validation in the case of all 561 initial characteristics and in the second case, when the principal component method was applied? Round to the nearest percent.

Options:

• quality is the same
• 2%
• 4% [+]
• 10%
• 20%
In [29]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

pca = PCA(n_components=0.9, random_state=RANDOM_STATE)
X_train_pca = pca.fit_transform(X_train_scaled)
X_test_pca = pca.transform(X_test_scaled)
In [30]:
svc = LinearSVC(random_state=RANDOM_STATE)
svc_params = {'C': [0.001, 0.01, 0.1, 1, 10]}
In [31]:
%%time
best_svc_pca = GridSearchCV(svc, svc_params, n_jobs=1, cv=3, verbose=1)
best_svc_pca.fit(X_train_pca, y_train);
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[Parallel(n_jobs=1)]: Done  15 out of  15 | elapsed:   12.0s finished
CPU times: user 13.9 s, sys: 83.2 ms, total: 14 s
Wall time: 13.6 s
In [32]:
best_svc_pca.best_params_, best_svc_pca.best_score_
Out[32]:
({'C': 1}, 0.89880304678998912)

The result with PCA is worse by 4%, comparing accuracy on cross-validation.

In [33]:
round(100 * (best_svc_pca.best_score_ - best_svc.best_score_))
Out[33]:
-4.0

Question 10:
Select all the correct statements: