#!/usr/bin/env python # coding: utf-8 #
# # # ## [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course # # Authors: [Olga Daykhovskaya](https://www.linkedin.com/in/odaykhovskaya/), [Yury Kashnitskiy](https://yorko.github.io). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose. # #
Assignment #7 (demo) # ##
# What is the minimum number of principal components required to cover the 90% of the variance of the original (scaled) data? # In[11]: # Your code here # **Answer options:** # - 56 # - 65 # - 66 # - 193 # **Вопрос 2:**
# What percentage of the variance is covered by the first principal component? Round to the nearest percent. # # **Answer options:** # - 45 # - 51 # - 56 # - 61 # In[12]: # Your code here # Visualize data in projection on the first two principal components. # In[13]: # Your code here # plt.scatter(, , c=y, s=20, cmap='viridis'); # **Question 3:**
# If everything worked out correctly, you will see a number of clusters, almost perfectly separated from each other. What types of activity are included in these clusters?
# # **Answer options:** # - 1 cluster: all 6 activities # - 2 clusters: (walking, walking upstairs, walking downstairs ) and (sitting, standing, laying) # - 3 clusters: (walking), (walking upstairs, walking downstairs) and (sitting, standing, laying) # - 6 clusters # ------------------------------ # Perform clustering with the `KMeans` method, training the model on data with reduced dimensionality (by PCA). In this case, we will give a clue to look for exactly 6 clusters, but in general case we will not know how many clusters we should be looking for. # # Options: # # - ** n_clusters ** = n_classes (number of unique labels of the target class) # - ** n_init ** = 100 # - ** random_state ** = RANDOM_STATE (for reproducibility of the result) # # Other parameters should have default values. # In[14]: # Your code here # Visualize data in projection on the first two principal components. Color the dots according to the clusters obtained. # In[15]: # Your code here # plt.scatter(, , c=cluster_labels, s=20, cmap='viridis'); # Look at the correspondence between the cluster marks and the original class labels and what kinds of activities the `KMeans` algorithm is confused at. # In[16]: # tab = pd.crosstab(y, cluster_labels, margins=True) # tab.index = ['walking', 'going up the stairs', # 'going down the stairs', 'sitting', 'standing', 'laying', 'all'] # tab.columns = ['cluster' + str(i + 1) for i in range(6)] + ['all'] # tab # We see that for each class (i.e., each activity) there are several clusters. Let's look at the maximum percentage of objects in a class that are assigned to a single cluster. This will be a simple metric that characterizes how easily the class is separated from others when clustering. # # Example: if for class "walking downstairs" (with 1406 instances belonging to it), the distribution of clusters is: # - cluster 1 - 900 # - cluster 3 - 500 # - cluster 6 - 6, # # then such a share will be 900/1406 \$ \approx \$ 0.64. # # # ** Question 4: **
# Which activity is separated from the rest better than others based on the simple metric described above?
# # **Answer:** # - walking # - standing # - walking downstairs # - all three options are incorrect # It can be seen that kMeans does not distinguish activities very well. Use the elbow method to select the optimal number of clusters. Parameters of the algorithm and the data we use are the same as before, we change only `n_clusters`. # In[18]: # # Your code here # inertia = [] # for k in tqdm_notebook(range(1, n_classes + 1)): # pass # ** Question 5: **
# How many clusters can we choose according to the elbow method?
# # **Answer options:** # - 1 # - 2 # - 3 # - 4 # ------------------------ # Let's try another clustering algorithm, described in the article – agglomerative clustering. # In[19]: # ag = AgglomerativeClustering(n_clusters=n_classes, # linkage='ward').fit(X_pca) # Calculate the Adjusted Rand Index (`sklearn.metrics`) for the resulting clustering and for ` KMeans` with the parameters from the 4th question. # In[20]: # Your code here # ** Question 6: **
# Select all the correct statements.
# # ** Answer options: ** # - According to ARI, KMeans handled clustering worse than Agglomerative Clustering # - For ARI, it does not matter which tags are assigned to the cluster, only the partitioning of instances into clusters matters # - In case of random partitioning into clusters, ARI will be close to zero # ------------------------------- # You can notice that the task is not very well solved when we try to detect several clusters (> 2). Now, let's solve the classification problem, given that the data is labeled. # # For classification, use the support vector machine – class `sklearn.svm.LinearSVC`. In this course, we didn't study this algorithm separately, but it is well-known and you can read about it, for example [here](http://cs231n.github.io/linear-classify/#svmvssoftmax). # # Choose the `C` hyperparameter for` LinearSVC` using `GridSearchCV`. # # - Train the new `StandardScaler` on the training set (with all original features), apply scaling to the test set # - In `GridSearchCV`, specify `cv` = 3. # In[21]: # # Your code here # scaler = StandardScaler() # X_train_scaled = # X_test_scaled = # In[22]: svc = LinearSVC(random_state=RANDOM_STATE) svc_params = {'C': [0.001, 0.01, 0.1, 1, 10]} # In[23]: # %%time # # Your code here # best_svc = None # In[24]: # best_svc.best_params_, best_svc.best_score_ # **Question 7**
# Which value of the hyperparameter `C` was chosen the best on the basis of cross-validation?
# # **Answer options:** # - 0.001 # - 0.01 # - 0.1 # - 1 # - 10 # In[26]: # y_predicted = best_svc.predict(X_test_scaled) # In[27]: # tab = pd.crosstab(y_test, y_predicted, margins=True) # tab.index = ['walking', 'climbing up the stairs', # 'going down the stairs', 'sitting', 'standing', 'laying', 'all'] # tab.columns = ['walking', 'climbing up the stairs', # 'going down the stairs', 'sitting', 'standing', 'laying', 'all'] # tab # ** Question 8: **
# Which activity type is worst detected by SVM in terms of precision? Recall?
# # **Answer options:** # - precision – going up the stairs, recall – laying # - precision – laying, recall – sitting # - precision – walking, recall – walking # - precision – standing, recall – sitting # Finally, do the same thing as in Question 7, but add PCA. # # - Use `X_train_scaled` and` X_test_scaled` # - Train the same PCA as before, on the scaled training set, apply scaling to the test set # - Choose the hyperparameter `C` via cross-validation on the training set with PCA-transformation. You will notice how much faster it works now. # # ** Question 9: **
# What is the difference between the best quality (accuracy) for cross-validation in the case of all 561 initial characteristics and in the second case, when the principal component method was applied? Round to the nearest percent.
# # ** Options: ** # - quality is the same # - 2% # - 4% # - 10% # - 20% # In[28]: # Your code here # ** Question 10: **
# Select all the correct statements: # # ** Answer options: ** # - Principal component analysis in this case allowed to reduce the model training time, while the quality (mean cross-validation accuracy) suffered greatly, by more than 10% # - PCA can be used to visualize data, but there are better methods for this task, for example, tSNE. However, PCA has lower computational complexity # - PCA builds linear combinations of initial features, and in some applications they might be poorly interpreted by humans