mlcourse.ai – Open Machine Learning Course

Authors: Vadim Shestopalov (@vchulski), Valentina Biryukova (@myltykritik), and Yury Kashnitsky (@yorko). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Fall 2019. Quiz 3. Unsupervised learning & time series

Prior to working on this quiz, you'd better check out the corresponding course material:

Also, checkout corresponding mlcourse.ai video lectures

Your task is to:

  1. study the materials
  2. write code where needed
  3. choose answers in the webform

Solutions are discussed during a live YouTube session on November 16. You can get up to 10 credits (those points in a web-form, 15 max, will be scaled to a max of 10 credits).

Deadline for Quiz 3: 2019 November 15, 20:59 GMT (London time)

Part 1. Unsupervised learning

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz3_part1_fall2019. TA for this part is Yury @yorko.

Question 1. Using the face recognition dataset downloaded with the code below choose the best number of n_components to achieve best accuracy on holdout set (holdout part should be 30% of the training set) using SVM model for classification (SVC) with the following params: gamma=0.01 and class_weight='balanced'
Note: Use random seed equal to 17 everywhere (train_test_split, PCA and SVC). Also, specify whiten=True for PCA, this will normalize variances of PCA components and will positively affect classification just like in case of using StandardScaler.


What number of PCA components maximizes holdout accuracy of the SVM model?

  1. 50
  2. 100
  3. 150
  4. 200
In [1]:
import numpy as np
import pandas as pd
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.datasets import fetch_lfw_people
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from matplotlib import pyplot as plt
In [2]:
# Download the data and load it as numpy arrays
lfw_people = fetch_lfw_people(data_home='../../data/faces/',
                              min_faces_per_person=70, resize=0.4)

X = lfw_people.data
n_features = X.shape[1]

# the label to predict is the id of the person
y = lfw_people.target
target_names = lfw_people.target_names
n_classes = target_names.shape[0]

print("Total dataset size:")
print("n_features: %d" % n_features)
print("n_classes: %d" % n_classes)
Total dataset size:
n_features: 1850
n_classes: 7
//anaconda3/lib/python3.7/site-packages/sklearn/externals/joblib/__init__.py:15: DeprecationWarning: sklearn.externals.joblib is deprecated in 0.21 and will be removed in 0.23. Please import this functionality directly from joblib, which can be installed with: pip install joblib. If this warning is raised when loading pickled models, you may need to re-serialize those models with scikit-learn 0.21+.
  warnings.warn(msg, category=DeprecationWarning)
In [3]:
for i, count in enumerate(np.bincount(y)):
    print(f'{count} photos of {target_names[i]}' )
77 photos of Ariel Sharon
236 photos of Colin Powell
121 photos of Donald Rumsfeld
530 photos of George W Bush
109 photos of Gerhard Schroeder
71 photos of Hugo Chavez
144 photos of Tony Blair
In [4]:
fig = plt.figure(figsize=(8, 6))

for i in range(15):
    ax = fig.add_subplot(3, 5, i + 1, xticks=[], yticks=[])
    ax.imshow(lfw_people.images[i], cmap='gray')
In [5]:
# You code here

Question 2. Choose the correct option. In K-means algorithm at each iteration:

  1. Centroid of the cluster is moved in a random direction to increase robustness of the solution
  2. Each instance is attributed to a closest centroid
  3. K is increased by 1
  4. All of the above

Question 3. Select all correct statements about agglomerative clustering

  1. At each step, two random instances are merges to form a cluster
  2. The algorithm terminates when all instances are merged into one cluster
  3. The ultimate number of iterations of the algorithm is $n$, where $n$ is the number of instances in the data set
  4. Output of the algorithm depends on the way to define distance between clusters, i.e. linkage

Question 4. For which of the following clustering algorithms shall one specify the number of clusters beforehand?

  1. Agglomerative clustering
  2. K-means
  3. Affinity Propagation
  4. All of the above

Question 5. Which of the following metrics, assessing clustering quality, can be calculated without knowing true cluster labels?

  1. Adjusted Mutual Information (AMI)
  2. Silhouette
  3. Completeness
  4. None of the above

Part 2. Time series

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz3_part2_fall2019. TA for this part is Valentina @myltykritik.

Question 6. Which of the following is an example of time series? Select all correct options.

  1. Daily temperature in Moscow for 20 years
  2. Texts of news from Times site
  3. Pigeon population in different Russian cities in 2019
  4. GPS-coordinates of someone's trajectory

Question 7. Which of these are possible components of a time series?

  1. Trend
  2. Seasonality
  3. Noise
  4. Cyclical
  5. All of the above

Question 8. Sales of some products in July were 200, in August - 600, in September - 500, in October - 100. What is the 3 month Moving Average forecast for November?

  1. 200
  2. 300
  3. 400
  4. 500

Question 9. You have some stock (S&P) data from here for five years until February 2018 and want to make prediction model for Facebook stocks. You need FB_data.csv file - it's committed to the course repo as well.

  1. Split the data into training and test sets. Everything before 01.09.2017 (pd.datetime(2017, 9, 1)) would form a training set. The rest would be a test set.
  2. Train the Prophet() model with default parameters
  3. Measure MAPE (mean average percentage error) for the test set

What test set MAPE do you get (approx.)?

  1. 2.5%
  2. 3.5%
  3. 4.5%
  4. 5.5%
In [6]:
from fbprophet import Prophet
In [7]:
!pip list | grep prophet
fbprophet                          0.4.post2     
In [8]:
df = pd.read_csv('../../data/FB_data.csv')
df = df[['date', 'close']].reset_index(drop=True)
df = df.rename({'close':'y', 'date':'ds'}, axis='columns')
df['ds'] = pd.to_datetime(df['ds'])
In [9]:
# You code here

Question 10. What steps should we perform when doing cross-validation for time series? Select all correct answers.

  1. Sort your data by time to emphasize time pattern
  2. No way! Shuffle all the data well, so model will not leak on random patterns!
  3. Make several folds, so in validation will be all data from initial series ↓ image credit
  4. No way! Perform method, known as "cross-validation on a rolling basis".