mlcourse.ai – Open Machine Learning Course

Authors: Yury Kashnitsky (@yorko), and Aidar Siraev (@cyber). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Fall 2019. Quiz 2. Linear Models

Whether math is needed in Data Science - that's a controversial question. Honestly, there's a ~95% chance that you'll never need to derive a math formula while predicting churn, credit fault, customer life-time value, looking for financial fraud or recommending TV shows or, in general, while improving your company's business via insights from data. I, personally, needed to take a derivative only once in a real project.

But still, we keep this theoretical part - it will help you in understanding the basic machinery of linear models, what math lives there under the hood. At least, this will be crucial to understand neaural nets and, in particular, back propagation, arguably, the most prominent ML algorithm.

Prior to working on the assignment, you'd better check out the corresponding course material:

  1. Classification, Decision Trees and k Nearest Neighbors, the same as an interactive web-based Kaggle Kernel (basics of machine learning are covered here)
  2. Linear classification and regression in 5 parts:

  3. If that's still not enough, watch two videos on logistic regression: mlcourse.ai/lectures

Your task is to:

  1. study the materials
  2. write code where needed
  3. choose answers in the webform

Deadline for Quiz 2: 2019 October 25, 20:59 GMT+1 (London time)

Solutions are discussed during a live YouTube session on October 26. You can get up to 12 credits (those points in a web-form, 18 max, will be scaled to a max of 12 credits).

1. Select the correct statements about ordinary least squares (OLS):

  1. OLS is a method for estimating the unknown parameters in a linear regression model.
  2. OLS works as follows: minimizing the sum of the differences between the observed dependent variable in the given dataset and the square of the product of the predicted variable and its weight.
  3. OLS provides an estimator with the lowest variance among all linear and unbiased estimators when errors are homoscedastic, uncorrelated and have zero expectation.
  4. OLS works as follows: minimizing the sum of the absolute value of the differences between the observed dependent variable in the given dataset and the product of the predicted variable and its weight.

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5

2. Why wouldn't you always use the closed-form solution for OLS: $\textbf{w}=(\textbf{X}^\textbf{T}\textbf{X})^{-1}\textbf{X}^\textbf{T}\textbf{y}$ (see Topic 4, part 1 for problem statement and designations)?

  1. it's computationally inefficient as compared to numeric optimization methods
  2. it's computationally unstable in case of multicollinearity, i.e. when determinant of $\textbf{X}^\textbf{T}\textbf{X}$ is close to zero
  3. numeric optimization methods typically yield a better solution in terms of mean squared error
  4. multiplying two square matrices is an operation with $O(n^4)$ complexity, where $n$ is the number of rows/columns

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5

3. Select a correct statement about maximum likelihood estimation:

  1. OLS is the maximum likelihood estimator for any linear machine learning model.
  2. OLS is the maximum likelihood estimator, only if the errors are normally distributed.
  3. Maximum likelihood method applies only to samples with normal distribution.

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5

4. Suppose, Lerbon Jasem is at rest wandering around a park. He decides to play a lottery where one needs to score 9 times out of 10 basketball penalty shoots to win a bunny. Lerbon manages to do that from the 3rd attempt: first he scored 6, then 8, and finally, 9. Suppose, the number of successes is governed by Binomial distribution $Bin(n, \theta) = {n \choose k} \theta^k (1-\theta)^{n-k}$, i.e. there are $n=10$ trials (shoots) and $\theta$ is an estimator for Lerbon's unknown true probability of success (scoring in a single shot) $p$. Moreover, we assume that scoring in each and every shoot is independent of other shoots (which probably didn't hold for Lerbon when he got irritated in the first shoot-out series, but still).

What is the partial derivative of log likelihood function with respect to $\theta$? What's the value of Maximum Likelihood Estimator of $p$ - $\theta_{MLE}$?

  1. $\large \frac{23}{\theta} - \frac{7}{1-\theta}$, 0.7(6)
  2. $\large \frac{23}{\theta} + \frac{7}{1-\theta}$, 1.4375
  3. $\large \frac{23}{\theta} - \frac{7}{1-\theta}$, 0.8
  4. $\large \frac{23}{\theta} + \frac{7}{1-\theta}$, 0.7(6)

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5

5. Why should one regularize a linear regression model?

  1. To avoid overfitting
  2. To reduce variance of model predictions
  3. To generalize better to future data
  4. To tackle multicollinearity

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5

6. Which of the following evaluation metrics should NOT be applied in case of logistic regression output compared with the target?

  1. Accuracy score
  2. Logloss
  3. Mean Squared Error
  4. ROC AUC

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q6-10

7. Which of the following approaches do we use to fit logistic regression parameters to the data in hand?

  1. Least Square Error
  2. Maximum Likelihood Estimation
  3. The margin of classification
  4. The Jaccard index

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q6-10

8. Let's analyze the logarithmic loss function as derived in the article in terms of margin:

$$\large L(M) = \ln(1 + e^{-M})$$

_(here we have natural logarithm, but the base is not that important due to $\log_ab = \frac{\ln b}{\ln a}$, i.e. some base $a$ different from $e$ will only result in a constant multiplier $\frac{1}{\ln a}$ which doesn't change the analysis)_

Select all correct statements:

  1. the model is penalized (meaning that the loss is positive) even in case when a training instance is correctly classified
  2. logarithmic loss is a strictly decreasing function of margin $M$
  3. derivative $\frac{dL}{dM}$ can be interpreted in terms of margin $M$: $\frac{dL}{dM} = -\sigma(-M)$
  4. second derivative $\frac{d^2L}{dM^2}$ is strictly positive, meaning that $L$ is convex

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q6-10

9. Suppose you train a logistic regression classifier:

$$\large P\left(y=1∣X\right)=\sigma\left(w_0 + w_1 x_1 + w_2 x_2\right)$$

$$\large a(X) = sign(P\left(y=1∣X\right) - 0.5),$$

meaning that predicted probability $P\left(y=1∣X\right)$ is compared with 0.5, and depending on that either +1 or -1 is returned.

Also we know that $w_0 = 3, w_1 = 0, w_2 = -1.$

Which of the following figures represents the decision boundary of the given classifier?

  1. A
  2. B
  3. C
  4. D

For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q6-10

10. In this question, we'll be working with Sklearn's bias-variance decomposition example. There they compare a decision tree regressor and bagging over the same trees. In our case, we'll compare 4 decision trees - with maximal depths of 1, 2, 5, and 10 (in all cases random_state shall be set to 17). So you need to take the same code from the example and change the estimators variable.

Your task is to:

  • read about bias-variance decomposition in the mlcourse.ai article
  • understand the code form the mentioned sklearn's example
  • understand what you changed in the code and how it affected the figures built in the end of the example

When you're done with that, choose all correct statements:

  1. Variance always increases with increased max_depth
  2. Minimal MSE is achieved when max_depth is set to 5
  3. Minimal MSE is achieved when bias is also minimal
  4. Minimal MSE is achieved when variance is also minimal
  5. The tree with max_depth=10 is overfitted

    For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q6-10