Authors: Yury Kashnitsky (@yorko), and Aidar Siraev (@cyber). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose.

Whether math is needed in Data Science - that's a controversial question. Honestly, there's a ~95% chance that you'll never need to derive a math formula while predicting churn, credit fault, customer life-time value, looking for financial fraud or recommending TV shows or, in general, while improving your company's business via insights from data. I, personally, needed to take a derivative only once in a real project.

But still, we keep this theoretical part - it will help you in understanding the basic machinery of linear models, what math lives there under the hood. At least, this will be crucial to understand neaural nets and, in particular, back propagation, arguably, the most prominent ML algorithm.

Prior to working on the assignment, you'd better check out the corresponding course material:

- Classification, Decision Trees and k Nearest Neighbors, the same as an interactive web-based Kaggle Kernel (basics of machine learning are covered here)
Linear classification and regression in 5 parts:

If that's still not enough, watch two videos on logistic regression: mlcourse.ai/lectures

- study the materials
- write code where needed
- choose answers in the webform

Solutions are discussed during a live YouTube session on October 26. You can get up to 12 credits (those points in a web-form, 18 max, will be scaled to a max of 12 credits).

**1. Select the correct statements about ordinary least squares (OLS):**

- OLS is a method for estimating the unknown parameters in a linear regression model.
- OLS works as follows: minimizing the sum of the differences between the observed dependent variable in the given dataset and the square of the product of the predicted variable and its weight.
- OLS provides an estimator with the lowest variance among all linear and unbiased estimators when errors are homoscedastic, uncorrelated and have zero expectation.
- OLS works as follows: minimizing the sum of the absolute value of the differences between the observed dependent variable in the given dataset and the product of the predicted variable and its weight.

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5*

**2. Why wouldn't you always use the closed-form solution for OLS: $\textbf{w}=(\textbf{X}^\textbf{T}\textbf{X})^{-1}\textbf{X}^\textbf{T}\textbf{y}$ (see Topic 4, part 1 for problem statement and designations)?**

- it's computationally inefficient as compared to numeric optimization methods
- it's computationally unstable in case of multicollinearity, i.e. when determinant of $\textbf{X}^\textbf{T}\textbf{X}$ is close to zero
- numeric optimization methods typically yield a better solution in terms of mean squared error
- multiplying two square matrices is an operation with $O(n^4)$ complexity, where $n$ is the number of rows/columns

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5*

**3. Select a correct statement about maximum likelihood estimation:**

- OLS is the maximum likelihood estimator for any linear machine learning model.
- OLS is the maximum likelihood estimator, only if the errors are normally distributed.
- Maximum likelihood method applies only to samples with normal distribution.

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5*

**4. Suppose, Lerbon Jasem is at rest wandering around a park. He decides to play a lottery where one needs to score 9 times out of 10 basketball penalty shoots to win a bunny. Lerbon manages to do that from the 3rd attempt: first he scored 6, then 8, and finally, 9. Suppose, the number of successes is governed by Binomial distribution $Bin(n, \theta) = {n \choose k} \theta^k (1-\theta)^{n-k}$, i.e. there are $n=10$ trials (shoots) and $\theta$ is an estimator for Lerbon's unknown true probability of success (scoring in a single shot) $p$. Moreover, we assume that scoring in each and every shoot is independent of other shoots (which probably didn't hold for Lerbon when he got irritated in the first shoot-out series, but still).**

**What is the partial derivative of log likelihood function with respect to $\theta$? What's the value of Maximum Likelihood Estimator of $p$ - $\theta_{MLE}$?**

- $\large \frac{23}{\theta} - \frac{7}{1-\theta}$, 0.7(6)
- $\large \frac{23}{\theta} + \frac{7}{1-\theta}$, 1.4375
- $\large \frac{23}{\theta} - \frac{7}{1-\theta}$, 0.8
- $\large \frac{23}{\theta} + \frac{7}{1-\theta}$, 0.7(6)

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5*

**5. Why should one regularize a linear regression model?**

- To avoid overfitting
- To reduce variance of model predictions
- To generalize better to future data
- To tackle multicollinearity

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q1-5*

**6. Which of the following evaluation metrics should NOT be applied in case of logistic regression output compared with the target?**

- Accuracy score
- Logloss
- Mean Squared Error
- ROC AUC

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q6-10*

**7. Which of the following approaches do we use to fit logistic regression parameters to the data in hand?**

- Least Square Error
- Maximum Likelihood Estimation
- The margin of classification
- The Jaccard index

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q6-10*

**8. Let's analyze the logarithmic loss function as derived in the article in terms of margin:**

_(here we have natural logarithm, but the base is not that important due to $\log_ab = \frac{\ln b}{\ln a}$, i.e. some base $a$ different from $e$ will only result in a constant multiplier $\frac{1}{\ln a}$ which doesn't change the analysis)_

**Select all correct statements:**

- the model is
*penalized*(meaning that the loss is positive) even in case when a training instance is correctly classified - logarithmic loss is a strictly decreasing function of margin $M$
- derivative $\frac{dL}{dM}$ can be interpreted in terms of margin $M$: $\frac{dL}{dM} = -\sigma(-M)$
- second derivative $\frac{d^2L}{dM^2}$ is strictly positive, meaning that $L$ is convex

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q6-10*

**9. Suppose you train a logistic regression classifier:**

**meaning that predicted probability $P\left(y=1∣X\right)$ is compared with 0.5, and depending on that either +1 or -1 is returned.**

**Also we know that $w_0 = 3, w_1 = 0, w_2 = -1.$**

**Which of the following figures represents the decision boundary of the given classifier?**

- A
- B
- C
- D

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz2_q6-10*

**10. In this question, we'll be working with Sklearn's bias-variance decomposition example. There they compare a decision tree regressor and bagging over the same trees. In our case, we'll compare 4 decision trees - with maximal depths of 1, 2, 5, and 10 (in all cases random_state shall be set to 17). So you need to take the same code from the example and change the estimators variable.**

**Your task is to:**

**read about bias-variance decomposition in the mlcourse.ai article****understand the code form the mentioned sklearn's example****understand what you changed in the code and how it affected the figures built in the end of the example**

**When you're done with that, choose all correct statements:**

- Variance always increases with increased
`max_depth`

- Minimal MSE is achieved when
`max_depth`

is set to 5 - Minimal MSE is achieved when bias is also minimal
- Minimal MSE is achieved when variance is also minimal
The tree with

`max_depth`

=10 is overfitted*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread***#quiz2_q6-10**