Author: Yury Kashnitsky (@yorko). Edited by Roman Volykhin (@GerrBert). This material is subject to the terms and conditions of the Creative Commons CC BY-NC-SA 4.0 license. Free use is permitted for any non-commercial purpose

Prior to working on the assignment, you'd better check out the corresponding course material:

- Classification, Decision Trees and k Nearest Neighbors, the same as an interactive web-based Kaggle Kernel
- Ensembles:
- Bagging, the same as a Kaggle Kernel
- Random Forest, the same as a Kaggle Kernel
- Feature Importance, the same as a Kaggle Kernel

- There are 5 video lectures on trees, forests and their applications: mlcourse.ai/lectures

We suggest that you first read the articles (quiz questions are based on them), if something is not clear - watch thr corresponding lecture.

- study the materials
- write code where needed
- choose answers in the webform.

Solutions will be discussed during a live YouTube session on September 28. You can get up to 10 credits (those points in a web-form, 15 max, will be scaled to a max of 10 credits).

*For discussions, please stick to ODS Slack, channel #mlcourse_ai_news, pinned thread #quiz1_fall2019*

**Question 1**. Which of these problems does not fall into 3 main types of ML tasks: classification, regression, and clustering?

- Identifying a topic of a live-chat with a customer
- Grouping news into topics
- Predicting LTV (Life-Time Value) - the amount of money spent by a customer in a certain large period of time
- Listing top products that a user is prone to buy (based on his/her click history)

**Question 2**. Maximal possible entropy is achieved when all states are equally probable (prove it yourself for a system with 2 states with probabilities $p$ and $1-p$). What's the maximal possible entropy of a system with N states? (here all logs are with base 2)

- $N \log N$
- $-\log N$
- $\log N$
- $-N \log N$

**Question 3**. In Topic 3 article, toy example with 20 balls, what's the information gain of splitting 20 balls in 2 groups based on the condition X <= 8?

- ~ 0.1
- ~ 0.01
- ~ 0.001
- ~ 0.0001

**Question 4.** In a toy binary classification task, there are $d$ features $x_1 \ldots x_d$, but target $y$ depends only on $x_1$ and $x_2$: $y = [\frac{x_1^2}{4} + \frac{x_2^2}{9} \leq 16]$, where $[\cdot]$ is an indicator function. All of features $x_3 \ldots x_d$ are noisy, i.e. do not influence the target feature at all. Obviously, machine learning algorithms shall perform almost perfectly in this task, where target is a simple function of input features. If we train sklearn's `DecisionTreeClassifier`

for this task, which parameters have crucial effect on accuracy (crucial - meaning that if these parameters are set incorrectly, then accuracy can drop significantly)? Select all that apply (to get credits, you need to select all that apply, no partially correct answers).

`max_features`

`criterion`

`min_samples_leaf`

`max_depth`

**Question 5.** Load iris data with `sklearn.datasets.load_iris`

. Train a decision tree with this data, specifying params `max_depth`

=4 and `random_state`

=17 (all other arguments shall be left unchanged). Use all available 150 instances to train a tree (do not perform train/validation split). Visualize the fitted decision tree, see topic 3 for examples. Let's call a leaf in a tree *pure* if it contains instances of only one class. How many pure leaves are there in this tree?

- 6
- 7
- 8
- 9

**Question 6.** There are 7 jurors in the courtroom. Each of them individually can correctly determine whether the defendant is guilty or not with 80% probability. How likely is the jury will make a correct verdict jointly if the decision is made by majority voting?

- 20.97%
- 80.00%
- 83.70%
- 96.66%

**Question 7.** In Topic 5, part 2, section 2. "Comparison with Decision Trees and Bagging" we show how bagging and Random Forest improve classification accuracy as compared to a single decision tree. Which of the following is a better explanation of the visual difference between decision boundaries built by a single desicion tree and those built by ensemble models?

- Ensembles ignore some of the features. Thus picking only important ones, they build a smoother decision boundary
- Some of the classification rules built by a decision tree can be applied only to a small number of training instances
- When fitting a decision tree, if two potential splits are equally good in terms of information criterion, then a random split is chosen. This leads to some randomness in building a decision tree. Therefore its decision boundary is so jagged

**Question 8.** Random Forest learns a coefficient for each input feature, which shows how much this feature influences the target feature. True/False?

- True
- False

**Question 9.** Suppose we fit `RandomForestRegressor`

to predict age of a customer (a real task actually, good for targeting ads), and the maximal age seen in the dataset is 98 years. Is it possible that for some customer in future the model predicts his/her age to be 105 years?

- Yes
- No

**Question 10.** Select all statements supporting advantages of Random Forest over decision trees (some statements might be true but not about Random Forest's pros, don't select those).

- Random Forest is easier to train in terms of computational resources
- Random Forest typically requires more RAM than a single decision tree
- Random Forest typically achieves better metrics in classification/regression tasks
- Single decision tree's prediction can be much easier interpreted