Notebook

Class discussion --¶

Q - How does random forest deal with outliers?
A - In general they are resilient about random outliers as they will be no split occuring for a few outlier observations. For consistent outliers, it will give a consisent signal how dependent variable is dependent on some outlier value.

Q - Which score is used for classification problem like R^sq for regression?
A - For classification use cross entropy/ loss loss

Q - Creating an appropriate test set is the most important. Why?
A - If test set is good indicator of how well your model will work in production, . This is something you will keep in mind even after getting promoted to high position, even if you don't do hands on coding. Wrong test set can lead to heavy loss in business in real world.

Use sequential (e.g. last 3 months) of data as test set for time related data (e.g. grocery) Think like what, how and when are you trying to predict

Q - Regarding point above, we may not have data related to time in our observation, but our data might be dependent on time. How to do train-validate split then? What should our test set look like?
A - Think of type of problem at hand.

Q - How to deal with seasonality if we take validation and test just last few months? A - think what factors lead to have seasonality. why is sales higher in summers? what factor impact that?

Q - Is OOB score better or worse estimator for model as compared to validation set score?
A - Generally worse because OOB only saw subset of trees and training data saw full forest.

Waterfall chart¶

Widely used in business (don't use python, would be cool if someone can make it. Not that difficult thought)

Tree interpretor¶

Can be used to debug why some prediction came out to be bad. (e.g. Jeremy like Citizen Kane)

Comparing test/ validation set with training set. (is_valid flag)

My questions --¶

Q - Again, what is bias?
A - ? -- Maybe difference between prediction and average of all y's (easiest guess)

Q - How is particular feature come out most important for one row, different for other row? A - ?

Q - About PDP, why only 1 point in 1960 (not 500 lines). Learn more about what is there on Y axis. A

Random Forest from scratch --¶

Will compare it with sklearn random forest to see how well we did

HW¶

Try to replicate whatever we learnt in class today for some kaggle problem or any other dataset

In [ ]: