#!/usr/bin/env python
# coding: utf-8

# # Background
# 
# Perhaps you've played [Twenty Questions](https://en.wikipedia.org/wiki/Twenty_Questions) before:  it's a game where one player (the *answerer*) thinks of a person, place, or thing, and other players ask yes-or-no questions to guess the object of the answerer's thoughts.  Since the answerer probably knows about a lot of different people and objects, a good strategy for the other players involves devising questions that reduce the space of possible answers as much as possible no matter how they are answered.  
# 
# Given a labeled collection of examples, you might imagine a technique to [learn a *decision tree*](https://en.wikipedia.org/wiki/Decision_tree_learning) of questions to classify these examples by asking as few questions as possible.  However, you might imagine that such a technique would necessarily be quite dependent on the exact examples on offer.  (In other words,  these techniques are prone to *overfitting*.)  As a simple illustration,  consider the case where your set of example objects was `{ 'ant', 'elephant'}`.  In this case, the question "is it smaller than a typical adult human" would enable you to differentiate between examples optimally.   However, that question would be useless if our set of example objects was the set of all domesticated dog breeds.
# 
# [Random decision forest models](https://en.wikipedia.org/wiki/Random_forest) work by training an *ensemble* of imprecise decision trees that only consider subsets of features or examples and then aggregating the results from the ensemble.  By learning and aggregating an ensemble of trees, random decision forests can be more accurate than individual decision trees *and* are less likely to overfit.  In this notebook, we'll use a random decision forest to classify documents as either "spam" (based on food reviews) or "legitimate" (based on Jane Austen).
# 
# We'll begin by loading in the feature vectors which we generated in either [the simple summaries feature extraction notebook](03-feature-engineering-summaries.ipynb) or [the TF-IDF feature extraction notebook](03-feature-engineering-tfidf.ipynb). 

# In[ ]:


import pandas as pd
import os.path

features = pd.read_parquet(os.path.join("data", "features.parquet"))


# When doing exploratory analysis, it's often a good idea to inspect your data as a sanity check.  In this case, we'll make sure that the feature vectors we generated in the last notebook have the shape we expect!

# In[ ]:


features.sample(5)


# In[ ]:


from sklearn.ensemble import RandomForestClassifier
from sklearn import model_selection
from mlworkflows import util

# smaller numbers will use less memory; larger numbers may lead 
# to better performance but may also use too much memory
DEFAULT_NUM_ESTIMATORS=10

# choose either 'gini' or 'entropy'
DEFAULT_CRITERION='gini'

estimators=util.get_param("RF_NUM_ESTIMATORS", DEFAULT_NUM_ESTIMATORS, int)
criterion=util.get_param("RF_CRITERION", DEFAULT_CRITERION, str)

train, test = model_selection.train_test_split(features)

rfc = RandomForestClassifier(n_estimators=estimators, criterion=criterion, random_state=404)


# ✅ Experiment with changing `NUM_ESTIMATORS` and `DEFAULT_CRITERION` to different values!  When you operationalize your pipeline, you'll be able to control these with the `RF_NUM_ESTIMATORS` and `RF_CRITERION` environment variables.

# In[ ]:


rfc.fit(X=train.iloc[:,2:train.shape[1]], y=train["label"])


# In[ ]:


from mlworkflows import plot

predictions = rfc.predict(test.iloc[:,2:train.shape[1]])
df, chart = plot.confusion_matrix(test["label"], predictions)


# In[ ]:


chart


# We can look at the raw numbers, and proportions of correctly and incorrectly classified items: 

# In[ ]:


df


# We can also look at the precision, recall and f1-score for the model. 

# In[ ]:


from sklearn.metrics import classification_report
print(classification_report(test.label.values, predictions))


# One interesting aspect of random decision forests is that they provide a metric for how important each feature was to the ultimate conclusion.  This is a useful property both for having *explainable models* (i.e., so you can explain to a human why the model made a particular prediction) and for guiding further experiments (i.e., so you can learn more about the real world based on what the model has identified as likely to be correlated with what you're trying to predict).

# In[ ]:


l = list(enumerate(rfc.feature_importances_))


# In[ ]:


l.sort(key=lambda x: -x[1])
l[:20]


# What these features actually *mean* depends on which feature engineering approach you chose in the last notebook.  In the case of the simple summaries approach, it's fairly straightforward -- we can just take the column names:

# In[ ]:


if len(l) < 20:
    # the simple summaries have fewer than 20 features
    d = dict(enumerate(train.columns[2:]))
    for k, v in l[:20]:
        print(d[k], v)


# If you used the simple summaries approach, you'll probably see that `stop_words` is the most important feature.
# 
# If you used the tf-idf approach, the preceding cell shouldn't have produced any output and we'll need a more involved approach.  Recall that we used [feature hashing](https://en.wikipedia.org/wiki/Feature_hashing) for the tf-idf feature vectors.  Feature hashing has many benefits, but one of the downsides is that hashing is one-way -- we can't go from the entry in the feature vector to the word that produced it.  However, we can do a little extra work to try and reconstruct the words that _might_ have produced such a vector.
# 
# The size of our feature vectors is smaller than the number of words we might have seen in our corpus of documents, [which means that more than one word will be represented by counts in a given bucket](https://en.wikipedia.org/wiki/Pigeonhole_principle).  But we can construct a list of the words that may have mapped to each bucket by getting a vocabulary for our document corpus and then hashing each word, which we can use to identify which words _may_ have corresponded to the most important features.

# In[ ]:


if len(l) > 20:
    from sklearn.utils.murmurhash import murmurhash3_bytes_s32
    from sklearn.feature_extraction.text import CountVectorizer
    from collections import defaultdict
    
    def fhash(v, size = 1024):
        return murmurhash3_bytes_s32(v.encode("utf-8"), 0) % size
    data = pd.read_parquet("data/training.parquet")
    vectorizer = CountVectorizer(token_pattern='(?u)\\b[A-Za-z]\\w+\\b')
    vectorizer.fit(data["text"])
    d = defaultdict(lambda: [])
    for k in vectorizer.vocabulary_.keys():
        d[fhash(k)].append(k)
    
    for k, v in l[:20]:
        print(d[k], v)


# This sort of investigation is useful to see how well our feature count is working in the context of feature hashing:  if we see that an important feature has many apparently-unrelated words (especially if many relatively-common words are in the same bucket), we may want to increase the number of buckets to improve the performance of our model.
# 
# Finally, we'll want to save the model so that we can use it outside of this notebook.

# In[ ]:


util.serialize_to(rfc, "model.sav")