#!/usr/bin/env python
# coding: utf-8

# # Background
# 
# In this notebook we'll train a [Logistic Regression model](https://en.wikipedia.org/wiki/Logistic_regression) to distinguish between spam data (food reviews) and legitimate data (Austen). 
# 
# Logistic regression is a standard statistical technique used to model a binary variable. In our case the binary variable we are predicting is 'spam' or 'not spam' (i.e. legitimate).  Logistic regression, when combined with a reasonable feature engineering approach, is often a sensible first choice for a classification problem!

# We begin by loading in the feature vectors which we generated in either [the simple summaries feature extraction notebook](03-feature-engineering-summaries.ipynb) or [the TF-IDF feature extraction notebook](03-feature-engineering-tfidf.ipynb). 

# In[ ]:


import pandas as pd
feats = pd.read_parquet("data/features.parquet")


# When doing exploratory analysis, it's often a good idea to inspect your data as a sanity check.  In this case, we'll make sure that the feature vectors we generated in the last notebook have the shape we expect!

# In[ ]:


feats.sample(10)


# The first 2 columns of the `feats` matrix are the index, and label. The remaining columns are the feature vectors. 
# 
# We begin by splitting the data into 2 sets: 
# 
# * `train` - a set of feature vectors which will be used to train the model
# * `test` - a set of feature vectors which will be used to evaluate the model we trained

# In[ ]:


from sklearn import model_selection
train, test = model_selection.train_test_split(feats, random_state=43)


# In[ ]:


from sklearn.linear_model import LogisticRegression


# In[ ]:


model = LogisticRegression(solver = 'lbfgs', max_iter = 4000)


# In[ ]:


#training the model
import time

start = time.time()
model.fit(X=train.iloc[:,2:train.shape[1]], y=train["label"])
end = time.time()
print(end - start)


# With the model trained we can use it to make predictions. We apply the model to the `test` set, then compare the predicted classification of spam or legitimate to the truth.  

# In[ ]:


predictions = model.predict(test.iloc[:,2:test.shape[1]])


# In[ ]:


predictions


# We use a binary confusion matrix to visualise the accuracy of the model. 

# In[ ]:


from mlworkflows import plot


# In[ ]:


df, chart = plot.binary_confusion_matrix(test["label"], predictions)


# In[ ]:


chart


# We can look at the raw numbers, and proportions of correctly and incorrectly classified items: 

# In[ ]:


df


# We can also look at the precision, recall and f1-score for the model. 

# In[ ]:


from sklearn.metrics import classification_report
print(classification_report(test.label.values, predictions))


# We want to save the model so that we can use it outside of this notebook.

# In[ ]:


model


# In[ ]:


from mlworkflows import util
util.serialize_to(model, "model.sav")