#!/usr/bin/env python
# coding: utf-8

# # Logistic Regression With Non-Linear Boundary Demo
# 
# _Source: 🤖[Homemade Machine Learning](https://github.com/trekhleb/homemade-machine-learning) repository_
# 
# > ☝Before moving on with this demo you might want to take a look at:
# > - 📗[Math behind the Logistic Regression](https://github.com/trekhleb/homemade-machine-learning/tree/master/homemade/logistic_regression)
# > - ⚙️[Logistic Regression Source Code](https://github.com/trekhleb/homemade-machine-learning/blob/master/homemade/logistic_regression/logistic_regression.py)
# 
# **Logistic regression** is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
# 
# Logistic Regression is used when the dependent variable (target) is categorical.
# 
# For example:
# 
# - To predict whether an email is spam (`1`) or (`0`).
# - Whether online transaction is fraudulent (`1`) or not (`0`).
# - Whether the tumor is malignant (`1`) or not (`0`).
# 
# > **Demo Project:** In this example we will try to classify microchips into to categories (`valid` and `invalid`) based on two artifical parameters `param_1` and `param_2`.

# In[1]:


# To make debugging of logistic_regression module easier we enable imported modules autoreloading feature.
# By doing this you may change the code of logistic_regression library and all these changes will be available here.
get_ipython().run_line_magic('load_ext', 'autoreload')
get_ipython().run_line_magic('autoreload', '2')

# Add project root folder to module loading paths.
import sys
sys.path.append('../..')


# ### Import Dependencies
# 
# - [pandas](https://pandas.pydata.org/) - library that we will use for loading and displaying the data in a table
# - [numpy](http://www.numpy.org/) - library that we will use for linear algebra operations
# - [matplotlib](https://matplotlib.org/) - library that we will use for plotting the data
# - [logistic_regression](https://github.com/trekhleb/homemade-machine-learning/blob/master/homemade/logistic_regression/logistic_regression.py) - custom implementation of logistic regression

# In[2]:


# Import 3rd party dependencies.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import custom logistic regression implementation.
from homemade.logistic_regression import LogisticRegression


# ### Load the Data
# 
# In this demo we will use artificial dataset in which `param_1` and `param_2` produce non-linear decision boundary (see the plot below).

# In[3]:


# Load the data.
data = pd.read_csv('../../data/microchips-tests.csv')

# Print the data table.
data.head(10)


# ### Plot the Data
# 
# Let's plot the data on 2D-plane and split it by two clasess (`valid` and `invalid`) to see the distribution.

# In[4]:


# Supported set of validities.
validities = [0, 1]

# Dataset parameters that we will take into account.
x_axis = 'param_1'
y_axis = 'param_2'

# Scatter the data on the plot for each validity class separatelly.
for validity in validities:
    plt.scatter(
        data[x_axis][data['validity'] == validity],
        data[y_axis][data['validity'] == validity],
        label=validity
    )

# Plot the data.    
plt.xlabel(x_axis)
plt.ylabel(y_axis)
plt.title('Microchips Tests')
plt.legend()
plt.show()


# ### Prepare the Data for Training
# 
# Before we move on with training our logistic model let's extract the data from Pandas data frame and shape training features set and labels set correctly.

# In[5]:


# Get the number of training examples.
num_examples = data.shape[0]

# Extract and shape features.
x_train = data[[x_axis, y_axis]].values.reshape((num_examples, 2))

# Extract and shape labels.
y_train = data['validity'].values.reshape((num_examples, 1))


# ### Init and Train Logistic Regression Model
# 
# > ☝🏻This is the place where you might want to play with model configuration.
# 
# - `polynomial_degree` - this parameter will allow you to add additional polynomial features of certain degree. More features - more curved the line will be.
# - `max_iterations` - this is the maximum number of iterations that gradient descent algorithm will use to find the minimum of a cost function. Low numbers may prevent gradient descent from reaching the minimum. High numbers will make the algorithm work longer without improving its accuracy.
# - `regularization_param` - parameter that will fight overfitting. The higher the parameter, the simplier is the model will be.
# - `polynomial_degree` - the degree of additional polynomial features (`x1^2 * x2, x1^2 * x2^2, ...`). This will allow you to curve the predictions.
# - `sinusoid_degree` - the degree of sinusoid parameter multipliers of additional features (`sin(x), sin(2*x), ...`). This will allow you to curve the predictions by adding sinusoidal component to the prediction curve.

# In[6]:


# Set up linear regression parameters.
max_iterations = 100000  # Max number of gradient descent iterations.
regularization_param = 0  # Helps to fight model overfitting.
polynomial_degree = 5  # The degree of additional polynomial features.
sinusoid_degree = 0  # The degree of sinusoid parameter multipliers of additional features.

# Init logistic regression instance.
logistic_regression = LogisticRegression(x_train, y_train, polynomial_degree, sinusoid_degree)

# Train logistic regression.
(thetas, costs) = logistic_regression.train(regularization_param, max_iterations)

# Rename the columns for each theta.
columns = []
for theta_index in range(0, thetas.shape[1]):
    columns.append('Theta ' + str(theta_index));

pd.DataFrame(thetas, index=['VALID', 'INVALID'], columns=columns)


# ### Analyze Gradient Descent Progress
# 
# The plot below illustrates how the cost function value changes over each iteration. You should see it decreasing. 
# 
# In case if cost function value increases it may mean that gradient descent missed the cost function minimum and with each step it goes further away from it.
# 
# From this plot you may also get an understanding of how many iterations you need to get an optimal value of the cost function.

# In[7]:


# Draw gradient descent progress for each label.
labels = logistic_regression.unique_labels

plt.plot(range(len(costs[0])), costs[0], label=labels[0])
plt.plot(range(len(costs[1])), costs[1], label=labels[1])

plt.xlabel('Gradient Steps')
plt.ylabel('Cost')
plt.legend()
plt.show()


# ### Calculate Model Training Precision
# 
# Calculate how many flowers from the training set have been guessed correctly. 

# In[8]:


# Make training set predictions.
y_train_predictions = logistic_regression.predict(x_train)

# Check what percentage of them are actually correct.
precision = np.sum(y_train_predictions == y_train) / y_train.shape[0] * 100

print('Training Precision: {:5.4f}%'.format(precision))


# ### Draw Decision Boundaries
# 
# Let's build our decision boundaries. These are the lines that distinguish classes from each other. This will give us a pretty clear overview of how successfull our training process was. You should see clear distinguishment of three sectors on the data plain. 

# In[9]:


# Get the number of training examples.
num_examples = x_train.shape[0]

# Set up how many calculations we want to do along every axis. 
samples = 150

# Generate test ranges for x and y axis.
x_min = np.min(x_train[:, 0])
x_max = np.max(x_train[:, 0])

y_min = np.min(x_train[:, 1])
y_max = np.max(x_train[:, 1])

X = np.linspace(x_min, x_max, samples)
Y = np.linspace(y_min, y_max, samples)
Z = np.zeros((samples, samples))

# z axis will contain our predictions. So let's get predictions for every pair of x and y.
for x_index, x in enumerate(X):
    for y_index, y in enumerate(Y):
        data = np.array([[x, y]])
        Z[x_index][y_index] = logistic_regression.predict(data)[0][0]

# Now, when we have x, y and z axes being setup and calculated we may print decision boundaries.
positives = (y_train == 1).flatten()
negatives = (y_train == 0).flatten()

plt.scatter(x_train[negatives, 0], x_train[negatives, 1], label='0')
plt.scatter(x_train[positives, 0], x_train[positives, 1], label='1')

plt.contour(X, Y, Z)

plt.xlabel('param_1')
plt.ylabel('param_2')
plt.title('Microchips Tests')
plt.legend()

plt.show()