#!/usr/bin/env python
# coding: utf-8

# # Neural network methods
# 
# Author: Gaurav Vaidya

# ## Learning objectives
# * Understand what an artificial neural network (ANN) is and how it can be used.
# * Implement ANNs for use in prediction and classification based on multiple input features.
# 
# ## Learning deeply
# [Artificial Neural Networks (ANNs)](https://en.wikipedia.org/wiki/Artificial_neural_network) and [deep learning](https://en.wikipedia.org/wiki/Deep_learning) are currently getting a lot of interest, both as a subject of research and as a tool for analyzing datasets. A big difference from other machine learning techniques we've looked at so far is that ANNs can identify characteristics of interest by themselves, rather than having to be chosen by data scientists. Some of the other advantages of ANNs are related specifically to interpreting video and audio data, such as by using [convolutional neural networks](https://en.wikipedia.org/wiki/Convolutional_neural_network), but today we will focus on simple ANNs so you understand their struction and function.
# 
# ### Units
# 
# ANNs are designed as *layers* of *units* (or nodes, or [artificial neurons](https://en.wikipedia.org/wiki/Artificial_neuron)). Each unit accepts multiple inputs, each of which has a different weight, including one bias input, which it combines into a single value. That single value is passed to an [activation function](https://en.wikipedia.org/wiki/Activation_function), which provides an output only if the combined value is greater than a particular threshold (usually, zero).
# 
# In effect, each unit focuses on a particular aspect of the layer underneath it, and then summarizing that information for the layer above it. Every input to every unit has a *weight* and the unit has a *bias* input, and so is effectively doing a linear regression on the incoming data to obtain an output. The use of a non-linear activation function allows the ANN to predict and classify data that are not [linearly separable](https://en.wikipedia.org/wiki/Linear_separability). ANNs used to use the same sigmoid function we saw before, but these days the [rectified linear unit (ReLU) function](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) has become much more popular. It simply returns zero when the real output value is less than zero.
# 
# ![ReLU and softmax rectifiers](../nb-images/800px-Rectifier_and_softplus_functions.svg.png)
# 
# ### Layers
# 
# ![A diagram showing input, output and hidden layers in an ANN](../nb-images/colored_neural_network.svg.png)
# 
# Every neural network has three layers:
# 
# * An input layer, where each unit corresponds to a particular input feature. This could be categorical data, continuous data, or even colour values from images.
# * An output layer. We will be running two examples today: in the first, we will use a single output unit (the predicted price for a particular house in California). In the second, we will use seven output units, each corresponding to a particular type of forest cover.
# * A hidden layer. Without a hidden layer, an ANN can only pick up linear relationships: how changes in the input layer correspond to values in the output layer. Thanks to the hidden layer, an ANN can also pick up non-linear relationships, where different groups of input values interact in complicated ways to get the correct response on the output layer. The "deep" in [deep learning](https://en.wikipedia.org/wiki/Deep_learning) refers to the hidden layers that allow the model to identify intermediate patterns between the input and output layers.
# 
# Putting it all together, we end up with a type of ANN called a *multilayer perceptron (MLP) network*, which looks like the following:
# 
# ![A visualization of a multi-layer perceptron (MLP) from the Scikit-learn manual](../nb-images/multilayerperceptron_network.png)
# 
# [Google's Machine Learning course](https://developers.google.com/machine-learning/crash-course/introduction-to-neural-networks/anatomy) puts this very nicely: "Stacking nonlinearities on nonlinearities lets us model very complicated relationships between the inputs and the predicted outputs. In brief, each layer is effectively learning a more complex, higher-level function over the raw inputs."

# ## So what do we need?
# To create an ANN, we need to choose:
# 1. The number of input units (= the number of input features)
# 2. The number of output units:
#     - When using the ANN to predict, we generally only need a single output unit.
#     - When using the ANN to classify, we generally set the number of output units to the number of possible labels.
# 3. The number of hidden layers
#     - More hidden layers allows for more complex models -- which, as you've learned, also increases your risk of overfitting! So you want to go for the simplest model that meets your needs.
# 4. A loss function. The scikit-learn classes we use always use [logistic loss](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html), also known as cross-entropy loss.
# 4. The solver to use. The solver controls learning by searching for local minima in the parameter space. ANNs generally use [stochastic gradient descent (SGD)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent) such as [RMSProp](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#RMSProp), but today we will use [Adaptive Moment Estimation (Adam)](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam).
# 5. The regularization protocol. We will use L2 regularization.
# 
# Note that these are the [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) of our model: we will adjust these hyperparameters to improve how quickly and accurately we can determine the actual parameters of our model, which is the set of weights and biases on all units across all layers.

# ## Reminders of the ground rules
# * Always have *training* data and *testing* data, and make sure the ANN *never* sees the testing data.
# * Always shuffle your data.
# * ANNs don't work well when the features are in different ranges: it's usually a good idea to *normalize* it before use.

# ## ANN for prediction: how much might this house cost?

# There aren't very good datasets for showcasing prediction on biological data, so we will use one of the classic machine learning datasets: a dataset of [California house prices](https://www.kaggle.com/camnugent/california-housing-prices), based on the 1990 census and published in [Pace and Barry, 1997](https://doi.org/10.1016/S0167-7152(96)00140-X). Scikit-Learn can download this dataset for us, so let's start with that.

# In[1]:


from sklearn import datasets
help(datasets.fetch_california_housing)


# In[5]:


import numpy as np
import pandas as pd

# Fetch California housing dataset. This will be downloaded to your computer.
calif = datasets.fetch_california_housing()
print("Shape of California housing data: ", calif.data.shape)
califdf = pd.DataFrame.from_records(calif.data, columns=calif.feature_names)
califdf.head()


# In[7]:


# What do the house prices look like?
print(calif.target[0:5]) # Units: $100,000x


# Let's start by shuffling our data and splitting into testing data and training data.

# In[14]:


from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(
    califdf,        # Input features (X)
    calif.target,   # Output features (y)
    test_size=0.25, # Put aside 25% of data for testing.
    shuffle=True    # Shuffle inputs.
)

# Did we err?

print("Train data shape: ", X_train.shape)
print("Train label shape: ", y_train.shape)

print("Test data shape: ", X_test.shape)
print("Test label shape: ", y_test.shape)


# In[17]:


# Let's have a look at the data. Is it in similar ranges?
import pandas as pd
pd.DataFrame(X_train).describe()


# The input data comes in many different ranges: compare the ranges of latitude (32.54 to 41.95), longitude (-124.35 to -114.31), median income (0.50 to 15.0) and population (3 to 35682). As we described earlier, it's a good idea to normalize these values. Scikit-Learn has several built-in scalers that do just this. We will use the [StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html), which changes the data so it is normally distributed, with a mean of 0 and a variance of 1, but other [standardization methods are available](https://scikit-learn.org/stable/modules/preprocessing.html#standardization-or-mean-removal-and-variance-scaling).

# In[18]:


from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

# Figure out how to scale all the input features in the training dataset.
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

# Also tranform our validation and testing data in the same way.
X_test_scaled = scaler.transform(X_test)

# Did that work?
pd.DataFrame(X_train_scaled).describe()


# We won't normalize the output labels for simplicity and because it's usually not necessary if you have a single output. However, if you have multiple output labels, you will want to normalize them to each other.

# Alright, we're ready to run our model! We will use the [MLPRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html) -- MLP stands for [multilevel perceptron](https://en.wikipedia.org/wiki/Multilayer_perceptron), a description of this kind of neural network.

# In[27]:


from sklearn.neural_network import MLPRegressor

ann = MLPRegressor(
    activation='relu',  # The activation function to use: ReLU
    solver='adam',      # The solver to use
    alpha=0.001,        # The L2 regularization rate: higher values increase cost for larger weights
    hidden_layer_sizes=(50, 20),
                        # The number of units in each hidden layer.
                        # Note that we don't need to specify input and output neuron numbers:
                        # MLPClassifier determines this based on the shape of the features and labels
                        # being fitted.
    verbose=True,       # Report on progress.
    batch_size='auto',  # Process dataset in batches of 200 rows at a time.
    early_stopping=True # This activates two features:
                        #  - We will hold 10% of data aside as validation data. At the end of each
                        #    iteration, we will test the validation data to see how well we're doing.
                        #  - If learning slows below a pre-determined level, we stop early rather than
                        #    overtraining on our data.
)
ann.fit(X_train_scaled, y_train)


# Three things to note:
#  - Some models will never converge, in which case you will see a warning. This suggests that learning is *not* completely, and is likely because something is wrong with learning using this dataset with these hyperparameters.
#  - We learn iteratively. In Scikit-Learn, each iteration is further broken up into "batches" of data.
#  - We expect to see loss going down over time and validation score going up over time. We can visualize these in a graph if we want.

# In[28]:


import matplotlib.pyplot as plt

# Visualize the loss curve and validation scores over iterations.
plt.plot(ann.loss_curve_, label='Loss at iteration')
plt.plot(ann.validation_scores_, label='Validation scores at iteration')
plt.legend(loc='best')
plt.show()


# Finally, we can use our test data to check how our model performs on data that it has not been previously exposed to. Let's see how we did!

# In[29]:


ann.score(X_test_scaled, y_test)


# Not bad, but it could definitely be better.
# 
# ### Let's make a prediction
# 
# Note that we can use this ANN to make a prediction. To come up with one, let's look at those values again:

# In[222]:


pd.DataFrame(calif.data, columns=calif.feature_names).describe()


# So what if we knew that a house was in a block that:
#  - had a median income of 30,000 USD
#  - had a median house age of 12 years
#  - had an average of 2 rooms
#  - had an average of 1 bedroom
#  - had a population of 2,000 in the block
#  - had an average occupancy of 2
#  - was located at (33.93 N, -118.49 E)
#  
# How much would our model predict it would cost?

# In[30]:


house_blocks = [[
    3.,
    12,
    2,
    1,
    2000,
    2.,
    33.93,
    -118.49
]]
house_blocks_scaled = scaler.transform(house_blocks)
print("Scaled values: ", house_blocks_scaled)

predicted_costs = ann.predict(house_blocks_scaled)
print("Predicted cost: ", predicted_costs)

plt.hist(calif.target)
plt.axvline(x=predicted_costs, c='red')
plt.show()


# # Backpropagation
# 
# The heart of neural networks is [backpropagation algorithms](https://en.wikipedia.org/wiki/Backpropagation), which are an efficient way to change the weights and biases in the ANN based on the size of the loss.
# 
# In effect, an ANN is trained by:
# 1. Setting all weights and biases randomly.
# 2. For each row in the test data:
#     1. Set the input units to the input features.
#     2. Use unit weights and biases, passing through the activation function, to calculate the output value of each unit -- right through to the output units.
#     3. Use a loss function to compare the output units with the expected output.
#     4. Use a backpropagation algorithm to update all the weights and biases to reduce the loss.
# 
# Google has a [nice visual explanation](https://google-developers.appspot.com/machine-learning/crash-course/backprop-scroll/) of backpropagation. [More detailed explanations](http://neuralnetworksanddeeplearning.com/chap2.html) are also available.
#     
# ## When backpropagation goes wrong
# 
# * Vanishing gradients: when weights for lower levels (closer to the input) become very small, gradients become very small too, making it hard or impossible to train these layers. The ReLU activation function can help prevent vanishing gradients.
# 
# * Exploding gradients: when weights become very large, the gradients for lower layers can become very large, making it hard for these gradients to converge. Batch normalization can help prevent exploding gradients, as can lowering the learning rate.
# 
# * Dead ReLU units: once the weighted sum for a ReLU activation function falls below 0, the ReLU unit can get stuck -- without an output, it doesn't contribute to the network output, and gradients can't flow through it in backpropagation. Lowering the learning rate can help keep ReLU units from dying.
# 
# * Dropout regularization: in this form of regularization, a proportion of unit activations are randomly dropped out. This prevents overfitting and so helps create a better model.

# ## ANN for classification: what sort of forest is this?
# Let's jump in with a dataset called [Covertype](https://archive.ics.uci.edu/ml/datasets/Covertype), where we try to predict forest cover type based on a number of features of a 30x30m area of forest as follows:
# 
# | Column | Feature | Units | Description | How measured |
# |---|--------|-------|-------------|--------------|
# | 1 | Aspect | degrees azimuth | Aspect in degrees azimuth | Quantitative |
# | 2 | Slope | degrees | Slope in degrees | Quantitative |
# | 3 | Horizontal_Distance_To_Hydrology | meters | Horz Dist to nearest surface water features | Quantitative |
# | 4 | Vertical_Distance_To_Hydrology | meters | Vert Dist to nearest surface water features | Quantitative |
# | 5 | Horizontal_Distance_To_Roadways | meters | Horz Dist to nearest roadway | Quantitative |
# | 6 | Hillshade_9am | 0 to 255 index | Hillshade index at 9am, summer solstice | Quantitative |
# | 7 | Hillshade_Noon | 0 to 255 index | Hillshade index at noon, summer soltice | Quantitative |
# | 8 | Hillshade_3pm | 0 to 255 index | Hillshade index at 3pm, summer solstice | Quantitative |
# | 9 | Horizontal_Distance_To_Fire_Points | meters | Horz Dist to nearest wildfire ignition points | Quantitative |
# | 10-14 | Wilderness_Area | 4 binary columns with 0 (absence) or 1 (presence) | Which wilderness area this plot is in | Qualitative |
# | 14-54 | Soil_Type | 40 binary columns with 0 (absence) or 1 (presence) | Soil Type designation | Qualitative |

# Using this information, we are trying to classify each 30x30m plot as one of seven forest types.
# 
# This dataset is built into Scikit, so we can use it to download and load the dataset for use.

# In[243]:


from sklearn import datasets
help(datasets.fetch_covtype)


# So we don't need to provide any arguments, but it warns us that it will need to download this dataset. It also describes the the returned dataset object will have the following properties:
# - .data: a numpy array with the features.
# - .target: a numpy array with the target labels. Note that each plot is classified into only one of these values.
# - .DESCR: describe this forest covertype.

# In[33]:


# Let's get the data pre-shuffled.
covtype = datasets.fetch_covtype(shuffle=True)
covtypedf = pd.DataFrame(covtype.data)
covtypedf


# In[34]:


print("Target: ", covtype.target)
print("Target shape: ", covtype.target.shape)


# As before, we start by splitting our data into test and training data.

# In[37]:


X_train, X_test, y_train, y_test = model_selection.train_test_split(
    covtypedf,      # Input features (X)
    covtype.target, # Output features (y)
    test_size=0.25, # Put aside 25% of data for testing.
    shuffle=True    # Shuffle inputs.
)

# Did we err?

print("Train data shape: ", X_train.shape)
print("Train label shape: ", y_train.shape)

print("Test data shape: ", X_test.shape)
print("Test label shape: ", y_test.shape)


# Our data is ready for processing. But remember that we have a variety of different input types: binary (0, 1), continuous in small ranges (0-255) and in large ranges (elevations). Before we process this data, we should normalize them into a standard range. We'll use a MinMaxScaler: it reduces the range of all data so they fit into the range 0 to 1.

# In[38]:


from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# Figure out how to scale all the input features in the training dataset.
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)

# Also tranform our validation and testing data in the same way.
X_test_scaled = scaler.transform(X_test)

# Did that work?
pd.DataFrame(X_train_scaled).describe()


# ## Classifying among multiple categories
# 
# Having multiple output units usually would result in each unit being considered independently, allowing you to assign multiple labels for a particular input (for instance, a single image might be classified as containing both a cloud as well as a bird). However, we use scikit-learn's [MLPClassifier](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification), which automatically uses *softmax* to treat labels as exclusive to each other.
# 
# ### The power of softmax
# 
# [Softmax](https://developers.google.com/machine-learning/crash-course/multi-class-neural-networks/softmax) is an additional layer added just before the output units that ensures that the sum of the outputs of all units in the output layer is 100%, in proportion to their inputs. The output of each individual unit is therefore the probability that it is the category into which the input should be categorized.
# 
# The result of this is that MLPClassifier can provide both a predicted label for a set of input features, as well as the probability that represents how certain the model is about this prediction.

# In[39]:


from sklearn.neural_network import MLPClassifier

classifier = MLPClassifier(
    activation='relu',
    solver='adam',
    alpha=0.001,
    hidden_layer_sizes=(40, 20),
    batch_size='auto',
    verbose=True,
    early_stopping=True
)
classifier.fit(X_train_scaled, y_train)


# In[40]:


# Visualize the loss curve and validation scores over iterations.
plt.plot(classifier.loss_curve_, label='Loss at iteration')
plt.plot(classifier.validation_scores_, label='Validation scores at iteration')
plt.legend(loc='best')
plt.show()


# In[44]:


classifier.score(X_test_scaled, y_test)


# # What's next?
# 
# In the next part of this course, we will discuss the landscape of machine learning methods. Artificial neural networks are a valuable part of this landscape, and -- as you can see -- very easy to set up and try, but will not always be the best solution to the problem.