#!/usr/bin/env python
# coding: utf-8

# This is an [jupyter](http://jupyter.org) notebook.
# Lectures about Python, useful both for beginners and experts, can be found at http://scipy-lectures.github.io.

# Open the notebook by (1) copying this file into a directory, (2) in that directory typing 
# jupyter-notebook
# and (3) selecting the notebook.
# 
# ***
# Written By: **Shashwat Shukla**
# ***
# 
# In this exercise, we will learn and code about Adaboost algorithm and look at one of it's application.

# # Adaboost
# ## Motivation

# Before we dive into the details of the Adaboost algorithm, a brief overview of what Adaboost is and what it has to do with Machine Learning will prove useful to the uninitiated.
# 
# First of all, what are we trying to achieve here? 
# Basically, we are given a dataset. For each datapoint, we have measured and stored the value of some parameters of interest. 
# We have also assigned each datapoint a **label**. That is, the dataset that we have is already classified. It could have been classified by a human or the label could be assigned on the basis of experimental/empirical observation.
# 
# It's best explained by an example. So let's say we conducted a census in India. We choose our dataset to be a subset of all the census data that we have collected. We manually label all the data in our subset. Let's say that our subset contained a million people. Given that there are a billion people in India, we don't want to do the rest of the labelling by hand! This is where Adaboost comes in. We will used the data we have already labelled as **training data** and feed it into the Adaboost algorithm. Adaboost 'learns' from this data. Now if you feed it the rest of the census data, which was not labelled, it will generate those labels for you. 
# 
# In the census example, each datapoint actually represents a person. The parameters of interest include height, weight, income, assets owned, marital status etc. 
# Here classification/labelling can mean anything: people can be classified into being above or below the poverty line. A more complex classification can be whether or not a person is eligible for a loan of (say) Rs. 10lakhs from a bank; more generally,  whether a person is credit worthy.
# 
# Note that both examples are **binary classifications**. Meaning that the labels can only be one of two values, like a simple yes or no. This is important because Adaboost assumes, and only works on binary classsified data.
# 
# 
# 
# Alright, so Adaboost stands for **Adaptive Boosting**. So what is it boosting, and why is it called adaptive?
# Let's go back to the census example and consider again the question of classifying the entire population of India as being eligible for a bank loan of Rs. 10 lakhs ie credit worthiness. I shall refer to this problem as credit worthiness henceforth.
# 
# Recall some of the parameters that we record for each person: height, weight, income, assets owned, marital status.  
# Let's pick one parameter, say income. How would you generate a classifier for our dataset using only this parameter? 
# That's easy: just pick a **threshold** say, an income of Rs 5 lakhs a year. This classifier will label any person with an income greater than 5 lakhs as being credit worthy and other people people as non credit worthy. 
# We could have generated a different classifier by simply choosing a different threshold. Like an income of 10 lakhs a year.
# So here's the **important point:** a classifier can be uniquely generated by choosing a parameter, and a threshold for that parameter.
# 
# Back in the census example, you must have realised that some parameters are more important than others: height and weight don't really have much to do with credit worthiness. Income and assets owned are definitely very important. In the real world, banks sometimes also factor in marital status(as a sign of being responsible). Now the obvious question is, how do we quantitatively combine income, assets, marital status etc. How much weightage should income, assets, marital status get in determining the final labels? What thresholds do we choose for these paramters?? 
# All of these questions can be answered in one go: Ask the data! We have already classified some of the data. The magic of Adaboost is that it extracts all of this information about weights and thresholds for us.
# This is the **"boosting"** in Adaboost. It boosts the weightage of classifiers that are more relevant. 
# 
# We need to define another term: **weak classifiers**. A weak classifier is just a classifier that is better than pure guessing. That is, it classifies more than half of the labelled data correctly. Even if it's success rate is just 51 % or even 50.01 %,, it still counts as a weak classifier. 
# A **strong classifier** is one that has a very good success rate, like 99% or whatever.
# (The astute reader would point out that every strong classifier is also a weak classifier. That's really not the point. These are just heuristic definitiions used to convey obvious differences in success rates.)
# 
# Now we are in a position to understand the Adaboost algorithm:
# 
# 1) Adaboost is iterative: In the first iteration, it picks the most accurate weak classifier. In the census example, it would most probably correspond to the parameter of income, with some threshold value(Note: classifiers that work with income as a parameter but have different threshold values are distinct classifiers).
# 
# 2) So we have picked our first weak classifier. Now what? Well, this weak classifier must have obviously misclassified a lot of datapoints(a little less than half). The classifier that we pick in the next iteration should then be better at correctly classifying this misclassified data. How do we choose such a classifier? The way it works in Adaboost is that **each data point is assigned a weight**. In the first iteration, obviously all the datapoints had equal weightage. Then the weight of the misclassified datapoints is incremented, and that of correctly labelled data is decremented. So the error rate of the second classifier is calculated as a weighted sum: if it misclassifies an already misclassified datapoint, it gets a higher error rate. 
# 
# 3) In the second iteration, the classifier with the smallest error rate on the weighted data is chosen. Now we have chosen two weak classifiers. We again increment the weights of the datapoints that are collectively misclassified by the two weak classifiers, and decrement the weigths of the ones that they correctly classify.
# 
# 4) Repeat this process of reassigning weights and picking weak classifiers on the weighted data until you have made the error rate as small as you like. Here we are referring to the error rate of the final strong classifier( what we form out of all the weak classifiers that we have picked so far).
# 
# So in a nutshell: Adaboost takes a lot of weak classifiers, assigns them appropriate weights, to make a strong classifier.
# 
# Okay. So far, we haven't encountered a single mathematical equation! Now that we understand *what* Adaboost does, we need to look at *how* it works.
# 

# ## Simulated Example

# The best way to understand the math is to see it in action. Here we will code the adaboost algorithm and use it to classify some data.
# 
# First, we import some libraries.

# In[ ]:


get_ipython().run_line_magic('matplotlib', 'inline')

import numpy as np
import matplotlib.pyplot as plt
import time
start_time = time.time()


# Next, we initialise some variables that we will need:
# -> N: The number of samples or data-points.
# -> T: The number of iterations in our boosting algorithm.
# -> dim: The number of parameters recorded for each data-point.
# 
# Unlike in the census example, we haven't actually collected any data or labelled it. So we will generate our own sample data and label it in a simple manner:
# -> x: The data. It is an N x dim matrix.
# -> label: N x 1 array that stores the known labels for each data-point. Here label belongs to {-1, 1}
# 
# Then we plot the original data. This is the learning dataset.

# In[ ]:


T = 20  
dim = 2
N = 1000

x = np.random.randn(N, 2)  # dim=2

label = np.zeros(N, dtype=np.int64)

# label = x[:,0] < x[:,1]  #linear separation example
label = (x[:, 0]**2 + x[:, 1]**2) < 1  # nonlinear separation example


label = label * 1.0

pos1 = np.nonzero(label == 1)
pos2 = np.where(label == 0)
label[pos2] = -1

# Plot the data

plt.figure()
plt.plot(x[pos1, 0], x[pos1, 1], 'b*')
plt.plot(x[pos2, 0], x[pos2, 1], 'r*')
plt.axis([-3, 3, -3, 3])
plt.title("Original data")


# We discussed how a classifier can be defined by selecting a parameter and a threshold for this parameter. 
# The output of this classifier that we have defined would be +1 or -1 depending on whether the input's value for this parameter is greater than or smaller than the threshold, as we have discussed before.
# 
# But we had also talked about generating an error rate for weak classifiers in every iteration, on the weighted dataset. 
# The following lines of code define a function weak_Classifier_error() that takes a weak classifier, a labelled and weighted dataset as input. 
# *It's output is the error rate for this weak classifier on the dataset for the given weights.*
# And the error rate, as mentioned before, just a weighted sum of the errors. 
# The weigths are: 0 if it is correctly classified and 1 if incorrectly classified.
# This simplfies to simply:
# **Error rate= Sum of weigths of datapoints that were classified incorrectly**
# 
# But wait you say: The arguments for this function are threshold, dimension, sign, weight and label. Where is the weak classifier? Remember, how we said that a classifier is *uniquely* defined by it's parameter and threshold values? So we just input these.
# 
# Also, the use of 'sign' is simple: It basically flips the classifier on it's head. If sign is +1, then it will classify data as +1 if it is *greater* than threshold. If sign is -1, it will classify data as +1 if it is *smaller* than threshold.
# 

# In[ ]:


temp = np.zeros(N, dtype=np.int64)


# Returns error and calculated labels corresponding to
def weakClassifier_error(i, j, k, x, weight, label):
                                                # threshold i
                                                # dimension j
                                                # sign k on dataset x.
                                                # Original labels are stored in
                                                # label

    temp_err = np.float64(0)
    # Initialise actual and expected labels to a perfect match( 0 = match , 1
    # = not a match)
    y = np.zeros(N, dtype=np.int64)

    if(k == 1):
        temp = (x[:, j] >= i)
    else:
        temp = (x[:, j] < i)

    temp = np.int64(temp)
    temp[np.where(temp == 0)] = -1
    y = np.int64(temp != label)
    # Calculate error of this weak classifier on the weighted dataset
    temp_err = np.sum(y * weight)

    return [temp_err, y]


# Now we can finally take a look at the actual code for the Adaboost algorithm. 
# 
# We define some variables that we need: 
# -> h: Tx3 array that stores the weak classifiers selected after each iteration:
#        h[index][0]= threshold
#        h[index][1]= dim (data dimension)
#        h[index][2]= pos (the sign of the classifier, +1/-1)
# -> alpha: T x 1 array that stores the weight of each weak classifier chosen to
#             make up the final classifier.
#             
# The point of 'h' is obvious. However, 'alpha' needs an explanation:
# We said that the final classifier will be a combination of all the weak classifiers that we have selected over the iterations.
# But what does *combination* mean exactly? 
# **In Adaboost, the final classifier is a weighted sum of the weak classifiers that we have selected.**
# 
# *Now don't get confused*: we have talked about weights before. We assigned weights to each datapoint, over each iteration, so that we could select a new classifier. These weigths are stored in weights[].
# And the error rate was a weighted sum of the weights.
# 
# What's beautiful about Adaboost is that it generates the *final* output in an analogous manner. 
# The weigths we are talking about here are assigned to the weak classifiers themselves. 
# These weights are stored in alpha[].
# The final output is the weighted sum of the outputs of each weak classifier.
# 
# Alright: so now we have two sets of weights: One set for the datapoints and one set for the weak classifiers themselves. 
# How are these weights determined?
# We said that the weigths corresponding to the datapoints were incremented or decremented based on whether or not they were correctly classified,  by the best weak classifier of that iteration.
# How is this done mathematically?
# 
# First we calculate alpha[t] (refer to the code). Then we use it to reassign weights. 
# The expression for alpha[t] isn't very revealing at first glance. But it is what it is, and it works. We will try and take a look at it in more detail later.
# 
# The observant reader will immediately point out and maybe puzzled by the fact that just a few sentences ago I mentioned that the weights for the weak classifiers are stored in alpha[]. 
# So is this is a different alpha? *No it's not!*
# This is an especially fascinating aspect of Adaboost. The weights for the weak classifiers and the weigths for the datapoints are linked by alpha[].
# 
# The last line in the following code block renormalises the weights. This is necessary as we are reassigning weigths after each iteration.

# In[ ]:


# Actual program begins

# h and alpha together completely specify the final strong classifier
h = np.zeros([T, 3], dtype=np.float64)
alpha = np.zeros(T, dtype=np.float64)

threshold = np.arange(-3.0, 3.0, 0.1)

weight = np.ones(N, dtype=np.float64) / (N)  # Initialise weights

# Initially set error to infinity, to allow comparing with error of classifiers
err = np.ones(T, dtype=np.float64) * np.inf

for t in range(T):
    for i in threshold:
        for j in range(dim):
            for k in [-1, 1]:
                [tmpe, y] = weakClassifier_error(i, j, k, x, weight, label)
                if(tmpe < err[t]):  # storing the better classifier in h
                    err[t] = tmpe
                    y0 = y
                    h[t][0] = i
                    h[t][1] = j
                    h[t][2] = k

    if(err[t] > 0.5):
        T = t
        # We have run out of weak classifiers! So truncate the no: of
        # iterations used
        print t, "Error!"
        break

    alpha[t] = 0.5 * np.log((1.0 - err[t]) / err[t])

    # y0=0 corresponded to correctly labelled datapoints. To reassign weights,
    y0[np.where(y0 == 0)] = -1
    # we need -1 and not 0 at these positions

    weight = np.float64(weight * np.exp(alpha[t] * y0))  # Reassign weights

    weight = weight / np.sum(weight)  # Normalise reassigned weights


# We have finished selecting all the weak classifiers and their corresponding weights. This is all that we need to generate the final output. 
# In our code, we are not feeding it a new dataset. Instead, we are inputting the training dataset itself. This is an excellent way to see if our algorithm actually works. Because we can compare the original labels directly with the generated lables.
# 
# Alright, so we now proceed to generate the final output. We said that the final output is the weighted sum of the outputs of all the weak classifiers. The weights, as we know, are stored in alpha[]
# 
# We then proceed to plot the generated labels also.
# We also plot 'misshits', which is a measure of how the accuracy of our final classifier improves when we add the weak classifiers one by one.

# In[ ]:


temp_sum = np.zeros(N, dtype=np.float64)
temp = np.zeros(N, dtype=np.float64)
final_label = np.zeros(N, dtype=np.float64)
misshits = np.zeros(T)

for t in range(T):  # Calculate final labels
    temp = h[t][2] * np.sign(x[:, h[t][1]] - h[t][0])
    temp_sum = np.float64(temp_sum + alpha[t] * temp)
    final_label = np.sign(temp_sum)
    misshits[t] = np.sum(np.float64(final_label != label)) / N


# Now plot the generated labels
pos1 = np.where(final_label == 1)
pos2 = np.where(final_label == -1)

plt.figure()
plt.plot(x[pos1, 0], x[pos1, 1], 'b*')
plt.plot(x[pos2, 0], x[pos2, 1], 'r*')
plt.axis([-3, 3, -3, 3])
plt.title("Generated data")
plt.show()

# Plot miss hits when more and more weak learners are used
plt.figure()
plt.plot(misshits)
plt.ylabel('Miss hists')

print("--- %s seconds ---" % (time.time() - start_time))


# That's all of it.
# We have described all the mathematical details of Adaboost through our code. But with math comes proofs, and we haven't actually proven the convergence of the Adaboost algorithm. Neither have we derived the expression for alpha[t]. 
# 
# Proving convergence is pretty difficult. But you must know that the adaboost algorithm is guaranteed to converge to a strong classifier with an arbitrarily good success rate, given enough iterations(and some other caveats).
# 
# For some intuition into where that weird expression of alpha came from, refer to [this](http://research.microsoft.com/en-us/um/cambridge/events/aistats2003/proceedings/156.pdf) link.

# ## Application of Adaboost: Digit recognition

# We have learnt how Adaboost works and have implemented the code for it as well. 
# 
# With a few modifications, our code can be tailored into a full-blown application of Adaboost in a real and relevant problem; that of digit recognition. The problem is simple to state: create a program that takes an image of a digit from 0-9 as input and gives as output, the digit in the picture.
# So how can we use Adaboost to solve this problem?
# 
# First, some specifics: The 'data' that we will use comprises of images that are 13 pixels wide and 13 pixels tall i.e **13x13**. The training set will have **1000 images**, as will the testing/validation set. Each image is stored as a 13x13 matrix. Each element of this matrix is a number from 0 to 1. Can you guess what the numbers correspond to? As is obvious, they represent the intensity of light in that particular pixel. Meaning that a value of zero corresponds to a black pixel and a value of 1 is a white pixel and any value in between will be varying shades of grey. This is the **greyscale representation of an image**.
# 
# **The data files can be found [here](https://github.com/riddhishb/ipython-notebooks/tree/master/Adaboost).**
# 
# The next step is to define our weak classifiers. How do we do that? 
# Well, in our implementation, we will make an obvious choice: Each image comprises 13x13= 169 pixels. We have stored a greyscale for each pixel. 
# So the index for the weak classifier will simply be the index of the pixel that we are looking at. The weak classifier's threshold will be a greyscale value. As in our previous example, there will be also be a sign associated with our classifier. 
# 
# So let's say that we are looking at the pixel with index (2,3) and have chosen a threshold greyscale value of 0.625 and the sign is +1. Then, if we give an image as input to this weak classifier, here's what will happen: Let this image have a value of 0.433 at the index (2,3). Thus, the output of our classifier will be -1 (as the sign is +1).
# 
# We compare our data with our previous code to conclude that here, 
# **N = 1000** (as we have a 1000 datapoints i.e images)
# **dim = 169** (each image has 169 pixels and each one of them is a dimension of data)
# 
# We now come to a practical consideration: how are the images stored? This might seem weird at first, but they the 1000 images are stored in a **13x13000 array**. Why 13x13000? Well, think of the images being **placed side-by-side**. So the length of all the images put together is simply the sum of their lengths and hence 13x1000 = 13000. The breadth is simply 13.
# It is emperical that you understand how the images are stored before you proceed further.
# 
# We previously said that dim = 169. So how will you pick the pixel in an image corresponding to say, the dimension 155? 
# As you might have guessed, you will look at the **row-major** form of the image. 
# Hence, row corresponding to 155 will be int(155 / 13) and the column will be 155 % 13.
# 
# Now that we have taken a look at all the specific details, we can now see Adaboost in action with the corresponding code!
# 

# In[ ]:


"""
All matrices used are implemented via numpy.
The following variables are used:
-> N: The number of samples or data-points.
-> T: The number of iterations in our boosting algorithm.
-> dim: The number of parameters recorded for each data-point.
        (for an image we can choose RGB intensities as features and then dim=3)
-> x: The data. It is an N x dim matrix.
-> label: N x 1 array that stores the known labels for each data-point.
-> final_label: Nx1 array that stores the labels generated for each data-point
                by the final strong classifier.
-> weight: Nx1 array that stores the weight for each data-point.
-> h: Tx3 array that stores the weak classifiers selected after each iteration:
       h[index][0]= threshold
       h[index][1]= dim (data dimension)
       h[index][2]= pos (the sign of the classifier, +1/-1)
-> alpha: T x 1 array that stores the weight of each weak classifier chosen to
            make up the final classifier.
-> final_alpha: Stores the weights for all the digits.
-> final_h: Stores the classifiers for all the digits.

"""
import numpy as np
import matplotlib.pyplot as plt
import time
import csv
start_time = time.time()


with open('images_training.txt', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='|')
    x = list(reader)

x = np.array(x, dtype=np.float64)
# this array is of size 13x13000, for all the 1000 13x13 images 100 for
# each digit 0 to 9

T = 20
dim = 169
N = 1000


# As discussed, this time around the dim = 169.
# We have imported the csv module to open the image. 
# 

# In[ ]:


temp = np.zeros(N, dtype=np.int64)

# Returns error and calculated labels corresponding to


def weakClassifier_error(i, j, k, x, weight, label):
                                                # threshold i
                                                # dimension j
                                                # sign k on dataset x.
                                                # Original labels are stored in
                                                # label

    j_row = j / 13
    j_col = j % 13
    temp_err = np.float64(0)
    # Initialise actual and expected labels to a perfect match( 0 = match , 1
    # = not a match)
    y = np.zeros(N, dtype=np.int64)

    if(k == 1):
        temp = (x[j_row, j_col:13000:13] >= i)
    else:
        temp = (x[j_row, j_col:13000:13] < i)

    temp = np.int64(temp)
    temp[np.where(temp == 0)] = -1
    y = np.int64(temp != label)
    # Calculate error of this weak classifier on the weighted dataset
    temp_err = np.sum(y * weight)

    return [temp_err, y]


# Notice how the definition of weakclassifier_error hasn't changed much. Only the "dim" has been complicated a bit.
# 
# To remind you, we previously said that dim = 169. So how will you pick the pixel in an image corresponding to say, the dimension 155? We will look at the row-major form of the image. Hence, row corresponding to 155 will be int(155 / 13) and the column will be 155 % 13.
# Here, j_row and j_col do just that. 
# Also, **x[j_row, j_col:13000:13]** returns a list that contains the pixel value at the position (j_row, j_col). This list contains a 1000 elements, corresponding to our 1000 training images. 
# 

# In[ ]:


# Actual program begins
threshold = np.arange(0, 1.0, 0.05)
# h and alpha together completely specify the final strong classifier
final_alpha = np.zeros((10, T), dtype=np.float64)
final_h = np.zeros((10, T, 3), dtype=np.float64)

for p in range(10):
    h = np.zeros([T, 3], dtype=np.float64)
    alpha = np.zeros(T, dtype=np.float64)
    temp = np.zeros(N, dtype=np.int64)

    label = np.zeros(N, dtype=np.int64)
    label = label * 1.0
    label[p * 100: p * 100 + 100] = 1
    label[np.where(label == 0)] = -1

    weight = np.ones(N, dtype=np.float64) / (N)  # Initialise weights

    # Initially set error to infinity, to allow comparing with error of
    # classifiers
    err = np.ones(T, dtype=np.float64) * np.inf

    for t in range(T):
        for i in threshold:
            for j in range(dim):
                for k in [-1, 1]:
                    [tmpe, y] = weakClassifier_error(i, j, k, x, weight, label)
                    if(tmpe < err[t]):  # storing the better classifier in h
                        err[t] = tmpe
                        y0 = y
                        h[t][0] = i
                        h[t][1] = j
                        h[t][2] = k

        if(err[t] > 0.5):
            T = t
            # We have run out of weak classifiers! So truncate the no: of
            # iterations used
            print t, "Error!"
            break

        alpha[t] = 0.5 * np.log((1.0 - err[t]) / err[t])

        # y0=0 corresponded to correctly labelled datapoints. To reassign
        # weights,
        y0[np.where(y0 == 0)] = -1
        # we need -1 and not 0 at these positions

        weight = np.float64(weight * np.exp(alpha[t] * y0))  # Reassign weights
        weight = weight / np.sum(weight)  # Normalise reassigned weights

    final_alpha[p] = alpha
    final_h[p] = h


# Compare this codeblock with the code for our toy example. You will notice that the only difference is that we have an additional for p in range(10).
# 
# So basically, because now we have to classify digits, we will build a **seperate strong classifiers for each digit from 0 to 9**. That is why we are running our Adaboost algorithm 10 times.
# In each of theses iterations, we are storing the strong classifier in the arrays final_alpha and final_h.
# 

# In[ ]:


with open('images_training.txt', 'rb') as csvfile:
    reader = csv.reader(csvfile, delimiter=',', quotechar='|')
    x = list(reader)

x = np.array(x, dtype=np.float64)

temp_sum = np.zeros((10, N), dtype=np.float64)
temp = np.zeros(N, dtype=np.float64)
final_label = np.zeros((10, N), dtype=np.float64)
misshits = np.zeros(T)

for p in range(10):
    label[100 * p: 100 * p + 100] = p

label = np.int64(label)
all_label = np.full(N, -1, dtype=np.int64)

for t in range(T):  # Calculate final labels
    for p in range(10):
        row = final_h[p][t][1] / 13
        col = final_h[p][t][1] % 13
        temp = final_h[p][t][2] * \
            np.sign(x[row, col: 13000: 13] - final_h[p][t][0])
        temp_sum[p] = np.float64(temp_sum[p] + final_alpha[p][t] * temp)
        final_label[p] = np.sign(temp_sum[p])
    for p in range(10):
        all_label[np.where(final_label[p] == 1)] = p
    misshits[t] = np.sum(np.float64(all_label != label)) / N

plt.figure()
plt.plot(misshits)
plt.ylabel('Miss hists')
plt.show()

print("--- %s seconds ---" % (time.time() - start_time))


# So we finally determine the labels for the test images. 
# Note that now, the label corresponding to each image is not a simple number in {-1,1}. Rather, it is a vector with 10 components corrresponding to the ten digits from 0 to 9. Each component can take value -1 or +1. 
# 
# We import the test images again with the help of the csv module. Then we calculate the labels and also the misshits.
# 

# That's the end of that!
# Hopefully, the Adaboost algorithm is now clear to you and is something that you can apply to other problems out there.
# If you would like to look at the mathematical details of **why** the Adaboost algorithm works, then I would definitely recommend [this](http://research.microsoft.com/en-us/um/cambridge/events/aistats2003/proceedings/156.pdf) link.
# 
# **Thank you!**