$\renewcommand{\vec}[1]{\boldsymbol{#1}}$

- Given an input vector $\vec{x}$ determine $p\left(\left. C_k\right|\vec{x}\right)$ where $k\in \{1\ldots K\}$ and $C_k$ is a discrete class label.
- We will discuss
**linear models**for classification.

**Note:** the algorithms described here can also be applied to transformed input, $\phi(\vec{x})$, to determine $p\left(C_k\right|\left.\phi(\vec{x})\right)$, where $\phi(\cdot)$ may be a nonlinear transformation of the input space.

Our **first model assumption** is that our target output can be modeled as

where $y$ will be a vector of probabilities with $K$ elements.

- Elements of $\vec{y}$ are $y_k = p\left(C_k\right|\left.\phi\left(\vec{x}\right)\right)$ i.e. the probability that the correct class label is $C_k$ given the input vector $\vec{x}$.
- Often we will have simply that $\vec{w}^\intercal\phi(\vec{x}) = \vec{w}^\intercal\vec{x}+w_0$ in which case we will simply omit $\phi()$ from the notation.
- The function $f()$ is known as the
**activation function**and its inverse is known as the**link function**. - Note that $f()$ will often be nonlinear.

The target variable, $y$, does **not provide a decision** for a class assignment for a given input $\vec{x}$.

- In real world cases where it is necessary to make a decision as to which class $\vec{x}$ should be assigned;
- one must apply an additional modeling step based on
*decision theory*.

There are a variety of decision models, all of which leverage the class posterior probability models, $p\left(C_k\right|\left.\vec{x}\right)$, such as

- Minimizing the misclassification rate - this effectively corresponds to choosing the class with the highest probability for a given $\vec{x}$.
- Minimizing the expected loss - minimizes the expected value of a given loss function, $\ell(\cdot)$, under the class posterior probability distribution.
- Reject option.

The models discussed today are called *linear* because,

- when the decision criteria is that of minimizing the misclassification rate,
- they divide the input space into $K$ regions,
- where the boundaries between regions are linear functions of the input vector $\vec{x}$.
- The decision boundaries will correspond to where $\vec{w}^\intercal\vec{x}=\text{constant}$, and thus represent a linear function of $\vec{x}$.
- In the case of transformed input, $\phi(\vec{x})$, the decision boundaries will correspond to where $\vec{w}^\intercal\phi(\vec{x})=\text{constant}$, and thus represent a linear function of $\phi(\vec{x})$.

Our first step is to form a model of the class posterior probabilities for the *inference stage:*

This can be done by modeling:

- the
*class-conditional densities*, $p\left(\vec{x}\right|\left.C_k\right)$, - the class priors, $p(C_k)$, and
- applying Bayes' theorem to obtain the posterior model, or

*Note:* It can also be done directly, as in *probabilistic discriminative modeling* (forward pointer).

Bayes' theorem describes the probability of an event, based on conditions that might be related to the event.

$$ P\left(A\right|\left.B\right) = \frac{P\left(B\right|\left.A\right) \, P(A)}{P(B)}\,,$$where $A$ and $B$ are events.

- $P(A)$ and $P(B)$ are the probabilities of $A$ and $B$ without regard to each other.
- $P\left( A \right|\left.B\right)$, a
*conditional probability*, is the probability of observing event $A$ given that $B$ is true. - $P\left(B\right|\left.A\right)$ is the probability of observing event $B$ given that $A$ is true.

- We will see that our class posterior probability model will depend only on the input $\vec{x}$...
- ...or a fixed transformation using basis functions $\phi()$, and
- class labels, $C_k$.

In this case, for $K$ classes, our activation function will take the form of the *softmax function*,

where

$$a_j = \ln\left(p\left(\vec{x}\right|\left.C_k\right)p(C_k)\right).$$In the **case of $K=2$**, the activation function will reduce to the *logistic sigmoid* function

where

$$a = \ln \frac {p\left(\vec{x}\right|\left. C_1\right)p(C_1)} {p\left(\vec{x}\right|\left. C_2\right)p\left(C_2\right)}\,.$$These models are **known as generative models** because they can be used to generate synthetic input
data by applying inverse transform sampling to the marginal distribution for $\vec{x}$:

To move forward, it is necessary to start making **model assumptions**.
Here we will *assume* that:

- we have continuous inputs, i.e. $x\in \mathbb{R}^n$ (see Bishop p.202 or Andrew Ng's Lecture 5 pdf for discrete input using Naïve Bayes & Laplace), and
- that the
*class-conditional densities*, $p(\vec{x}|C_k)$, are modeled by a Gaussian distribution.

Under the Gaussian assumption, the class-conditional density for class $C_k$ is

$$p(\vec{x}|C_k) = \frac{1}{\left(2 \pi \right)^{n/2}} \frac{1}{\left|\vec{\Sigma}\right|^{1/2}} \exp \left( -\frac{1}{2} \left(\vec{x} - \vec{\mu}_k\right)^\intercal \vec{\Sigma}^{-1} \left(\vec{x} - \vec{\mu}_k\right) \right)\,,$$where:

- $n$ is the dimension of the input vector $\vec{x}$,
- $\vec{\Sigma}$ is the
*covariance matrix*, and - $\vec{\mu}$ is the mean vector.

*Note:* here we have *assumed* that all classes share the same covariance matrix.

A logistic function or logistic curve is a common "S" shape (sigmoid curve), defined as

$$\sigma(x)={\frac {L}{1+\exp(-k(x-x_{0}))}}\,,$$where:

- $x_0$: the x-value of the sigmoid's midpoint,
- $L$: the curve's maximum value, and
*k*: the steepness of the curve.

In the case of **two classes**, this result is substituted into the logistic sigmoid function and reduces to

were we have defined:

$$\vec{w} = \mathbf{\Sigma}^{-1} \left( \mathbf{\mu}_1 - \mathbf{\mu}_2 \right),$$$$w_0 = -\frac{1}{2} \vec{\mu}_1^\intercal \vec{\Sigma}^{-1} \vec{\mu}_1 + \frac{1}{2} \vec{\mu}_2^\intercal \vec{\Sigma}^{-1} \vec{\mu}_2 + \ln \frac{p(C_1)} {p(C_2)}$$- The class prior probabilities, $p(C_k)$, effectively act as a bias term.
- Note that we have yet to specify a model for these distributions.

If we are to use the result above, we will need to make **another model assumption**.

- We will
*assume*that the*class priors*are modeled by a Bernoulli distribution with $p(C_1)=\gamma$ and $p(C_2)=1-\gamma$.

These results can be extended to the case of $K>2$ classes for which we obtain

$$a_k(\vec{x}) = \left[\vec{\Sigma}^{-1} \vec{\mu}_k \right]^\intercal \vec{x} - \frac{1}{2} \vec{\mu}_k^\intercal \vec{\Sigma}^{-1} \vec{\mu}_k + \ln p(C_k)$$which is used in the first activation function provided at the begin of this section.

We have not formulated a complete model for the posterior class densities, in that we have not yet solved for the model parameters, $\vec{\mu}$ and $\vec{\Sigma}$.

We do that now using a **maximum likelihood** approach.

- A
**function of the parameters**of a statistical model**given**some data.

The likelihood of a set of parameter values, $\theta$, given outcomes $x$, is the probability of those observed outcomes given those parameter values,

$$\mathcal{L}(\theta |x)=P(x|\theta)$$- Frequently, the natural logarithm of the likelihood function, called the
**log-likelihood**, is more convenient to work with.

Considering the case of two classes, $C_1$ and $C_2$, with

- Bernoulli prior distributions, $p(C_1)=\gamma$ and $p(C_2)=1-\gamma$, and
- Gaussian
*class-conditional density*distributions $p(\vec{x}|C_k)$, - assume we have a training data set, $\Psi$, with $m$ elements of the form

where $t_i=0$ indicates that $\vec{x_i}$ is in class $C_1$ and $t_i=1$ indicates that $\vec{x}_i$ is in class $C_2$.

The likelihood function is then given by

$$p\left(\vec{t}, \vec{X} \mid \gamma, \vec{\mu}_1, \vec{\mu}_2, \vec{\Sigma}\right) = \prod_{i=1}^m \left[\gamma \mathcal{N} \left(\vec{x}_i \mid \mu_1, \vec{\Sigma}\right)\right]^{t_i} \left[\left(1-\gamma\right) \mathcal{N}\left(\vec{x}_i \mid \mu_2, \vec{\Sigma}\right)\right]^{1-t_i}\,.$$Taking the derivate of this expression with respect to the various model parameters, $\gamma$, $\mu_1$, $\mu_2$, and $\vec{\Sigma}$, and setting it equal to zero, we obtain

$$\gamma = \frac{N_1}{N_1+N_2}$$where $N_1$ is the number of training inputs in class $C_1$ and $N_2$ is the number in class $C_2$.

- We will choose some
*truth*values for our parameters and use our generative model to generate synthetic data. - We can then use that data as input to the maximum likelihood solution to see the estimates of the truth parameters.
- Let's use 1-D input for simplicity.
- We will be using a basic form of inverse transform sampling.
- Specifically, we wish for our training input data, $\vec{x}$, to be derived from a distribution modeled by the marginal distribution $p(\vec{x})$.

- To obtain this, we first formulate the cumulative distribution function, $CDF(\vec{X}) = \int_{-\infty}^{\vec{X}} p(\vec{x})d\vec{x}$ (note that the range of the CDF is $[0,1]$).
- To obtain appropriately distributed $\vec{x}$ values, we choose some value from a uniform distribution on $[0,1]$, say $y$, and find the value $\vec{X}$ such that $CDF(\vec{X})=y$.
- This value of $\vec{X}$ is our input $\vec{x}$.
- This approach requires us to find the inverse of the CDF, which we will do numerically.
- Once we have obtained a value for $\vec{x}$ we need to assign it to a particular class.

We will *assume* that the correct class is that for which the posterior probability $p(\left.C_k\right|\vec{x})$ is greatest, **unless** the difference between the two posterior probabilities is less than some minimum value in which case we will chose randomly - this will add some "noise" to our input training data.

Select truth data values, these will *NOT* be known to the training algorithm, they are only used in generating the sample data.

In [4]:

```
true_mu1 =-2.0
true_mu2 = 2.0
true_sigma = 2.0
true_gamma = 0.5
```

In [5]:

```
import scipy.integrate as sci_intgr
import scipy.optimize as sci_opt
```

Defining probability functions

In [6]:

```
def p_xCk(x, mu, sigma):
'Class conditional probability p(x|Ck)'
denom = math.sqrt(2.0 * math.pi * sigma)
arg = -0.5 * (x - mu) * (x - mu) / sigma
return math.exp(arg) / denom
```

In [7]:

```
def p_x(x, mu1, mu2, sigma, gamma):
'Marginal probability p(x)'
return gamma * p_xCk(x, mu1, sigma) + (1.0 - gamma) * p_xCk(x, mu2, sigma)
```

In [8]:

```
def p_Ckx(x, mu1, mu2, sigma, gamma):
'Posterior class probability vector (p(C_1|x), p(C_2|x))'
a = math.log(p_xCk(x, mu1, sigma)*gamma/(p_xCk(x,mu2,sigma)*(1-gamma)))
pc1 = 1.0/(1.0 + math.exp(-a))
return (pc1, 1.0 - pc1)
```

In [9]:

```
def cdf(x, mu1, mu2, sigma, gamma):
'Cumulative distribution function P(x<X)'
return sci_intgr.quad(func=p_x, a=-np.inf, b=x, args=(mu1, mu2, sigma, gamma))
```

In [10]:

```
def inv_cdf(y, mu1, mu2, sigma, gamma):
'Inverse of the CDF'
f = lambda x: cdf(x,mu1,mu2,sigma,gamma)[0] - y
return sci_opt.newton(f, 0)
```

Class conditional probabilities

In [11]:

```
domain = np.linspace(-5, 5, 100)
```

In [12]:

```
px_class1 = [p_xCk(x, true_mu1, true_sigma) for x in domain]
px_class2 = [p_xCk(x, true_mu2, true_sigma) for x in domain]
```

In [13]:

```
fig = plt.figure(figsize=(5,4))
plt.plot(domain, px_class1, label='Class 1')
plt.plot(domain, px_class2, label='Class 2')
plt.xlabel('$x$'); plt.ylabel('$p(x|C_k)$');plt.legend(bbox_to_anchor=(1.37,1), fancybox=True);
plt.title('Class conditional probability - $p(x|C_k)$');
```

Marginal distribution of $x$

In [14]:

```
px = [p_x(x, true_mu1, true_mu2, true_sigma, true_gamma) for x in domain]
```

In [15]:

```
fig = plt.figure(figsize=(5,4))
plt.plot(domain,px)
plt.xlabel('$x$');plt.ylabel('$p(x)$');plt.title('Marginal distribution - $p(x)$');
```

Posterior distribution

In [16]:

```
pc1x, pc2x = [], []
for x in domain:
pck = p_Ckx(x, true_mu1, true_mu2, true_sigma, true_gamma)
pc1x.append(pck[0]); pc2x.append(pck[1])
```

In [17]:

```
fig = plt.figure(figsize=(5,4))
plt.plot(domain, pc1x, label='Class 1')
plt.plot(domain, pc2x, label='Class 2')
plt.xlabel('$x$');plt.ylabel('$p(C_k|x)$');
plt.title('Posterior distributions - $p(C_1|x)$ and $p(C_2|x)$');
```

Or model is ready, let's generate some data and test it.

In [18]:

```
np.random.seed(123456789) # do not do this in real life
num_samples = 1000
x = np.zeros(num_samples)
t = np.zeros(num_samples)
pcx = np.zeros(num_samples)
n1 = 0
nae = 0
assignment_epsilon = 0.5
```

Assigning `x`

to `1`

for **class 1** and `0`

for **class 2**.

In [19]:

```
for i in range(num_samples):
rv = np.random.uniform()
x[i] = inv_cdf(rv, true_mu1, true_mu2, true_sigma, true_gamma)
pcx1 = p_Ckx(x[i], true_mu1, true_mu2, true_sigma, true_gamma)
pcx2 = pcx1[1]
pcx1 = pcx1[0]
#we don't want a perfect dividing line for our domain, otherwise why would we need a learning algorithim?
if math.fabs(pcx2-pcx1) <= assignment_epsilon:
nae = nae + 1
if np.random.uniform() <= 0.5:
t[i] = 1
n1 = n1 + 1
else:
t[i] = 0
elif pcx1 > pcx2:
t[i] = 1
n1 = n1 + 1
else: t[i]=0
```

In [20]:

```
fig = plt.figure(figsize=(5,4))
plt.scatter(x, t, marker='.', alpha=0.1)
plt.axvline(0, linestyle='--')
plt.xlabel('$x$');plt.ylabel('Class ($t$)'); plt.title('Simulated training data');
```

In [21]:

```
n2 = num_samples - n1
e_gamma = n1/num_samples
```

In [22]:

```
e_mu1 = 0
for x_i, t_i in zip(x,t):
if t_i == 1:
e_mu1 += x_i
# e_mu1 += x_i*t_i
e_mu1 /= n1
```

In [23]:

```
e_mu2 = 0
for x_i, t_i in zip(x,t):
if t_i == 0:
e_mu2 += x_i
# e_mu2 += x_i*(1-t_i)
e_mu2 /= n2
```

In [24]:

```
def Sn(Z, ck, mu=(e_mu1,e_mu2)):
idx = 1-int(ck)
return (Z[0]-mu[idx]) * (Z[0]-mu[idx])
```

In [25]:

```
e_sigma = 0
for x_i, t_i in zip(x,t):
e_sigma += Sn((x_i, t_i), t_i)
e_sigma /= float(num_samples)
```

In [26]:

```
print('The number of inputs in C1 is {0}'.format(n1))
print('The number of inputs that were assigned based on the assignment factor is {0}'.format(nae))
print('True model parameters: gamma={0}, µ1={1}, µ2={2}, ∑={3}'.format(true_gamma,
true_mu1, true_mu2,
true_sigma))
print('Estimated model parameters: gamma={0}, µ1={1}, µ2={2}, ∑={3}'.format(e_gamma, e_mu1,
e_mu2, e_sigma))
```

Probablistic discriminative models assume the same generalized linear model

$$y(\phi(\vec{x})) = f(\vec{w}^\intercal\phi(\vec{x}))$$as in the probabilistic generative models.

- Instead of formulating models for the
*class-conditional*densities, $p\left(\phi(\vec{x})\right|\left.C_k\right)$, and the*class priors*, $p(C_k)$, - the discriminative approach explicitly models the
*class posterior*probabilities, $p\left(C_k\right|\left.\phi(\vec{x})\right)$ with model parameters $\vec{w}$. - As in the probabilistic generative approach, maximum likelihood is used to estimate the model parameters given some training data set, $\Psi$.

The **key difference** is the form of the likelihood function.

- In the
**probabilistic generative case**, the likelihood function is a function of the joint probability, $p\left(\phi(\vec{x}),C_k\right)=p\left(\phi(\vec{x})\right|\left.C_k\right)p(C_k)$. - In the
**probabilistic discriminative approach**, the likelihood function is a function of the conditional class posteriors, $p\left(C_k\right|\left.\phi(\vec{x})\right)$ only.

- Logistic regression is one specific example of discriminative modeling, for the case of
**two classes**. - It assumes a model for the class posterior probabilities, $p(C_k|\phi(\vec{x}))$, in the form of the logistic sigmoid

with

$$p\left(C_2\right|\left.\phi(\vec{x})\right) = 1 - p\left(C_1\right|\left.\phi(\vec{x})\right).$$- We apply maximum likelihood to obtain the model parameters.
- Assume that our training set is of the form $\Psi=\left\{\phi(\vec{x}_i),t_i\right\}$ where $t_i \in \{0,1\}$, and $i=1,\ldots, m$.
- The likelihood function of the training data is then $$p(\Psi|\vec{w}) = \prod_{i=1}^m \sigma(\vec{w}^\intercal \phi(\vec{x}_i))^{t_i} (1 - \sigma(\vec{w}^\intercal \phi(\vec{x}_i)))^{(1-t_i)}$$

Defining the error function, $E(\vec{w})$, as the negative of the log-likelihood function, and taking the gradient with respect to $\vec{w}$, we obtain $$\bigtriangledown E(\vec{w}) = \sum_{i=1}^m \left[\sigma(\vec{w}^\intercal \phi(\vec{x}_i)) - t_i\right] \phi(\vec{x}_i).$$

- this error function looks the same as that obtained for
**linear**regression under the assumption of a Gaussian noise model which had a closed form solution. - the nonlinearity of the
*sigmoid*function, $\sigma\left(\vec{w}^\intercal \phi(\vec{x}_i)\right)$ prevents a closed form solution in the**logistic**regression problem. - We must apply an iterative method to obtain a numerical solution for the parameters, $\vec{w}$.

Here we will consider the *Newton-Raphson* method for which minimizing the error function takes the form

where $\vec{H}$ is the *Hessian* matrix composed of the second derivatives of the error function

where $\Phi$ is the $n \times m$ design matrix whose $n$-th row is given by $\phi(\vec{x_n})^\intercal$ and $\vec{R}$ is an $n \times n$ diagonal matrix with elements.

$$R_{i,i} = \sigma(\vec{w}^\intercal \phi(\vec{x}_i)) \left[1-\sigma(\vec{w}^\intercal \phi(\vec{x}_i))\right].$$This can be reduced to a form equivalent to that of locally weighted linear *regression* as follows

where $\vec{z}$ is an $n$-dimensional vector defined by

$$\vec{z} = \Phi \vec{w}^{(\tau)} - \vec{R}^{-1}(\vec{y} - \vec{t})$$- As an experiment, you can try increasing the number of training points,
`N`

. - Eventually, the training points will overlap so that it will not be possible to completely seperate them with the transformation provided.

In [27]:

```
# preparing training dataset
np.random.seed(123456789) # fixing for reproducubility
N = 100 #number of data points
D = 2 #dimension of input vector
t = np.zeros(N) #training set classifications
X = np.zeros((N,D)) #training data in input space
sigma = .25
mu0 = 0.0
mu1 = 1.0
```

- Pick a value from a uniform distribution on $[0,1]$.
- If it is less than 0.5, assign class 1 and pick $x_1, x_2$ from a $\mathcal{N}(\mu_0,\sigma)$
- otherwise assign class 2 and pick $x_1,x_2$ from $\mathcal{N}(\mu_1,\sigma)$

In [29]:

```
for i in range(N):
#choose class to sample for
fac = 1
if np.random.rand() <= 0.5:
thismu = mu0
t[i] = 1
else:
thismu = mu1
t[i] = 0
if np.random.rand() < 0.5: fac = -1
X[i,0] = fac * np.random.normal(thismu, sigma)
X[i,1] = fac * np.random.normal(thismu, sigma)
```

In [30]:

```
fig = plt.figure(figsize=(5, 5))
ax = fig.gca()
ax.set_xlabel('$x_1$')
ax.set_ylabel('$x_2$')
ax.set_title('Training data set')
create_scatter(X, t, ax)
```

The training data does not have a linear boundary in the original input space.

- So lets apply a transformation, $\phi()$ (function
`phi(...)`

) to try to make it linearly separable *Note*: This transformation is not the only one that works.- For example try switching the values of $\mu_1$ and $\mu_2$.
- The result will be a different mapping that is still linearly separable.

In [31]:

```
def phi(x, mu, sigma):
detSigma = np.linalg.det(sigma)
fac = math.pow(2 * math.pi, len(mu) / 2.0) * math.sqrt(detSigma)
arg = -0.5 * np.dot((x - mu).T, np.dot(np.linalg.inv(sigma), x - mu))
return math.exp(arg) / fac
```

In [32]:

```
phiX = np.zeros((N,D))
MU1 = np.ones(D)*mu0
MU2 = np.ones(D)*mu1
SIGMA = np.diag(np.ones(D))*sigma
for i in range(N):
phiX[i,0] = phi(x=X[i,:], mu=MU2, sigma=SIGMA)
phiX[i,1] = phi(x=X[i,:], mu=MU1, sigma=SIGMA)
```

In [33]:

```
fig = plt.figure(figsize=(5,5)); ax =fig.gca()
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
create_scatter(phiX,t,ax)
```

Now, lets apply machine learning to determine the boundary!

- We will assume M = 3, i.e. that there are 3 free parameters,
- that is $\vec{w} = [w_0, w_1, w_2]^\intercal$ and,
`phi_n = [1, phiX[0], phiX[1]]`

.

In [34]:

```
M = 3
Phi = np.ones((N,M))
Phi[:,1] = phiX[:,0]
Phi[:,2] = phiX[:,1]
w = np.zeros(M)
R = np.zeros((N,N))
y = np.zeros(N)
```

In [35]:

```
def sigmoid(a):
return 1.0 / (1.0 + math.exp(-a))
def totalErr(y,t):
e = 0.0
for i in range(len(y)):
if t[i] > 0:
e += math.log(y[i])
else:
e += math.log(1.0 - y[i])
return -e
```

Starting Newton-Raphson.

- As a stopping criteria we will use a tolerance on the change in the error function and a maximum number of iterations

In [36]:

```
max_its = 100
tol = 1e-2
w0 = [w[0]]
w1 = [w[1]]
w2 = [w[2]]
err = []
error_delta = 1 + tol
current_error = 0
idx = 0
```

In [37]:

```
from functools import reduce
```

In [38]:

```
while math.fabs(error_delta) > tol and idx < max_its:
#update y & R
for i in range(N):
y[i] = sigmoid(reduce(lambda accum, Z: accum + Z[0]*Z[1], zip(w, Phi[i,:]), 0))
R[i,i] = y[i] - y[i]*y[i]
#update w
z = np.dot(Phi,w) - np.dot(np.linalg.pinv(R),y-t)
term_1 = np.linalg.pinv(np.dot(np.dot(Phi.T,R),Phi))
term_2 = np.dot(np.dot(term_1, Phi.T),R)
w = np.dot(term_2, z)
w0.append(w[0])
w1.append(w[1])
w2.append(w[2])
idx += 1
temp = totalErr(y,t)
error_delta = current_error - temp
current_error = temp
err.append(error_delta)
```

In [39]:

```
print('The total number of iterations was {0}'.format(idx))
print('The total error was {0}'.format(current_error))
print('The final change in error was {0}'.format(error_delta))
print('The final parameters were {0}'.format(w))
```

Our decision boundary is now formed by the line where $\sigma(a) = 0.5$, i.e. where $a = 0$, which for this example is where $\phi_2 = -\frac{w_1}{w_2}\phi_1$, i.e. where $\vec{w} \cdot\vec{\phi} = 0.5$.

In [40]:

```
fig = plt.figure(figsize=(5,5)); ax =fig.gca()
ax.set_xlabel('$x_1$'); ax.set_ylabel('$x_2$')
create_scatter(phiX,t,ax)
bdryx = (-0.2,1.1)
bdryy = (-(w[0]+w[1]*bdryx[0])/w[2], -(w[0]+w[1]*bdryx[1])/w[2])
ax.plot(bdryx, bdryy, linestyle='--');
```

In [41]:

```
fig = plt.figure(figsize=(5,5))
plt.plot(w0, 'bo-', label='$w_0$')
plt.plot(w1, 'rx-', label='$w_1$')
plt.plot(w2, 'gs-', label='$w_2$')
plt.plot(err, 'm*-', label='error delta')
plt.legend(loc='upper left', frameon=True)
plt.xlabel('Newton-Raphson iterations');
```