Sebastian Raschka

last updated: 05/07/2014

I am really looking forward to your comments and suggestions to improve and extend this tutorial! Just send me a quick note

via Twitter: @rasbt

or Email: [email protected]

- 1 Introduction
- 2 Defining the Region $R_n$
- 2.1 Two different approaches - fixed volume vs. fixed number of samples in a variable volume
- 2.2 Example 3D-hypercubes
- 2.3 The
*window function* - 2.4 Parzen-window estimation
- 2.5 Critical parameters of the Parzen-window technique: window width and kernel
- 2.6 Critical assumption: Convergence
- 2.7 Implementing the Parzen-window estimation

- 3 Applying the Parzen-window approach to a random mutlivariate Gaussian dataset
- 3.1 Generating 10000 random 2D-patterns from a Gaussian distribution
- 3.2 Implementing and plotting the multivariate Gaussian density function

- 4 Conclusion and Drawbacks of the Parzen-window technique
- 5 Replacing the hypercube with a Gaussian Kernel
- 6 Plotting the estimates from the different Kernels
- 7 Using the kernel density estimation for a pattern classification task

The Parzen-window technique is a widely used non-parametric approach to estimate a probability density function $p(\pmb x)$ for a specific point $\pmb x$ from a sample $\pmb x_n$ that doesn't require any knowledge or assumption about the underying distribution.

**Putting it in context - Where would this method be useful?**

A popular application of the Parzen-window technique is to estimate the class-conditional densities (or also often called 'likelihoods') $p(\pmb x \; | \omega_i)$ in a supervised pattern classification problem from the training dataset (where $\pmb x$ is a multi-dimensional sample that belongs to particular class $\omega_i$).

Imagine that we are about to design a Bayes classifier for solving a statistical pattern classification task using Bayes's rule:

$P(\omega_i\; | \;\pmb x) = \frac{p(\pmb x \; | \; w_i) \cdot P(\omega_i)}{p(\pmb x)} \\ \Rightarrow posterior \; probability = \frac{\; likelihood \; \cdot \; prior \; probability}{evidence}$

If the parameters of the the *class-conditional densities* (also called *likelihoods*) are known, it is pretty easy to design the classifier. I have solved some simple examples in IPython notebooks under the section Parametric Approaches.

However, it becomes much more challenging, if we don't don't have prior knowledge about the underlying parameters that define the model of our data.

Imagine we are about to design a classifier for a pattern classification task where the parameters of the underlying sample distribution are not known. Therefore, we wouldn't need the knowledge about the whole range of the distribution; it would be sufficient to know the probability of the particular point, which we want to classify, in order to make the decision. And here we are going to see how we can estimate this probability from the training sample.

However, the only problem of this approach would be that we would seldom have exact values - if we consider the histogram of the frequencies for a arbitrary training dataset. Therefore, we define a certain *region* (i.e., the **Parzen-window**) around the particular value to make the estimate.

**And where does this name Parzen-window come from?**

As it was quite common in the earlier days, this technique was named after its inventor, Emanuel Parzen, who published his detailed mathematical analysis in 1962 in the

The basis of this approach is to count how many samples fall within a specified region $R_n$ (or "window" if you will). Our intuition tells us, that (based on the observation), the probability that one sample falls into this region is

$p(x) = \frac{\#\;of\;samples\;in\;R}{total \; samples}$

To tackle this problem from a more mathematical standpoint to estimate "the probability of observing $k$ points out of $n$ in a Region $R$" we consider a binomial distribution

$p_k = \bigg[ \begin{array}{c} n \\ k \end{array}\bigg] \cdot p^k \cdot (1-p)^{n-k}$

and make the assumption that in a binomial distribution, the probability peaks sharply at the mean, so that:

$E[k] = n \cdot p \; \sim \; k = n \cdot p$

And if we think of the probability of a continous variable, we know that it is defined as:

$p(\pmb x) = \int_R dx = p(\pmb x) \cdot V$,

where $V$ is the volume of the region $R$,

and if we rearrange those terms, so that we arrive at the following equation, which we will use later:

$\frac{k}{n} = p(\pmb x) \cdot V \\ \Rightarrow p(\pmb x) = \frac{k / n}{V}$

This simple equation above (i.e, the "probability estimate") lets us caclulate the probability density of a point $\pmb x$ by counting how many points $k$ fall in a defined region (or volume).

Now, there are two possible approaches to estimate the densities at different points $\pmb x^*$.

For a particular number $n$ (= number of total points), we use volume $V$ of a fixed size and observe how many points $k$ fall into the region. In other words, we use the same volume to make an estimate at different regions.

**The Parzen-window technique falls into this category!**

For a particular number $n$ (= number of total points), we use a fixed number $k$ (number of points that fall inside the region or volume) and adjust the volume accordingly.

(**The k-nearest neighbor technique uses this approach, which will be discussed in a separate article.**)

To illustrate this with an example and a set of equations, let us assume this region $R_n$ is a hypercube.

And the volume of this hypercube is defined by $V_n = {h_n}^d$, where $h_n$ is the length of the hypercube, and $d$ is the number of dimensions.

For an 2D-hypercube with length 1, for example, this would be $V_1 = {1}^2$ and for a 3D hypercube $V_1 = {1}^3$, respectively.

So let us visualize such an simple example: a typical 3-dimensional unit hypercube ($h_1 = 1$) representing the region $R_1$, and 10 sample points, where 3 of them lie within the hypercube (red triangles), and the other 7 outside (blue dots).

**Note that using a hypercube would not be an ideal choice for a real application, but it certainly makes the implementations of the following steps a lot shorter and easier to follow**.

In [1]:

```
%pylab inline
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
from itertools import product, combinations
fig = plt.figure(figsize=(7,7))
ax = fig.gca(projection='3d')
ax.set_aspect("equal")
# Plot Points
# samples within the cube
X_inside = np.array([[0,0,0],[0.2,0.2,0.2],[0.1, -0.1, -0.3]])
X_outside = np.array([[-1.2,0.3,-0.3],[0.8,-0.82,-0.9],[1, 0.6, -0.7],
[0.8,0.7,0.2],[0.7,-0.8,-0.45],[-0.3, 0.6, 0.9],
[0.7,-0.6,-0.8]])
for row in X_inside:
ax.scatter(row[0], row[1], row[2], color="r", s=50, marker='^')
for row in X_outside:
ax.scatter(row[0], row[1], row[2], color="k", s=50)
# Plot Cube
h = [-0.5, 0.5]
for s, e in combinations(np.array(list(product(h,h,h))), 2):
if np.sum(np.abs(s-e)) == h[1]-h[0]:
ax.plot3D(*zip(s,e), color="g")
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.set_zlim(-1.5, 1.5)
plt.show()
```

Once we visualized the region $R_1$ like above, it is easy and intuitive to count how many samples fall within this region, and how many lie outside. To approach this problem more mathematically, we would use the following equation to count the samples $k_n$ within this hypercube, where $\phi$ is our so-called *window function*:

$\phi(\pmb u) = \Bigg[ \begin{array}{ll} 1 & \quad |u_j| \leq 1/2 \; ;\quad \quad j = 1, ..., d \\
0 & \quad otherwise \end{array} $

for a hypercube of unit length 1 centered at the coordinate system's origin. What this function basically does is assigning a value 1 to a sample point if it lies within 1/2 of the edges of the hypercube, and 0 if lies outside (note that the evaluation is done for all dimensions of the sample point).

If we extend on this concept, we can define a more general equation that applies to hypercubes of any length $h_n$ that are centered at $\pmb x$:

$k_n = \sum\limits_{i=1}^{n} \phi \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)$

where $\pmb u = \bigg( \frac{\pmb x - \pmb x_i}{h_n} \bigg)$

In [2]:

```
def window_function(x_vec, unit_len=1):
"""
Implementation of the window function. Returns 1 if 3x1-sample vector
lies within a origin-centered hypercube, 0 otherwise.
"""
for row in x_vec:
if np.abs(row) > (unit_len/2):
return 0
return 1
```

Using the *window function* that we just implemented above, let us now quantify how many points actually lie inside and outside the hypercube.

In [3]:

```
X_all = np.vstack((X_inside,X_outside))
assert(X_all.shape == (10,3))
k_n = 0
for row in X_all:
k_n += window_function(row.reshape(3,1))
print('Points inside the hypercube:', k_n)
print('Points outside the hybercube:', len(X_all) - k_n)
```

Based on the window function that we defined in the section above, we can now formulate the Parzen-window estimation with a hypercube kernel as follows:

$p_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg]$

where

$h^d = V_n\quad$ and $\quad\phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg] = k$

And applying this to out unit-hypercube example above, for which 3 out of 10 samples fall inside the hypercube (region $R$), we can calculate the probability $p(\pmb x)$ that $\pmb x$ samples fall within region $R$ as follows:

$p(\pmb x) = \frac{k/n}{h^d} = \frac{3/10}{1^3} = \frac{3}{10} = 0.3$

One of the most critical assumptions why this technique works (i.e., the Parzen-window estimation, but the k-nearest neighbor technique as well), is that $p_n$ (where $n$ stands for the "number of samples", not "non-parametric" ;) ) converges to the true density $p(\pmb x)$ when we assume an infinite number of training samples, which was nicely shown by Emanuel Parzen in his paper.

The two critical parameters in the Parzen-window techniques are

- 1) the window width
- 2) the kernel

Let us skip this part for now and discuss and explore the question of choosing an appropriate window width laterby using an hands-on example.

Most commonly, either a hypercube or a Gaussian kernel is used for the window function. But how do we know which is better? It really depends on the training sample. In practice, the choice is often made by testing the derived pattern classifier to see which method leads to a better performance on the test data set.

Intuitively, it would make sense to use a Gaussian kernel for a data set that follows a Gaussian distribution. **But remmeber, the whole purpose of the Parzen-window estimation is to estimate densities of a unknown distribution!** So, in practice we wouldn't know whether our data stems from a Gaussian distribution or not (otherwise we wouldn't need to estimate, but can use a parametric techniqe like MLE or Bayesian Estimation).

If we would decide to use a Gaussian kernel instead of the hypercube, we can just simply swap the terms of the window function, which we defined above for the hypercube, by:

$\frac{1}{(\sqrt{2 \pi})^d h_{n}^{d}} exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg]$

The Parzen-window estimation would then look like this:

$p_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \Bigg[ \frac{1}{(\sqrt {2 \pi})^d h_{n}^{d}} exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg] \Bigg]$

**mixing different kernels:**

In some papers you will see, that the authors mixed hypercube and Gaussian kernels to estimate the densities at different regions. In practice, this might work very well, however, but note that in theory this would violate on of the underlying principles: that the density integrates to 1.

$\lim\limits_{n \rightarrow \infty}{p_n(\pmb x)} = p(\pmb x)$

A few other requirements are that at the limit the volume we choose for the Parzen-window becomes infinite small:

$\lim\limits_{n \rightarrow \infty}{V_n} = 0$

and the number of points $k_n$ in this region converges to:

$\lim\limits_{n \rightarrow \infty}{k_n} = \infty$

from which we can conclude:

$\lim\limits_{n \rightarrow \infty}{\frac{k_n}{n}} = 0$

And again, let us go to the more interesting part and implement the code for the hypercube kernel:

In [4]:

```
def parzen_window_est(x_samples, h=1, center=[0,0,0]):
'''
Implementation of the Parzen-window estimation for hypercubes.
Keyword arguments:
x_samples: A 'n x d'-dimensional numpy array, where each sample
is stored in a separate row.
h: The length of the hypercube.
center: The coordinate center of the hypercube
Returns the probability density for observing k samples inside the hypercube.
'''
dimensions = x_samples.shape[1]
assert (len(center) == dimensions), 'Number of center coordinates have to match sample dimensions'
k = 0
for x in x_samples:
is_inside = 1
for axis,center_point in zip(x, center):
if np.abs(axis-center_point) > (h/2):
is_inside = 0
k += is_inside
return (k / len(x_samples)) / (h**dimensions)
print('p(x) =', parzen_window_est(X_all, h=1))
```

**At this point I can fully understand if you lost the overview a little bit. I summarized the three crucial parts (hypercube kernel, window function, and the resulting parzen-window estimation) in a later section, and I think it is worthwhile to take a brief look at it, before we apply it to a data set below.**

Let us use an 2-dimensional dataset drawn from a multivariate Gaussian distribution to apply the Parzen-window technique for the density estimation.

The general multivariate Gaussian probability density function (pdf) is defined as:

$p(\pmb x) = \frac{1}{(2\pi)^{d/2} \; |\Sigma|^{1/2}} exp \bigg[ -\frac{1}{2}(\pmb x - \pmb \mu)^t \Sigma^{-1}(\pmb x - \pmb \mu) \bigg]$

And we will use the following parameters for drawing the random samples:

$\pmb{\mu} = \bigg[
\begin{array}{c}
0\\
0\\
\end{array} \bigg]\;, \quad$

$\Sigma = \bigg[
\begin{array}{cc}
\sigma_{11}^2 & \sigma_{12}^2\\
\sigma_{21}^2 & \sigma_{22}^2\\
\end{array} \bigg] \; = \bigg[
\begin{array}{cc}
1 & 0\\
0 & 1\\
\end{array} \bigg]$

In [6]:

```
import numpy as np
# Generate 10000 random 2D-patterns
mu_vec = np.array([0,0])
cov_mat = np.array([[1,0],[0,1]])
x_2Dgauss = np.random.multivariate_normal(mu_vec, cov_mat, 10000)
print(x_2Dgauss.shape)
```

In [7]:

```
#%pylab inline
#from matplotlib import pyplot as plt
f, ax = plt.subplots(figsize=(7, 7))
ax.scatter(x_2Dgauss[:,0], x_2Dgauss[:,1], marker='o', color='green', s=4, alpha=0.3)
plt.title('10000 samples randomly drawn from a 2D Gaussian distribution')
plt.ylabel('x2')
plt.xlabel('x1')
ftext = 'p(x) ~ N(mu=(0,0)^t, cov=I)'
plt.figtext(.15,.85, ftext, fontsize=11, ha='left')
plt.ylim([-4,4])
plt.xlim([-4,4])
plt.show()
```

First, let us plot the multivariate Gaussian distribution (here: bivariate) in a 3D plot to get an better idea of the actual density distribution.

In [8]:

```
#%pylab inline
#import numpy as np
#from matplotlib import pyplot as plt
from matplotlib.mlab import bivariate_normal
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
x = np.linspace(-5, 5, 200)
y = x
X,Y = np.meshgrid(x, y)
Z = bivariate_normal(X, Y)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=plt.cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
plt.title('Bivariate Gaussian distribution')
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
plt.show()
```

To calculate the probabilities from the multivariate Gaussian density function and compare the results our estimate, let us implement it using the following equation:

$p(\pmb x) = \frac{1}{(2\pi)^{d/2} \; |\Sigma|^{1/2}} exp \bigg[ -\frac{1}{2}(\pmb x - \pmb \mu)^t \Sigma^{-1}(\pmb x - \pmb \mu) \bigg]$

Unfortunately, there is currently no Python library that provides this functionality. Good news is, that `scipy.stats.multivariate_normal.pdf()`

will be implemented in the new release candidate of `scipy`

(v. 0.14).

In [9]:

```
#import numpy as np
def pdf_multivariate_gauss(x, mu, cov):
'''
Caculate the multivariate normal density (pdf)
Keyword arguments:
x = numpy array of a "d x 1" sample vector
mu = numpy array of a "d x 1" mean vector
cov = "numpy array of a d x d" covariance matrix
'''
assert(mu.shape[0] > mu.shape[1]), 'mu must be a row vector'
assert(x.shape[0] > x.shape[1]), 'x must be a row vector'
assert(cov.shape[0] == cov.shape[1]), 'covariance matrix must be square'
assert(mu.shape[0] == cov.shape[0]), 'cov_mat and mu_vec must have the same dimensions'
assert(mu.shape[0] == x.shape[0]), 'mu and x must have the same dimensions'
part1 = 1 / ( ((2* np.pi)**(len(mu)/2)) * (np.linalg.det(cov)**(1/2)) )
part2 = (-1/2) * ((x-mu).T.dot(np.linalg.inv(cov))).dot((x-mu))
return float(part1 * np.exp(part2))
```

Let us quickly confirm that the multivariate Gaussian, which we just implemented, works as expected by comparing
it with the bivariate Gaussian from the `matplotlib.mlab`

package.

In [73]:

```
from matplotlib.mlab import bivariate_normal
x = np.array([[0],[0]])
mu = np.array([[0],[0]])
cov = np.eye(2)
mlab_gauss = bivariate_normal(x,x)
mlab_gauss = float(mlab_gauss[0]) # because mlab returns an np.array
impl_gauss = pdf_multivariate_gauss(x, mu, cov)
print('mlab_gauss:', mlab_gauss)
print('impl_gauss:', impl_gauss)
assert(mlab_gauss == impl_gauss), 'Implementations of the mult. Gaussian return different pdfs'
```

And finally, let us compare the densities that we estimate using our implementation of the Parzen-window estimation with the actual multivariate Gaussian densities.

But before we do the comparison, we have to ask ourselves one more question: Which window size should we choose (i.e., what should be the side length $h$ of our hypercube)?

The window width is a function of the number of training samples,

$h_n \propto \frac{1}{\sqrt{n}}$

But why $\sqrt{n}$, not just $n$?

This is because the number of $k_n$ points within a window grows much smaller than the number of training samples, although

$\lim\limits_{n \rightarrow \infty}{k_n} = \infty$

we have: $ k < n$

This is also one of the biggest drawbacks of the Parzen-window technique, since in practice, the number of training data is usually (too) small, which makes the choice of an 'optimal' window size difficult.

**In practice, one would choose different window widths and analyze which one results in the best performance of the derived classifier.** The only guideline we have is that we assume that 'optimal' the window width shrinks with the number of training samples.

If we would choose a window width that is "too small", this would result in local spikes, and a window width that is "too big" would average over the whole distribution. Below, I try to illustrate it using a 1D-sample.

In a later section we will have a look how it looks like real data:

For our example, we have the luxury of knowing the distribution of our data, so let us look at some possible window widths to see how it affects the density estimation for the center of the distribution

(based on our test in the section above, we would expect a value close to $p(\pmb x) = 0.1592\;$ ).

In [10]:

```
print('Predict p(x) at the center [0,0]: ')
print('h = 0.1 ---> p(x) =', parzen_window_est(x_2Dgauss, h=0.1, center=[0, 0]))
print('h = 0.3 ---> p(x) =',parzen_window_est(x_2Dgauss, h=0.3, center=[0, 0]))
print('h = 0.6 ---> p(x) =',parzen_window_est(x_2Dgauss, h=0.6, center=[0, 0]))
print('h = 1 ---> p(x) =',parzen_window_est(x_2Dgauss, h=1, center=[0, 0]))
```

This rough estimate above gives us some idea about a window size that would give us a quite reasonable estimate: A good value for $h$ should be somewhere around 0.6.

But we can do better (maybe I should use a minimization algorithm here, but I think this example should illustrate the procedure sufficiently). So let us create 400 evenly spaced values between (0,1) for $h$ and see which gives us the estimate for the probability at the center of the distribution.

In [11]:

```
import operator
# generate a range of 400 window widths between 0 < h < 1
h_range = np.linspace(0.001, 1, 400)
# calculate the actual density at the center [0, 0]
mu = np.array([[0],[0]])
cov = np.eye(2)
actual_pdf_val = pdf_multivariate_gauss(np.array([[0],[0]]), mu, cov)
# get a list of the differnces (|estimate-actual|) for different window widths
parzen_estimates = [np.abs(parzen_window_est(x_2Dgauss, h=i, center=[0, 0])
- actual_pdf_val) for i in h_range]
# get the window width for which |estimate-actual| is closest to 0
min_index, min_value = min(enumerate(parzen_estimates), key=operator.itemgetter(1))
print('Optimal window width for this data set: ', h_range[min_index])
```

Now, that we have a "appropriate" window width, let us compare the estimates to the actual densities for some example points.

In [199]:

```
import prettytable
p1 = parzen_window_est(x_2Dgauss, h=h_range[min_index], center=[0, 0])
p2 = parzen_window_est(x_2Dgauss, h=h_range[min_index], center=[0.5, 0.5])
p3 = parzen_window_est(x_2Dgauss, h=h_range[min_index], center=[0.3, 0.2])
mu = np.array([[0],[0]])
cov = np.eye(2)
a1 = pdf_multivariate_gauss(np.array([[0],[0]]), mu, cov)
a2 = pdf_multivariate_gauss(np.array([[0.5],[0.5]]), mu, cov)
a3 = pdf_multivariate_gauss(np.array([[0.3],[0.2]]), mu, cov)
results = prettytable.PrettyTable(["", "predicted", "actual"])
results.add_row(["p([0,0]^t",p1, a1])
results.add_row(["p([0.5,0.5]^t",p2, a2])
results.add_row(["p([0.3,0.2]^t",p3, a3])
print(results)
```

As we can see in the table above, our prediction works quite well!

Last but not least, let us plot the densities that we'd predict using the Parzen-window technique with a hypercube!

In [13]:

```
%pylab inline
import numpy as np
from matplotlib import pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
##############################################
### Predicted bivariate Gaussian densities ###
##############################################
fig = plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
X = np.linspace(-5, 5, 100)
Y = np.linspace(-5, 5, 100)
X,Y = np.meshgrid(X,Y)
Z = []
for i,j in zip(X.ravel(),Y.ravel()):
Z.append(parzen_window_est(x_2Dgauss, h=h_range[min_index], center=[i, j]))
Z = np.asarray(Z).reshape(100,100)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=plt.cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
plt.title('Predicted bivariate Gaussian densities')
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
###########################################
### Actual bivariate Gaussian densities ###
###########################################
fig = plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
x = np.linspace(-5, 5, 100)
y = x
X,Y = np.meshgrid(x, y)
Z = bivariate_normal(X, Y)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=plt.cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
plt.title('Actual bivariate Gaussian densities')
plt.show()
```

As we see in the two figures above (estimated vs. actual bivariate Gaussian probability distributions), we were able to estimate the Gaussian densities reasonably well.

One of the biggest drawbacks of the Parzen-window technique is that we have to **keep our training dataset** around for estimating (computing) the probability densities. For example, if we design a Bayes' classifier and use the Parzen-window technique to estimate the class-conditional probability densities $p(\pmb x_i \; | \; \omega_j)$, the computational task requires the whole training dataset to make the estimate for every point - this is a disadvantage of non-parametric approaches in general.

In contrast, parametric methods, e.g., the Maximum Likelihood Estimate (MLE) or Bayesian Estimation (BE) only require the training data to calculate the value of the parameters. Once, the parameters were obtained from the training dataset, it can be discarded and recomputation is not required for classifying data in the test dataset. However, in Bayesian Learning (BL), new samples can be incoporated to improve the estimated parameter values.

Since the Parzen-window technique estimates the probability densities based on the training dataset, it also relies on a reasonable size of training samples to make a "good" estimate. The more training samples we have in the the training dataset, roughly speaking, the more accurate the estimation becomes (Central limit theorem) since we reduce the likelihood of encountering a sparsity of points for local regions - assuming that our training samples are *i.i.d* (**i**ndependently drawn and **i**dentically **d**istributed). However, on the other hand, a larger number or training samples will also lead to a decrease of the computational performance as discussed in the section above.

Choosing an appropriate window width is a very challenging task, since we have no knowledge about the distribution of the training data (otherwise the non-parameteric approach would be obsolete), and therefore we have to evaluate different window width by analysing the performance of the derived classifier in a pattern classification task, using the assumption $h_n \propto \frac{1}{\sqrt{n}}$ as a guideline.

Most commonly, either a hypercube or Gaussian kernel is chosen for the Parzen-window function $\phi$. However, it is impossible to tell which kernel would yield a better estimation of the probability densities beforehand, since we assume that we don't have any knowledge about the underlying distribution when we are using he Parzen-window technique. Similar to choosing an appropriate window width, we also have to evalulate the performance of the derived classifier to make a practical choice.

Working through the examples above, we estimated the probability densities for a bivariate Gaussian distribution using a hypercube kernel, and our Parzen-window estimation was implemented by the following equation:

$p_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \bigg[ \frac{\pmb x - \pmb x_i}{h_n} \bigg]$

Now, let us switch to a Gaussian kernel for the Parzen-window estimation, so that the equation becomes:

$p_n(\pmb x) = \frac{1}{n} \sum\limits_{i=1}^{n} \frac{1}{h^d} \phi \Bigg[ \frac{1}{(\sqrt {2 \pi})^d h_{n}^{d}} exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg] \Bigg]$

where

$h^d = V_n\quad$ and $\quad\phi \Bigg[ \frac{1}{(\sqrt {2 \pi})^d h_{n}^{d}} exp \; \bigg[ -\frac{1}{2} \bigg(\frac{\pmb x - \pmb x_i}{h_n} \bigg)^2 \bigg] \Bigg] = k$

And applying this to out unit-hypercube example above, for which 3 out of 10 samples fall inside the hypercube (region $R$), we can calculate the probability $p(\pmb x)$ that $\pmb x$ samples fall within region $R$ as follows:

$p(\pmb x) = \frac{k/n}{h^d} = \frac{3/10}{1^3} = \frac{3}{10} = 0.3$

Let us quickly summarize how we implemented the Parzen-window estimation for the hypercube.

In [4]:

```
def hypercube_kernel(h, x, x_i):
"""
Implementation of a hypercube kernel for Parzen-window estimation.
Keyword arguments:
h: window width
x: point x for density estimation, 'd x 1'-dimensional numpy array
x_i: point from training sample, 'd x 1'-dimensional numpy array
Returns a 'd x 1'-dimensional numpy array as input for a window function.
"""
assert (x.shape == x_i.shape), 'vectors x and x_i must have the same dimensions'
return (x - x_i) / (h)
def parzen_window_func(x_vec, h=1):
"""
Implementation of the window function. Returns 1 if 'd x 1'-sample vector
lies within inside the window, 0 otherwise.
"""
for row in x_vec:
if np.abs(row) > (1/2):
return 0
return 1
def parzen_estimation(x_samples, point_x, h, d, window_func, kernel_func):
"""
Implementation of a parzen-window estimation.
Keyword arguments:
x_samples: A 'n x d'-dimensional numpy array, where each sample
is stored in a separate row. (= training sample)
point_x: point x for density estimation, 'd x 1'-dimensional numpy array
h: window width
d: dimensions
window_func: a Parzen window function (phi)
kernel_function: A hypercube or Gaussian kernel functions
Returns the density estimate p(x).
"""
k_n = 0
for row in x_samples:
x_i = kernel_func(h=h, x=point_x, x_i=row[:,np.newaxis])
k_n += window_func(x_i, h=h)
return (k_n / len(x_samples)) / (h**d)
```

Let's go back to the hypercube example where 3 points lie inside, 7 points lie outside the hypercube, and our density estimate for a unit-1 hypercube at the center is $p(\pmb x) = 0.3$.

In [5]:

```
%pylab inline
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import numpy as np
from itertools import product, combinations
fig = plt.figure(figsize=(7,7))
ax = fig.gca(projection='3d')
ax.set_aspect("equal")
# Plot Points
# samples within the cube
X_inside = np.array([[0,0,0],[0.2,0.2,0.2],[0.1, -0.1, -0.3]])
X_outside = np.array([[-1.2,0.3,-0.3],[0.8,-0.82,-0.9],[1, 0.6, -0.7],
[0.8,0.7,0.2],[0.7,-0.8,-0.45],[-0.3, 0.6, 0.9],
[0.7,-0.6,-0.8]])
for row in X_inside:
ax.scatter(row[0], row[1], row[2], color="r", s=50, marker='^')
for row in X_outside:
ax.scatter(row[0], row[1], row[2], color="k", s=50)
# Plot Cube
h = [-0.5, 0.5]
for s, e in combinations(np.array(list(product(h,h,h))), 2):
if np.sum(np.abs(s-e)) == h[1]-h[0]:
ax.plot3D(*zip(s,e), color="g")
ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.set_zlim(-1.5, 1.5)
plt.show()
```

We can now use the `parzen_estimation()`

function with the hypercube kernel to calculate $p(\pmb x)$.

In [11]:

```
point_x = np.array([[0],[0],[0]])
print('p(x) =', parzen_estimation(X_all, point_x, h=1, d=3,
window_func=parzen_window_func,
kernel_func=hypercube_kernel))
```

**And let us quickly confirm that it works on a bigger dataset for non-unit hypercubes that are not centered at the origin:**

(Note that I just chose an arbitrary length for *h*, this is likely not the best choice for *h*.)

In [3]:

```
import numpy as np
# Generate 10000 random 2D-patterns
mu_vec = np.array([0,0])
cov_mat = np.array([[1,0],[0,1]])
x_2Dgauss = np.random.multivariate_normal(mu_vec, cov_mat, 10000)
```

In [35]:

```
import prettytable
p1 = parzen_estimation(x_2Dgauss, np.array([[0],[0]]), h=0.3, d=2,
window_func=parzen_window_func,
kernel_func=hypercube_kernel)
p2 = parzen_estimation(x_2Dgauss, np.array([[0.5],[0.5]]), h=0.3, d=2,
window_func=parzen_window_func,
kernel_func=hypercube_kernel)
p3 = parzen_estimation(x_2Dgauss, np.array([[0.3],[0.2]]), h=0.3, d=2,
window_func=parzen_window_func,
kernel_func=hypercube_kernel)
mu = np.array([[0],[0]])
cov = np.eye(2)
a1 = pdf_multivariate_gauss(np.array([[0],[0]]), mu, cov)
a2 = pdf_multivariate_gauss(np.array([[0.5],[0.5]]), mu, cov)
a3 = pdf_multivariate_gauss(np.array([[0.3],[0.2]]), mu, cov)
results = prettytable.PrettyTable(["", "p(x) predicted", "p(x) actual"])
results.add_row(["p([0,0]^t",p1, a1])
results.add_row(["p([0.5,0.5]^t",p2, a2])
results.add_row(["p([0.3,0.2]^t",p3, a3])
print(results)
```

`scipy.stats`

¶Since we already went through the Parzen-window technique step by step for the hypercube kernel, let us import the `gaussian_kde`

class from the `scipy`

package for a more convenient approach.

The complete documentation can be found on docs.scipy.org.

**The gaussian_kde class takes 2 parameters as input**

**dataset**:`array_like`

Datapoints to estimate from. In case of univariate data this is a 1-D array, otherwise a 2-D array with shape (# of dims, # of data).**bw_method**:`str, scalar or callable, optional`

The method used to calculate the estimator bandwidth. This can be`'scott'`

,`'silverman'`

, a scalar constant or a callable. If a scalar, this will be used directly as kde.factor. If a callable, it should take a gaussian_kde instance as only parameter and return a scalar. If`'None'`

(default),`'scott'`

is used. See Notes for more details.

Note initializiation of a `gaussian_kde`

instance requires a numpy array in which the different samples are ordered by columns and where the rows reflect the dimensions of the dataset - this is why we have to pass our previously generated training data in its transposed form.

First have a quick look how the `gaussian_kde()`

method works using a simple scalar as window width, like we did with the hypercube kernel before and estimate the density at the center.

In [37]:

```
# Example evaluating the density at the center
from scipy.stats import kde
density = kde.gaussian_kde(x_2Dgauss.T, bw_method=0.3)
print(density.evaluate(np.array([[0],[0]])))
```

Let us now compare the Gaussian kernel against the hypercube for the same arbitrarily chosen window width:

In [76]:

```
import prettytable
gde = kde.gaussian_kde(x_2Dgauss.T, bw_method=0.3)
results = prettytable.PrettyTable(["", "p(x) hypercube kernel", "p(x) Gaussian kernel", "p(x) actual"])
results.add_row(["p([0,0]^t",p1, gde.evaluate(np.array([[0],[0]]))[0], a1])
results.add_row(["p([0.5,0.5]^t",p2, gde.evaluate(np.array([[0.5],[0.5]]))[0], a2])
results.add_row(["p([0.3,0.2]^t",p3, gde.evaluate(np.array([[0.3],[0.2]]))[0], a3])
print(results)
```

The `gaussian_kde()`

class comes with 2 different "bandwith estimation calculations", let us explore if it makes a difference to choose one over the other on our dataset:

In [78]:

```
from scipy.stats import kde
import prettytable
mu = np.array([[0],[0]])
cov = np.eye(2)
scott = kde.gaussian_kde(x_2Dgauss.T, bw_method='scott')
silverman = kde.gaussian_kde(x_2Dgauss.T, bw_method='silverman')
scalar = kde.gaussian_kde(x_2Dgauss.T, bw_method=0.3)
actual = pdf_multivariate_gauss(np.array([[0],[0]]), mu, cov)
results = prettytable.PrettyTable(["", "p([0,0]^t gaussian kernel"])
results.add_row(["bw_method scalar 0.3:", scalar.evaluate(np.array([[0],[0]]))[0]])
results.add_row(["bw_method scott:", scott.evaluate(np.array([[0],[0]]))[0]])
results.add_row(["bw_method silverman:", silverman.evaluate(np.array([[0],[0]]))[0]])
results.add_row(["actual density:", actual])
print(results)
```

And eventually, let us plot the results for the different Gaussian kernel bandwidth estimation methods for the estimated distribution.

In [150]:

```
%pylab inline
import numpy as np
from matplotlib import pyplot as plt
from matplotlib.mlab import bivariate_normal
from mpl_toolkits.mplot3d import Axes3D
X = np.linspace(-5, 5, 100)
Y = np.linspace(-5, 5, 100)
X,Y = np.meshgrid(X,Y)
##########################################
### Hypercube kernel density estimates ###
##########################################
fig = plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
Z = []
for i,j in zip(X.ravel(),Y.ravel()):
Z.append(parzen_estimation(x_2Dgauss, np.array([[i],[j]]), h=0.3, d=2,
window_func=parzen_window_func,
kernel_func=hypercube_kernel))
Z = np.asarray(Z).reshape(100,100)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=plt.cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
plt.title('Hypercube kernel with window width h=0.3')
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
plt.show()
#########################################
### Gaussian kernel density estimates ###
#########################################
for bwmethod,t in zip([scalar, scott, silverman], ['scalar h=0.3', 'scott', 'silverman']):
fig = plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
Z = bwmethod(np.array([X.ravel(),Y.ravel()]))
Z = Z.reshape(100,100)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=plt.cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
plt.title('Gaussian kernel, bw_method %s' %t)
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
plt.show()
###########################################
### Actual bivariate Gaussian densities ###
###########################################
fig = plt.figure(figsize=(10, 7))
ax = fig.gca(projection='3d')
Z = bivariate_normal(X, Y)
surf = ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=plt.cm.coolwarm,
linewidth=0, antialiased=False)
ax.set_zlim(0, 0.2)
ax.zaxis.set_major_locator(plt.LinearLocator(10))
ax.zaxis.set_major_formatter(plt.FormatStrFormatter('%.02f'))
fig.colorbar(surf, shrink=0.5, aspect=7, cmap=plt.cm.coolwarm)
ax.set_xlabel('X')
ax.set_ylabel('Y')
ax.set_zlabel('p(x)')
plt.title('Actual bivariate Gaussian densities')
plt.show()
```