In this notebook, you will explore how distances between data points behave in high-dimensional metric spaces. To this end, you will experiment both with artifical data as well as real image data.
You find supplementary material for this notebook in the file notebook10.zip on the lecture's homepage. This zip file consists of the file wissrech2.py (an update of the package, which you have already used before) and a folder images, which contains natural image scenes.
This notebook in fact requires very little programming. It is essentially one simple subroutine consisting of a few lines of code, which you will implement below. This subroutine does the following:
A plot produces by your subroutine should in style look like the example below. Everything you need to compute pairwise distances you find within the package scipy. A subroutine to compute histograms you find within the package numpy. Finally, a look at this matplotlib demo
http://matplotlib.org/examples/pylab_examples/annotation_demo2.html
should be helpful for getting the annotations into you plot.
def add_pd_histogram_plot(ax, data, pd_func, label=''):
"""
Plots the normalized histogram of pairwise distances to the given AxisObject ax.
Parameters:
ax - the AxisObject to which the plot is added
data - array-like with shape (N,d) where N is the number of data points and d the dimension
pd_func - function object such that pd_func(data) computes the pairwise distances
label - Text which is placed at the mode of histogram
"""
# your code goes here
Do the following numerical experiments:
You find routines for random number generation in the package numpy.random.
Discuss the concentration behavior of the histograms for increasing dimension $d$. What does this imply for the meaningfulness of pairwise distances when $d$ becomes large?
Your answer:
# your code goes here
Do a numerical experiment which is identical to that from Task 2 except for the data generation (Step 1.). Instead, generate the following data this time:
If you compare the diagrams which you obtain this time with the diagrams from Task 2, what do you observe? Do you notice something special about the Euclidean distance?
(To see effects more clearly, it might be helpful to generate further diagrams, where you combine some plots from Task 2 and some plots from this task)
Your answer:
# your code goes here
So far, you worked with completely artifical data, which originated from probability distribution. In this task, you will work with patches from natural image scenes given as gray-level images. The images are contained in the folder images. To generate the image patches, you can use the subroutine wissrech2.random_image_data ('random' here only means that the patches of given size are taken from a random position within the big image).
Now do the same experiment as in Task 2 except that you use image patches as your data. Further, as dimensions you should take values whose square root is an integer (because random_image_data transforms patches of shape $\lfloor \sqrt{d} \rfloor \times \lfloor \sqrt{d} \rfloor$ into flat $d$-dimensional vectors, if you ask for $d$-dimensional data).
Describe what you observe? Do you have an explanation why it even seems to help to increase the size of the image patches although this increases the formal dimension of your data?
Your answer:
# your code goes here