lpproj
is a Python implementation of Locality Preserving Projections, built to be compatible with scikit-learn. It can be installed with pip; e.g.
pip install lpproj
For more information, see http://github.com/jakevdp/lpproj
This notebook contains a very short example showing the use of the code.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
We'll use scikit-learn and create some data consisting of blobs in 300 dimensions:
from sklearn.datasets import make_blobs
X, y = make_blobs(1000, n_features=300, centers=4,
cluster_std=8, random_state=42)
If we select a few random two-dimensional projections, we can see that the clusters overlap significantly along any particular "line-of-sight" into the high-dimensional data:
fig, ax = plt.subplots(2, 2, figsize=(10, 10))
rand = np.random.RandomState(42)
for axi in ax.flat:
i, j = rand.randint(X.shape[1], size=2)
axi.scatter(X[:, i], X[:, j], c=y)
We can find a projection that preserves the locality of the points using the LocalityPreservingProjection
estimator; here we'll project the data into two dimensions:
from lpproj import LocalityPreservingProjection
lpp = LocalityPreservingProjection(n_components=2)
X_2D = lpp.fit_transform(X)
Plotting this projection, we confirm that it has kept nearby points together, as represented by the distinct clusters visible in the projection:
plt.scatter(X_2D[:, 0], X_2D[:, 1], c=y)
plt.title("Projected from 500->2 dimensions");
For more information, see the Locality Preserving Projection website
Of course, there are well-known tools that can do very similar things: for example, a standard Principal Component Analysis projection produces much the same results in this case:
from sklearn.decomposition import PCA
Xpca = PCA(n_components=2).fit_transform(X)
plt.scatter(Xpca[:, 0], Xpca[:, 1], c=y);
It is important to keep in mind, though, that these are two fundamentally different models: PCA finds a linear projection which maximizes the preserved variance in the data. LPP finds a linear projection which maximizes the preserved locality in the data.
The difference between the two can be made more clear by looking at data with different properties. One example is data with outliers. Because PCA is a variance-based method, it is strongly affected by the presence of outliers. LPP, on the other hand, focuses on preserving local neighborhoods and thus outliers do not have as strong an effect.
First we'll add ten outliers to our original data:
rand = np.random.RandomState(42)
Xnoisy = X.copy()
Xnoisy[:10] += 1000 * rand.randn(10, X.shape[1])
Now we can compute the PCA and LPP projections and view the results:
Xpca = PCA(n_components=2).fit_transform(Xnoisy)
Xlpp = LocalityPreservingProjection(n_components=2).fit_transform(Xnoisy)
fig, ax = plt.subplots(1, 2, figsize=(16, 5))
ax[0].scatter(Xlpp[:, 0], Xlpp[:, 1], c=y)
ax[0].set_title('LPP with outliers')
ax[1].scatter(Xpca[:, 0], Xpca[:, 1], c=y)
ax[1].set_title('PCA with outliers');
In the presence of outliers, the projection found by LPP is much more useful than the projection found by PCA.