As the name says, dimensionality reduction is the idea of reducing your feature set to a much smaller number. Dimensionality reduction is often used in visualization of datasets to try and detect samples that are similar. We will cover three dimensionality reduction techniques here:
# Packages we will use throughout this notebook
using UMAP
using Makie
using XLSX
using VegaDatasets
using DataFrames
using MultivariateStats
using RDatasets
using StatsBase
using Statistics
using LinearAlgebra
using Plots
using ScikitLearn
using MLBase
using Distances
We will use a dataset from the VegaDatasets package. The dataset is about car specifications of over 400 car models.
C = DataFrame(VegaDatasets.dataset("cars"))
Name | Miles_per_Gallon | Cylinders | Displacement | Horsepower | |
---|---|---|---|---|---|
String | Float64? | Int64 | Float64 | Int64? | |
1 | chevrolet chevelle malibu | 18.0 | 8 | 307.0 | 130 |
2 | buick skylark 320 | 15.0 | 8 | 350.0 | 165 |
3 | plymouth satellite | 18.0 | 8 | 318.0 | 150 |
4 | amc rebel sst | 16.0 | 8 | 304.0 | 150 |
5 | ford torino | 17.0 | 8 | 302.0 | 140 |
6 | ford galaxie 500 | 15.0 | 8 | 429.0 | 198 |
7 | chevrolet impala | 14.0 | 8 | 454.0 | 220 |
8 | plymouth fury iii | 14.0 | 8 | 440.0 | 215 |
9 | pontiac catalina | 14.0 | 8 | 455.0 | 225 |
10 | amc ambassador dpl | 15.0 | 8 | 390.0 | 190 |
11 | citroen ds-21 pallas | missing | 4 | 133.0 | 115 |
12 | chevrolet chevelle concours (sw) | missing | 8 | 350.0 | 165 |
13 | ford torino (sw) | missing | 8 | 351.0 | 153 |
14 | plymouth satellite (sw) | missing | 8 | 383.0 | 175 |
15 | amc rebel sst (sw) | missing | 8 | 360.0 | 175 |
16 | dodge challenger se | 15.0 | 8 | 383.0 | 170 |
17 | plymouth 'cuda 340 | 14.0 | 8 | 340.0 | 160 |
18 | ford mustang boss 302 | missing | 8 | 302.0 | 140 |
19 | chevrolet monte carlo | 15.0 | 8 | 400.0 | 150 |
20 | buick estate wagon (sw) | 14.0 | 8 | 455.0 | 225 |
21 | toyota corona mark ii | 24.0 | 4 | 113.0 | 95 |
22 | plymouth duster | 22.0 | 6 | 198.0 | 95 |
23 | amc hornet | 18.0 | 6 | 199.0 | 97 |
24 | ford maverick | 21.0 | 6 | 200.0 | 85 |
25 | datsun pl510 | 27.0 | 4 | 97.0 | 88 |
26 | volkswagen 1131 deluxe sedan | 26.0 | 4 | 97.0 | 46 |
27 | peugeot 504 | 25.0 | 4 | 110.0 | 87 |
28 | audi 100 ls | 24.0 | 4 | 107.0 | 90 |
29 | saab 99e | 25.0 | 4 | 104.0 | 95 |
30 | bmw 2002 | 26.0 | 4 | 121.0 | 113 |
⋮ | ⋮ | ⋮ | ⋮ | ⋮ | ⋮ |
dropmissing!(C)
M = Matrix(C[:,2:7])
names(C)
9-element Vector{String}: "Name" "Miles_per_Gallon" "Cylinders" "Displacement" "Horsepower" "Weight_in_lbs" "Acceleration" "Year" "Origin"
car_origin = C[:,:Origin]
carmap = labelmap(car_origin) #from MLBase
uniqueids = labelencode(carmap,car_origin)
392-element Vector{Int64}: 1 1 1 1 1 1 1 1 1 1 1 1 1 ⋮ 1 1 1 1 2 1 1 1 3 1 1 1
We will first center the data.
# center and normalize the data
data = M
data = (data .- mean(data,dims = 1))./ std(data,dims=1)
392×6 Matrix{Float64}: -0.697747 1.48205 1.07591 0.663285 0.619748 -1.28362 -1.08212 1.48205 1.48683 1.57258 0.842258 -1.46485 -0.697747 1.48205 1.18103 1.18288 0.539692 -1.64609 -0.953992 1.48205 1.04725 1.18288 0.53616 -1.28362 -0.82587 1.48205 1.02813 0.923085 0.554997 -1.82732 -1.08212 1.48205 2.24177 2.42992 1.60515 -2.00855 -1.21024 1.48205 2.48068 3.00148 1.62045 -2.37102 -1.21024 1.48205 2.34689 2.87158 1.57101 -2.55226 -1.21024 1.48205 2.49023 3.13138 1.70404 -2.00855 -1.08212 1.48205 1.86908 2.22208 1.02709 -2.55226 -1.08212 1.48205 1.80219 1.70248 0.689209 -2.00855 -1.21024 1.48205 1.39127 1.44268 0.743365 -2.73349 -1.08212 1.48205 1.96464 1.18288 0.922314 -2.18979 ⋮ ⋮ 0.199113 0.309571 -0.128168 0.143685 -0.0383613 0.311242 1.86471 0.309571 0.645885 -0.505815 0.0440496 0.528722 0.327236 -0.862911 -0.367073 -0.323955 -0.462189 -0.377448 -0.185255 0.309571 0.359199 0.195645 -0.167864 -0.304954 1.09597 -0.862911 -0.481748 -0.220035 -0.368005 -0.594928 1.60847 -0.862911 -0.567753 -0.531795 -0.715308 -0.92115 0.455359 -0.862911 -0.414854 -0.375915 -0.0324748 0.637463 0.455359 -0.862911 -0.519972 -0.479835 -0.220842 0.0212673 2.63345 -0.862911 -0.930889 -1.36315 -0.997859 3.28348 1.09597 -0.862911 -0.567753 -0.531795 -0.803605 -1.4286 0.583482 -0.862911 -0.711097 -0.661694 -0.415097 1.10867 0.967851 -0.862911 -0.720653 -0.583754 -0.303253 1.39865
PCA expects each column to be an observation, so we will use the transpose of the matrix.
# each car is now a column, PCA takes features - by - samples matrix
data'
6×392 adjoint(::Matrix{Float64}) with eltype Float64: -0.697747 -1.08212 -0.697747 … 1.09597 0.583482 0.967851 1.48205 1.48205 1.48205 -0.862911 -0.862911 -0.862911 1.07591 1.48683 1.18103 -0.567753 -0.711097 -0.720653 0.663285 1.57258 1.18288 -0.531795 -0.661694 -0.583754 0.619748 0.842258 0.539692 -0.803605 -0.415097 -0.303253 -1.28362 -1.46485 -1.64609 … -1.4286 1.10867 1.39865
First, we will fit the model via PCA. maxoutdim
is the output dimensions, we want it to be 2 in this case.
p = fit(PCA,data',maxoutdim=2)
PCA(indim = 6, outdim = 2, principalratio = 0.9194828785333569)
We can obtain the projection matrix by calling the function projection
P = projection(p)
6×2 Matrix{Float64}: 0.398973 -0.244835 -0.430615 0.148314 -0.443531 0.108497 -0.434122 -0.166158 -0.430103 0.286095 0.291926 0.892652
Now that we have the projection matrix, P
, we can apply it on one car as follows:
P'*(data[1,:]-mean(p))
2-element Vector{Float64}: -2.3230016965226925 -0.5713519642644685
Or we can transorm all the data via the transform function.
Yte = MultivariateStats.transform(p, data') #notice that Yte[:,1] is the same as P'*(data[1,:]-mean(p))
2×392 Matrix{Float64}: -2.323 -3.20196 -2.66658 -2.60214 … 1.22011 1.70921 1.86951 -0.571352 -0.68187 -0.992744 -0.621975 -1.87471 0.632857 0.815607
We can also go back from two dimensions to 6 dimensions, via the reconstruct
function... But this time, it will be approximate.
# reconstruct testing observations (approximately)
Xr = reconstruct(p, Yte)
6×392 Matrix{Float64}: -0.786928 -1.11055 -0.820834 … 0.945785 0.526984 0.546196 0.91558 1.27768 1.00103 -0.803445 -0.64215 -0.684075 0.968334 1.34619 1.075 -0.744559 -0.689425 -0.740696 1.1034 1.50334 1.32257 -0.218179 -0.847159 -0.947116 0.835669 1.18209 0.862883 -1.06112 -0.554079 -0.570742 -1.18816 -1.54341 -1.66462 … -1.31728 1.06388 1.27381
norm(Xr-data') # this won't be zero
13.743841055569009
Finally, we will generate a scatter plot of the cars:
Plots.scatter(Yte[1,:],Yte[2,:])
Plots.scatter(Yte[1,car_origin.=="USA"],Yte[2,car_origin.=="USA"],color=1,label="USA")
Plots.xlabel!("pca component1")
Plots.ylabel!("pca component2")
Plots.scatter!(Yte[1,car_origin.=="Japan"],Yte[2,car_origin.=="Japan"],color=2,label="Japan")
Plots.scatter!(Yte[1,car_origin.=="Europe"],Yte[2,car_origin.=="Europe"],color=3,label="Europe")
This is interesting! There seems to be three main clusters with cars from the US dominating two clusters.
p = fit(PCA,data',maxoutdim=3)
Yte = MultivariateStats.transform(p, data')
scatter3d(Yte[1,:],Yte[2,:],Yte[3,:],color=uniqueids,legend=false)
This is a 3d plot, but eventhough you can set the camera view, you won't be able to move the 3d plot around. Let's use another package for this purpose. We will use Mackie
.
using GLMakie
scene = Makie.scatter(Yte[1,:],Yte[2,:],Yte[3,:],color=uniqueids)
WARNING: using GLMakie.Makie in module Main conflicts with an existing identifier.
┌ Info: Makie/Makie is caching fonts, this may take a while. Needed only on first run! └ @ Makie /Users/logankilpatrick/.julia/packages/Makie/bmvT8/src/utilities/texture_atlas.jl:113
And now, you can call display(scene)
to create an interactive gui.
display(scene)
GLMakie.Screen(...)
The next method we will use for dimensionality reduction is t-SNE. There are multiple ways you can call t-SNE from julia. Check out this notebook: https://github.com/nassarhuda/JuliaTutorials/blob/master/TSNE/TSNE.ipynb. But we will take this opportunity to try out something new... Call a function from the Scikit learn python package. This makes use of the package ScikitLearn
.
@sk_import manifold : TSNE
tfn = TSNE(n_components=2) #,perplexity=20.0,early_exaggeration=50)
Y2 = tfn.fit_transform(data);
Plots.scatter(Y2[:,1],Y2[:,2],color=uniqueids,legend=false,size=(400,300),markersize=3)
┌ Info: Installing sklearn via the Conda scikit-learn package... └ @ PyCall /Users/logankilpatrick/.julia/packages/PyCall/BD546/src/PyCall.jl:711 ┌ Info: Running `conda install -y scikit-learn` in root environment └ @ Conda /Users/logankilpatrick/.julia/packages/Conda/sNGum/src/Conda.jl:128
Collecting package metadata (current_repodata.json): ...working... done Solving environment: ...working... done ## Package Plan ## environment location: /Users/logankilpatrick/.julia/conda/3 added / updated specs: - scikit-learn The following packages will be downloaded: package | build ---------------------------|----------------- joblib-1.0.1 | pyhd3eb1b0_0 208 KB scikit-learn-0.24.2 | py38hb2f4e1b_0 4.9 MB threadpoolctl-2.1.0 | pyh5ca1d4c_0 17 KB ------------------------------------------------------------ Total: 5.1 MB The following NEW packages will be INSTALLED: joblib pkgs/main/noarch::joblib-1.0.1-pyhd3eb1b0_0 scikit-learn pkgs/main/osx-64::scikit-learn-0.24.2-py38hb2f4e1b_0 threadpoolctl pkgs/main/noarch::threadpoolctl-2.1.0-pyh5ca1d4c_0 Downloading and Extracting Packages scikit-learn-0.24.2 | 4.9 MB | #################################### | 100% joblib-1.0.1 | 208 KB | #################################### | 100% threadpoolctl-2.1.0 | 17 KB | #################################### | 100% Preparing transaction: ...working... done Verifying transaction: ...working... done Executing transaction: ...working... Installed package of scikit-learn can be accelerated using scikit-learn-intelex. More details are available here: https://intel.github.io/scikit-learn-intelex For example: $ conda install scikit-learn-intelex $ python -m sklearnex my_application.py done
This is interesting! The same patterns appears to hold here too.
This will be our final dimensionality reduction method and we will use the package UMAP
for it.
L = cor(data,data,dims=2)
embedding = umap(L, 2)
2×392 Matrix{Float64}: -6.21828 -6.74166 -6.06368 -6.37369 … 2.87221 6.41146 6.54725 -5.30016 -5.57061 -5.86658 -5.3112 -3.53104 1.47467 1.42146
Plots.scatter(embedding[1,:],embedding[2,:],color=uniqueids)
For UMAP, we can create distances between every pair of observations differently, if we choose to. But even with both choices, we will see that UMAP generates a very similar pattern to what we have observed with t-SNE and PCA.
L = pairwise(Euclidean(), data, data,dims=1)
embedding = umap(-L, 2)
2×392 Matrix{Float64}: 7.97172 9.97302 8.69386 8.65178 … -6.35568 -3.924 -3.74242 -3.46232 -0.840573 -3.34202 -3.31067 5.70094 3.44536 3.69962
Plots.scatter(embedding[1,:],embedding[2,:],color=uniqueids)
After finishing this notebook, you should be able to:
All dimensionality reduction techniques we used seemed to agree on that European and Japanese cars seem to be similar in specifications where as American cars seem to form their own two clusters based on their specifications.
Blue are American cars. Green and orange are Japanese and European.