The goal of this milestone is to get familiar with the graph Laplacian and its spectral decomposition.
%matplotlib inline
If you get a No module named 'sklearn'
error when running the below cell, install scikit-learn with conda install scikit-learn
(after activating the ntds_2018
environment).
import numpy as np
from scipy import sparse
from scipy import linalg
import scipy.sparse.linalg
import matplotlib as mpl
mpl.style.use('seaborn')
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# for ground truth checking
import pandas as pd
Let's denote your graph as $\mathcal{G} = (\mathcal{V}, \mathcal{E}, A)$, where $\mathcal{V}$ is the set of nodes, $\mathcal{E}$ is the set of edges, $A \in \mathbb{R}^{N \times N}$ is the (weighted) adjacency matrix, and $N = |\mathcal{V}|$ is the number of nodes.
Import the adjacency matrix $A$ that you constructed in the first milestone. (You're allowed to update it between milestones if you want to.)
adjacency = np.load("adjacency.npy")
adjacency_unweighted = np.copy(adjacency)
adjacency_unweighted[adjacency_unweighted!=0]=1
degrees = np.sum(adjacency_unweighted, axis = 0)
n_nodes = adjacency_unweighted.shape[0]
## We are removing those edges where the weight is smaller thane the threshold
threshold = 20
node_map = np.where(degrees >= threshold)[0]
adjacency_th = np.delete(adjacency_unweighted,np.where(degrees < threshold),0)
adjacency_th = np.delete(adjacency_th,np.where(degrees < threshold),1)
degrees_th = np.sum(adjacency_th, axis = 0)
n_nodes_th = adjacency_th.shape[0]
adjacency_csr = sparse.csr_matrix(adjacency_unweighted);
degree_matrix_csc = sparse.diags(degrees,format = "csc")
From the (weighted) adjacency matrix $A$, compute both the combinatorial (also called unnormalized) and the normalized graph Laplacian matrices.
Note: if your graph is weighted, use the weighted adjacency matrix. If not, use the binary adjacency matrix.
For efficient storage and computation, store these sparse matrices in a compressed sparse row (CSR) format.
laplacian_combinatorial_csr = sparse.csr_matrix(degree_matrix_csc - adjacency_csr);
inv_degree_matrix_csr = sparse.linalg.inv(degree_matrix_csc).tocsr()
sqrt_inv_degree_matrix_csr = sparse.csr_matrix.sqrt(inv_degree_matrix_csr)
laplacian_normalized_csr = sqrt_inv_degree_matrix_csr * laplacian_combinatorial_csr * sqrt_inv_degree_matrix_csr
Use one of them as the graph Laplacian $L$ for the rest of the milestone. We however encourage you to run the code with both to get a sense of the difference!
laplacian = laplacian_normalized_csr
Compute the eigendecomposition of the Laplacian $L = U^\top \Lambda U$, where the columns $u_k \in \mathbb{R}^N$ of $U = [u_1, \dots, u_N] \in \mathbb{R}^{N \times N}$ are the eigenvectors and the diagonal elements $\lambda_k = \Lambda_{kk}$ are the corresponding eigenvalues.
Make sure that the eigenvalues are ordered, i.e., $0 = \lambda_1 \leq \lambda_2 \leq \dots \leq \lambda_N$.
# calculate eigenvalues and eigenvectors
[eigenvalues, eigenvectors] = sparse.linalg.eigsh(laplacian_normalized_csr, k = n_nodes-1, which = 'LM')
#This function will not return the first eigenvalue 0 because we know that the first eigenvalue is always 0
#So when we now look at eigenvalues we should not forget that there is one more 0.
# sort the resulting values and vectors
sortID = np.argsort(eigenvalues)
eigenvalues = eigenvalues[sortID]
eigenvalues[eigenvalues < 10**(-10)] = 0
eigenvectors = eigenvectors[:,sortID]
eigenvectors[eigenvectors < 10**(-10)] = 0
Answer
The Laplacian is always a symmetric real valued matrix and therefore a Hermitian matrix as well. So we can use the solver for sparse Hermitian matrices.
Answer
The matrix S is the incidence matrix having as rows the nodes and as columns the edges ($S \in \mathbb{R}^{NxM}$, N the number of nodes and M the number of edges). For an unweighted graph, $S_{i,j}$ = 1, for an edge $e_j$ between two vertexes $v_{i,k}$, -1 for $e_j$ between $v_{k,i}$ and 0 otherwise (the choice of sign is arbitrary, but it has to be once positive and once negative for a given edge). For a weighted graph, the entries are $\sqrt{w_{i,k}}$ and $-\sqrt{w_{i,k}}$, instead of +1 and -1, respectively.
$S^\top$ acts as a graph gradient, and $S^\top x$ for a weighted graph computes:
\begin{equation} (S^\top x )[j]= \sqrt{w_{i,k}} (x[i] - x[k]) \end{equation}which corresponds to the derivative of x along edge j
Answer
\begin{equation} \begin{split} \lambda_k & = u_k^\top L u_k \text{, result of the eigendecomposition of $L$ with $u_k$ being a unit-vector}\\ & = u_k^\top S S^\top u_k \text{ , with $L = S S^\top$}\\ & = (S^\top u_k)^\top S^\top u_k \text{ , where the order of factors reverses when taking the transpose}\\ & =\| S^\top u_k \|_2^2 \text{ , with $x^\top x$ being the squared Euclidean norm}\\ \end{split} \end{equation}Answer
This quantity tells us how "smooth" x is, i.e. a larger quantity means a higher variation among the x vector components.
Answer
$u_0$ corresponds to the eigenvalue 0 and cannot be the vector 0. From the definition of the eigendecomposition we have:
\begin{equation} L u_0 = \lambda_0 u_0 = 0 \end{equation}Multiply by $u_0^\top$ gives: \begin{equation} u_0^\top L u_0 = u_0^\top \cdot 0 = 0 \end{equation}
Since $u_0^\top L u_0$ is defined as follows: \begin{equation} u_0^\top L u_0 = \frac{1}{2} \sum_{\substack{u,v} \in E} w_{u,v}(u_0[u]-u_0[v])^2 \end{equation}
We get: \begin{equation} \sum_{\substack{u,v} \in E} w_{u,v}(u_0[u]-u_0[v])^2 = 0 \end{equation}
For this equation to hold, we need to have $u_0[u] = u_0[v]$ for any edge (u,v) $\in E$. In the case where the graph is connected, we get that $u_0[i] = u_0[k]$ for every i,k $\in V$. From this we get that $u_0$ equals: \begin{align} u_0 &= \alpha \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{bmatrix} \end{align}
with $\alpha \in \mathbb{R}^*$, or to have a unit vector, $\alpha$ needs to equal $\alpha = \frac{1}{\sqrt{N}}$, where $N$ is the number of connected nodes. Thus, the value of $u_0$ is the unit vector $e$.
Let's call $u_0'$ the eigenvector of the normalized Laplacian $L_{norm}$.
From the theory, we know that $u_0' = D^{\frac{1}{2}} u_0$, and hence $u_0' = D^{\frac{1}{2}} e$ is the eigenvector of $L_{norm}$ of eigenvalue 0.
Comment on what you observe.*
plt.plot(eigenvalues.real, "b." , markersize = 1)
plt.xlabel('Eigenvalue Index')
plt.ylabel('Eigenvalue')
Text(0, 0.5, 'Eigenvalue')
Answer
As espected the eigenvalues are in the domain $[0,2]$. The first eigenvalues which equal to 0 represent the number of connected components.
It is interesting to observe that a lot of eigenvalues equal 1. In this notebook we will not further investigate into this property, but we will keep it in mind for the final report.
# in addition to the 0-multiplicity we add 1 which corresponds to the first eigenvalue 0 which is omitted in the computation
n_conn_comp = eigenvalues[eigenvalues == 0].shape[0] + 1
print("Number of connected components:",n_conn_comp)
Number of connected components: 8
There are 8 connected components.
Answer
For normalized Laplacians the upper bound on the eigenvalues is 2, where equality holds iff the graph is bipartite.
For our graph (normalized graph and the threshold graph), we observed that the upper bound of the eigenvalues of the combinatorial Laplacian was the same than the maximum degree. Defining the upper bound is an active field of research, and diverse theorems of tight bounds can be found in the literature.
Laplacian eigenmaps is a method to embed a graph $\mathcal{G}$ in a $d$-dimensional Euclidean space. That is, it associates a vector $z_i \in \mathbb{R}^d$ to every node $v_i \in \mathcal{V}$. The graph $\mathcal{G}$ is thus embedded as $Z \in \mathbb{R}^{N \times d}$.
Answer
We use Laplacian eigenmaps for dimensionality reduction. Often, graph data is intrinsically low dimensional, but lies in a very high-dimensional space. Thus, mapping a network to a vector space and reducing the dimension, while preserving relevant graph properties, can be useful for faster computation, machine learning algorithms, statistics, or visualization.
Embed your graph in $d=2$ dimensions with Laplacian eigenmaps. Try with and without re-normalizing the eigenvectors by the degrees, then keep the one your prefer.
Recompute the eigenvectors you need with a partial eigendecomposition method for sparse matrices. When $k \ll N$ eigenvectors are needed, partial eigendecompositions are much more efficient than complete eigendecompositions. A partial eigendecomposition scales as $\Omega(k |\mathcal{E}|$), while a complete eigendecomposition costs $\mathcal{O}(N^3)$ operations.
#Creating the matrix in the good format
adjacency_lc_csr = sparse.csr_matrix(adjacency_th);
degrees_lc = np.sum(adjacency_th, axis = 0)
degree_matrix_lc_csc = sparse.diags(degrees_lc,format = "csc")
n_nodes_lc = degrees_lc.shape[0]
#Computation of the laplacian for our graph
laplacian_combinatorial_lc_csr = sparse.csr_matrix(degree_matrix_lc_csc - adjacency_lc_csr);
inv_degree_matrix_lc_csr = sparse.linalg.inv(degree_matrix_lc_csc).tocsr()
sqrt_inv_degree_matrix_lc_csr = sparse.csr_matrix.sqrt(inv_degree_matrix_lc_csr)
laplacian_normalized_lc_csr = sqrt_inv_degree_matrix_lc_csr * laplacian_combinatorial_lc_csr * sqrt_inv_degree_matrix_lc_csr
def graph_embedding(laplacian_normalized_csr, d):
[eigenvalues, eigenvectors] = sparse.linalg.eigsh(laplacian_normalized_csr, k = d+1, which = 'SM')
sortID = np.argsort(eigenvalues)
eigenvalues = eigenvalues[sortID]
eigenvectors = eigenvectors[:,sortID]
proj = eigenvectors[:,1:d+1]
return proj
proj = graph_embedding(laplacian_normalized_lc_csr, 2)
plt.plot(proj[:,0],proj[:,1], '.')
[<matplotlib.lines.Line2D at 0x125cf2a90>]
Answer
With the embedding in two dimensions, airports that are connected by a flight now appear close to each other in the plot. We start to see clusters that naturally emerge form the geometry. Since the clusters represent airports that are well connected to each other, there might be another meaning to it such as financially, economically, and culturally similar countries.
What does the embedding $Z \in \mathbb{R}^{N \times d}$ preserve?
Answer
The embedding $Z \in \mathbb{R}^{N \times d}$ preserves relevant graphs properties such as the number of data points N, while reducing the dimension d. The goal of the Laplacian eigenmap algorithm is to preserve local information optimally. All the point node that are close to each other, meaning nodes connected by an edge with a large weight, must remain close to each other in the embedding.
Spectral clustering is a method to partition a graph into distinct clusters. The method associates a feature vector $z_i \in \mathbb{R}^d$ to every node $v_i \in \mathcal{V}$, then runs $k$-means in the embedding space $\mathbb{R}^d$ to assign each node $v_i \in \mathcal{V}$ to a cluster $c_j \in \mathcal{C}$, where $k = |\mathcal{C}|$ is the number of desired clusters.
For this part, we have chosen to use the unweighted adjacency matrix, because we feel that our graph will probably be clustered according to geographical features. Moreover we kept only the nodes with a degree larger than 20, because we want to see only the airports (=nodes) that are significant (an airport with less than 20 flights is not very significant). By applying this threshold, we also assure that our graph is connected.
#Computation of the eigenvalues and eigenvectors for our graph
[eigenvalues_lc, eigenvectors_lc] = sparse.linalg.eigsh(laplacian_normalized_lc_csr, k = n_nodes_lc-1, which = 'LM')
#This function will not return the first 0 because it is assuming it is always there.
#So when we now look at eigenvalues we had not to forget that there is one more 0.
sortID = np.argsort(eigenvalues_lc)
eigenvalues_lc = eigenvalues_lc[sortID]
eigenvectors_lc = eigenvectors_lc[:,sortID]
n_conn_comp = eigenvalues_lc[eigenvalues_lc == 0].shape[0]+1
eigenvalues_lc[eigenvalues_lc < 10**(-10)] = 0
print("Number of connected components:",n_conn_comp)
Number of connected components: 1
# Graph Spectrum
plt.plot(eigenvalues_lc, "b." , markersize = 1)
plt.title("Graph Spectrum")
plt.xlabel("Numbers of the eigenvalues")
plt.ylabel("Values of eigenvalues")
Text(0, 0.5, 'Values of eigenvalues')
Answer
#Plot eigenvalues repartitions
x = np.ones_like(range(len(eigenvalues_lc)))
plt.plot(eigenvalues_lc,x,"b.",markersize = 10)
plt.xlim([0,0.5])
plt.title("Repartition of eigenvalues (zoom on the part next to 0)")
plt.xlabel("Value of the eigenvalue")
##count how many values we have until the gap
nb_before_1rst=len(eigenvalues_lc[np.where(eigenvalues_lc< 0.2)])+1 # +1 correspond to the first 0 not in the list of eigenvalues
nb_before_2nd=len(eigenvalues_lc[np.where(eigenvalues_lc< 0.4)])+1 # +1 correspond to the first 0 not in the list of eigenvalues
print("number of eigenvalues before 1st gap :",nb_before_1rst)
print("number of eigenvalues before 2nd gap :",nb_before_2nd)
number of eigenvalues before 1st gap : 3 number of eigenvalues before 2nd gap : 6
KMeans
class imported from scikit-learn.We choose to do the embedding with the eigenvectors obtained from the eigendecomposition of the normalized Laplacian.
# For k=3 and d=3
k = 3; d = 3
H = eigenvectors_lc[:,:d];
clusters3 = KMeans(n_clusters=k, random_state=0).fit_predict(H)
print("----- For k=",k," and d=",d," -----")
print("Number of elements in clusters :")
for i in range(k):
cnt = 0
for j in clusters3:
if j == i:
cnt +=1
print("Cluster ",i+1,":",cnt)
#For k=6 and d=6
k = 6; d = 6
H = eigenvectors_lc[:,:d];
clusters6 = KMeans(n_clusters=k, random_state=0).fit_predict(H)
print()
print("----- For k=",k," and d=",d," -----")
print("Number of elements in clusters :")
for i in range(k):
cnt = 0
for j in clusters6:
if j == i:
cnt +=1
print("Cluster ",i+1,":",cnt)
----- For k= 3 and d= 3 ----- Number of elements in clusters : Cluster 1 : 289 Cluster 2 : 112 Cluster 3 : 60 ----- For k= 6 and d= 6 ----- Number of elements in clusters : Cluster 1 : 107 Cluster 2 : 11 Cluster 3 : 167 Cluster 4 : 93 Cluster 5 : 58 Cluster 6 : 25
#For k=2, we have :
fiedler_vect = np.sign(eigenvectors_lc[:,0])
nb_neg=len(fiedler_vect[np.where(fiedler_vect==-1)])
nb_pos=len(fiedler_vect[np.where(fiedler_vect==1)])
print("----- For k=2 (this is just an exemple, it has no real sense for our graph) -----")
print("Number of elements in clusters :")
print("Cluster labeled +1 :",nb_pos)
print("Cluster labeled -1 :",nb_neg)
----- For k=2 (this is just an exemple, it has no real sense for our graph) ----- Number of elements in clusters : Cluster labeled +1 : 148 Cluster labeled -1 : 313
What do you expect? What do you observe?*
# For k=3
new_order3 = np.array([],dtype = int)
for i in range(3):
new_order3 = np.append(new_order3,np.where(clusters3 == i))
plt.spy(adjacency_th[:,new_order3][new_order3], markersize=1)
plt.title("Reordered ajacency matrix for k=3")
#For k=6
plt.figure()
new_order6 = np.array([],dtype = int)
for i in range(6):
new_order6 = np.append(new_order6,np.where(clusters6 == i))
plt.spy(adjacency_th[:,new_order6][new_order6], markersize=1)
plt.title("Reordered ajacency matrix for k=6")
Text(0.5, 1.05, 'Reordered ajacency matrix for k=6')
We were expecting a block diagonal matrix or something close to it, because the nodes of one cluster are supposed to be mostly connected with the nodes of this same cluster. Of course the clusters are not fully independent (in this case, our graph would have several distinct components).
If you have ground truth clusters for your dataset, compare the cluster assignment from spectral clustering to the ground truth. A simple quantitative measure is to compute the percentage of nodes that have been correctly categorized. If you don't have a ground truth, qualitatively assess the quality of the clustering.
Ground truth clusters are the "real clusters".
For example, the genre of musical tracks in FMA, the category of Wikipedia articles, the spammer status of individuals, etc.
Look for the labels
in the dataset descriptions.
Since there are no labels in our dataset, we have chosen to check our hypothesis of geographical clusters. In order to do that we import back our dataframe of airports and flights.
# import of routes
routes = pd.read_csv('routes.dat', sep=',', encoding='utf-8', engine='python')
routes.columns = ['Airline','Airline ID','Source Airport','Source Airport ID','Destination Airport','Destination Airport ID','Codeshare','Stops','Equipment']
routes = routes.drop(columns=['Source Airport ID','Destination Airport ID'])
# import of source and destination airport
source_airports = routes[['Source Airport']]
source_airports = source_airports.rename(columns={'Source Airport':'Airport'})
dest_airports = routes[['Destination Airport']]
dest_airports = dest_airports.rename(columns={'Destination Airport':'Airport'})
# creation of a dataframe with all airport and airport_idx
# (we use airport_idx insteed of airportID because some airports have no airportID)
airports = pd.concat([source_airports,dest_airports])
airports = airports.drop_duplicates()
airports.reset_index(inplace=True)
airports = airports.drop(columns=['index'])
airports.reset_index(inplace=True)
airports = airports.set_index('Airport')
airports = airports.rename(columns={'index':'airport_idx'})
#For clustering with k=3
print("--------------------------------- For k=3 ---------------------------------\n")
for i in range(3):
print("Cluster",i+1," :\n",airports.index[node_map[np.where(clusters3 == i)]].values)
--------------------------------- For k=3 --------------------------------- Cluster 1 : ['DME' 'GYD' 'KZN' 'LED' 'OVB' 'SVX' 'IKT' 'YKS' 'ABJ' 'ACC' 'COO' 'DKR' 'BOD' 'BRS' 'GVA' 'LCA' 'LPA' 'TFS' 'ZRH' 'ARN' 'OSL' 'TRF' 'DWC' 'MNL' 'SSA' 'BCN' 'DLA' 'LGW' 'LOS' 'NBO' 'BKK' 'DRW' 'HKT' 'KUL' 'PEN' 'PER' 'REP' 'RGN' 'SGN' 'DUS' 'FMM' 'HAM' 'MUC' 'NAP' 'OLB' 'SZG' 'TXL' 'ZAD' 'AMS' 'BGY' 'BLQ' 'BRU' 'CGN' 'CMN' 'LYS' 'MPL' 'PMI' 'SAW' 'TLS' 'TSF' 'MEL' 'SYD' 'CAI' 'TLV' 'VKO' 'CCU' 'DAC' 'DXB' 'JED' 'KTM' 'MCT' 'HAV' 'DEL' 'ISB' 'AYT' 'BSL' 'HRG' 'SKP' 'AGP' 'ALG' 'ATH' 'BEG' 'BGO' 'BHX' 'BIA' 'BIO' 'BRE' 'BRI' 'BUD' 'CAG' 'CDG' 'CFU' 'CTA' 'CWL' 'DBV' 'DTM' 'DUB' 'EDI' 'FAO' 'FCO' 'FRA' 'HAJ' 'HEL' 'HER' 'IBZ' 'KGS' 'KRK' 'KTW' 'LEJ' 'LHR' 'LIN' 'LIS' 'MAD' 'MAN' 'MXP' 'NCE' 'NUE' 'PMO' 'PRG' 'PSA' 'RAK' 'RHO' 'SKG' 'SPU' 'STN' 'STR' 'SUF' 'TRN' 'TUN' 'VCE' 'VIE' 'VRN' 'WAW' 'WRO' 'ZAG' 'FAI' 'BET' 'BKI' 'CEB' 'CGK' 'DPS' 'NRT' 'SVO' 'TOS' 'AEP' 'CPT' 'JNB' 'BLR' 'BOM' 'COK' 'HYD' 'MAA' 'SXF' 'TBS' 'CPH' 'KJA' 'KRR' 'EVN' 'RIX' 'HND' 'ALA' 'DYU' 'FRU' 'IKA' 'IST' 'SHJ' 'KUF' 'ROV' 'ADB' 'BJV' 'DLM' 'AMM' 'HBE' 'TIP' 'BVA' 'KBP' 'KIV' 'OTP' 'AUH' 'BAH' 'CMB' 'DMM' 'DOH' 'KWI' 'RUH' 'CHQ' 'IEV' 'MRS' 'NTE' 'SOF' 'AJA' 'FLR' 'GOT' 'LIL' 'ORY' 'SXB' 'EBB' 'ABZ' 'ADL' 'AKL' 'BNE' 'CHC' 'CNS' 'CTS' 'GLA' 'ITM' 'LBA' 'LHE' 'LUX' 'NAN' 'NCL' 'PPT' 'RTM' 'SVG' 'ACE' 'ALC' 'DJE' 'FNC' 'FUE' 'GDN' 'JER' 'LCY' 'MAH' 'MLA' 'NBE' 'SVQ' 'VLC' 'ADD' 'FIH' 'LAD' 'TRD' 'VNO' 'CGH' 'VCP' 'ABV' 'BEY' 'BLL' 'DAR' 'LJU' 'MRU' 'SOU' 'TNR' 'ORN' 'MLE' 'MYY' 'SUB' 'UPG' 'AGA' 'KEF' 'MSQ' 'OPO' 'TLL' 'AHO' 'KHI' 'KRT' 'TAS' 'TSE' 'MHD' 'SYZ' 'THR' 'PFO' 'EMA' 'LPL' 'POM' 'KHV' 'GSE' 'RYG' 'ORK' 'SCQ' 'SNN' 'EBL' 'MED' 'SAH' 'WLG' 'TPS' 'ESB' 'HIR' 'VLI' 'CIA' 'CRL' 'EIN' 'GRO' 'HHN' 'LTN' 'NRN' 'NYO' 'PIK' 'WMI' 'BFS'] Cluster 2 : ['LIM' 'BOG' 'UIO' 'BSB' 'GIG' 'GRU' 'ORD' 'STL' 'YQB' 'YUL' 'ATL' 'FLL' 'JAX' 'MCO' 'PBI' 'RSW' 'TPA' 'AUA' 'ANU' 'SDQ' 'SXM' 'YVR' 'LAS' 'LAX' 'DFW' 'EZE' 'JFK' 'MIA' 'PUJ' 'YEG' 'YOW' 'YYC' 'CUN' 'GDL' 'GUA' 'MEX' 'MTY' 'PVR' 'SAT' 'SJD' 'SJO' 'SNA' 'TIJ' 'SCL' 'CWB' 'POA' 'YWG' 'BZE' 'ANC' 'CCS' 'CLT' 'POS' 'CVG' 'DTW' 'GSP' 'LGA' 'MEM' 'MSP' 'MSY' 'BOS' 'SJU' 'PLS' 'PTY' 'EWR' 'YYZ' 'PHL' 'ABQ' 'AUS' 'BDL' 'BHM' 'BNA' 'BUF' 'BWI' 'CHS' 'CLE' 'CMH' 'CNF' 'DCA' 'DEN' 'GRR' 'HNL' 'HOU' 'IAD' 'IAH' 'IND' 'MBJ' 'MCI' 'MKE' 'MYR' 'NAS' 'OAK' 'OGG' 'OKC' 'OMA' 'PDX' 'PHX' 'PIT' 'RDU' 'SAL' 'SAN' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC' 'SMF' 'YHZ' 'DAL' 'MDW' 'SFB' 'AZA' 'PIE'] Cluster 3 : ['HAK' 'HGH' 'HKG' 'KIX' 'SIN' 'SWA' 'TPE' 'CAN' 'CGO' 'CGQ' 'CKG' 'CSX' 'CTU' 'DLC' 'FOC' 'HET' 'HFE' 'HRB' 'ICN' 'INC' 'KHN' 'KMG' 'KWE' 'KWL' 'LHW' 'LJG' 'NGB' 'NKG' 'NNG' 'PEK' 'PVG' 'SHE' 'SJW' 'SYX' 'SZX' 'TAO' 'TNA' 'TSN' 'TYN' 'URC' 'WNZ' 'WUH' 'XIY' 'XMN' 'ZUH' 'HAN' 'MFM' 'NGO' 'PUS' 'CJU' 'FUK' 'CNX' 'SHA' 'KHH' 'OKA' 'RMQ' 'DMK' 'YNT' 'DAD' 'NAY']
For k=3, we observe that our cluster are relative to the continents.
We notice that some airports are badly clustered if we consider geographical clusters : for exemple ITM (Osaka Airport, Japan) is in Cluster 1 ("Europe"). Moreover the "mix" of geographic areas in cluster 1 really shows that some in-clusters exist. Choosing a higher k will probably show a result closer to the geographical "reality".
#For clustering with k=6
print("--------------------------------- For k=6 ---------------------------------\n")
for i in range(6):
print("Cluster",i+1," :\n",airports.index[node_map[np.where(clusters6 == i)]].values)
--------------------------------- For k=6 --------------------------------- Cluster 1 : ['LIM' 'BOG' 'UIO' 'ORD' 'STL' 'YQB' 'YUL' 'ATL' 'FLL' 'JAX' 'MCO' 'PBI' 'RSW' 'TPA' 'AUA' 'ANU' 'SDQ' 'SXM' 'YVR' 'LAS' 'LAX' 'DFW' 'JFK' 'MIA' 'PUJ' 'YEG' 'YOW' 'YYC' 'CUN' 'GDL' 'GUA' 'HAV' 'MEX' 'MTY' 'PVR' 'SAT' 'SJD' 'SJO' 'SNA' 'TIJ' 'FAI' 'BET' 'SCL' 'YWG' 'BZE' 'ANC' 'CCS' 'CLT' 'POS' 'CVG' 'DTW' 'GSP' 'LGA' 'MEM' 'MSP' 'MSY' 'BOS' 'SJU' 'PLS' 'PTY' 'EWR' 'YYZ' 'PHL' 'ABQ' 'AUS' 'BDL' 'BHM' 'BNA' 'BUF' 'BWI' 'CHS' 'CLE' 'CMH' 'DCA' 'DEN' 'GRR' 'HOU' 'IAD' 'IAH' 'IND' 'MBJ' 'MCI' 'MKE' 'MYR' 'NAS' 'OAK' 'OGG' 'OKC' 'OMA' 'PDX' 'PHX' 'PIT' 'RDU' 'SAL' 'SAN' 'SDF' 'SEA' 'SFO' 'SJC' 'SLC' 'SMF' 'YHZ' 'DAL' 'MDW' 'SFB' 'AZA' 'PIE'] Cluster 2 : ['BSB' 'GIG' 'GRU' 'SSA' 'EZE' 'AEP' 'CWB' 'POA' 'CNF' 'CGH' 'VCP'] Cluster 3 : ['BOD' 'BRS' 'GVA' 'LCA' 'LPA' 'TFS' 'ZRH' 'ARN' 'OSL' 'TRF' 'BCN' 'LGW' 'DUS' 'FMM' 'HAM' 'MUC' 'NAP' 'OLB' 'SZG' 'TXL' 'ZAD' 'AMS' 'BGY' 'BLQ' 'BRU' 'CGN' 'LYS' 'MPL' 'PMI' 'SAW' 'TLS' 'TSF' 'TLV' 'AYT' 'BSL' 'HRG' 'SKP' 'AGP' 'ALG' 'ATH' 'BEG' 'BGO' 'BHX' 'BIA' 'BIO' 'BRE' 'BRI' 'BUD' 'CAG' 'CDG' 'CFU' 'CTA' 'CWL' 'DBV' 'DTM' 'DUB' 'EDI' 'FAO' 'FCO' 'FRA' 'HAJ' 'HEL' 'HER' 'IBZ' 'KGS' 'KRK' 'KTW' 'LEJ' 'LIN' 'LIS' 'MAD' 'MAN' 'MXP' 'NCE' 'NUE' 'PMO' 'PRG' 'PSA' 'RAK' 'RHO' 'SKG' 'SPU' 'STN' 'STR' 'SUF' 'TRN' 'TUN' 'VCE' 'VIE' 'VRN' 'WAW' 'WRO' 'ZAG' 'TOS' 'SXF' 'CPH' 'RIX' 'ADB' 'BJV' 'DLM' 'BVA' 'KIV' 'OTP' 'CHQ' 'IEV' 'MRS' 'NTE' 'SOF' 'AJA' 'FLR' 'GOT' 'LIL' 'ORY' 'SXB' 'ABZ' 'GLA' 'LBA' 'LUX' 'NCL' 'RTM' 'SVG' 'ACE' 'ALC' 'DJE' 'FNC' 'FUE' 'GDN' 'JER' 'LCY' 'MAH' 'MLA' 'NBE' 'SVQ' 'VLC' 'TRD' 'VNO' 'BLL' 'LJU' 'SOU' 'TNR' 'ORN' 'AGA' 'KEF' 'OPO' 'TLL' 'AHO' 'PFO' 'EMA' 'LPL' 'GSE' 'RYG' 'ORK' 'SCQ' 'SNN' 'TPS' 'ESB' 'CIA' 'CRL' 'EIN' 'GRO' 'HHN' 'LTN' 'NRN' 'NYO' 'PIK' 'WMI' 'BFS'] Cluster 4 : ['MNL' 'BKK' 'DRW' 'HAK' 'HGH' 'HKG' 'HKT' 'KIX' 'PEN' 'PER' 'REP' 'RGN' 'SGN' 'SIN' 'SWA' 'TPE' 'CAN' 'CGO' 'CGQ' 'CKG' 'CSX' 'CTU' 'DLC' 'FOC' 'HET' 'HFE' 'HRB' 'ICN' 'INC' 'KHN' 'KMG' 'KWE' 'KWL' 'LHW' 'LJG' 'MEL' 'NGB' 'NKG' 'NNG' 'PEK' 'PVG' 'SHE' 'SJW' 'SYD' 'SYX' 'SZX' 'TAO' 'TNA' 'TSN' 'TYN' 'WNZ' 'WUH' 'XIY' 'XMN' 'ZUH' 'BKI' 'CEB' 'CGK' 'DPS' 'HAN' 'MFM' 'NGO' 'NRT' 'PUS' 'CJU' 'FUK' 'HND' 'CNX' 'SHA' 'KHH' 'ADL' 'AKL' 'BNE' 'CHC' 'CNS' 'CTS' 'HNL' 'ITM' 'NAN' 'PPT' 'OKA' 'RMQ' 'DMK' 'MYY' 'SUB' 'UPG' 'YNT' 'DAD' 'POM' 'WLG' 'HIR' 'VLI' 'NAY'] Cluster 5 : ['ABJ' 'ACC' 'COO' 'DKR' 'DWC' 'DLA' 'LOS' 'NBO' 'KUL' 'CMN' 'CAI' 'CCU' 'DAC' 'DXB' 'JED' 'KTM' 'MCT' 'DEL' 'ISB' 'LHR' 'CPT' 'JNB' 'BLR' 'BOM' 'COK' 'HYD' 'MAA' 'IKA' 'IST' 'SHJ' 'AMM' 'HBE' 'TIP' 'AUH' 'BAH' 'CMB' 'DMM' 'DOH' 'KWI' 'RUH' 'EBB' 'LHE' 'ADD' 'FIH' 'LAD' 'ABV' 'BEY' 'DAR' 'MRU' 'MLE' 'KHI' 'KRT' 'MHD' 'SYZ' 'THR' 'EBL' 'MED' 'SAH'] Cluster 6 : ['DME' 'GYD' 'KZN' 'LED' 'OVB' 'SVX' 'IKT' 'YKS' 'URC' 'VKO' 'SVO' 'TBS' 'KJA' 'KRR' 'EVN' 'ALA' 'DYU' 'FRU' 'KUF' 'ROV' 'KBP' 'MSQ' 'TAS' 'TSE' 'KHV']
For k=6, we observe that our clusters are more relative to smaller areas (countries or region of countries).
This clustering is really qualitative and shows some meaningful relations between countries. The case of Russia is quite revealing : it is a separate cluster and includes a lot of countries from the former Soviet Union.
Plot the cluster assignment (one color per cluster) on the 2D embedding you computed above with Laplacian eigenmaps.
colors = ['Blue', 'Red', 'Green']
for i in range(3):
cluster = proj[np.where(clusters3 == i)[0], :].real
plt.plot(cluster[:,0],cluster[:,1], '.', color=colors[i])
Answer
Why did we use the eigenvectors of the graph Laplacian as features? Could we use other features for clustering?
Answer
The eigenvectors give the embedding of the node in a lower dimensional space keeping similar nodes (in the sense of their egde weigth) close together. Similar nodes are close to each other considering the $L-$distance used for the $k$-mean clustering. Therefore, the eigenvectors are good choice for clustering nodes.
We could have used other features, such as the eigenvectors of the adjacency matrix.