import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import spatial
from sklearn import decomposition, cluster
sns.set(rc={'figure.figsize' : (10,10)})
rij
denote the correlation between the ith
and jth
observations, then the quantity 1 − rij
is proportional to the squared Euclidean distance between the ith
and jth
observations. On the USArrests data, show that this proportionality holds.arrests_df = pd.read_csv('arrests.csv')
arrests_states = arrests_df[arrests_df.columns[0]].rename('state')
arrests_df = arrests_df.drop(arrests_df.columns[0], axis=1)
arrests_df.describe()
arrests_df = (arrests_df - arrests_df.mean()) / arrests_df.std()
arrests_df.head()
Murder | Assault | UrbanPop | Rape | |
---|---|---|---|---|
0 | 1.242564 | 0.782839 | -0.520907 | -0.003416 |
1 | 0.507862 | 1.106823 | -1.211764 | 2.484203 |
2 | 0.071633 | 1.478803 | 0.998980 | 1.042878 |
3 | 0.232349 | 0.230868 | -1.073593 | -0.184917 |
4 | 0.278268 | 1.262814 | 1.758923 | 2.067820 |
dist = spatial.distance.pdist(arrests_df, 'euclidean') ** 2
corr = 1 - spatial.distance.pdist(arrests_df, 'correlation')
dist_mtrx = spatial.distance.squareform(dist)
corr_mtrx = spatial.distance.squareform(corr)
for i in range(0, arrests_df.shape[0]):
sns.scatterplot(x=dist_mtrx[i], y=dist_mtrx[i])
10.2.3
, a formula for calculating PVE was given in Equation 10.8
. We also saw that the PVE can be obtained using the sdev output of the prcomp()
function.On the USArrests data, calculate PVE in two ways:
(a) Using the sdev output of the prcomp()
function, as was done in
Section 10.2.3
.
(b) By applying Equation 10.8
directly. That is, use the prcomp()
function to compute the principal component loadings. Then, use those loadings in Equation 10.8
to obtain the PVE.
..obviously as I'm using python instead of R
I'll take a different tact
pca = decomposition.PCA().fit(arrests_df)
pve_1 = pca.explained_variance_ratio_
_, (ax1, ax2) = plt.subplots(ncols=2, figsize=(15,10))
ax = sns.barplot(x=list(range(1, len(pve_1) + 1)), y=pve_1, ax=ax1)
ax.set_title('SKLearn `explained_variance_ratio`')
ax.set_ylabel('PVE')
ax.set_xlabel('Principal Component')
total_var = ((arrests_df ** 2).sum() / arrests_df.shape[0]).sum()
pc_var = ((arrests_df @ pca.components_.T) ** 2).sum() / arrests_df.shape[0]
pve_2 = pc_var / total_var
ax = sns.barplot(x=list(range(1, len(pve_1) + 1)), y=pve_2, ax=ax2)
ax.set_title('PVE calculated by hand')
ax.set_ylabel('PVE')
ax.set_xlabel('Principal Component')
Text(0.5,0,'Principal Component')
(a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
(b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
(c) Hierarchically cluster the states using complete linkage and Eu- clidean distance, after scaling the variables to have standard de- viation one.
(d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
.. TODO: find a better python dendogram library
sns.clustermap(arrests_df, metric='euclidean', method='complete')
<seaborn.matrix.ClusterGrid at 0x10d5e45c0>
arrests_df_not_std = pd.read_csv('arrests.csv')
arrests_states_not_std = arrests_df[arrests_df.columns[0]].rename('state')
arrests_df_not_std = arrests_df.drop(arrests_df.columns[0], axis=1)
sns.clustermap(arrests_df_not_std, metric='euclidean', method='complete')
<seaborn.matrix.ClusterGrid at 0x104fca748>
(a) Generate a simulated data set with 20 observations in each of three classes (i.e. 60 observations total), and 50 variables.
toy_df = pd.DataFrame(np.random.normal(0, 1, (60, 50)))
true_classes = pd.Series(np.concatenate([np.repeat(0, 20),
np.repeat(1, 20),
np.repeat(2, 20)])).rename('class')
toy_df = pd.DataFrame((true_classes.values.reshape(60,1) + toy_df.values))
toy_df = (toy_df - toy_df.mean()) / toy_df.std()
(b) Perform PCA on the 60 observations and plot the first two prin- cipal component score vectors. Use a different color to indicate the observations in each of the three classes. If the three classes appear separated in this plot, then continue on to part (c). If not, then return to part (a) and modify the simulation so that there is greater separation between the three classes. Do not continue to part (c) until the three classes show at least some separation in the first two principal component score vectors.
pca = decomposition.PCA().fit(toy_df)
projection = pd.DataFrame(pca.transform(toy_df))
scatter_df = pd.concat([projection[[0,1]], true_classes], axis=1)
ax = sns.scatterplot(x=0, y=1, hue='class', data=scatter_df)
_ = ax.set_ylabel('Second Principal Component')
_ = ax.set_xlabel('First Principal Component')
(c) Perform K-means clustering of the observations with K = 3
. How well do the clusters that you obtained in K-means clustering compare to the true class labels?
cluster.KMeans(n_clusters=3).fit_predict(toy_df)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2], dtype=int32)
(d) Perform K-means clustering with K = 2
. Describe your results.
cluster.KMeans(n_clusters=2).fit_predict(toy_df)
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
(e) Now perform K-means clustering with K = 4
, and describe your
results.
cluster.KMeans(n_clusters=4).fit_predict(toy_df)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 3, 2, 3, 2, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
(f) Now perform K-means clustering with K = 3
on the first two principal component score vectors, rather than on the raw data. That is, perform K-means clustering on the 60 × 2
matrix of which the first column is the first principal component score vector, and the second column is the second principal component score vector. Comment on the results.
preds = cluster.KMeans(n_clusters=3).fit_predict(toy_df)
scatter_df = pd.concat([projection[[0,1]], pd.Series(preds).rename('class')], axis=1)
ax = sns.scatterplot(x=0, y=1, hue='class', data=scatter_df)
ax.set_ylabel('Second Principal Component')
ax.set_xlabel('First Principal Component')
Text(0.5,0,'First Principal Component')
(g) Using the scale() function, perform K-means clustering with K = 3 on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in (b)? Explain.
I've been using the scaled vars throughout.. as the authors recommend in the book..
toy_df_not_std = pd.DataFrame(np.random.normal(0, 1, (60, 50)))
toy_df_not_std = pd.DataFrame((true_classes.values.reshape(60,1) + toy_df.values))
cluster.KMeans(n_clusters=3).fit_predict(toy_df_not_std)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
Ch10Ex11.csv
that consists of 40
tissue samples with measurements on 1,000 genes
. The first 20
samples are from healthy patients, while the second 20
are from a diseased group.(a) Load in the data
gene_df = pd.read_csv('Ch10Ex11.csv', header=None)
gene_df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | -0.961933 | 0.441803 | -0.975005 | 1.417504 | 0.818815 | 0.316294 | -0.024967 | -0.063966 | 0.031497 | -0.350311 | ... | -0.509591 | -0.216726 | -0.055506 | -0.484449 | -0.521581 | 1.949135 | 1.324335 | 0.468147 | 1.061100 | 1.655970 |
1 | -0.292526 | -1.139267 | 0.195837 | -1.281121 | -0.251439 | 2.511997 | -0.922206 | 0.059543 | -1.409645 | -0.656712 | ... | 1.700708 | 0.007290 | 0.099062 | 0.563853 | -0.257275 | -0.581781 | -0.169887 | -0.542304 | 0.312939 | -1.284377 |
2 | 0.258788 | -0.972845 | 0.588486 | -0.800258 | -1.820398 | -2.058924 | -0.064764 | 1.592124 | -0.173117 | -0.121087 | ... | -0.615472 | 0.009999 | 0.945810 | -0.318521 | -0.117889 | 0.621366 | -0.070764 | 0.401682 | -0.016227 | -0.526553 |
3 | -1.152132 | -2.213168 | -0.861525 | 0.630925 | 0.951772 | -1.165724 | -0.391559 | 1.063619 | -0.350009 | -1.489058 | ... | -0.284277 | 0.198946 | -0.091833 | 0.349628 | -0.298910 | 1.513696 | 0.671185 | 0.010855 | -1.043689 | 1.625275 |
4 | 0.195783 | 0.593306 | 0.282992 | 0.247147 | 1.978668 | -0.871018 | -0.989715 | -1.032253 | -1.109654 | -0.385142 | ... | -0.692998 | -0.845707 | -0.177497 | -0.166491 | 1.483155 | -1.687946 | -0.141430 | 0.200778 | -0.675942 | 2.220611 |
5 rows × 40 columns
(b) Apply hierarchical clustering to the samples using correlation-based distance, and plot the dendrogram. Do the genes separate the samples into the two groups? Do your results depend on the type of linkage used?
(c) Your collaborator wants to know which genes differ the most across the two groups. Suggest a way to answer this question, and apply it here.
sns.clustermap(gene_df, metric='correlation', method='single')
<seaborn.matrix.ClusterGrid at 0x1154b8780>
sns.clustermap(gene_df, metric='correlation', method='average')
<seaborn.matrix.ClusterGrid at 0x114994b70>
sns.clustermap(gene_df, metric='correlation', method='complete')
<seaborn.matrix.ClusterGrid at 0x11015ca20>