Homework 7 in DSCI445: Statistical Machine Learning @ CSU
Be sure to set.seed(445)
.
Consider the USArrests
data. We will now perform hierarchical clustering on the states.
Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
What affect does scaling the variables have on the hierarchical clusters obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
In this problem you will generate simulatted data and then perform PCA and \(K\)-means clustering on the data. First run the following to obtain the data.
library(mvtnorm)
n <- 20
p <- 10
x <- rmvnorm(n*3, rep(0, p))
# shift means
x[seq_len(n), ] <- x[seq_len(n), ] + matrix(rep(runif(p, min = 1, max = 3), n), nrow = n, byrow = TRUE)
x[seq_len(n) + 2*n, ] <- x[seq_len(n) + 2*n, ] + matrix(rep(runif(p, min = -3, max = -1), n), nrow = n, byrow = TRUE)
# add class labels
y <- c(rep("-1", n), rep("0", n), rep("1", n))
Perform PCA on the \(60\) observations and plot the first two principal comonent score vectors. Use a different color to indicate the observations in each of the true classes (y
).
Perform \(K\) means clustering of the observations with \(K = 3\). How well do the clusters you obtained in \(K\)-means clustering compare to the true class labels? (Hint: table()
may be useful here.)
Perform \(K\) means clustering of the observations with \(K = 2\). Describe your results.
Perform \(K\) means clustering of the observations with \(K = 4\). Describe your results.
Now perform \(K\) means clustering with \(K = 3\) on the first two principal components rather than the raw data. Comment on the results.
Using the scale()
function, perform \(K\) means clustering with \(K = 3\) on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in b)-e)?
In this folder, there is a data set called gene_exp.csv
that consists of \(40\) tissue samples with measurements on \(1,000\) genes. The first \(20\) are from healthy patients while the second \(20\) are from a diseased group.
R
. Note, there are no headers in the file.