hw-7

Homework 7 in DSCI445: Statistical Machine Learning @ CSU

Assignment

Be sure to set.seed(445).

Consider the USArrests data. We will now perform hierarchical clustering on the states.
1. Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
2. Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
3. Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
4. What affect does scaling the variables have on the hierarchical clusters obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.
In this problem you will generate simulated data and then perform PCA and \(K\)-means clustering on the data. First run the following to obtain the data.
```
library(mvtnorm)

n <- 20
p <- 10
x <- rmvnorm(n*3, rep(0, p))

# shift means
x[seq_len(n), ] <- x[seq_len(n), ] + matrix(rep(runif(p, min = 1, max = 3), n), nrow = n, byrow = TRUE)
x[seq_len(n) + 2*n, ] <- x[seq_len(n) + 2*n, ] + matrix(rep(runif(p, min = -3, max = -1), n), nrow = n, byrow = TRUE)

# add class labels
y <- c(rep("-1", n), rep("0", n), rep("1", n))
```
1. Perform PCA on the \(60\) observations and plot the first two principal comonent score vectors. Use a different color to indicate the observations in each of the true classes (y).
2. Perform \(K\) means clustering of the observations with \(K = 3\). How well do the clusters you obtained in \(K\)-means clustering compare to the true class labels? (Hint: table() may be useful here.)
3. Perform \(K\) means clustering of the observations with \(K = 2\). Describe your results.
4. Perform \(K\) means clustering of the observations with \(K = 4\). Describe your results.
5. Now perform \(K\) means clustering with \(K = 3\) on the first two principal components rather than the raw data. Comment on the results.
6. Using the scale() function, perform \(K\) means clustering with \(K = 3\) on the data after scaling each variable to have standard deviation one. How do these results compare to those obtained in b)-e)?

Be sure to share your server project with the instructor and grader:

Open your hw-7 project on liberator.stat.colostate.edu
Click the drop down on the project (top right side) > Share Project…
Click the drop down and add “dsci445instructors” to your project.

This is how you receive points for reproducibility on your homework!