library(ISLR) library(tidyverse) library(tidymodels) # data str(Hitters) ## Reproducibility set.seed(445) ## Data Processing # 1. Remove records with missing data. Create a new (complete) version of your data set. (Hint: `drop_na` in `tidyr` could be helpful.) # 2. You may need to create dummy variables for categorical variables in your recipes. `step_dummy(all_nominal_predictors())` is a good way to do this. # 3. You may need to standardize all variables in your recipes. `step_normalize(all_predictors())` is a good way to do this. ## Ridge Regression # 1. Create a vector of $\lambda$ values from $\lambda = .01$ to $\lambda = 10^10$ of length $100$. # 2. Fit a ridge regression model for each $\lambda$ in your grid. Be sure to normalize your predictors. # 3. Make a line plot of coefficient corresponding to each $\lambda$. You should have an individual line for each variable with coefficient value on the $y$-axis and $\lambda$ on the $x$ axis. What happens to your coefficients as $\lambda$ increases? # 4. Perform $10$-fold cross validation and get an estimate of the test MSE for each $\lambda$ in your grid. Which $\lambda$ would you choose and why? (Hint: look at the `tune` package for a fast way to do this.) ## Lasso # 1. Fit the lasso model for each $\lambda$ in your grid. # 2. Make a line plot of coefficient corresponding to each $\lambda$. You should have an individual line for each variable with coefficient value on the $y$-axis and $\lambda$ on the $x$ axis. (Hint: `coef` may be a useful function). What happens to your coefficients as $\lambda$ increases? # 3. Perform $10$-fold cross validation and get an estimate of the test MSE for each $\lambda$ in your grid. Which $\lambda$ would you choose and why? ## Principal Components Regression # 1. Fit the PCR model using 10-fold cross validation for values of $M$. Be sure to normalize your predictors. # 2. Create a plot of the CV MSE vs. $M$. # 3. When does the smallest cross-validation error occur? Which $M$ would you choose for your final model? # 4. How much variability in $Y$ is explained for your chosen value of $M$? ## Partial Least Squares # 1. Fit the PLS model using 10-fold cross validation for values of $M$. Be sure to normalize your predictors. # 2. Create a plot of the CV MSE vs. $M$. # 3. When does the smallest cross-validation error occur? Which $M$ would you choose for your final model? # 4. How much variability in $Y$ is explained for your chosen value of $M$? # 5. Discuss the difference between PCR and PLS results. Which would you prefer?