# R Programming Task on Model Comparison and Lasso

{r setup, include = FALSE}
knitr::opts_chunk$set(echo = TRUE) library(cvTools) #library(glmnet) library(sandwich) “ ### Warning **Warning:** If you get an error when running the code, make sure that you have the necessary packages installed. You can install these packages by running the following code in base R (not RStudio or RMarkdown). > install.packages("alr4") > install.packages("cvTools") > install.packages("glmnet") > install.packages("sandwich") ## Model Comparison (5 points) Use the Puromycin data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/Puromycin). For this problem, you will run through various methods of model comparison – Adjusted R-squared, AIC, LRT, and LOOCV. 1. Create two models that predict rate. The first model will use conc as the predictor. The second model will use conc and state as predictors. Do not include any interactions of other transformations. Run each of your models through the summary() function. {r} data(Puromycin) lm.rate <- lm(rate ~conc, data=Puromycin) lm.rate2 <- lm(rate ~conc+state, data=Puromycin) summary(lm.rate) summary(lm.rate2) “ Which model is preferred according to adjusted R-squared? Model with conc and state is better because it has highest adjusted r-squared value. 2. Use the AIC() function to calculate the AIC for each model. “{r} AIC(lm.rate,lm.rate2) “ Which model is preferred according to AIC? Model with conc and state is better because it has lower AIC 3. Run a likelihood ratio test between the two models. “{r} anova(lm.rate,lm.rate2,test=’LRT’) “ Which model is preferred according to the LRT? There is signicant statistical evidence that the model is better if we add the state to our model (p < 0.05) so the model with state is prefered.. 4. Use the cvFit function within the cvTools package to run LOOCV for both models. {r} cvFit(lm.rate, data=Puromycin, y = Puromycin$rate , cost = rmspe, K = nrow(Puromycin))
cvFit(lm.rate2, data=Puromycin, y = Puromycin$rate , cost = rmspe, K = nrow(Puromycin)) “ Which model is preferred according to LOOCV? Model with conc and state is better because it has lowest LOOCV. *** ## Lasso (5 points) Use the swiss data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/swiss). For this problem, you will use Lasso regression to predict Fertility based on all other predictors. You will need to utilize the glmnet() and cv.glmnet() functions within the glmnet package. See the Lecture Notes for details. 1. Create the design matrix using the model.matrix() function. This is used as one of the arguments for glmnet() when running Lasso. “{r} data(swiss) matrix <- model.matrix(Fertility~Agriculture +Examination +Education+ Catholic, swiss) matrix “ 2. Run Lasso using different values of$\lambda$to determine which variable is the first to have its estimated slope coefficient go to 0. This may take multiple iterations. Remember to set the value of$\alpha$to 1, as we did in the Lecture Notes. {r} model1 <- glmnet::glmnet(matrix, swiss$Fertility, alpha = 1)
coef(model1, c(0.1, 1,2 ,3 ,4 ,5, 10 ,15, 20, 30, 50, 100))

“

Which variable was the first to have its estimated slope coefficient go to 0?

Agriculture

3. At what (approximate) value of $\lambda$ do **all** of the estimated slope coefficients go to 0? An integer value is fine here!
{r}
coef(model1, 2)
coef(model1, 3)
coef(model1, 4)
coef(model1, 5)
coef(model1, 6)
coef(model1, 7)
coef(model1, 8)
coef(model1, 9)


Approximate value of $\lambda$ do **all** of the estimated slope coefficients go to 0 is 9.

4. Use the cv.glmnet() function to determine the optimal value for $\lambda$.
“{r}
cv.lasso <- glmnet::cv.glmnet(matrix, swiss$Fertility, alpha = 1) cv.lasso$lambda.min
“

What value of $\lambda$ is returned? Note: This will vary from trial-to-trial (that’s okay!).

0.02128741.

5. Run lasso regression again using your optimal $\lambda$ from Part 4.
“{r}
optimal_model <- glmnet::glmnet(matrix, swiss$Fertility, alpha = 1, lambda = cv.lasso$lambda.min)

coef(optimal_model, cv.lasso$lambda.min) “ What variables are in the model? Agriculture, Examination, Education, Catholic. *** ## Variances (12 points) Use the stopping data set from the alr4 package. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/alr4/versions/1.0.5/topics/stopping). For this problem, you will run through various methods of dealing with nonconstant variance – using OLS, WLS (assuming known weights), WLS (with a sandwich estimator), and bootstrap. 1. Create a scatterplot of Distance versus Speed (Y vs X). “{r} data(stopping, package = “alr4″) plot(stopping$Speed, stopping\$Distance, main=”scatterplot of Distance versus Speed”,
xlab = ‘Speed’, ylab = ‘Distance’)

“

2. Breifly explain why this graph supports fitting a quadratic regression model.

Because scatterplot shows that there is quadretic pattern between speed and distance, so fitting a quadratic regression model is good.

3. Fit a quadratic model with constant variance (OLS). Remember to use the I() function to inhibit the squared term.
“{r}
lm.Distance <- lm(Distance ~ Speed + I(Speed^2), data=stopping)

summary(lm.Distance)

“`