```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE) library(cvTools) #library(glmnet) library(sandwich)

“`

### Warning

**Warning:** If you get an error when running the code, make sure that you have the necessary packages installed. You can install these packages by running the following code in base R (not RStudio or RMarkdown).

> install.packages("alr4") > install.packages("cvTools") > install.packages("glmnet") > install.packages("sandwich")

## Model Comparison (5 points)

Use the `Puromycin` data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/Puromycin).

For this problem, you will run through various methods of model comparison – Adjusted R-squared, AIC, LRT, and LOOCV.

1. Create two models that predict `rate`. The first model will use `conc` as the predictor. The second model will use `conc` and `state` as predictors. Do not include any interactions of other transformations. Run each of your models through the `summary()` function. ```{r} data(Puromycin) lm.rate <- lm(rate ~conc, data=Puromycin) lm.rate2 <- lm(rate ~conc+state, data=Puromycin) summary(lm.rate) summary(lm.rate2)

“`

Which model is preferred according to adjusted R-squared?

Model with `conc` and `state` is better because it has highest adjusted r-squared value.

2. Use the `AIC()` function to calculate the AIC for each model.

“`{r}

AIC(lm.rate,lm.rate2)

“`

Which model is preferred according to AIC?

Model with `conc` and `state` is better because it has lower AIC

3. Run a likelihood ratio test between the two models.

“`{r}

anova(lm.rate,lm.rate2,test=’LRT’)

“`

Which model is preferred according to the LRT?

There is signicant statistical evidence that the model is better if we add the `state` to our model (p < 0.05) so the model with state` is prefered..

4. Use the `cvFit` function within the `cvTools` package to run LOOCV for both models. ```{r} cvFit(lm.rate, data=Puromycin, y = Puromycin$rate , cost = rmspe, K = nrow(Puromycin)) cvFit(lm.rate2, data=Puromycin, y = Puromycin$rate , cost = rmspe, K = nrow(Puromycin))

“`

Which model is preferred according to LOOCV?

Model with `conc` and `state` is better because it has lowest LOOCV.

***

## Lasso (5 points)

Use the `swiss` data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/swiss).

For this problem, you will use Lasso regression to predict `Fertility` based on all other predictors. You will need to utilize the `glmnet()` and `cv.glmnet()` functions within the `glmnet` package. See the Lecture Notes for details.

1. Create the design matrix using the `model.matrix()` function. This is used as one of the arguments for `glmnet()` when running Lasso.

“`{r}

data(swiss)

matrix <- model.matrix(Fertility~Agriculture +Examination +Education+ Catholic, swiss)

matrix

“`

2. Run Lasso using different values of $\lambda$ to determine which variable is the first to have its estimated slope coefficient go to 0. This may take multiple iterations. Remember to set the value of $\alpha$ to 1, as we did in the Lecture Notes. ```{r} model1 <- glmnet::glmnet(matrix, swiss$Fertility, alpha = 1) coef(model1, c(0.1, 1,2 ,3 ,4 ,5, 10 ,15, 20, 30, 50, 100))

“`

Which variable was the first to have its estimated slope coefficient go to 0?

Agriculture

3. At what (approximate) value of $\lambda$ do **all** of the estimated slope coefficients go to 0? An integer value is fine here! ```{r} coef(model1, 2) coef(model1, 3) coef(model1, 4) coef(model1, 5) coef(model1, 6) coef(model1, 7) coef(model1, 8) coef(model1, 9) ```

Approximate value of $\lambda$ do **all** of the estimated slope coefficients go to 0 is 9.

4. Use the `cv.glmnet()` function to determine the optimal value for $\lambda$.

“`{r}

cv.lasso <- glmnet::cv.glmnet(matrix, swiss$Fertility, alpha = 1)

cv.lasso$lambda.min

“`

What value of $\lambda$ is returned? Note: This will vary from trial-to-trial (that’s okay!).

0.02128741.

5. Run lasso regression again using your optimal $\lambda$ from Part 4.

“`{r}

optimal_model <- glmnet::glmnet(matrix, swiss$Fertility, alpha = 1, lambda = cv.lasso$lambda.min)

coef(optimal_model, cv.lasso$lambda.min)

“`

What variables are in the model?

Agriculture, Examination, Education, Catholic.

***

## Variances (12 points)

Use the `stopping` data set from the `alr4` package. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/alr4/versions/1.0.5/topics/stopping).

For this problem, you will run through various methods of dealing with nonconstant variance – using OLS, WLS (assuming known weights), WLS (with a sandwich estimator), and bootstrap.

1. Create a scatterplot of `Distance` versus `Speed` (Y vs X).

“`{r}

data(stopping, package = “alr4″)

plot(stopping$Speed, stopping$Distance, main=”scatterplot of `Distance` versus `Speed`”,

xlab = ‘Speed’, ylab = ‘Distance’)

“`

2. Breifly explain why this graph supports fitting a quadratic regression model.

Because scatterplot shows that there is quadretic pattern between speed and distance, so fitting a quadratic regression model is good.

3. Fit a quadratic model with constant variance (OLS). Remember to use the `I()` function to inhibit the squared term.

“`{r}

lm.Distance <- lm(Distance ~ Speed + I(Speed^2), data=stopping)

summary(lm.Distance)

“`