```{r setup, include = FALSE} knitr::opts_chunk$set(echo = TRUE) library(cvTools) #library(glmnet) library(sandwich)
“`
### Warning
**Warning:** If you get an error when running the code, make sure that you have the necessary packages installed. You can install these packages by running the following code in base R (not RStudio or RMarkdown).
> install.packages("alr4") > install.packages("cvTools") > install.packages("glmnet") > install.packages("sandwich")
## Model Comparison (5 points)
Use the `Puromycin` data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/Puromycin).
For this problem, you will run through various methods of model comparison – Adjusted R-squared, AIC, LRT, and LOOCV.
1. Create two models that predict `rate`. The first model will use `conc` as the predictor. The second model will use `conc` and `state` as predictors. Do not include any interactions of other transformations. Run each of your models through the `summary()` function. ```{r} data(Puromycin) lm.rate <- lm(rate ~conc, data=Puromycin) lm.rate2 <- lm(rate ~conc+state, data=Puromycin) summary(lm.rate) summary(lm.rate2)
“`
Which model is preferred according to adjusted R-squared?
Model with `conc` and `state` is better because it has highest adjusted r-squared value.
2. Use the `AIC()` function to calculate the AIC for each model.
“`{r}
AIC(lm.rate,lm.rate2)
“`
Which model is preferred according to AIC?
Model with `conc` and `state` is better because it has lower AIC
3. Run a likelihood ratio test between the two models.
“`{r}
anova(lm.rate,lm.rate2,test=’LRT’)
“`
Which model is preferred according to the LRT?
There is signicant statistical evidence that the model is better if we add the `state` to our model (p < 0.05) so the model with state` is prefered..
4. Use the `cvFit` function within the `cvTools` package to run LOOCV for both models. ```{r} cvFit(lm.rate, data=Puromycin, y = Puromycin$rate , cost = rmspe, K = nrow(Puromycin)) cvFit(lm.rate2, data=Puromycin, y = Puromycin$rate , cost = rmspe, K = nrow(Puromycin))
“`
Which model is preferred according to LOOCV?
Model with `conc` and `state` is better because it has lowest LOOCV.
***
## Lasso (5 points)
Use the `swiss` data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/swiss).
For this problem, you will use Lasso regression to predict `Fertility` based on all other predictors. You will need to utilize the `glmnet()` and `cv.glmnet()` functions within the `glmnet` package. See the Lecture Notes for details.
1. Create the design matrix using the `model.matrix()` function. This is used as one of the arguments for `glmnet()` when running Lasso.
“`{r}
data(swiss)
matrix <- model.matrix(Fertility~Agriculture +Examination +Education+ Catholic, swiss)
matrix
“`
2. Run Lasso using different values of $\lambda$ to determine which variable is the first to have its estimated slope coefficient go to 0. This may take multiple iterations. Remember to set the value of $\alpha$ to 1, as we did in the Lecture Notes. ```{r} model1 <- glmnet::glmnet(matrix, swiss$Fertility, alpha = 1) coef(model1, c(0.1, 1,2 ,3 ,4 ,5, 10 ,15, 20, 30, 50, 100))
“`
Which variable was the first to have its estimated slope coefficient go to 0?
Agriculture
3. At what (approximate) value of $\lambda$ do **all** of the estimated slope coefficients go to 0? An integer value is fine here! ```{r} coef(model1, 2) coef(model1, 3) coef(model1, 4) coef(model1, 5) coef(model1, 6) coef(model1, 7) coef(model1, 8) coef(model1, 9) ```
Approximate value of $\lambda$ do **all** of the estimated slope coefficients go to 0 is 9.
4. Use the `cv.glmnet()` function to determine the optimal value for $\lambda$.
“`{r}
cv.lasso <- glmnet::cv.glmnet(matrix, swiss$Fertility, alpha = 1)
cv.lasso$lambda.min
“`
What value of $\lambda$ is returned? Note: This will vary from trial-to-trial (that’s okay!).
0.02128741.
5. Run lasso regression again using your optimal $\lambda$ from Part 4.
“`{r}
optimal_model <- glmnet::glmnet(matrix, swiss$Fertility, alpha = 1, lambda = cv.lasso$lambda.min)
coef(optimal_model, cv.lasso$lambda.min)
“`
What variables are in the model?
Agriculture, Examination, Education, Catholic.
***
## Variances (12 points)
Use the `stopping` data set from the `alr4` package. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/alr4/versions/1.0.5/topics/stopping).
For this problem, you will run through various methods of dealing with nonconstant variance – using OLS, WLS (assuming known weights), WLS (with a sandwich estimator), and bootstrap.
1. Create a scatterplot of `Distance` versus `Speed` (Y vs X).
“`{r}
data(stopping, package = “alr4″)
plot(stopping$Speed, stopping$Distance, main=”scatterplot of `Distance` versus `Speed`”,
xlab = ‘Speed’, ylab = ‘Distance’)
“`
2. Breifly explain why this graph supports fitting a quadratic regression model.
Because scatterplot shows that there is quadretic pattern between speed and distance, so fitting a quadratic regression model is good.
3. Fit a quadratic model with constant variance (OLS). Remember to use the `I()` function to inhibit the squared term.
“`{r}
lm.Distance <- lm(Distance ~ Speed + I(Speed^2), data=stopping)
summary(lm.Distance)
“`