R Programming Task on Model Comparison and Lasso – All Assignment Experts

```{r setup, include = FALSE}
 knitr::opts_chunk$set(echo = TRUE)

library(cvTools)
 #library(glmnet)
 library(sandwich)

“`

### Warning

**Warning:** If you get an error when running the code, make sure that you have the necessary packages installed. You can install these packages by running the following code in base R (not RStudio or RMarkdown).

> install.packages("alr4")
 > install.packages("cvTools")
 > install.packages("glmnet")
 > install.packages("sandwich")

## Model Comparison (5 points)

Use the `Puromycin` data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/Puromycin).

For this problem, you will run through various methods of model comparison – Adjusted R-squared, AIC, LRT, and LOOCV.

1. Create two models that predict `rate`. The first model will use `conc` as the predictor. The second model will use `conc` and `state` as predictors. Do not include any interactions of other transformations. Run each of your models through the `summary()` function.
 ```{r}
 data(Puromycin)

lm.rate <- lm(rate ~conc, data=Puromycin)
 lm.rate2 <- lm(rate ~conc+state, data=Puromycin)

summary(lm.rate)
 summary(lm.rate2)

“`

Which model is preferred according to adjusted R-squared?

Model with `conc` and `state` is better because it has highest adjusted r-squared value.

2. Use the `AIC()` function to calculate the AIC for each model.
“`{r}
AIC(lm.rate,lm.rate2)

“`

Which model is preferred according to AIC?

Model with `conc` and `state` is better because it has lower AIC

3. Run a likelihood ratio test between the two models.
“`{r}
anova(lm.rate,lm.rate2,test=’LRT’)

“`

Which model is preferred according to the LRT?

There is signicant statistical evidence that the model is better if we add the `state` to our model (p < 0.05) so the model with state` is prefered..

4. Use the `cvFit` function within the `cvTools` package to run LOOCV for both models.
 ```{r}
 cvFit(lm.rate, data=Puromycin, y = Puromycin$rate , cost = rmspe, K = nrow(Puromycin))
 cvFit(lm.rate2, data=Puromycin, y = Puromycin$rate , cost = rmspe, K = nrow(Puromycin))

“`

Which model is preferred according to LOOCV?

Model with `conc` and `state` is better because it has lowest LOOCV.

***

## Lasso (5 points)

Use the `swiss` data set built into R. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/datasets/versions/3.6.2/topics/swiss).

For this problem, you will use Lasso regression to predict `Fertility` based on all other predictors. You will need to utilize the `glmnet()` and `cv.glmnet()` functions within the `glmnet` package. See the Lecture Notes for details.

1. Create the design matrix using the `model.matrix()` function. This is used as one of the arguments for `glmnet()` when running Lasso.
“`{r}
data(swiss)
matrix <- model.matrix(Fertility~Agriculture +Examination +Education+ Catholic, swiss)
matrix

“`

2. Run Lasso using different values of $\lambda$ to determine which variable is the first to have its estimated slope coefficient go to 0. This may take multiple iterations. Remember to set the value of $\alpha$ to 1, as we did in the Lecture Notes.
 ```{r}
 model1 <- glmnet::glmnet(matrix, swiss$Fertility, alpha = 1)
 coef(model1, c(0.1, 1,2 ,3 ,4 ,5, 10 ,15, 20, 30, 50, 100))

“`

Which variable was the first to have its estimated slope coefficient go to 0?

Agriculture

3. At what (approximate) value of $\lambda$ do **all** of the estimated slope coefficients go to 0? An integer value is fine here!
 ```{r}
 coef(model1, 2)
 coef(model1, 3)
 coef(model1, 4)
 coef(model1, 5)
 coef(model1, 6)
 coef(model1, 7)
 coef(model1, 8)
 coef(model1, 9)
 ```

Approximate value of $\lambda$ do **all** of the estimated slope coefficients go to 0 is 9.

4. Use the `cv.glmnet()` function to determine the optimal value for $\lambda$.
“`{r}
cv.lasso <- glmnet::cv.glmnet(matrix, swiss$Fertility, alpha = 1)
cv.lasso$lambda.min
“`

What value of $\lambda$ is returned? Note: This will vary from trial-to-trial (that’s okay!).

0.02128741.

5. Run lasso regression again using your optimal $\lambda$ from Part 4.
“`{r}
optimal_model <- glmnet::glmnet(matrix, swiss$Fertility, alpha = 1, lambda = cv.lasso$lambda.min)

coef(optimal_model, cv.lasso$lambda.min)
“`

What variables are in the model?

Agriculture, Examination, Education, Catholic.

***

## Variances (12 points)

Use the `stopping` data set from the `alr4` package. Further information on the data set can be found [here](https://www.rdocumentation.org/packages/alr4/versions/1.0.5/topics/stopping).

For this problem, you will run through various methods of dealing with nonconstant variance – using OLS, WLS (assuming known weights), WLS (with a sandwich estimator), and bootstrap.

1. Create a scatterplot of `Distance` versus `Speed` (Y vs X).
“`{r}
data(stopping, package = “alr4″)

plot(stopping$Speed, stopping$Distance, main=”scatterplot of `Distance` versus `Speed`”,
xlab = ‘Speed’, ylab = ‘Distance’)

“`

2. Breifly explain why this graph supports fitting a quadratic regression model.

Because scatterplot shows that there is quadretic pattern between speed and distance, so fitting a quadratic regression model is good.

3. Fit a quadratic model with constant variance (OLS). Remember to use the `I()` function to inhibit the squared term.
“`{r}
lm.Distance <- lm(Distance ~ Speed + I(Speed^2), data=stopping)

summary(lm.Distance)

“`

Stumble

M	T	W	T	F	S	S
« Feb
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31