R Exercises and Solutions: ANOVA, Linear and Logistic Regression, Chi-Square Test

In order to solve the tasks you need:

Problem 1

People who are concerned about their health may prefer hot dogs that are low in sodium and calories. The data file contains sample data on the sodium and calories contained in each of 54 major hot dog brands. The hot dogs are classified by type: beef (A), poultry (B), and meat (C).

The data file called hotdogs.rda contains the sodium and calorie content for random samples of each type of hot dog. This data set is included in the DS705data package.

Part 1a

Make boxplots for the variable calories for the 3 types of hotdogs. Describe the 3 boxplots and the suitability of these samples for conducting analysis of variance.

Answer 1a

load("hotdogs.rda")
boxplot(calories ~ type, data = hotdogs, xlab = "Hotdogs", ylab = "Calories")

The boxplot for hotdogs of type C is much lower than the boxplots for types A and B. This suggests that type C hotdogs are significantly different in terms of calories from the other two types of hotdogs. There is also an outlier for hotdogs of type C, which may lower the chance of rejecting the null hypothesis when conducting analysis of variance.

The distribution for type A is slightly skewed to the right, type B is also skewed to the right, and type C shows skewness to the left. Skewness, which is a sign of nonnormality also shows that the samples may not be very suitable for analysis of variance.

The length of the boxes for types A and C are very similar, while the box for type B is longer. This shows lack of homogeneity of variances, which is not suitable for analysis of variance.

Part 1b

Conduct an analysis of variance test (the standard one that assumes normality and equal variance) to compare population mean calorie counts for these three types of hot dogs. (i) State the null and alternative hypotheses, (ii) use R to compute the test statistic and p-value, and (iii) write a conclusion in context at \(\alpha=0.10\).

Answer 1b

\(H_0: \mu_A = \mu_B = \mu_C\)

\(H_a:\) Means are not all equal.

fit <- aov(calories ~ type, data = hotdogs)
summary(fit)
##             Df Sum Sq Mean Sq F value  Pr(>F)   
## type         2  15425    7712   6.278 0.00365 **
## Residuals   51  62649    1228                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since p-value = 0.00365 < 0.1, we reject the null hypothesis. Some means are statistically significantly different. Population mean calorie counts for these three types of hot dogs are not the same.

Part 1c

Follow up with the Tukey-Kramer multiple comparison procedure using a 90% experiment-wise confidence level.

  1. Provide the R code used and

  2. Write an interpretation for your multiple comparison output in the context of the problem; include the intervals for which population means are significantly different.

Answer 1c

TukeyHSD(fit, conf.level = 0.90)
##   Tukey multiple comparisons of means
##     90% family-wise confidence level
## 
## Fit: aov(formula = calories ~ type, data = hotdogs)
## 
## $type
##           diff       lwr        upr     p adj
## B-A  -9.144118 -33.41772  15.129490 0.7102899
## C-A -39.673529 -63.94714 -15.399922 0.0033799
## C-B -30.529412 -55.76791  -5.290917 0.0371484

This output indicates that the differences C-A and C-B are significant (p adj < 0.1), while B-A is not significant (p adj = 0.71 > 0.1). This means that type C hotdogs are significantly different in terms of population mean calorie counts from the other two types of hotdogs.

The 90% confidence interval on the difference between means for types C and A extends from -63.947 to -15.400 and for types C and B - from -55.768 to -5.291.

Part 1d

As part of a vigorous road test for independent random samples of size 20 for 4 different brands of tires, the tread wear was measured for comparison. The data frame treadwear.rda contains the resulting data.

Begin by exploring the sample means and standard deviations for each brand and looking at a boxplot comparison. That is, find the sample means and standard deviations and produce a boxplots of the tread wear measures for the four brands of tires.

Conduct hypothesis tests for the normality of the tread wear for each brand using a 5% level of significance in each case.

Also test for the homogeneity of the variances using \(\alpha=0.05\).

Comment on the results of each.

Answer 1d

load("treadwear.rda")
tbl <- cbind(aggregate(wear ~ brand, data = treadwear, mean),
             aggregate(wear ~ brand, data = treadwear, sd)[2])
colnames(tbl) <- c("brand", "wear.mean", "wear.sd")
tbl
##   brand wear.mean   wear.sd
## 1     A  576.3017 148.10367
## 2     B  671.1044 111.29454
## 3     C  825.9870  57.85366
## 4     D  853.2877  81.23036
boxplot(wear ~ brand, data = treadwear, xlab = "Brand", ylab = "Wear")

# hypothesis tests for the normality
shapiro.test(treadwear$wear[treadwear$brand == "A"])
## 
##  Shapiro-Wilk normality test
## 
## data:  treadwear$wear[treadwear$brand == "A"]
## W = 0.94655, p-value = 0.3177
shapiro.test(treadwear$wear[treadwear$brand == "B"])
## 
##  Shapiro-Wilk normality test
## 
## data:  treadwear$wear[treadwear$brand == "B"]
## W = 0.92027, p-value = 0.1003
shapiro.test(treadwear$wear[treadwear$brand == "C"])
## 
##  Shapiro-Wilk normality test
## 
## data:  treadwear$wear[treadwear$brand == "C"]
## W = 0.95389, p-value = 0.4301
shapiro.test(treadwear$wear[treadwear$brand == "D"])
## 
##  Shapiro-Wilk normality test
## 
## data:  treadwear$wear[treadwear$brand == "D"]
## W = 0.95129, p-value = 0.3871
# test for the homogeneity of the variances
bartlett.test(wear ~ brand, data = treadwear)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  wear by brand
## Bartlett's K-squared = 17.034, df = 3, p-value = 0.0006956

Results of Shapiro-Wilk normality test show that the p-value for each brand is larger than \(\alpha\) (0.05) so the null hypothesis that the data came from a normally distributed population cannot be rejected. The tread wear for each brand is normally distributed.

Results for Bartlett test of homogeneity of variances show that the p-value is smaller than \(\alpha\) (0.0006956 < 0.05), which means that we reject the null hypothesis that the variances of tread wear for each of the brands are the same. At least one brand’s variance is statistically significantly different from the others.

Part 1e

What is the most appropriate inference procedure to compare population mean tread wear for these four brands of tires? Perform this procedure.

  1. State the null and alternative hypotheses, (ii) use R to compute the test statistic and p-value, and (iii) write a conclusion in context at \(\alpha=0.05\).

Answer 1e

Welch ANOVA (since variances are not equal).

\(H_0: \mu_A = \mu_B = \mu_C\)

\(H_a:\) Means are not all equal.

oneway.test(wear ~ brand, data = treadwear)
## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  wear and brand
## F = 27.201, num df = 3.000, denom df = 40.197, p-value = 8.988e-10

Since p-value = 8.988e-10 < 0.05, we reject the null hypothesis. Some means are statistically significantly different. Population mean tread wear for these four brands of tires are not the same.

Part 1f

Conduct the most appropriate multiple comparisons procedure to determine which brands have significantly different tread wear. Use a family-wise error rate of \(\alpha=0.05\). Use complete sentences to interpret the results in the context of the problem.

Answer 1f

library(userfriendlyscience)
oneway(treadwear$wear, treadwear$brand, posthoc = 'games-howell')
## ### Oneway Anova for y=wear and x=brand (groups: A, B, C, D)
## 
## Omega squared: 95% CI = [.38; .64], point estimate = .53
## Eta Squared: 95% CI = [.41; .63], point estimate = .55
## 
##                                         SS Df        MS     F     p
## Between groups (error + effect) 1029881.43  3 343293.81 31.02 <.001
## Within groups (error only)        841065.2 76  11066.65            
## 
## 
## ### Post hoc test: games-howell
## 
##       diff  ci.lo  ci.hi    t    df     p
## B-A  94.80 -16.88 206.48 2.29 35.27  .120
## C-A 249.69 151.80 347.57 7.02 24.67 <.001
## D-A 276.99 174.18 379.79 7.33 29.48 <.001
## C-B 154.88  78.40 231.37 5.52 28.57 <.001
## D-B 182.18  99.07 265.30 5.91 34.77 <.001
## D-C  27.30 -32.90  87.50 1.22 34.33  .616

This output indicates that the differences C-A, D-A, C-B and D-B are significant (p adj < 0.05), while B-A and D-C are not significant (p adj > 0.05). This means that brand C and brand D tires are significantly different in terms of population mean wear from brand A and brand B tires.

Problem 2

This dataset contains the prices of ladies’ diamond rings and the carat size of their diamond stones. The rings are made with gold of 20 carats purity and are each mounted with a single diamond stone. The data was presented in a newspaper advertisement suggesting the use of simple linear regression to relate the prices of diamond rings to the carats of their diamond stones.

The data is in the file diamond.rda and is included in the DS705data package.

Part 2a

Does it appear that a linear model is at least possibly a plausible model for predicting the price from carats of the diamonds for these rings?

Begin by creating a scatterplot and comment on the suitability of a linear regression model.

Answer 2a

load("diamond.rda")
plot(diamond)

A linear model looks like a plausible model for predicting the price from carats of the diamonds for these rings.

Part 2b

Obtain the estimated y-intercept and slope for the estimated regression equation and write the equation in the form price\(=\hat{\beta_0} + \hat{\beta_1}\)carats (only with \(\hat{\beta_0}\) and \(\hat{\beta_1}\) replaced with the numerical estimates from your R output).

Answer 2b

mod <- lm(price ~ carat, data = diamond)
mod
## 
## Call:
## lm(formula = price ~ carat, data = diamond)
## 
## Coefficients:
## (Intercept)        carat  
##      -250.6       3671.4

Replace the ## symbols with your slope and intercept

\(\widehat{price}\) = -250.6 + 3671.4 carat

Part 2c

Compute the sample Pearson correlation coefficient and test whether or not the population Pearson correlation coefficient between price and carat is zero using a 1% level of significance. (i) State the null and alternative hypotheses, (ii) test statistic, (iii) the p-value, and (iv) conclusion.

Answer 2c

cor(diamond$carat, diamond$price)
## [1] 0.9875512
cor.test(diamond$carat, diamond$price, conf.level = 0.99)
## 
##  Pearson's product-moment correlation
## 
## data:  diamond$carat and diamond$price
## t = 42.116, df = 45, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.9731306 0.9942549
## sample estimates:
##       cor 
## 0.9875512

Sample correlation: 0.988

\(H_0:\) True correlation is 0

\(H_a:\) True correlation is not equal to 0

t = 42.116

p-value < 2.2e-16

Since p-value < 0.01, we reject the null hypothesis. True correlation between price and carat is statistically significantly different from zero.

Part 2d

Provide a 95% confidence interval to estimate the slope of the regression equation and interpret the interval in the context of the application (do not us the word “slope” in your interpretation).

Answer 2d

confint(mod, "carat", level = 0.95)
##          2.5 %   97.5 %
## carat 3495.818 3846.975

The 95% confidence interval for slope of the regression equation is (3495.818, 3846.975). It means that 95% of the time, if the carats of diamond stones increase by 0.01, the increase in the expected price of diamond ring is between 34.95818 and 38.46975.

Part 2e

Check to see if the linear regression model assumptions are reasonable for this data.

(Step 1) Are the residuals normal? Construct a histogram, normal probability plot, and boxplot of the residuals and perform a Shapiro-Wilk test for normality.

Answer 2e.1

res <- residuals(mod)
par(mfrow = c(2, 2))
hist(res)
qqnorm(res)
qqline(res)
boxplot(res)
par(mfrow = c(1, 1))

shapiro.test(res)
## 
##  Shapiro-Wilk normality test
## 
## data:  res
## W = 0.98604, p-value = 0.8406

All of the above plots indicate that the residuals are normally distributed. Also, since p-value in the Shapiro-Wilk normality test is higher than 0.05, the null hypothesis that the data came from a normally distributed population cannot be rejected. The residuals are normally distributed.

(Step 2) Plot the residuals against the fitted values. Does the equal variances assumption seem reasonable? Does the linear regression line seem to be a good fit?

Answer 2e.2

plot(diamond$carat, res,
     ylab = "Residuals", xlab = "Fitted Values", main = "Residuals vs. Fitted")
abline(0, 0)

The equal variances assumption seem reasonable and the regression line appears to be a good fit, because the residuals are equally spread around a horizontal line without distinct patterns.

(Step 3) Perform the Breusch-Pagan test for equal variances of the residuals. What does the test tell you?

Answer 2e.3

library(lmtest)
bptest(mod)
## 
##  studentized Breusch-Pagan test
## 
## data:  mod
## BP = 0.18208, df = 1, p-value = 0.6696

p-value = 0.6696 > 0.05, so the null hypothesis of homoscedasticity cannot be rejected. Variances of the residuals are statistically significantly equal.

Part 2f

Calculate and interpret the coefficient of determination \(r^2_{yx}\) (same as \(R^2\)).

Answer 2f

cor(diamond$carat, diamond$price)^2
## [1] 0.9752573

The coefficient of determination shows that the model explains about 97.5% of variability of the price data around its mean.

Part 2g

Should the regression equation obtained for price and carats be used for making predictions? Explain your answer.

Answer 2g

The regression equation obtained for price and carats can be used for making predictions, since the coefficient of determination shows that the model fits the data very well and also the residuals’ analysis indicates that the model assumptions are satisfied.

Part 2h

What would be the straightforward interpretation of the y-intercept in this regression model? Does it make sense here? Why would this not be appropriate as a stand-alone interpretation for this scenario? (hint: what is extrapolation?)

Answer 2gs

The y-intercept in this regression model would mean that the price for a ring with no diamond stone would be -250.6. It does not make sense, because the price cannot be a negative number. Also, this would not be appropriate as a stand-alone interpretation for this scenario because we cannot make any assumptions on how much a ring with no diamond would cost as it is beyond our observation range.

Part 2i

Create 95% prediction and confidence limits for the population mean price for the carats given in the sample data and plot them along with a scatterplot of the data.

Answer 2i

# 95% prediction limits
carat <- sort(diamond$carat)
pred.int <- predict(mod, newdata = data.frame(carat = carat), interval = "prediction")
# 95% confidence limits
conf.int <- predict(mod, newdata = data.frame(carat = carat), interval = "confidence")
# scatterplot
plot(price ~ carat, data = diamond)
# add prediction interval
lines(carat, pred.int[, 2], col = "blue")
lines(carat, pred.int[, 3], col = "blue")
# add confidence interval
lines(carat, conf.int[, 2], col = "red", lty = 2)
lines(carat, conf.int[, 3], col = "red", lty = 2)
# add legend
legend("topleft", c("95% Confidence Interval", "95% Prediction Interval"),
       col = c("red", "blue"), lty = c(2, 1))

Problem 3

Blood type is classified as “A, B, or AB”, or O. In addition, blood can be classified as Rh+ or Rh -. In a survey of 500 randomly selected individuals, a phlebotomist obtained the results shown in the table below.

Rh FactorA, B, or ABOTotal
Rh+226198424
Rh-463076
Total272228500

Part 3a

Conduct the appropriate test of significance to test the following research question “Is Rh factor associated with blood type?” Use a 5% level of significance and include all parts of the test.

  1. Provide R code

and state the following:

  1. null and alternative hypotheses

  2. test statistic

  3. degrees of freedom

  4. p-value (report to 4 decimal places, enter 0 if P < 0.00005)

  5. conclusions.

Answer 3a

bloodtype <- matrix(c(226, 198, 46, 30), nrow = 2, byrow = TRUE)
rownames(bloodtype) <- c("Rh+", "Rh-")
colnames(bloodtype) <- c("A/B/AB", "O")
bloodtype
##     A/B/AB   O
## Rh+    226 198
## Rh-     46  30
chisq.test(bloodtype)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  bloodtype
## X-squared = 1.0804, df = 1, p-value = 0.2986

\(H_0:\) Rh factor and blood type are independent variables;

\(H_a:\) Rh factor and blood type are not independent.

X-squared = 1.0804

df = 1

p-value = 0.2986

Since p-value = 0.2986 > 0.05, we cannot reject the null hypothesis that Rh factor and blood type are independent. Rh factor is not associated with blood type.

Part 3b

Compute and interpret the odds ratio of having Type O blood for Rh+ compared to Rh-.

Answer 3b

library(mosaic)
oddsRatio(bloodtype)
## [1] 1.343363

For Rh+, the odds of having Type O blood are 1.34 times larger than the odds for having Type O blood when blood is Rh-.

Part 3c

Construct and interpret a 90% confidence interval for the population proportion of people who are Rh-, given that they have Type O blood.

Answer 3c

n <- sum(bloodtype[, "O"])
k <- bloodtype["Rh-", "O"]
# proportion point estimate
pbar <- k/n
pbar
## [1] 0.1315789
# standard error
SE <- sqrt(pbar ∗ (1 − pbar) / n)
# margin of error
E <- qnorm(0.95) ∗ SE
# confidence interval
pbar + c(−E, E)
## [1] 0.09475603 0.16840187

At 90% confidence level, between 9.48% and 16.84% of people who have Type O blood are Rh-.

Problem 4

The carinate dove shell is a yellowish to brownish colored smooth shell found along shallow water coastal areas in California and Mexico. A study was conducted to determine if the shell height of the carinate dove shell could be accurately predicted and to identify the independent variables needed to do so. Data was collected for a random sample of 30 of these gastropods and 8 variables that researchers thought might be good predictors were recorded.

The shell heights (in mm) are labeled in the file as Y and the potential predictor variables are simply named as X1, X2, X3, X4, X5, X6, X7, and X8. Independent variables X1 through X7 are quantitative while X8 is categorical.

The data is in the file shells.rda and is included in the DS705data package.

Part 4a

Use stepwise model selection with AIC as the stepping criterion and direction = “both” to identify the best first-order model for predicting shell height (Y).

Identify the predictor variables in the final model as well as the AIC for that model.

Answer 4a

load("shells.rda")
null <- lm(Y ~ 1, data = shells)
full <- lm(Y ~ ., data = shells)
mod_A <- step(null, scope = list(upper = full), data = shells,
              direction = "both", trace = FALSE)
mod_A
## 
## Call:
## lm(formula = Y ~ X2 + X4 + X1 + X6 + X7, data = shells)
## 
## Coefficients:
## (Intercept)           X2           X4           X1           X6           X7  
##     1.42718      0.65317      0.60923      1.15023     -0.06487      0.02636
extractAIC(mod_A)
## [1]    6.0000 -121.9933

The predictor variables in the final model are X1, X2, X4, X6 and X7. The AIC for this model is -121.99.

Part 4b

Compute the variance inflation factor for the final model from part 4a.

Does this model suffer from multicollinearity? Explain your answer.

Answer 4b

library(car)
vif(mod_A)
##       X2       X4       X1       X6       X7 
## 3.804041 2.697711 4.286121 4.428297 3.172575

This model does not suffer from multicollinearity, because VIF values are quite small and none of the VIF values exceed 10.

Part 4c

Let’s define Model B as follows:

Y = \(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_2^2 + \beta_4 X_4 + \beta_5 X_6 +\epsilon\)

Fit Model B and compare the AIC of it to the model that the stepwise selection procedure identified as best in 4a, which you may refer to as Model A.

Which model is the better fitting model according to AIC? Explain your answer.

Answer 4c

mod_B <- lm(Y ~ X1 + X2 + I(X2^2) + X4 + X6, data = shells)
mod_B
## 
## Call:
## lm(formula = Y ~ X1 + X2 + I(X2^2) + X4 + X6, data = shells)
## 
## Coefficients:
## (Intercept)           X1           X2      I(X2^2)           X4           X6  
##     9.16461      1.11286     -5.45450      1.14091      0.79526     -0.03961
extractAIC(mod_B)
## [1]    6.0000 -129.4831
extractAIC(mod_A)
## [1]    6.0000 -121.9933

According to AIC, the better fitting model is model B because the AIC of it is smaller.

Part 4d

Compute the variance inflation factor for Model B from part 4c.

Does this model suffer from multicollinearity? Explain your answer.

Answer 4d

vif(mod_B)
##         X1         X2    I(X2^2)         X4         X6 
##   4.195143 373.677887 369.491380   2.119951   2.309547

Model B suffers from multicollinearity because VIF values for X2 and X2^2 are very high (both exceed 10, which is a sign of multicollinearity).

Part 4e

Center the variable X2 and compute the quadratic term associated with it (call them cX2 and cx2sq, respectively). We’ll identify this as Model C:

Y = \(\beta_0 + \beta_1 X_1 + \beta_2 cX_2 + \beta_3 cX_2^2 + \beta_4 X_4 + \beta_5 X_6 +\epsilon\)

Compute the variance inflation factor for Model C. Does this model suffer from multicollinearity? Explain your answer.

Answer 4e

cX2 <- shells$X2 - mean(shells$X2)
cX2sq <- cX2^2
shells <- cbind(shells, cX2, cX2sq)
mod_C <- lm(Y ~ X1 + cX2 + cX2sq + X4 + X6, data = shells)
mod_C
## 
## Call:
## lm(formula = Y ~ X1 + cX2 + cX2sq + X4 + X6, data = shells)
## 
## Coefficients:
## (Intercept)           X1          cX2        cX2sq           X4           X6  
##     2.70477      1.11286      0.52081      1.14091      0.79526     -0.03961
vif(mod_C)
##       X1      cX2    cX2sq       X4       X6 
## 4.195143 3.765151 1.036090 2.119951 2.309547

Model C does not suffer from multicollinearity, because all VIF values are small and none of them exceed 10.

Part 4f

Compare the adjusted R-squared for Models A and C.

Explain what adjusted R-squared measures and state which model is “better” according to this criterion.

Answer 4f

summary(mod_A)$adj.r.squared
## [1] 0.9413233
summary(mod_C)$adj.r.squared
## [1] 0.9542871

The adjusted R-squared shows the percentage of variation explained by the independent variables that affect the dependent variable. It means that model C is “better” according to this criterion.

Part 4g

Test the residuals of Model C for serial correlation. Use a 5% level of significance.

Describe the outcome of this test.

Answer 4g

dwtest(mod_C, alternative = "two.sided")
## 
##  Durbin-Watson test
## 
## data:  mod_C
## DW = 2.8267, p-value = 0.02104
## alternative hypothesis: true autocorrelation is not 0

The p-value is smaller than \(\alpha\) (0.02104 < 0.05) so we reject the null hypothesis that the autocorrelation of the disturbances is 0. The residuals are correlated.

Part 4h

Using Model C, construct a 95% prediction interval for the shell height (Y) for a randomly selected shell with \(X1=3.6, X2=2.4, X4=3.0, and X6=48\).

Write an interpretation for the interval in the context of this problem.

Answer 4h

newdata <- data.frame(X1 = 3.6, cX2 = 2.4 - mean(shells$X2),
                      cX2sq = (2.4 - mean(shells$X2))^2,
                      X4 = 3, X6 = 48)
predict(mod_C, newdata, interval = "prediction")
##        fit      lwr     upr
## 1 7.136107 6.907404 7.36481

The 95% prediction interval of (6.907, 7.365) means that 95% of the time, the shell height for a randomly selected shell with \(X1=3.6, X2=2.4, X4=3.0, and X6=48\) will be between 6.907 and 7.365.

Problem 5

A study on the Primary News Source for Americans was conducted using a random sample of 115 Americans in 2015. The results are shown below.

 TVRadioNewspaperInternet
Sample from 201538201542
Distribution in 199545%18%16%21%

Conduct the hypothesis test to determine if the distribution of Primary News Source for Americans is the same in 2015 as it was in 1995. Use \(\alpha = 0.10\). State your hypotheses, test statistic, df, p-value, and conclusions, including a practical conclusion in the context of the problem.

Answer 5

x <- c(38, 20, 15, 42)
p <- c(0.45, 0.18, 0.16, 0.21)
chisq.test(x, p = p)
## 
##  Chi-squared test for given probabilities
## 
## data:  x
## X-squared = 17.499, df = 3, p-value = 0.000558

\(H_0: p_1 = 0.45, p_2 = 0.18, p_3 = 0.16, p_4 = 0.21\)

\(H_a:\) at lest one equation is not true

Test statistic is 17.499, df = 3, p-value = 0.000558.

Since p-value < 0.1, we reject the null hypothesis. At least one proportion is statistically significantly different from expected proportion. The distribution of Primary News Source for Americans in 2015 has changed since 1995.

Problem 6

In an effort to make better cheese, a company has a random sample of 30 cheese consumers taste 30 specially prepared pieces of Australian cheddar cheese (1 piece for each person). Each subject rated the taste of their piece of cheese as “acceptable” (coded as 1) or “not acceptable” (coded as 0). One variable measured was called ACETIC and was a quantitative variable ranging from 4.5 to 6.5 units. The other variable recorded was whether the person was a child (person=1) or an adult (person=2).

The data file called cheese.rda. This data set is included in the DS705data package.

Part 6a

Fit the first order model for predicting whether or not the taste of the cheese is acceptable (i.e. acceptable=1) from the acetic value and also whether the person was a child or an adult.

At a 5% level of significance, should either variable be dropped from the model?

Answer 6a

load("cheese.rda")
mod1 <- glm(taste ~ acetic + person, family = binomial, data = cheese)
summary(mod1)
## 
## Call:
## glm(formula = taste ~ acetic + person, family = binomial, data = cheese)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2245  -0.4998  -0.2002   0.3040   1.6066  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept)  -18.709      8.517  -2.197   0.0280 *
## acetic         2.787      1.412   1.975   0.0483 *
## personAdult    3.096      1.371   2.258   0.0239 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 34.795  on 29  degrees of freedom
## Residual deviance: 20.550  on 27  degrees of freedom
## AIC: 26.55
## 
## Number of Fisher Scoring iterations: 6

At a 5% level of significance, no variable should be dropped from the model.

Part 6b

Convert the estimated coefficient of acetic to an odds ratio and interpret it in the context of the problem.

Answer 6b

exp(coef(mod1))[2]
##   acetic 
## 16.23497

The odds for taste being acceptable multiply by 16.23497 for every one-unit increase in the amount of acetic.

Part 6c

Compute the predicted probability of a child finding the taste of the cheese acceptable when the value for acetic is 6.

Answer 6c

newdata1 <- data.frame(acetic = 6, person = "Child")
predict(mod1, newdata1, type = "response")
##         1 
## 0.1206732

Part 6d

Compute a 95% confidence interval for the predicted probability of a child finding the taste of the cheese acceptable when the value for acetic is 6.

Answer 6d

pred <- predict(mod1, newdata1, type = "link", se.fit = TRUE)
invlink <- mod1[["family"]][["linkinv"]]
z <- qnorm(1 - (1-0.95)/2)
low.ci <- pred$fit - z * pred$se.fit
up.ci <- pred$fit + z * pred$se.fit
    
prediction <- invlink(pred$fit)
low.ci.response <- invlink(low.ci)
up.ci.response <- invlink(up.ci)
    
data.frame(prediction = prediction, low.ci = low.ci.response, up.ci = up.ci.response)
##   prediction     low.ci     up.ci
## 1  0.1206732 0.01591904 0.5379397