Factor Analysis, Clustering, Hypothesis Testing, Linear Regression: Exercises and Solutions in R
In order to solve the tasks you need:
- R Studio
- Data Files
Problem 1
Scenario: A survey was conducted to find out how teenagers think about the future and barriers they think will hinder their career. The researcher would like to determine if the 15 survey items can be summarized more efficiently by a smaller set of latent factors.
You will need these files:
- careerbarrier.rda: Data for the 15 survey items for a random sample of 76 teens.
- Career Barrier Survey.docx: Descriptions of the variables in the data file.
Part A
Question 1
Conduct Bartlett’s test for sphericity on the responses for the 15 survey questions.
## $chisq
## [1] 287.4985
##
## $p.value
## [1] 6.756016e-19
##
## $df
## [1] 105
Question 2
State the null and alternative hypothesis.
\(H_0:\) The correlation matrix is the identity matrix;
\(H_a:\) The correlation matrix is not the identity matrix.
Question 3
State your conclusion at a 5% level of significance and respond with whether factor analysis is warranted based on this measure.
Since p-value < 0.05, we reject the null hypothesis. The correlation matrix is not the identity matrix. Factor analysis is warranted based on this measure.
Question 4
Round the p-value to four decimal places (enter 0 if P < 0.00005).
p-value = 0
Part B1
Question 5
Compute the Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy (MSA) for the responses for the 16 survey questions.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = mat)
## Overall MSA = 0.68
## MSA for each item =
## money lazy law noexp math support health reading
## 0.36 0.67 0.67 0.61 0.67 0.66 0.67 0.73
## english aoda grades disc social relatshp looks
## 0.69 0.82 0.67 0.66 0.75 0.72 0.58
Question 6
Report the overall MSA value.
Overall MSA = 0.68
Question 7
Is the overall MSA value acceptable for factor analysis?
Yes.
Question 8
Should any questionnaire items be dropped from the factor analysis because of MSA values under 0.50?
Yes.
Question 9
If so which one(s)? (if there aren’t any, write “none”)
Answer: money
Part B2
Question 10
Compute the Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy (MSA) for the responses for the remaining survey questions after you have dropped the items from Part B1. Use the following questions to document your findings.
## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = mat2)
## Overall MSA = 0.7
## MSA for each item =
## lazy law noexp math support health reading english
## 0.68 0.67 0.61 0.68 0.66 0.68 0.74 0.70
## aoda grades disc social relatshp looks
## 0.82 0.67 0.70 0.77 0.71 0.62
Question 11
Report the overall MSA value.
Overall MSA = 0.7
Question 12
Is the new overall MSA value acceptable for factor analysis?
Yes.
Question 13
Should any questionnaire items be dropped from the factor analysis because of MSA values under 0.50?
No.
Question 14
If so which one(s)? (if there aren’t any say “none”)
Answer: None.
Part C
Question 15
Use R to create a scree plot for the questionnaire items that you deemed to be appropriate for the factor analysis from Part B.
Question 16
Using the knee in the scree plot, how many factors should be extracted?
Answer: 3
Question 17
How many components have eigenvalues (aka variances, latent roots) greater than 1 and how many factors does this suggest extracting?
Answer: 5
Part D
Question 18
Use a principal components extraction with the varimax
rotation to extract 5 factors. Print the output with factor loadings under 0.5 suppressed and sort the loadings.
fa.out <- principal(careerbarrier[, -1], nfactors = 5, rotate = "varimax")
print.psych(fa.out, cut = 0.5, sort = TRUE)
## Principal Components Analysis
## Call: principal(r = careerbarrier[, -1], nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
## item RC2 RC1 RC3 RC4 RC5 h2 u2 com
## looks 14 0.79 0.67 0.33 1.2
## relatshp 13 0.76 0.65 0.35 1.2
## disc 11 0.71 0.53 0.47 1.1
## social 12 0.59 0.50 0.50 1.9
## reading 7 0.82 0.78 0.22 1.3
## english 8 0.74 0.67 0.33 1.4
## health 6 0.63 0.59 0.41 2.0
## aoda 9 0.58 0.59 0.41 2.2
## lazy 1 0.81 0.77 0.23 1.4
## law 2 0.76 0.65 0.35 1.3
## grades 10 0.65 0.53 0.47 1.5
## support 5 0.82 0.71 0.29 1.1
## math 4 0.64 0.60 0.40 2.0
## noexp 3 0.89 0.85 0.15 1.1
##
## RC2 RC1 RC3 RC4 RC5
## SS loadings 2.28 2.28 2.11 1.27 1.15
## Proportion Var 0.16 0.16 0.15 0.09 0.08
## Cumulative Var 0.16 0.33 0.48 0.57 0.65
## Proportion Explained 0.25 0.25 0.23 0.14 0.13
## Cumulative Proportion 0.25 0.50 0.73 0.87 1.00
##
## Mean item complexity = 1.5
## Test of the hypothesis that 5 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.08
## with the empirical chi square 96.17 with prob < 1.4e-08
##
## Fit based upon off diagonal values = 0.89
Question 19
What is the cumulative variance explained? Answer as a percent, not a decimal number.
Answer: 65%
Question 20
Is this considered an acceptable percent of total variation?
Yes.
Problem 2
Scenario: Water quality variables nitrogen, turbidity, phosphorus, dissolved oxygen, temperature, and conductivity are measured in 31 randomly selected farm ponds in Southeastern Minnesota. Researchers would like to determine if there is an underlying structure that will enable clustering of these 31 ponds into homogeneous groups. You will need this file: farmpondquality.rda
Part A
Part B
Question 22
Plot the dendrogram for hierarchical clustering using complete linkage and add the rectangles by cutting the dendrogram at a height of 5.
Question 23
How many clusters does this form?
4 clusters.
Part C
Question 24
Append the original data frame (the unscaled one) with the cluster number from cutting the dendrogram at a height of 5. Find the number of ponds in each cluster and obtain the means of the original variables for each cluster.
library(dplyr) # for easy grouping and summarising
farmpondquality <- cbind(farmpondquality,
h_clusters = cutree(h_clusters, h = 5))
tbl <- farmpondquality %>%
group_by(h_clusters) %>%
summarise(num_ponds = n(),
mean_nitro = mean(nitro),
mean_turb = mean(turb),
mean_phos = mean(phos),
mean_disoxy = mean(disoxy),
mean_temp = mean(temp),
mean_cond = mean(cond))
print.data.frame(tbl)
## h_clusters num_ponds mean_nitro mean_turb mean_phos mean_disoxy mean_temp
## 1 1 2 0.7865357 13.766667 2.5072143 9.510714 20.62500
## 2 2 15 0.3006019 28.454413 1.8769083 9.224706 23.08413
## 3 3 10 0.1819414 9.091905 1.6315357 11.507738 20.41036
## 4 4 4 0.1832321 9.183929 0.9327321 8.553571 22.56696
## mean_cond
## 1 346.7500
## 2 149.6277
## 3 206.7179
## 4 488.5625
Question 25
For k-means clustering, plot the within sum of squares for the first 15 clusters against the cluster number and use the plot to determine a good number of clusters to partition the cases into. Use the standardized values.
Question 26
How many clusters do you think are best, based on this plot? Why?
6, because this is where the ‘knee’ seems to be in the plot (around 6 clusters, the decline slows down).
Question 27
Perform the k-means clustering on the z-scores of the 6 pond quality variables using the number of clusters you determined from the plot. Find the number of cases in each cluster as well as the cluster means for the raw variables.
k_means_cluster <- kmeans(farmpondquality_new, 6)
farmpondquality <- cbind(farmpondquality, k_means_clusters = k_means_cluster$cluster)
tbl2 <- farmpondquality %>%
group_by(k_means_clusters) %>%
summarise(num_ponds = n(),
mean_nitro = mean(nitro),
mean_turb = mean(turb),
mean_phos = mean(phos),
mean_disoxy = mean(disoxy),
mean_temp = mean(temp),
mean_cond = mean(cond))
## `summarise()` ungrouping output (override with `.groups` argument)
## k_means_clusters num_ponds mean_nitro mean_turb mean_phos mean_disoxy
## 1 1 4 0.6661321 11.833333 1.854054 9.031429
## 2 2 5 0.1996857 8.747143 1.021886 9.097857
## 3 3 6 0.2169155 15.863810 3.286909 11.557341
## 4 4 3 0.4458810 48.966667 2.056605 5.532421
## 5 5 2 0.2131071 52.107143 1.103179 9.171429
## 6 6 11 0.1732851 13.148918 1.144643 10.980411
## mean_temp mean_cond
## 1 22.40089 260.64375
## 2 21.80357 471.90000
## 3 22.08294 231.77579
## 4 21.88690 172.33571
## 5 24.76786 93.32143
## 6 21.41526 132.99675
Problem 3
A study was conducted on the relationship of seating position and nausea on buses. The data in the following table classifies each person in a random sample of bus riders by the location of his or her seat and whether or not nausea was reported.
Table | Front | Middle | Rear | Total |
---|---|---|---|---|
Nausea | 98 | 110 | 161 | 369 |
No Nausea | 264 | 321 | 280 | 865 |
Total | 362 | 431 | 441 | 1234 |
Part A
Question 28
Test to see whether or not the seat position within a bus is associated with motion sickness.
bus_riders <- matrix(c(98, 110, 161, 264, 321, 280), nrow = 2, byrow = TRUE)
rownames(bus_riders) <- c("Nausea", "No Nausea")
colnames(bus_riders) <- c("Front", "Middle", "Rear")
bus_riders
## Front Middle Rear
## Nausea 98 110 161
## No Nausea 264 321 280
##
## Pearson's Chi-squared test
##
## data: bus_riders
## X-squared = 14.509, df = 2, p-value = 0.000707
Question 29
State the null and alternative hypothesis.
\(H_0:\) Motion sickness and seat position within a bus are independent variables;
\(H_a:\) Motion sickness and seat position within a bus are not independent variables.
Question 30
State the test statistic. Give the answers to the nearest thousandth decimal.
X-squared = 14.509
Question 31
State the degrees of freedom.
df = 2
Question 32
Round the p-value to four decimal places (enter 0 if P < 0.00005).
p-value = 0.0007
Question 33
State your conclusion.
Since p-value < 0.05, we reject the null hypothesis that motion sickness and seat position within a bus are independent variables. The seat position within a bus is associated with motion sickness.
Part B
Question 34
Construct a 90% confidence interval (without the Yates correction) for the difference in population proportions of all bus riders in the front who report nausea and all bus riders in the rear who report nausea. (Use Diff = Front - Rear)
## Nausea No Nausea
## Front 98 264
## Rear 161 280
##
## 2-sample test for equality of proportions without continuity
## correction
##
## data: bus_riders_tbl
## X-squared = 8.1012, df = 1, p-value = 0.004424
## alternative hypothesis: two.sided
## 90 percent confidence interval:
## -0.14819088 -0.04053138
## sample estimates:
## prop 1 prop 2
## 0.2707182 0.3650794
Question 35
Enter the lower bound of the 90% CI (round to 3 decimal places).
-0.148
Question 36
Enter the upper bound of the 90% CI (round to 3 decimal places).
-0.041
Question 37
Write an interpretation for the interval in the context of the problem.
We are 90% confident that the difference in population proportions of all bus riders in the front who report nausea and all bus riders in the rear who report nausea is between -0.148 and -0.041.
Part C
Question 38
Compute the odds ratio of having nausea for those in the rear compared to those in the front of the bus.
## [1] 1.54898
Question 39
Report the odds ratio to 3 decimal places.
1.549
Question 40
Interpret the odds ratio in the context of the problem.
For people sitting in the rear of the bus, the odds of having nausea are 1.549 times larger than the odds of having nausea for those in the front of the bus.
Problem 4
Scenario : A random sample of apartments was obtained from mid-sized towns in the Midwest. They are classified as having either “3 bedrooms” or “2 bedrooms” and the monthly rent was recorded. You will need this file: rent.rda
Part A
Question 41
Load the data set and create boxplots for the monthly rents for each type of apartment.
Question 42
Comment on the shapes of the boxplots and whether or not they contain outliers. Does there appear to be a difference in the distributions of monthly rent between 2 and 3 bedroom apartments?
The box for 3 bedroom apartments is longer than the box for 2 bedroom apartments and the distribution for 3 bedroom apartments seems to be skewed to the left, while the distribution for 2 bedroom apartments seems to be skewed to the right. There are no outliers. There appears to be a difference in the distributions of monthly rent between 2 and 3 bedroom apartments.
Part B
Question 43
Conduct the Shapiro-Wilk test for normality for each sample.
##
## Shapiro-Wilk normality test
##
## data: monthlyrent$rent[monthlyrent$type == "3 Bedrooms"]
## W = 0.90679, p-value = 0.2596
##
## Shapiro-Wilk normality test
##
## data: monthlyrent$rent[monthlyrent$type == "2 Bedrooms"]
## W = 0.96512, p-value = 0.8423
Question 44
Using a 5% level of significance for each test individually, choose the option that describes conclusions for each distribution.
- Normality is rejected for the rents of both the 3-bedroom and 2-bedroom apartments.
- Normality is rejected for the rents of the 2-bedroom apartments but not the 3-bedroom apartments.
- Normality is rejected for the rents of the 3-bedroom apartments but not the 2-bedroom apartments.
- Normality is not rejected for the rents of either the 3-bedroom or 2-bedroom apartments.
Answer: (d) Normality is not rejected for the rents of either the 3-bedroom or 2-bedroom apartments.
Part C
Question 45
Test for equality of population variances using a 5% level of significance.
##
## F test to compare two variances
##
## data: rent by type
## F = 5.2305, num df = 9, denom df = 9, p-value = 0.02159
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 1.299192 21.058128
## sample estimates:
## ratio of variances
## 5.230541
Question 46
State the null and alternative hypotheses.
\(H_0:\) The variances of the rents of both the 3-bedroom and 2-bedroom apartments are equal;
\(H_a:\) The variances are not equal.
Question 47
Round the p-value to four decimal places (enter 0 if P < 0.00005).
p-value = 0.0216
Question 48
State the conclusion for the test.
Since p-value < 0.05, we reject the null hypothesis. The variances of the rents of the 3-bedroom and 2-bedroom apartments are not equal.
Part D
Question 49
Conduct the appropriate hypothesis test (for two samples only - not ANOVA) to compare the population mean rents for these two types of apartments in mid-sized Midwestern towns. Use a 10% level of significance.
##
## Welch Two Sample t-test
##
## data: rent by type
## t = 3.0105, df = 12.32, p-value = 0.01057
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
## 37.65164 146.34836
## sample estimates:
## mean in group 3 Bedrooms mean in group 2 Bedrooms
## 608 516
Question 50
State the hypotheses.
\(H_0:\) mean rents for 3-bedroom and 2-bedroom apartments are equal;
\(H_a:\) mean rents for 3-bedroom and 2-bedroom apartments are not equal.
Question 51
State the test statistic.
t = 3.0105
Question 52
State the df.
df = 12.32
Question 53
Round the p-value to four decimal places (enter 0 if P < 0.00005).
p-value = 0.0106
Question 54
State the conclusion.
Since p-value < 0.1, we reject the null hypothesis. Mean rents for 3-bedroom and 2-bedroom apartments are statistically significantly different.
Question 55
Also obtain the appropriate 90% confidence interval for the difference in population mean rents for these two types of apartments in mid-sized Midwestern towns. (you can add your R code in Question 49). Write an interpretation for the interval in the context of the problem.
90% confidence interval for the difference in population mean rents is (37.652, 146.348). It means that we are 90% confident that the true difference in population mean rents for 3-bedroom and 2-bedroom apartments is between 37.652 and 146.348.
Problem 5
Scenario: A psychologist is interested in the relationship between academic performance and “self-concept” as well as the student’s IQ and gender for 39 seventh grade students in a rural school district. Academic performance is measured as a grade point average (let y = GPA). Self-concept (x1) is measured by the student’s score on the Piers-Harris Children’s Self-Concept Scale, the IQ (x2) and the type of learner (x3, 0=Visual, 1=Auditory) of each student is also recorded. You will need this file: gpa7th.rda
Part A
Consider the model \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_3 + \beta_5 x_2 x_3 + \epsilon \] #### Question 56
Fit this model and report on which coefficients are significantly different from zero (create the interaction terms separately, do not use x1:x3 or x2:x3 in the code). Use the hierarchical approach to model-building.
load("gpa7th.rda")
ind_Visual <- which(gpa7th$x3 == "Visual")
ind_Auditory <- which(gpa7th$x3 == "Auditory")
x1x3 <- NULL
x2x3 <- NULL
x1x3[ind_Auditory] <- 1 * gpa7th$x1[ind_Auditory]
x1x3[ind_Visual] <- 0 * gpa7th$x1[ind_Visual]
x2x3[ind_Auditory] <- 1 * gpa7th$x2[ind_Auditory]
x2x3[ind_Visual] <- 0 * gpa7th$x2[ind_Visual]
gpa7th_new <- cbind(gpa7th, x1x3, x2x3)
mod1 <- lm(y ~ x1 + x2 + x3 + x1x3 + x2x3, data = gpa7th_new)
summary(mod1)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x1x3 + x2x3, data = gpa7th_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.20709 -0.30514 0.03727 0.25057 1.15223
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.963561 1.221225 -2.427 0.0209 *
## x1 0.040486 0.008774 4.614 5.72e-05 ***
## x2 0.031657 0.011863 2.669 0.0117 *
## x3Auditory 2.074237 1.652304 1.255 0.2182
## x1x3 -0.034406 0.015210 -2.262 0.0304 *
## x2x3 0.002440 0.015614 0.156 0.8768
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4805 on 33 degrees of freedom
## Multiple R-squared: 0.6653, Adjusted R-squared: 0.6146
## F-statistic: 13.12 on 5 and 33 DF, p-value: 4.572e-07
Question 57
Should any terms be dropped from this model at a 5% level of significance?
Yes.
Question 58
Select the term(s) that should be dropped.
x3 and x2x3.
Part B
Fit the model (create the interaction term separately, do not use x1:x3 in the code)
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_3 + \epsilon \]
Question 59
Obtain the residuals for this model and evaluate the residual plots using the “plot” function. Also create a histogram of the residuals.
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x1x3, data = gpa7th_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18700 -0.30971 0.03875 0.24889 1.15463
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.097275 0.858682 -3.607 0.000983 ***
## x1 0.040035 0.008166 4.903 2.3e-05 ***
## x2 0.033065 0.007602 4.350 0.000118 ***
## x3Auditory 2.296831 0.825052 2.784 0.008710 **
## x1x3 -0.033512 0.013890 -2.413 0.021377 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4736 on 34 degrees of freedom
## Multiple R-squared: 0.665, Adjusted R-squared: 0.6256
## F-statistic: 16.88 on 4 and 34 DF, p-value: 1.035e-07
##
## studentized Breusch-Pagan test
##
## data: mod2
## BP = 9.0755, df = 4, p-value = 0.05924
Question 60
Does a visual inspection of the residual plots and histogram indicate that the model assumptions appear to be satisfied? Explain your answer.
Yes, a visual inspection of the residual plots and histogram indicate that the model assumptions appear to be satisfied. The regression line appears to be a good fit, because there does not seem to be any distinct patterns of residuals (in the Residuals vs Fitted
graph). Also, residuals appear to be normally distributed as they follow a straight line quite well in the Normal Q-Q
plot and the histogram of residuals is bell-shaped. Other plots also indicate that the model assumptions appear to be satisfied.
Question 61
Also perform a Breusch-Pagan test for homogeneity of variance among the residuals. Use a 5% level of significance (you can include the code in your answer to Question 59). Comment on the results of the Breusch-Pagan test.
Since p-value = 0.059 > 0.05, the null hypothesis of homoscedasticity cannot be rejected at a 5% level of significance. There is homogeneity of variance among the residuals.
Part C
Fit the model (same as in Part B, create the interaction term separately - do not use x1:x3 in code)
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_3 + \epsilon \]
Question 62
Construct a 95% confidence interval for the mean gpa of all 7th-graders with Piers-Harris Children’s Self-Concept Score of 50, IQ of 105, and who are auditory learners.
newdata <- data.frame(x1 = 50, x2 = 105, x3 = "Auditory", x1x3 = 50)
predict(mod2, newdata, interval = "confidence")
## fit lwr upr
## 1 2.997526 2.696134 3.298919
Question 63
Enter the lower bound of the 95% CI (round to 3 decimal places).
2.696
Question 64
Enter the upper bound of the 95% CI (round to 3 decimal places).
3.299
Question 65
Write an interpretation for the interval in the context of the problem.
We are 95% confident that for all 7th-graders with Piers-Harris Children’s Self-Concept Score of 50, IQ of 105, and who are auditory learners, the mean GPA is between 2.696 and 3.299.
Part D
Fit the model
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_3 + \epsilon \]
Question 66
Interpret the value of \(\hat{\beta_2}\) in the context of the problem.
The value of \(\hat{\beta_2} = 0.033\) shows that for each unit increase in IQ (x2), the GPA score is expected to increase by 0.033.
Part E
Fit the first-order logistic regression model to predict that a randomly selected student is an auditory learner from gpa, the Piers-Harris Children’s Self-Concept Score, and the IQ.
Question 67
Use it to predict the probability of being an auditory learner for a 7th-grader with a 3.5 gpa, Piers-Harris Children’s Self-Concept Score of 50, and IQ of 105.
mod_glm <- glm(x3 ~ y + x1 + x2, family = binomial, data = gpa7th)
newdata_glm <- data.frame(y = 3.5, x1 = 50, x2 = 105)
predict(mod_glm, newdata_glm, type = "response")
## 1
## 0.7665911
Question 68
Report the probability to 4 decimal places.
Answer: 0.7666
Problem 6
Scenario: A study was conducted to evaluate the quality of beef after storage times (STORAGE) of 10, 40, 80, and 120 days. The beef quality variables assessed were beefy aroma (BEEFY), bloody aroma (BLOODY), and a grassy aroma (GRASSY), which were all measured on a rating scale ranging from 0 to 15. Thirty samples were evaluated by several beef quality specialists and the average rating was obtained for this data file. You will need this file: beef.rda
Part A
Question 69
Use the Henze-Zirkler Multivariate Normality Test to test for multivariate normality among the three response variables: BEEFY, BLOODY, and GRASSY. Include a chi-square quantile plot in your analysis and use a 1% level of significance for each individual hypothesis test.
## $multivariateNormality
## Test HZ p value MVN
## 1 Henze-Zirkler 0.7762501 0.1296601 YES
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk BEEFY 0.8791 0.0027 NO
## 2 Shapiro-Wilk BLOODY 0.8937 0.0059 NO
## 3 Shapiro-Wilk GRASSY 0.9054 0.0114 NO
##
## $Descriptives
## n Mean Std.Dev Median Min Max 25th 75th
## BEEFY 30 7.583343 3.703254 7.875000 2.25000 12.66670 3.56250 11.291700
## BLOODY 30 4.938889 1.992796 5.500000 1.66667 7.41667 3.06250 6.729167
## GRASSY 30 1.955555 1.457037 1.916665 0.00000 4.41667 0.58333 3.250000
## Skew Kurtosis
## BEEFY -0.1221207 -1.680482
## BLOODY -0.3083830 -1.469143
## GRASSY 0.1412231 -1.577458
Question 70
According to this test, is there sufficient evidence to conclude that BEEFY, BLOODY, and GRASSY are not multivariate normal? Explain.
According to this test, there is no sufficient evidence to conclude that BEEFY, BLOODY, and GRASSY are not multivariate normal. The p-value = 0.13 > 0.01 and the chi-square quantile plot indicates that data are multivariate normal.
Part B
Conduct Box’s M Test to test for equality of covariances. Use a 1% level of significance.
Question 71
Is there sufficient evidence to conclude that the covariance matrices are not equal at the 1% level of significance?
There is no sufficient evidence to conclude that the covariance matrices are not equal at the 1% level of significance, because p-value = 0.022 > 0.01.
Question 72
Insert your R code here.
## ---
## biotools version 4.0
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: beef[, -1]
## Chi-Sq (approx.) = 31.973, df = 18, p-value = 0.02215
Question 73
Based on the criteria of multivariate normality and equal covariance matrices, is it appropriate to proceed with MANOVA?
Yes.
Part C
Question 74
Regardless of the outcomes of the previous hypothesis tests, conduct a MANOVA to determine if there are differences between the different storage times for the population mean vectors when beefy, bloody, and grassy aromas are considered together. Use the Wilk’s Lambda statistic and let \(\alpha = .05\)
res.man <- manova(cbind(BEEFY, BLOODY, GRASSY) ~ STORAGE, data = beef)
summary(res.man, test = "Wilks")
## Df Wilks approx F num Df den Df Pr(>F)
## STORAGE 3 0.1269 8.6897 9 58.56 3.816e-08 ***
## Residuals 26
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Question 75
State the null and alternative hypothesis.
\(H_0:\) mean vectors are equal across the groups;
\(H_a:\) at least two of the groups’ mean vectors differ.
Question 76
State the conclusion for the test.
Since p-value < 0.05, we reject the null hypothesis. There are differences between the different storage times for the population mean vectors.
Question 77
Round the p-value to four decimal places (enter 0 if P < 0.00005).
p-value = 0
Part D
Question 78
Follow up with univariate ANOVAs and Tukey multiple comparisons on the response variables to see which population means differ. Use a 5% level of significance for each univariate ANOVA and each Tukey procedure. (We are temporarily ignoring the multiple comparisons problem.)
## Response BEEFY :
## Df Sum Sq Mean Sq F value Pr(>F)
## STORAGE 3 322.55 107.516 37.192 1.491e-09 ***
## Residuals 26 75.16 2.891
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response BLOODY :
## Df Sum Sq Mean Sq F value Pr(>F)
## STORAGE 3 91.744 30.5814 33.948 3.837e-09 ***
## Residuals 26 23.422 0.9008
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response GRASSY :
## Df Sum Sq Mean Sq F value Pr(>F)
## STORAGE 3 51.727 17.2422 45.563 1.713e-10 ***
## Residuals 26 9.839 0.3784
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = BEEFY ~ STORAGE, data = beef)
##
## $STORAGE
## diff lwr upr p adj
## 40-10 -1.171283 -3.629593 1.2870267 0.5668196
## 80-10 -6.212302 -8.807288 -3.6173169 0.0000033
## 120-10 -7.881942 -10.400960 -5.3629235 0.0000000
## 80-40 -5.041019 -7.391613 -2.6904251 0.0000189
## 120-40 -6.710658 -8.977108 -4.4442085 0.0000001
## 120-80 -1.669639 -4.083652 0.7443732 0.2537189
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = BLOODY ~ STORAGE, data = beef)
##
## $STORAGE
## diff lwr upr p adj
## 40-10 -0.5138894 -1.886182 0.8584034 0.7351889
## 80-10 -3.2559536 -4.704542 -1.8073650 0.0000092
## 120-10 -4.1458363 -5.552018 -2.7396546 0.0000001
## 80-40 -2.7420641 -4.054227 -1.4299012 0.0000279
## 120-40 -3.6319468 -4.897138 -2.3667554 0.0000001
## 120-80 -0.8898827 -2.237447 0.4576821 0.2907963
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = GRASSY ~ STORAGE, data = beef)
##
## $STORAGE
## diff lwr upr p adj
## 40-10 0.6527761 -0.2366643 1.542217 0.2090224
## 80-10 2.3472207 1.4083298 3.286112 0.0000016
## 120-10 3.3472213 2.4358160 4.258626 0.0000000
## 80-40 1.6944446 0.8439769 2.544912 0.0000556
## 120-40 2.6944451 1.8744216 3.514469 0.0000000
## 120-80 1.0000005 0.1265874 1.873414 0.0203952
Question 79
Write your conclusion.
From the univariate ANOVA tests, we can see that all the response variables (BEEFY, BLOODY, GRASSY) are highly significantly different among STORAGE times.
According to Tukey multiple comparisons tests, population mean scores for beefy and bloody aroma are different for storage times of 10 and 80, 10 and 120, 40 and 80, 40 and 120 days.
Population mean scores for grassy aroma are different for storage times of 10 and 80, 10 and 120, 40 and 80, 40 and 120 days and also 80 and 120 days.
Question 80
Match each nonparametric procedure with the parametric procedure with which it most closely corresponds.
Answer:
- Wilcoxon Signed Rank - Paired data t-test
- Wilcoxon Rank Sum Test - Pooled t-test
- Kruskal-Wallis Test - One-way ANOVA
- Fisher’s Exact Test - Chi–square test for independence
Question 81
Which of the following could be used in multiple regression model selection?
- AIC
- PRESS
- VIF
- \(R^2_{Adj}\)
- ALL of the above
Answer: ALL of the above
Question 82
A Type I error can best be described as:
- not rejecting a false null hypothesis.
- a sample that has not been chosen randomly.
- using a two-sided alternative when a one-sided alternative would be sufficient.
- concluding that the null hypothesis is false when in fact it is true.
- mistakenly using a parametric procedure instead of a nonparametric procedure when the data is not normally distributed.
Answer: d) concluding that the null hypothesis is false when in fact it is true.
Question 83
Which of the following statements DOES NOT represent a model assumption of linear regression?
- the standard deviation of the independent variable is normally distributed
- the residuals are normally distributed
- the variance of the error terms is constant at each value of x
- the random error terms are independent
- the mean of the random error terms is zero
Answer: a) the standard deviation of the independent variable is normally distributed
Question 84
Which of the following statements DOES NOT represent a model assumption of ANOVA?
- the samples are independent
- the population means are not equal
- the samples are randomly selected
- the population variances are all equal
- the populations are normally distributed
Answer: b) the population means are not equal
Question 85
In linear regression, if the correlation coefficient between the response variable y and the explanatory variable x1 is zero, then
- The scatterplot of x1 vs. y will be a perfectly straight line.
- There is no linear relationship between x1 and y.
- \(R^2\) will be 1.
- The observed values of y are equal to the predicted values.
- We reject \(H_0: \beta_1 = 0\)
Answer: b) There is no linear relationship between x1 and y.
Question 86
For the following problems, match the scenario to the appropriate statistical procedure. Each procedure may be used more than once or not at all.
Scenario: A business manager has data for 36 variables collected for the employees in various positions within a very large organization. The manager would like to reduce the number of variables to a much smaller number of latent variables that will represent the shared variability present in the original 36 variables.
Procedure: Factor Analysis
Scenario: A random sample of American adults is collected to see if how systolic blood pressure is related to weight, height, age, and gender.
Procedure: Linear regression
Scenario: A manufacturing plant measures the weekly man-hours lost due to accidents for a random sample of its employees both before and after the installation of an elaborate safety program and would like to conduct a test to see if the loss of man-hours is lower after the program. The difference in loss of man-hours is not normally distributed.
Procedure: Wilcoxon signed rank test
Scenario: A clinical research study is designed to investigate how systolic blood pressure, cholesterol level, and whether or not a person is a smoker can predict the presence or absence of hypertension.
Procedure: Logistic regression
Scenario: Do elementary, middle, and high school students differ in the amount of time spent watching television per week? Sample data looks to be normally distributed with approximately equal variance.
Procedure: ANOVA
Scenario: Is the probability distribution of college students at a college football game such that there are equal proportions of Freshmen, Sophomores, Juniors, and Seniors?
Procedure: Chi-Square test for Goodness-of-fit
Scenario: Economists collect independent random samples from five different professional disciplines in order to compare the population variances of salaries in each.
Procedure: Levene test