Factor Analysis, Clustering, Hypothesis Testing, Linear Regression: Exercises and Solutions in R

In order to solve the tasks you need:

R Studio
Data Files

Problem 1

Scenario: A survey was conducted to find out how teenagers think about the future and barriers they think will hinder their career. The researcher would like to determine if the 15 survey items can be summarized more efficiently by a smaller set of latent factors.

You will need these files:

careerbarrier.rda: Data for the 15 survey items for a random sample of 76 teens.
Career Barrier Survey.docx: Descriptions of the variables in the data file.

Part A

Question 1

Conduct Bartlett’s test for sphericity on the responses for the 15 survey questions.

library(psych)
load("careerbarrier.rda")
mat <- cor(careerbarrier)
cortest.bartlett(mat, n = 76)

## $chisq
## [1] 287.4985
## 
## $p.value
## [1] 6.756016e-19
## 
## $df
## [1] 105

Question 2

State the null and alternative hypothesis.

\(H_0:\) The correlation matrix is the identity matrix;

\(H_a:\) The correlation matrix is not the identity matrix.

Question 3

State your conclusion at a 5% level of significance and respond with whether factor analysis is warranted based on this measure.

Since p-value < 0.05, we reject the null hypothesis. The correlation matrix is not the identity matrix. Factor analysis is warranted based on this measure.

Question 4

Round the p-value to four decimal places (enter 0 if P < 0.00005).

p-value = 0

Part B1

Question 5

Compute the Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy (MSA) for the responses for the 16 survey questions.

KMO(mat)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = mat)
## Overall MSA =  0.68
## MSA for each item = 
##    money     lazy      law    noexp     math  support   health  reading 
##     0.36     0.67     0.67     0.61     0.67     0.66     0.67     0.73 
##  english     aoda   grades     disc   social relatshp    looks 
##     0.69     0.82     0.67     0.66     0.75     0.72     0.58

Question 6

Report the overall MSA value.

Overall MSA = 0.68

Question 7

Is the overall MSA value acceptable for factor analysis?

Yes.

Question 8

Should any questionnaire items be dropped from the factor analysis because of MSA values under 0.50?

Yes.

Question 9

If so which one(s)? (if there aren’t any, write “none”)

Answer: money

Part B2

Question 10

Compute the Kaiser-Meyer-Olkin (KMO) Measure of Sampling Adequacy (MSA) for the responses for the remaining survey questions after you have dropped the items from Part B1. Use the following questions to document your findings.

mat2 <- cor(careerbarrier[, -1])
KMO(mat2)

## Kaiser-Meyer-Olkin factor adequacy
## Call: KMO(r = mat2)
## Overall MSA =  0.7
## MSA for each item = 
##     lazy      law    noexp     math  support   health  reading  english 
##     0.68     0.67     0.61     0.68     0.66     0.68     0.74     0.70 
##     aoda   grades     disc   social relatshp    looks 
##     0.82     0.67     0.70     0.77     0.71     0.62

Question 11

Report the overall MSA value.

Overall MSA = 0.7

Question 12

Is the new overall MSA value acceptable for factor analysis?

Yes.

Question 13

Should any questionnaire items be dropped from the factor analysis because of MSA values under 0.50?

No.

Question 14

If so which one(s)? (if there aren’t any say “none”)

Answer: None.

Part C

Question 15

Use R to create a scree plot for the questionnaire items that you deemed to be appropriate for the factor analysis from Part B.

output <- princomp(careerbarrier[, -1], cor = TRUE)
plot(output, type = "lines", main = "Scree plot")
abline(h = 1, lty = 2)

Question 16

Using the knee in the scree plot, how many factors should be extracted?

Answer: 3

Question 17

How many components have eigenvalues (aka variances, latent roots) greater than 1 and how many factors does this suggest extracting?

Answer: 5

Part D

Question 18

Use a principal components extraction with the varimax rotation to extract 5 factors. Print the output with factor loadings under 0.5 suppressed and sort the loadings.

fa.out <- principal(careerbarrier[, -1], nfactors = 5, rotate = "varimax")
print.psych(fa.out, cut = 0.5, sort = TRUE)

## Principal Components Analysis
## Call: principal(r = careerbarrier[, -1], nfactors = 5, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##          item   RC2   RC1   RC3   RC4   RC5   h2   u2 com
## looks      14  0.79                         0.67 0.33 1.2
## relatshp   13  0.76                         0.65 0.35 1.2
## disc       11  0.71                         0.53 0.47 1.1
## social     12  0.59                         0.50 0.50 1.9
## reading     7        0.82                   0.78 0.22 1.3
## english     8        0.74                   0.67 0.33 1.4
## health      6        0.63                   0.59 0.41 2.0
## aoda        9        0.58                   0.59 0.41 2.2
## lazy        1              0.81             0.77 0.23 1.4
## law         2              0.76             0.65 0.35 1.3
## grades     10              0.65             0.53 0.47 1.5
## support     5                    0.82       0.71 0.29 1.1
## math        4                    0.64       0.60 0.40 2.0
## noexp       3                          0.89 0.85 0.15 1.1
## 
##                        RC2  RC1  RC3  RC4  RC5
## SS loadings           2.28 2.28 2.11 1.27 1.15
## Proportion Var        0.16 0.16 0.15 0.09 0.08
## Cumulative Var        0.16 0.33 0.48 0.57 0.65
## Proportion Explained  0.25 0.25 0.23 0.14 0.13
## Cumulative Proportion 0.25 0.50 0.73 0.87 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.08 
##  with the empirical chi square  96.17  with prob <  1.4e-08 
## 
## Fit based upon off diagonal values = 0.89

Question 19

What is the cumulative variance explained? Answer as a percent, not a decimal number.

Answer: 65%

Question 20

Is this considered an acceptable percent of total variation?

Yes.

Problem 2

Scenario: Water quality variables nitrogen, turbidity, phosphorus, dissolved oxygen, temperature, and conductivity are measured in 31 randomly selected farm ponds in Southeastern Minnesota. Researchers would like to determine if there is an underlying structure that will enable clustering of these 31 ponds into homogeneous groups. You will need this file: farmpondquality.rda

Part A

Question 21

Load the data set and standardize the variables in the file (i.e. find the z-scores for each value). Store the z-scores in a new data frame.

load("farmpondquality.rda")
farmpondquality_new <- scale(farmpondquality, center = TRUE, scale = TRUE)

Part B

Question 22

Plot the dendrogram for hierarchical clustering using complete linkage and add the rectangles by cutting the dendrogram at a height of 5.

h_clusters <- hclust(dist(farmpondquality_new), method = "complete")
plot(h_clusters)
rect.hclust(h_clusters, h = 5)

Question 23

How many clusters does this form?

4 clusters.

Part C

Question 24

Append the original data frame (the unscaled one) with the cluster number from cutting the dendrogram at a height of 5. Find the number of ponds in each cluster and obtain the means of the original variables for each cluster.

library(dplyr) # for easy grouping and summarising
farmpondquality <- cbind(farmpondquality,
                         h_clusters = cutree(h_clusters, h = 5))
tbl <- farmpondquality %>%
  group_by(h_clusters) %>%
  summarise(num_ponds = n(),
            mean_nitro = mean(nitro),
            mean_turb = mean(turb),
            mean_phos = mean(phos),
            mean_disoxy = mean(disoxy),
            mean_temp = mean(temp),
            mean_cond = mean(cond))
print.data.frame(tbl)

##   h_clusters num_ponds mean_nitro mean_turb mean_phos mean_disoxy mean_temp
## 1          1         2  0.7865357 13.766667 2.5072143    9.510714  20.62500
## 2          2        15  0.3006019 28.454413 1.8769083    9.224706  23.08413
## 3          3        10  0.1819414  9.091905 1.6315357   11.507738  20.41036
## 4          4         4  0.1832321  9.183929 0.9327321    8.553571  22.56696
##   mean_cond
## 1  346.7500
## 2  149.6277
## 3  206.7179
## 4  488.5625

Question 25

For k-means clustering, plot the within sum of squares for the first 15 clusters against the cluster number and use the plot to determine a good number of clusters to partition the cases into. Use the standardized values.

set.seed(123)
wss <- (nrow(farmpondquality_new)-1) * sum(apply(farmpondquality_new, 2, var))
for (i in 2:15) {
  wss[i] <- sum(kmeans(farmpondquality_new, centers = i)$withinss)
}
plot(1:15, wss, type = "b",
     xlab = "Number of Clusters",
     ylab = "Within groups sum of squares")

Question 26

How many clusters do you think are best, based on this plot? Why?

6, because this is where the ‘knee’ seems to be in the plot (around 6 clusters, the decline slows down).

Question 27

Perform the k-means clustering on the z-scores of the 6 pond quality variables using the number of clusters you determined from the plot. Find the number of cases in each cluster as well as the cluster means for the raw variables.

k_means_cluster <- kmeans(farmpondquality_new, 6)
farmpondquality <- cbind(farmpondquality, k_means_clusters = k_means_cluster$cluster)
tbl2 <- farmpondquality %>%
  group_by(k_means_clusters) %>%
  summarise(num_ponds = n(),
            mean_nitro = mean(nitro),
            mean_turb = mean(turb),
            mean_phos = mean(phos),
            mean_disoxy = mean(disoxy),
            mean_temp = mean(temp),
            mean_cond = mean(cond))

## `summarise()` ungrouping output (override with `.groups` argument)

print.data.frame(tbl2)

##   k_means_clusters num_ponds mean_nitro mean_turb mean_phos mean_disoxy
## 1                1         4  0.6661321 11.833333  1.854054    9.031429
## 2                2         5  0.1996857  8.747143  1.021886    9.097857
## 3                3         6  0.2169155 15.863810  3.286909   11.557341
## 4                4         3  0.4458810 48.966667  2.056605    5.532421
## 5                5         2  0.2131071 52.107143  1.103179    9.171429
## 6                6        11  0.1732851 13.148918  1.144643   10.980411
##   mean_temp mean_cond
## 1  22.40089 260.64375
## 2  21.80357 471.90000
## 3  22.08294 231.77579
## 4  21.88690 172.33571
## 5  24.76786  93.32143
## 6  21.41526 132.99675

Problem 3

A study was conducted on the relationship of seating position and nausea on buses. The data in the following table classifies each person in a random sample of bus riders by the location of his or her seat and whether or not nausea was reported.

Table	Front	Middle	Rear	Total
Nausea	98	110	161	369
No Nausea	264	321	280	865
Total	362	431	441	1234

Part A

Question 28

Test to see whether or not the seat position within a bus is associated with motion sickness.

bus_riders <- matrix(c(98, 110, 161, 264, 321, 280), nrow = 2, byrow = TRUE)
rownames(bus_riders) <- c("Nausea", "No Nausea")
colnames(bus_riders) <- c("Front", "Middle", "Rear")
bus_riders

##           Front Middle Rear
## Nausea       98    110  161
## No Nausea   264    321  280

chisq.test(bus_riders)

## 
##  Pearson's Chi-squared test
## 
## data:  bus_riders
## X-squared = 14.509, df = 2, p-value = 0.000707

Question 29

State the null and alternative hypothesis.

\(H_0:\) Motion sickness and seat position within a bus are independent variables;

\(H_a:\) Motion sickness and seat position within a bus are not independent variables.

Question 30

State the test statistic. Give the answers to the nearest thousandth decimal.

X-squared = 14.509

Question 31

State the degrees of freedom.

df = 2

Question 32

Round the p-value to four decimal places (enter 0 if P < 0.00005).

p-value = 0.0007

Question 33

State your conclusion.

Since p-value < 0.05, we reject the null hypothesis that motion sickness and seat position within a bus are independent variables. The seat position within a bus is associated with motion sickness.

Part B

Question 34

Construct a 90% confidence interval (without the Yates correction) for the difference in population proportions of all bus riders in the front who report nausea and all bus riders in the rear who report nausea. (Use Diff = Front - Rear)

bus_riders_tbl <- t(bus_riders[, -2])
bus_riders_tbl

##       Nausea No Nausea
## Front     98       264
## Rear     161       280

stats::prop.test(bus_riders_tbl, correct = FALSE, conf.level = 0.9)

## 
##  2-sample test for equality of proportions without continuity
##  correction
## 
## data:  bus_riders_tbl
## X-squared = 8.1012, df = 1, p-value = 0.004424
## alternative hypothesis: two.sided
## 90 percent confidence interval:
##  -0.14819088 -0.04053138
## sample estimates:
##    prop 1    prop 2 
## 0.2707182 0.3650794

Question 35

Enter the lower bound of the 90% CI (round to 3 decimal places).

-0.148

Question 36

Enter the upper bound of the 90% CI (round to 3 decimal places).

-0.041

Question 37

Write an interpretation for the interval in the context of the problem.

We are 90% confident that the difference in population proportions of all bus riders in the front who report nausea and all bus riders in the rear who report nausea is between -0.148 and -0.041.

Part C

Question 38

Compute the odds ratio of having nausea for those in the rear compared to those in the front of the bus.

library(mosaic)
oddsRatio(bus_riders_tbl)

## [1] 1.54898

Question 39

Report the odds ratio to 3 decimal places.

1.549

Question 40

Interpret the odds ratio in the context of the problem.

For people sitting in the rear of the bus, the odds of having nausea are 1.549 times larger than the odds of having nausea for those in the front of the bus.

Problem 4

Scenario : A random sample of apartments was obtained from mid-sized towns in the Midwest. They are classified as having either “3 bedrooms” or “2 bedrooms” and the monthly rent was recorded. You will need this file: rent.rda

Part A

Question 41

Load the data set and create boxplots for the monthly rents for each type of apartment.

load("rent.rda")
boxplot(rent ~ type, data = monthlyrent, xlab = "Type", ylab = "Rent")

Question 42

Comment on the shapes of the boxplots and whether or not they contain outliers. Does there appear to be a difference in the distributions of monthly rent between 2 and 3 bedroom apartments?

The box for 3 bedroom apartments is longer than the box for 2 bedroom apartments and the distribution for 3 bedroom apartments seems to be skewed to the left, while the distribution for 2 bedroom apartments seems to be skewed to the right. There are no outliers. There appears to be a difference in the distributions of monthly rent between 2 and 3 bedroom apartments.

Part B

Question 43

Conduct the Shapiro-Wilk test for normality for each sample.

shapiro.test(monthlyrent$rent[monthlyrent$type == "3 Bedrooms"])

## 
##  Shapiro-Wilk normality test
## 
## data:  monthlyrent$rent[monthlyrent$type == "3 Bedrooms"]
## W = 0.90679, p-value = 0.2596

shapiro.test(monthlyrent$rent[monthlyrent$type == "2 Bedrooms"])

## 
##  Shapiro-Wilk normality test
## 
## data:  monthlyrent$rent[monthlyrent$type == "2 Bedrooms"]
## W = 0.96512, p-value = 0.8423

Question 44

Using a 5% level of significance for each test individually, choose the option that describes conclusions for each distribution.

1. Normality is rejected for the rents of both the 3-bedroom and 2-bedroom apartments.
1. Normality is rejected for the rents of the 2-bedroom apartments but not the 3-bedroom apartments.
1. Normality is rejected for the rents of the 3-bedroom apartments but not the 2-bedroom apartments.
1. Normality is not rejected for the rents of either the 3-bedroom or 2-bedroom apartments.

Answer: (d) Normality is not rejected for the rents of either the 3-bedroom or 2-bedroom apartments.

Part C

Question 45

Test for equality of population variances using a 5% level of significance.

var.test(rent ~ type, data = monthlyrent)

## 
##  F test to compare two variances
## 
## data:  rent by type
## F = 5.2305, num df = 9, denom df = 9, p-value = 0.02159
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   1.299192 21.058128
## sample estimates:
## ratio of variances 
##           5.230541

Question 46

State the null and alternative hypotheses.

\(H_0:\) The variances of the rents of both the 3-bedroom and 2-bedroom apartments are equal;

\(H_a:\) The variances are not equal.

Question 47

Round the p-value to four decimal places (enter 0 if P < 0.00005).

p-value = 0.0216

Question 48

State the conclusion for the test.

Since p-value < 0.05, we reject the null hypothesis. The variances of the rents of the 3-bedroom and 2-bedroom apartments are not equal.

Part D

Question 49

Conduct the appropriate hypothesis test (for two samples only - not ANOVA) to compare the population mean rents for these two types of apartments in mid-sized Midwestern towns. Use a 10% level of significance.

t.test(rent ~ type, data = monthlyrent, paired = FALSE, var.equal = FALSE, conf.level = 0.9)

## 
##  Welch Two Sample t-test
## 
## data:  rent by type
## t = 3.0105, df = 12.32, p-value = 0.01057
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
##   37.65164 146.34836
## sample estimates:
## mean in group 3 Bedrooms mean in group 2 Bedrooms 
##                      608                      516

Question 50

State the hypotheses.

\(H_0:\) mean rents for 3-bedroom and 2-bedroom apartments are equal;

\(H_a:\) mean rents for 3-bedroom and 2-bedroom apartments are not equal.

Question 51

State the test statistic.

t = 3.0105

Question 52

State the df.

df = 12.32

Question 53

Round the p-value to four decimal places (enter 0 if P < 0.00005).

p-value = 0.0106

Question 54

State the conclusion.

Since p-value < 0.1, we reject the null hypothesis. Mean rents for 3-bedroom and 2-bedroom apartments are statistically significantly different.

Question 55

Also obtain the appropriate 90% confidence interval for the difference in population mean rents for these two types of apartments in mid-sized Midwestern towns. (you can add your R code in Question 49). Write an interpretation for the interval in the context of the problem.

90% confidence interval for the difference in population mean rents is (37.652, 146.348). It means that we are 90% confident that the true difference in population mean rents for 3-bedroom and 2-bedroom apartments is between 37.652 and 146.348.

Problem 5

Scenario: A psychologist is interested in the relationship between academic performance and “self-concept” as well as the student’s IQ and gender for 39 seventh grade students in a rural school district. Academic performance is measured as a grade point average (let y = GPA). Self-concept (x1) is measured by the student’s score on the Piers-Harris Children’s Self-Concept Scale, the IQ (x2) and the type of learner (x3, 0=Visual, 1=Auditory) of each student is also recorded. You will need this file: gpa7th.rda

Part A

Consider the model \[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_3 + \beta_5 x_2 x_3 + \epsilon \] #### Question 56

Fit this model and report on which coefficients are significantly different from zero (create the interaction terms separately, do not use x1:x3 or x2:x3 in the code). Use the hierarchical approach to model-building.

load("gpa7th.rda")

ind_Visual <- which(gpa7th$x3 == "Visual")
ind_Auditory <- which(gpa7th$x3 == "Auditory")
x1x3 <- NULL
x2x3 <- NULL

x1x3[ind_Auditory] <- 1 * gpa7th$x1[ind_Auditory]
x1x3[ind_Visual] <- 0 * gpa7th$x1[ind_Visual]

x2x3[ind_Auditory] <- 1 * gpa7th$x2[ind_Auditory]
x2x3[ind_Visual] <- 0 * gpa7th$x2[ind_Visual]

gpa7th_new <- cbind(gpa7th, x1x3, x2x3)

mod1 <- lm(y ~ x1 + x2 + x3 + x1x3 + x2x3, data = gpa7th_new)
summary(mod1)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x1x3 + x2x3, data = gpa7th_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.20709 -0.30514  0.03727  0.25057  1.15223 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.963561   1.221225  -2.427   0.0209 *  
## x1           0.040486   0.008774   4.614 5.72e-05 ***
## x2           0.031657   0.011863   2.669   0.0117 *  
## x3Auditory   2.074237   1.652304   1.255   0.2182    
## x1x3        -0.034406   0.015210  -2.262   0.0304 *  
## x2x3         0.002440   0.015614   0.156   0.8768    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4805 on 33 degrees of freedom
## Multiple R-squared:  0.6653, Adjusted R-squared:  0.6146 
## F-statistic: 13.12 on 5 and 33 DF,  p-value: 4.572e-07

Question 57

Should any terms be dropped from this model at a 5% level of significance?

Yes.

Question 58

Select the term(s) that should be dropped.

x3 and x2x3.

Part B

Fit the model (create the interaction term separately, do not use x1:x3 in the code)

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_3 + \epsilon \]

Question 59

Obtain the residuals for this model and evaluate the residual plots using the “plot” function. Also create a histogram of the residuals.

library(lmtest)
mod2 <- lm(y ~ x1 + x2 + x3 + x1x3, data = gpa7th_new)
summary(mod2)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x1x3, data = gpa7th_new)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18700 -0.30971  0.03875  0.24889  1.15463 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.097275   0.858682  -3.607 0.000983 ***
## x1           0.040035   0.008166   4.903  2.3e-05 ***
## x2           0.033065   0.007602   4.350 0.000118 ***
## x3Auditory   2.296831   0.825052   2.784 0.008710 ** 
## x1x3        -0.033512   0.013890  -2.413 0.021377 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4736 on 34 degrees of freedom
## Multiple R-squared:  0.665,  Adjusted R-squared:  0.6256 
## F-statistic: 16.88 on 4 and 34 DF,  p-value: 1.035e-07

res <- residuals(mod2)
par(mfrow = c(2, 2))
plot(mod2)

par(mfrow = c(1, 1))
hist(res, main = "Histogram of residuals", xlab = "Residuals")

bptest(mod2)

## 
##  studentized Breusch-Pagan test
## 
## data:  mod2
## BP = 9.0755, df = 4, p-value = 0.05924

Question 60

Does a visual inspection of the residual plots and histogram indicate that the model assumptions appear to be satisfied? Explain your answer.

Yes, a visual inspection of the residual plots and histogram indicate that the model assumptions appear to be satisfied. The regression line appears to be a good fit, because there does not seem to be any distinct patterns of residuals (in the Residuals vs Fitted graph). Also, residuals appear to be normally distributed as they follow a straight line quite well in the Normal Q-Q plot and the histogram of residuals is bell-shaped. Other plots also indicate that the model assumptions appear to be satisfied.

Question 61

Also perform a Breusch-Pagan test for homogeneity of variance among the residuals. Use a 5% level of significance (you can include the code in your answer to Question 59). Comment on the results of the Breusch-Pagan test.

Since p-value = 0.059 > 0.05, the null hypothesis of homoscedasticity cannot be rejected at a 5% level of significance. There is homogeneity of variance among the residuals.

Part C

Fit the model (same as in Part B, create the interaction term separately - do not use x1:x3 in code)

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_3 + \epsilon \]

Question 62

Construct a 95% confidence interval for the mean gpa of all 7th-graders with Piers-Harris Children’s Self-Concept Score of 50, IQ of 105, and who are auditory learners.

newdata <- data.frame(x1 = 50, x2 = 105, x3 = "Auditory", x1x3 = 50)
predict(mod2, newdata, interval = "confidence")

##        fit      lwr      upr
## 1 2.997526 2.696134 3.298919

Question 63

Enter the lower bound of the 95% CI (round to 3 decimal places).

2.696

Question 64

Enter the upper bound of the 95% CI (round to 3 decimal places).

3.299

Question 65

Write an interpretation for the interval in the context of the problem.

We are 95% confident that for all 7th-graders with Piers-Harris Children’s Self-Concept Score of 50, IQ of 105, and who are auditory learners, the mean GPA is between 2.696 and 3.299.

Part D

Fit the model

\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_3 + \epsilon \]

Question 66

Interpret the value of \(\hat{\beta_2}\) in the context of the problem.

The value of \(\hat{\beta_2} = 0.033\) shows that for each unit increase in IQ (x2), the GPA score is expected to increase by 0.033.

Part E

Fit the first-order logistic regression model to predict that a randomly selected student is an auditory learner from gpa, the Piers-Harris Children’s Self-Concept Score, and the IQ.

Question 67

Use it to predict the probability of being an auditory learner for a 7th-grader with a 3.5 gpa, Piers-Harris Children’s Self-Concept Score of 50, and IQ of 105.

mod_glm <- glm(x3 ~ y + x1 + x2, family = binomial, data = gpa7th)
newdata_glm <- data.frame(y = 3.5, x1 = 50, x2 = 105)
predict(mod_glm, newdata_glm, type = "response")

##         1 
## 0.7665911

Question 68

Report the probability to 4 decimal places.

Answer: 0.7666

Problem 6

Scenario: A study was conducted to evaluate the quality of beef after storage times (STORAGE) of 10, 40, 80, and 120 days. The beef quality variables assessed were beefy aroma (BEEFY), bloody aroma (BLOODY), and a grassy aroma (GRASSY), which were all measured on a rating scale ranging from 0 to 15. Thirty samples were evaluated by several beef quality specialists and the average rating was obtained for this data file. You will need this file: beef.rda

Part A

Question 69

Use the Henze-Zirkler Multivariate Normality Test to test for multivariate normality among the three response variables: BEEFY, BLOODY, and GRASSY. Include a chi-square quantile plot in your analysis and use a 1% level of significance for each individual hypothesis test.

library(MVN)
load("beef.rda")
mvn(beef[, -1], mvnTest = "hz", multivariatePlot = "qq")

## $multivariateNormality
##            Test        HZ   p value MVN
## 1 Henze-Zirkler 0.7762501 0.1296601 YES
## 
## $univariateNormality
##           Test  Variable Statistic   p value Normality
## 1 Shapiro-Wilk   BEEFY      0.8791    0.0027    NO    
## 2 Shapiro-Wilk  BLOODY      0.8937    0.0059    NO    
## 3 Shapiro-Wilk  GRASSY      0.9054    0.0114    NO    
## 
## $Descriptives
##         n     Mean  Std.Dev   Median     Min      Max    25th      75th
## BEEFY  30 7.583343 3.703254 7.875000 2.25000 12.66670 3.56250 11.291700
## BLOODY 30 4.938889 1.992796 5.500000 1.66667  7.41667 3.06250  6.729167
## GRASSY 30 1.955555 1.457037 1.916665 0.00000  4.41667 0.58333  3.250000
##              Skew  Kurtosis
## BEEFY  -0.1221207 -1.680482
## BLOODY -0.3083830 -1.469143
## GRASSY  0.1412231 -1.577458

Question 70

According to this test, is there sufficient evidence to conclude that BEEFY, BLOODY, and GRASSY are not multivariate normal? Explain.

According to this test, there is no sufficient evidence to conclude that BEEFY, BLOODY, and GRASSY are not multivariate normal. The p-value = 0.13 > 0.01 and the chi-square quantile plot indicates that data are multivariate normal.

Part B

Conduct Box’s M Test to test for equality of covariances. Use a 1% level of significance.

Question 71

Is there sufficient evidence to conclude that the covariance matrices are not equal at the 1% level of significance?

There is no sufficient evidence to conclude that the covariance matrices are not equal at the 1% level of significance, because p-value = 0.022 > 0.01.

Question 72

Insert your R code here.

library(biotools)

## ---
## biotools version 4.0

boxM(beef[, -1], beef[, 1])

## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  beef[, -1]
## Chi-Sq (approx.) = 31.973, df = 18, p-value = 0.02215

Question 73

Based on the criteria of multivariate normality and equal covariance matrices, is it appropriate to proceed with MANOVA?

Yes.

Part C

Question 74

Regardless of the outcomes of the previous hypothesis tests, conduct a MANOVA to determine if there are differences between the different storage times for the population mean vectors when beefy, bloody, and grassy aromas are considered together. Use the Wilk’s Lambda statistic and let \(\alpha = .05\)

res.man <- manova(cbind(BEEFY, BLOODY, GRASSY) ~ STORAGE, data = beef)
summary(res.man, test = "Wilks")

##           Df  Wilks approx F num Df den Df    Pr(>F)    
## STORAGE    3 0.1269   8.6897      9  58.56 3.816e-08 ***
## Residuals 26                                            
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 75

State the null and alternative hypothesis.

\(H_0:\) mean vectors are equal across the groups;

\(H_a:\) at least two of the groups’ mean vectors differ.

Question 76

State the conclusion for the test.

Since p-value < 0.05, we reject the null hypothesis. There are differences between the different storage times for the population mean vectors.

Question 77

Round the p-value to four decimal places (enter 0 if P < 0.00005).

p-value = 0

Part D

Question 78

Follow up with univariate ANOVAs and Tukey multiple comparisons on the response variables to see which population means differ. Use a 5% level of significance for each univariate ANOVA and each Tukey procedure. (We are temporarily ignoring the multiple comparisons problem.)

summary.aov(res.man)

##  Response BEEFY :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## STORAGE      3 322.55 107.516  37.192 1.491e-09 ***
## Residuals   26  75.16   2.891                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response BLOODY :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## STORAGE      3 91.744 30.5814  33.948 3.837e-09 ***
## Residuals   26 23.422  0.9008                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response GRASSY :
##             Df Sum Sq Mean Sq F value    Pr(>F)    
## STORAGE      3 51.727 17.2422  45.563 1.713e-10 ***
## Residuals   26  9.839  0.3784                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(aov(BEEFY ~ STORAGE, data = beef))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = BEEFY ~ STORAGE, data = beef)
## 
## $STORAGE
##             diff        lwr        upr     p adj
## 40-10  -1.171283  -3.629593  1.2870267 0.5668196
## 80-10  -6.212302  -8.807288 -3.6173169 0.0000033
## 120-10 -7.881942 -10.400960 -5.3629235 0.0000000
## 80-40  -5.041019  -7.391613 -2.6904251 0.0000189
## 120-40 -6.710658  -8.977108 -4.4442085 0.0000001
## 120-80 -1.669639  -4.083652  0.7443732 0.2537189

TukeyHSD(aov(BLOODY ~ STORAGE, data = beef))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = BLOODY ~ STORAGE, data = beef)
## 
## $STORAGE
##              diff       lwr        upr     p adj
## 40-10  -0.5138894 -1.886182  0.8584034 0.7351889
## 80-10  -3.2559536 -4.704542 -1.8073650 0.0000092
## 120-10 -4.1458363 -5.552018 -2.7396546 0.0000001
## 80-40  -2.7420641 -4.054227 -1.4299012 0.0000279
## 120-40 -3.6319468 -4.897138 -2.3667554 0.0000001
## 120-80 -0.8898827 -2.237447  0.4576821 0.2907963

TukeyHSD(aov(GRASSY ~ STORAGE, data = beef))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = GRASSY ~ STORAGE, data = beef)
## 
## $STORAGE
##             diff        lwr      upr     p adj
## 40-10  0.6527761 -0.2366643 1.542217 0.2090224
## 80-10  2.3472207  1.4083298 3.286112 0.0000016
## 120-10 3.3472213  2.4358160 4.258626 0.0000000
## 80-40  1.6944446  0.8439769 2.544912 0.0000556
## 120-40 2.6944451  1.8744216 3.514469 0.0000000
## 120-80 1.0000005  0.1265874 1.873414 0.0203952

Question 79

Write your conclusion.

From the univariate ANOVA tests, we can see that all the response variables (BEEFY, BLOODY, GRASSY) are highly significantly different among STORAGE times.

According to Tukey multiple comparisons tests, population mean scores for beefy and bloody aroma are different for storage times of 10 and 80, 10 and 120, 40 and 80, 40 and 120 days.

Population mean scores for grassy aroma are different for storage times of 10 and 80, 10 and 120, 40 and 80, 40 and 120 days and also 80 and 120 days.

Question 80

Match each nonparametric procedure with the parametric procedure with which it most closely corresponds.

Answer:

Wilcoxon Signed Rank - Paired data t-test
Wilcoxon Rank Sum Test - Pooled t-test
Kruskal-Wallis Test - One-way ANOVA
Fisher’s Exact Test - Chi–square test for independence

Question 81

Which of the following could be used in multiple regression model selection?

AIC
PRESS
VIF
\(R^2_{Adj}\)
ALL of the above

Answer: ALL of the above

Question 82

A Type I error can best be described as:

1. not rejecting a false null hypothesis.
1. a sample that has not been chosen randomly.
1. using a two-sided alternative when a one-sided alternative would be sufficient.
1. concluding that the null hypothesis is false when in fact it is true.
1. mistakenly using a parametric procedure instead of a nonparametric procedure when the data is not normally distributed.

Answer: d) concluding that the null hypothesis is false when in fact it is true.

Question 83

Which of the following statements DOES NOT represent a model assumption of linear regression?

1. the standard deviation of the independent variable is normally distributed
1. the residuals are normally distributed
1. the variance of the error terms is constant at each value of x
1. the random error terms are independent
1. the mean of the random error terms is zero

Answer: a) the standard deviation of the independent variable is normally distributed

Question 84

Which of the following statements DOES NOT represent a model assumption of ANOVA?

1. the samples are independent
1. the population means are not equal
1. the samples are randomly selected
1. the population variances are all equal
1. the populations are normally distributed

Answer: b) the population means are not equal

Question 85

In linear regression, if the correlation coefficient between the response variable y and the explanatory variable x1 is zero, then

1. The scatterplot of x1 vs. y will be a perfectly straight line.
1. There is no linear relationship between x1 and y.
1. \(R^2\) will be 1.
1. The observed values of y are equal to the predicted values.
1. We reject \(H_0: \beta_1 = 0\)

Answer: b) There is no linear relationship between x1 and y.

Question 86

For the following problems, match the scenario to the appropriate statistical procedure. Each procedure may be used more than once or not at all.

Scenario: A business manager has data for 36 variables collected for the employees in various positions within a very large organization. The manager would like to reduce the number of variables to a much smaller number of latent variables that will represent the shared variability present in the original 36 variables.

Procedure: Factor Analysis

Scenario: A random sample of American adults is collected to see if how systolic blood pressure is related to weight, height, age, and gender.

Procedure: Linear regression

Scenario: A manufacturing plant measures the weekly man-hours lost due to accidents for a random sample of its employees both before and after the installation of an elaborate safety program and would like to conduct a test to see if the loss of man-hours is lower after the program. The difference in loss of man-hours is not normally distributed.

Procedure: Wilcoxon signed rank test

Scenario: A clinical research study is designed to investigate how systolic blood pressure, cholesterol level, and whether or not a person is a smoker can predict the presence or absence of hypertension.

Procedure: Logistic regression

Scenario: Do elementary, middle, and high school students differ in the amount of time spent watching television per week? Sample data looks to be normally distributed with approximately equal variance.

Procedure: ANOVA

Scenario: Is the probability distribution of college students at a college football game such that there are equal proportions of Freshmen, Sophomores, Juniors, and Seniors?

Procedure: Chi-Square test for Goodness-of-fit

Scenario: Economists collect independent random samples from five different professional disciplines in order to compare the population variances of salaries in each.

Procedure: Levene test