- Details
- Parent Category: Programming Assignments' Solutions
We Helped With This R Studio Programming Assignment: Have A Similar One?
Short Assignment Requirements
Assignment Description
ESC_PS 8850:
Quantitative Foundations in Educational Research
![]() |

Townsend Hall 101E
Wolfgang Wiedermann
University of Missouri
A brief introduction into
Ten-year projection: 22,452 packages (cf. Fox & Leanage, 2016)
>
a <- 2
> a
[1] 2
> A
Error: object 'A' not found
> rm(a)
> a
Error: object 'a' not found
> a <- b <- 2 > a
[1] 2
> b
[1] 2
> 4 -> a > a
[1] 4
> # insert comments
> a <- 2; b <- 3 > a; b [1] 2
[1] 3
> a <- 2
> b <- 3 > a + b [1] 5
> a - b [1] -1
> a / b
[1] 0.6666667
> a * b [1] 6
> exp(a)
[1] 7.389056
> log(7.389056) [1] 2 > x <- 1:5 > x
[1] 1 2 3 4 5
> x + 3
[1] 4 5 6 7 8
> exp(x)
[1] 2.718282 7.389056 20.085537
[4] 54.598150 148.413159
##
--- Numeric variables --- ## > x <- c(21.2, 23.1, 19.5, 20.9)
> x
[1] 21.2 23.1 19.5 20.9
> class(x)
[1] "numeric"
## ---- Character variables ---- ##
> y <- c("m", "w", "m", "m") > y
[1] "m" "w" "m" "m" > class(y)
[1] "character"
## ------------ Factors --------- ##
> y <- c(0,1,0,0) > class(y)
[1] "numeric" > yy <- factor(y) > class(yy)
[1] "factor"
> yy
[1] 0 1 0 0 Levels: 0 1 Variable Types in R > levels(yy) <- c("m", "w")
> yy
[1] m w m m Levels: m w
> is.numeric(yy)
[1] FALSE
## ------------------------------ ##
> y <- c(0,1,0,0)
> class(y)
[1] "numeric"
> levels(y) <- c("m", "w")
> y [1] 0 1 0 0 attr(,"levels") [1] "m" "w"
> is.numeric(y) [1] TRUE
> levels(y) [1] "m" "w"
Read in a RData file:
### -------- Reading data using load():
load(file = "C:/stgallen.RData") ls()
[1]
"mydata" ### -------- Return first lines of
dataset: head(mydata)
id age sex A1 A2 A3 A4 A5 A6 A7 A8 A9 A10
1 1 24 0 3 3 1 0 2 0 0 0 1 0
2 2 32 1 3 2 3 3 0 0 0 1 0 0
3 3 27 1 2 3 2 2 0 0 0 0 0 0
4 4 30 1 4 5 2 0 1 2 0 0 0 0
5 5 32 1 3 2 3 0 0 0 0 1 1 0
### -------- View entire dataset: View(mydata)
Read in a SPSS file:
### -------- Install and load package:
install.packages("foreign") library(foreign)
###
-------- Reading data using read.spss(): mydata
<- read.spss(file = "C:/stgallen.sav", to.data.frame = TRUE) Warning
message:
In read.spss(file = "C:/datafile.sav", to.data.frame = TRUE) :
C:/stgallen.sav: Unrecognized record type 7, subtype 18 encountered in system file
### -------- Return first lines of dataset: head(mydata) id age sex A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 1 1 24 0 3 3 1 0 2 0 0 0 1 0
2 2 32 1 3 2 3 3 0 0 0 1 0 0
3 3 27 1 2 3 2 2 0 0 0 0 0 0
4 4 30 1 4 5 2 0 1 2 0 0 0 0 5 5 32 1 3 2 3 0 0 0 0 1 1 0
Variables within a dataset:
head(mydata)
id age sex A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 1 1 24 0 3 3 1 0 2 0 0 0 1 0
2 2 32 1 3 2 3 3 0 0 0 1 0 0
3 3 27 1 2 3 2 2 0 0 0 0 0 0
4 4 30 1 4 5 2 0 1 2 0 0 0 0
5 5 32 1 3 2 3 0 0 0 0 1 1 0
## ------------ return first six observations of “age”:
head(mydata$age)
[1] 24 32 27 30 32 23
## ------------ Compute AUDIT sum score: mydata$audit <- rowSums(mydata[, 4:13]) head(mydata)
id age sex A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 audit 1 1 24 0 3 3 1 0 2 0 0 0 1 0 10
2 2 32 1 3 2 3 3 0 0 0 1 0 0 12
3 3 27 1 2 3 2 2 0 0 0 0 0 0 9
4 4 30 1 4 5 2 0 1 2 0 0 0 0 14
5 5 32 1 3 2 3 0 0 0 0 1 1 0 10
6 6 23 0 2 5 3 0 0 0 0 0 0 0 10
>
attach(mydata)
## ------ mean and median ------ ##
> mean(age, na.rm=TRUE) [1] 25.52995
> median(age, na.rm=TRUE)
[1] 25
## -- variance, stddev, quartile -- ##
> var(age, na.rm=TRUE) [1] 20.34285
> sd(age, na.rm=TRUE) [1] 4.510305
> sqrt(var(age, na.rm = TRUE)) [1] 4.510305
> quantile(age, na.rm = TRUE) 0% 25% 50% 75% 100%
16 22 25 28 38
> detach(mydata)
Descriptive Statistics in R ## ---------- Correlation ---------- ##
> cor(age, audit, use = "complete.obs",
+ method = "pearson")
[1] -0.01415857
> cor(age, audit, use = "complete.obs",
+ method = "spearman")
[1] 0.01356717
> cor(age, audit, use = "complete.obs",
+ method = "kendall")
[1] 0.007756665
## ---------- Data Summary ---------- ##
> summary(age)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
16.00 22.00 25.00 25.53 28.00 38.00 5
> summary(mydata[, c(2,3,14)]) age sex audit
Min. :16.00 Min. :0.0000 Min. : 4.00 1st Qu.:22.00 1st Qu.:0.0000 1st Qu.: 8.25 Median :25.00 Median :1.0000 Median :10.00
Mean :25.53 Mean :0.5405 Mean :10.12 3rd Qu.:28.00 3rd Qu.:1.0000 3rd Qu.:12.00 Max. :38.00 Max. :1.0000 Max. :19.00
NA's :5
Assignment Description
ESC_PS
8850:
Quantitative Foundations in
Educational Research
![]() |
Class 2:
Hypothesis Testing
The Pearson Correlation Coefficient
Wolfgang Wiedermann
University of Missouri
Example Empirical Example:
Patterson, Reid, & Dishion (1992) proposed a model for the development of antisocial behavior and academic failure. Children at risk for antisocial behavior often develop poor relationships with teachers. As a result, teachers may be reluctant to adequately monitor/supervise these children or to reinforce them for prosocial behavior or academic achievement.
In other words, there should be a positive association between prosocial behavior and academic competence.
To test this hypothesis, we use a sample of n = 12 second grade elementary school children. Prosocial behavior and academic competence are measured on a scale from 0 – 10.
To generalize from the sample to the population of second graders, we use principles of statistical inference and hypothesis testing.
Hypothesis Testing
• Hypothesis: A tentative explanation for a phenomenon.
• Statistical Hypothesis: A statement about the value of a population parameter (e.g., mean), about the type of probability distribution of a variable (e.g., normally distributed), or about an effect or relationship.
• Null Hypothesis (H0): The observations (values, relationships, effects) are results of pure chance.
• Alternative Hypothesis (HA) or Research Hypothesis: The observations show a real effect that is not due to pure chance.
• Hypothesis Testing:
To test whether our estimated value for the parameter is different enough from the hypothesized value to reject the null hypothesis in favor of the alternative hypothesis.
• Results:
• Either reject H0 or fail to reject H0.
• No statistical statement is made about the HA.
Potential Outcome of Hypothesis Testing:
• Type I Errors of 5% and Type II Errors of 20% are commonly accepted in the social sciences.
Probabilities of Hypothesis Testing:
• 1 – β is also known as the statistical power of a test.
General Steps of Hypothesis Testing:
1. Check statistical assumptions (e.g., distributional assumptions).
2. State the Null (H0) and the Alternative Hypothesis (HA).
3. Specify the nominal significance level α (e.g., 0.05, 0.01, 0.001).
4. Specify the statistical test (e.g., t-test, F-test, etc.).
5. Compute the test statistic from the observed data (e.g., t-value)
6. Compute the p-value P(data| H0), i.e., the probability of the observations under the Null Hypothesis.
7. Retain or reject the Null Hypothesis and draw conclusions.
The Pearson Correlation
Pearson Correlation Coefficient
The Pearson correlation coefficient measures the linear association of two variables (e.g., 𝑥𝑥 and 𝑦𝑦) and can be described as the standardized crossproduct of 𝑥𝑥 and 𝑦𝑦, i.e.,
𝑐𝑐𝑐𝑐𝑐𝑐(𝑥𝑥, 𝑦𝑦)
𝜌𝜌𝑥𝑥𝑦𝑦 =
𝜎𝜎𝑥𝑥𝜎𝜎𝑦𝑦
with 𝑐𝑐𝑐𝑐𝑐𝑐(𝑥𝑥, 𝑦𝑦) being the covariance
1
𝑐𝑐𝑐𝑐𝑐𝑐𝑥𝑥,
𝑦𝑦
=
𝑥𝑥−𝑥𝑥̅
(𝑦𝑦−𝑦𝑦)
𝑛𝑛
and 𝜎𝜎𝑥𝑥 and 𝜎𝜎𝑦𝑦 being the standard deviations of 𝑥𝑥 and 𝑦𝑦
2 = 1 𝑥𝑥−𝑥𝑥̅2 𝜎𝜎𝑦𝑦2 = 1 𝑦𝑦−𝑦𝑦2
𝜎𝜎𝑥𝑥𝑛𝑛
𝑛𝑛
Properties of the Pearson Correlation
• The Pearson correlation coefficient is a standardized measure bounded on the interval –1 and 1, i.e., −1 ≤𝜌𝜌𝑥𝑥𝑦𝑦≤ 1.
𝜌𝜌𝑥𝑥𝑦𝑦 < 0 𝜌𝜌𝑥𝑥𝑦𝑦 = 0 𝜌𝜌𝑥𝑥𝑦𝑦 > 0
• 𝜌𝜌𝑥𝑥𝑦𝑦 = 𝑐𝑐𝑐𝑐𝑐𝑐(𝑥𝑥, 𝑦𝑦)/(𝜎𝜎𝑥𝑥𝜎𝜎𝑦𝑦) describes the Pearson correlation coefficient in its symmetric form, i.e., 𝜌𝜌𝑥𝑥𝑦𝑦 = 𝜌𝜌𝑦𝑦𝑥𝑥
Properties of the Pearson Correlation
>
install.packages("datasets")
> library(datasets)
> anscombe x1 x2 x3 x4 y1 y2 y3 y4 10 10 10 8 8.04 9.14 7.46 6.58
8 8 8 8 6.95 8.14 6.77 5.76
13 13 13 8 7.58 8.74 12.74 7.71
9 9 9 8 8.81 8.77 7.11 8.84 11 11 11 8 8.33 9.26 7.81 8.47
14 14 14 8 9.96 8.10 8.84 7.04 6 6 6 8 7.24 6.13 6.08 5.25
5 10 15 5 10 15 4 4 4 19 4.26 3.10 5.39 12.50
12 12 12 8 10.84 9.13 8.15 5.56
x1 x2 7 7 7 8 4.82 7.26 6.42 7.91
Significance Test for Pearson Correlation
To test whether the observed correlation coefficient significantly differs from zero, we can use the t-test. The test statistic is defined as
𝑡𝑡
= 𝜌𝜌𝑥𝑥𝑦𝑦
with 𝑛𝑛 – 2 degrees of freedom. The obtained t-value is part of the so-called
Example Empirical Example (cont‘d):
Step 0: Read-in data
prosoc <- c(6,7,7,8,4,4,3,6,7,6,7,4) acad <- c(5,9,8,5,4,3,3,7,8,8,7,3)
Step 1: Check assumptions (normality and linearity)
shapiro.test(prosoc)
Shapiro-Wilk normality test
data: AMC
W = 0.8876, p-value = 0.1098
shapiro.test(acad)
Shapiro-Wilk normality test
data: AVC
W = 0.8805, p-value = 0.08905
plot(jitter(prosoc), jitter(acad), pch=19, cex=1.2, xlim=c(0,10), ylim=c(0,10)) abline(lm(acad ~ prosoc))
Example
Step 2: Formulate H0 and HA
H0: Prosocial behavior and academic competence are statistically uncorrelated (ρ = 0).
H1: Prosocial behavior and academic competence are statistically correlated (ρ ≠ 0).
Step 3 and 4: Specify significance level and statistical test
Significane level: α = 0.05
Statistical Test: Pearson correlation
Step 5 and 6: Calculate test statistic and p-value cor.test(prosoc, acad)
Pearson's product-moment correlation data: prosoc and acad
t = 3.8056, df = 10, p-value = 0.003454
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.3494733 0.9317479
sample estimates:
cor 0.7691181
Step 7: Conclusion
We reject the Null Hypothesis and conclude that there exists a positive association between prosocial behavior and academic competence.
Assignment Description
ESC_PS 8850:
Quantitative Foundations in
![]() |

Class 3:
The Linear Regression Model
Wolfgang Wiedermann
University of Missouri
Outline
• Basic Notation of the Linear Regression Model
• Parameter Estimation: Ordinary Least Squares
• Properties of the Linear Regression Model
• Statistical Inference
• Model Fit
• Model Selection
The Linear Regression Model
Bivariate Example
• The simplest form of regression deals with one independent variable 𝑥𝑥 (the predictor/explanatory variable) and one dependent variable 𝑦𝑦 (the outcome/response).
• It is assumed that the association between 𝑥𝑥 and 𝑦𝑦 can be described by a straight line:
Any straight line can be described by two parameters: The intercept (𝛽𝛽0) and the slope
(𝛽𝛽𝑦𝑦𝑦𝑦).
Regression equation: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥 + 𝜖𝜖
𝛽𝛽0 = 𝑦𝑦 − 𝛽𝛽1𝑥𝑥̅ reflects the predicted value 𝑦𝑦 when 𝑥𝑥 = 0
𝛽𝛽1 = 𝑐𝑐𝑐𝑐𝑐𝑐 𝑥𝑥,
𝑦𝑦 /𝜎𝜎𝑦𝑦2 reflects
the change in 𝑦𝑦 when moving from 𝑥𝑥 to 𝑥𝑥 +
1, i.e., 𝛽𝛽1 = 𝑦𝑦 𝑥𝑥
+ 1 − 𝑦𝑦 𝑥𝑥
Predicted values: 𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥
Basic Notation
The (general) linear model assumes that a response variable, 𝑦𝑦𝑖𝑖 (i = 1, ..., n), emerges from a set of (weighted) explanatory variables, 𝑥𝑥𝑗𝑗 (p = 1, ... k), and an error term:
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑖𝑖1 + ⋯ + 𝛽𝛽𝑘𝑘𝑥𝑥𝑖𝑖𝑘𝑘 + 𝜀𝜀𝑖𝑖
systematic part random part
or in matrix notation:
𝒚𝒚 = 𝑿𝑿𝜷𝜷 + 𝜺𝜺
𝒚𝒚 ... a 𝑛𝑛 × 1 response vector,
𝑿𝑿 ... a 𝑛𝑛 × 𝑘𝑘 matrix of explanatory variables (including a 1 vector for the intercept 𝛽𝛽0)
𝜷𝜷... 𝑘𝑘 × 1 vector of regression weights
𝜺𝜺... 𝑛𝑛 × 1 vector of disturbances
Basic Notation
The equation 𝒚𝒚=𝑿𝑿𝜷𝜷+𝜺𝜺 can be written as:
1 𝑥𝑥11 … 𝑥𝑥1𝑗𝑗 … 𝑥𝑥1𝑘𝑘
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
1 𝑥𝑥𝑖𝑖1 … 𝑥𝑥𝑖𝑖𝑗𝑗 … 𝑥𝑥𝑗𝑗𝑘𝑘
⋮ ⋮ ⋮ ⋮ ⋮ ⋮
1 𝑥𝑥𝑛𝑛1 … 𝑥𝑥𝑛𝑛𝑗𝑗 … 𝑥𝑥𝑛𝑛𝑘𝑘
𝒚𝒚 = 𝑿𝑿𝜷𝜷 The model is called general because: | + | 𝜺𝜺 |
• More than one explanatory variable can be used to predict 𝒚𝒚.
• The explanatory variables 𝑥𝑥𝑗𝑗 can be continuous (e.g., age) or categorical (e.g., age groups).
Systematic Part of the Model
• The model is called linear because ...
... it can be written in the form 𝒚𝒚 = 𝑿𝑿𝜷𝜷+𝜺𝜺
... the model is linear in the parameters, e.g.,
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 +𝛽𝛽1𝑥𝑥𝑖𝑖1 +𝛽𝛽2𝑥𝑥𝑖𝑖12 +𝜀𝜀𝑖𝑖
or
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 +𝛽𝛽1𝑙𝑙𝑐𝑐𝑙𝑙(𝑥𝑥𝑖𝑖1)+𝜀𝜀𝑖𝑖 but not in the variables.
• Note that, e.g., 𝑦𝑦𝑖𝑖 = 𝛽𝛽0 +𝛽𝛽1𝑥𝑥𝑖𝑖1𝛽𝛽2 + 𝜀𝜀𝑖𝑖 is not linear in the parameters!
Systematic Part of the Model Parameter Interpretation:
• Reconsider a multiple linear regression model where 𝑦𝑦 is predicted from two variables, 𝑥𝑥1 and 𝑥𝑥2:
𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥1 + 𝛽𝛽2𝑥𝑥2 + 𝜀𝜀
• The effect 𝛽𝛽1 reflects the change in 𝑦𝑦 when moving from 𝑥𝑥1 to 𝑥𝑥1 + 1 while holding 𝑥𝑥2 constant at any value 𝑐𝑐, i.e.,
𝛽𝛽1 =
𝑦𝑦 𝑥𝑥1 + 1, 𝑥𝑥2 = 𝑐𝑐 − 𝑦𝑦 𝑥𝑥1,
𝑥𝑥2 = 𝑐𝑐
(this feature defines the effect as being unconditional)
• Analogous conclusions hold for 𝛽𝛽2, i.e.,
𝛽𝛽2 = 𝑦𝑦|𝑥𝑥2 +
1, 𝑥𝑥1 = 𝑐𝑐
−
𝑦𝑦|𝑥𝑥2,
𝑥𝑥1 = 𝑐𝑐
Random Part of the Model
• The random part (disturbances or error) are commonly assumed to follow a normal distribution with an expected value of zero and a variance 𝜎𝜎2, i.e.,
𝜀𝜀𝑖𝑖~𝑁𝑁(0, 𝜎𝜎2).
• Graphical
representation:
• Some properties of the random part: Hatzinger (2011)
• The
mean of the random part is zero: 𝜀𝜀̅ 𝑛𝑛 𝑛𝑛𝑖𝑖=1𝜀𝜀𝑖𝑖 = 0
• The explanatory variables and the error term are assumed to be
independent: 𝐶𝐶𝑐𝑐𝐶𝐶(𝜀𝜀𝑖𝑖, 𝑥𝑥𝑗𝑗) = 0
Parameter Estimation The most well-known model estimation techniques are
• Ordinary Least Squares Method (OLS):
...aims at minimizing the sum of the squared distances of observed data points and predicted values under a certain model.
• Maximum Likelihood:
...aims at finding those parameter values 𝜷𝜷 which are most plausible for the data generating process.
• OLS estimation considers the deviation of 𝒚𝒚 from its expected value under a certain model, i.e.,
𝜺𝜺 = 𝒚𝒚 − 𝑿𝑿𝜷𝜷
with 𝑿𝑿𝜷𝜷 being the predicted values 𝒚𝒚 under a certain model.
• For example:
prosoc <- c(6,7,5,8,4,9,3,6,7,6,7,4) acad <- c(5,9,8,5,4,3,3,7,8,8,7,3) mod <- lm(acad ~ prosoc)
plot(prosoc, acad, pch=19, cex=1.2, xlim=c(0,10), ylim=c(0,10))
abline(mod) # add regression line pred <- fitted(mod) # predicted values segments(prosoc, acad, prosoc, pred, lty="dashed") # add residuals
• OLS estimation considers the deviation of 𝒚𝒚 from its expected value under a certain model, i.e.,
𝜺𝜺 = 𝒚𝒚 − 𝑿𝑿𝜷𝜷
with 𝑿𝑿𝜷𝜷 being the predicted values 𝒚𝒚 under a certain model.
• Specifically, OLS estimation uses the sum of the squared deviations, denoted by 𝑄𝑄:
𝑛𝑛
𝑄𝑄 = 𝜺𝜺′𝜺𝜺 = 𝜀𝜀𝑖𝑖2
𝑖𝑖=1
• We need to find estimators for the regression parameters 𝜷𝜷 which minimize 𝑄𝑄.
• Step 1: Insert model equation
𝑄𝑄 = 𝜺𝜺′𝜺𝜺
=𝒚𝒚
− 𝑿𝑿𝜷𝜷 ′ 𝒚𝒚 − 𝑿𝑿𝜷𝜷
=𝒚𝒚′ − 𝜷𝜷′𝑿𝑿′ 𝒚𝒚
− 𝑿𝑿𝜷𝜷
= 𝒚𝒚′𝒚𝒚 − 𝜷𝜷′𝑿𝑿′𝒚𝒚 − 𝒚𝒚′𝑿𝑿𝜷𝜷 + 𝜷𝜷′𝑿𝑿′𝑿𝑿𝜷𝜷
= 𝒚𝒚′𝒚𝒚 − 𝟐𝟐𝜷𝜷′𝑿𝑿′𝒚𝒚 + 𝜷𝜷′𝑿𝑿′𝑿𝑿𝜷𝜷
Note: The terms 𝜷𝜷′𝑿𝑿′𝒚𝒚 and 𝒚𝒚′𝑿𝑿𝜷𝜷 in line 4 can be re-written as 𝟐𝟐𝜷𝜷′𝑿𝑿′𝒚𝒚 because 𝜷𝜷′𝑿𝑿′𝒚𝒚= 𝒚𝒚′𝑿𝑿𝜷𝜷 [... due to 𝜷𝜷′𝑿𝑿′𝒚𝒚=(𝒚𝒚′𝑿𝑿𝜷𝜷)′]. Thus, −𝜷𝜷′𝑿𝑿′𝒚𝒚−𝒚𝒚′𝑿𝑿𝜷𝜷=−𝟐𝟐𝜷𝜷′𝑿𝑿′𝒚𝒚.
• Step 2: Find the derivative of 𝑸𝑸 and set it to zero:
𝜕𝜕𝑄𝑄′ ′𝒚𝒚
+ 𝟐𝟐𝑿𝑿′𝑿𝑿𝜷𝜷
= 𝟎𝟎
= −2𝑿𝑿
𝜕𝜕𝜷𝜷
𝜕𝜕𝜷𝜷′𝑿𝑿′𝒚𝒚
Rules
for matrix derivatives: 𝜕𝜕𝜷𝜷′ =𝑿𝑿′𝒚𝒚 and
: 𝜕𝜕𝜷𝜷
𝜕𝜕𝜷𝜷′𝑿𝑿′′𝑿𝑿𝜷𝜷 =𝟐𝟐𝑿𝑿′𝑿𝑿𝜷𝜷
• Step 3: Solve −2𝑿𝑿′𝒚𝒚 + 𝟐𝟐𝑿𝑿′𝑿𝑿𝜷𝜷 = 𝟎𝟎 for 𝜷𝜷:
−2𝑿𝑿′𝒚𝒚 + 𝟐𝟐𝑿𝑿′𝑿𝑿𝜷𝜷 = 𝟎𝟎
thus
𝟐𝟐𝑿𝑿′𝑿𝑿𝜷𝜷 = 2𝑿𝑿′𝒚𝒚
𝑿𝑿′𝑿𝑿𝜷𝜷 = 𝑿𝑿′𝒚𝒚
Multiplying the last expression by the inverse (𝑿𝑿′𝑿𝑿)−1 leads to
(𝑿𝑿′𝑿𝑿)−1𝑿𝑿′𝑿𝑿𝜷𝜷 = (𝑿𝑿′𝑿𝑿)−1𝑿𝑿′𝒚𝒚.
Because (𝑿𝑿′𝑿𝑿)−1𝑿𝑿′𝑿𝑿 equals the identity matrix 𝑰𝑰 = (𝑿𝑿′𝑿𝑿)−1𝑿𝑿′𝑿𝑿 (i.e., a diagonal matrix of 1s), we obtain
(𝑿𝑿′𝑿𝑿)−1𝑿𝑿′𝑿𝑿𝜷𝜷 = 𝑰𝑰𝜷𝜷 = 𝜷𝜷 = (𝑿𝑿′𝑿𝑿)−𝟏𝟏𝑿𝑿′𝒚𝒚
A closer look at 𝜷𝜷 = (𝑿𝑿′𝑿𝑿)−𝟏𝟏𝑿𝑿′𝒚𝒚
• The cross-product matrix 𝑿𝑿′𝑿𝑿 provides information on the magnitude of co-variation among the explanatory variables.
• The cross-product matrix 𝑿𝑿′𝒚𝒚 provides information on the magnitude of co-variation of the explanatory variables and the response variable.
• Note that we need to find the inverse of 𝑿𝑿′𝑿𝑿 (i.e., (𝑿𝑿′𝑿𝑿)−𝟏𝟏). In general, matrix inverses are sensititve to the magnitude association among matrix columns (i.e., the predictors). In regression analysis → see collinearity issue.
• For the univariate case (i.e., considering only one predictor), we obtain
𝑿𝑿′𝑿𝑿 = ∑𝑖𝑖 𝑥𝑥𝑖𝑖2 and 𝑿𝑿′𝒚𝒚 = ∑𝑖𝑖 𝑥𝑥𝑖𝑖𝑦𝑦𝑖𝑖. Thus, the equation reduces to:
∑(𝑥𝑥 − 𝑥𝑥̅)(𝑦𝑦 − 𝑦𝑦) 𝑐𝑐𝑐𝑐𝑐𝑐(𝑥𝑥, 𝑦𝑦)
𝛽𝛽
= ∑(𝑥𝑥 − 𝑥𝑥̅)2 = 𝜎𝜎𝑦𝑦2
Empirical Example: Artifical dataset with known parameter values:
set.seed(1842) # to reproduce results n <- 500 # sample size b0 <- 2.5 # true intercept b1 <- 5.7 # true slope of x1 b2 <- 3.2 # true slope of x2
### ----- Generate artifical data:
x1 <- rnorm(n, mean = 0, sd = 2) x2 <- rnorm(n, mean = 1, sd = 3.5) error <- rnorm(n, mean = 0, sd = 1) y <- b0 + x1*b1 + x2*b2 + error
### ----- Show dataset: mydata <- data.frame(y, x1, x2) head(mydata) y x1 x2 1 2.140325 -1.06642262 2.0639484 2 -5.599110 -0.57981954 -1.6829349 3 6.253962 -1.42709712 3.2378684
4 5.722593 -0.08850081 0.7485646 5 -15.115317 -1.21956682 -2.7025510 6 34.428496 3.87573159 3.0039056
Estimate parameters using 𝜷𝜷 = (𝑿𝑿′𝑿𝑿)−1𝑿𝑿′𝒚𝒚:
X <- cbind(1, x1, x2) # create the design matrix X (incl. the intercept) beta <- solve(t(X) %*% X) %*% t(X)%*%y # estimate parameters using OLS beta
[,1]
2.502532 x1 5.667552 x2 3.194397
Note: solve()is used to compute in inverse of a matrix, t() generates a transposed matrix, %*% is used for matrix multiplications.
Using the build-in function: lm(...) model <- lm(y ~ x1 + x2, data = mydata) coef(model)
(Intercept) x1 x2
2.502532 5.667552 3.194397
For the simulated data, we get the following regression equation:
𝑦𝑦𝑖𝑖 = 2.503 + 5.668 𝑥𝑥𝑖𝑖1 + 3.194 𝑥𝑥𝑖𝑖2 + 𝜀𝜀𝑖𝑖
The predicted values (𝑦𝑦𝑖𝑖) are obtained through:
𝑦𝑦𝑖𝑖 = 2.503 + 5.668 𝑥𝑥𝑖𝑖1 + 3.194 𝑥𝑥𝑖𝑖2
The estimated (raw) residuals are obtained through 𝜀𝜀𝑖𝑖 = 𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖.
y.hat <- fitted(model) # predicted values
y.res <- resid(model) # raw residuals head(cbind(y, y.hat, y.res)) y y.hat y.res 1 2.140325 3.051597 -0.9112712 2 -5.599110 -6.159587 0.5604775 3 6.253962 4.757421 1.4965403
4 5.722593 4.392161 1.3304320 5 -15.115317 -13.042446 -2.0728711 6 34.428496 34.064108 0.3643884
Statistical Inference
Confidence Interval:
• Under the assumption that the error term follows a normal distribution
with zero mean and standard deviation 𝜎𝜎𝜀𝜀 (i.e., 𝜀𝜀 ~ 𝑁𝑁(0, 𝜎𝜎𝜀𝜀)), one obtains that sample estimates for the intercept (𝛽𝛽̂0) and the slope (𝛽𝛽̂1) also follow a normal distribution:
̅ 𝛽𝛽̂1 ~ 𝑁𝑁
𝛽𝛽 ,
𝛽𝛽̂0 ~ 𝑁𝑁𝛽𝛽0, 𝜎𝜎𝜀𝜀
• Replacing 𝜎𝜎𝜀𝜀 with
the sample estimate gives the standard errors 𝑆𝑆𝑆𝑆
𝛽𝛽̂0 and 𝑆𝑆𝑆𝑆
𝛽𝛽̂1 and a level (1 – 𝛼𝛼)
confidence interval is given by
𝛽𝛽̂0 − 𝑡𝑡𝑐𝑐𝑐𝑐𝑖𝑖𝑐𝑐 ×
𝑆𝑆𝑆𝑆 𝛽𝛽̂0 ; 𝛽𝛽̂0 + 𝑡𝑡𝑐𝑐𝑐𝑐𝑖𝑖𝑐𝑐 ×
𝑆𝑆𝑆𝑆 𝛽𝛽̂0 𝛽𝛽̂1 − 𝑡𝑡𝑐𝑐𝑐𝑐𝑖𝑖𝑐𝑐 ×
𝑆𝑆𝑆𝑆 𝛽𝛽̂1 ; 𝛽𝛽̂1 + 𝑡𝑡𝑐𝑐𝑐𝑐𝑖𝑖𝑐𝑐 ×
𝑆𝑆𝑆𝑆 𝛽𝛽̂1
where 𝑡𝑡𝑐𝑐𝑐𝑐𝑖𝑖𝑐𝑐 is the upper 𝛼𝛼/2 critical value of Student’s t-distribution with df
= n – 2.
(General) Linear Model Significant test:
• To test the null hypothesis that 𝐻𝐻0:𝛽𝛽1 = 0 we can use
𝑡𝑡
= 𝛽𝛽̂1 with 𝑆𝑆𝑆𝑆
𝛽𝛽̂1 = 𝜎𝜎𝜀𝜀
𝑆𝑆𝑆𝑆
𝛽𝛽̂1∑ 𝑥𝑥 − 𝑥𝑥̅ 2
• The p-value for the test statistic can be found from Student’s tdistribution with df = n – 2.
• When the regression assumptions are fulfilled, evaluating 𝐻𝐻0:𝛽𝛽1 = 0 corrsponds to testing whether there is a linear relation between 𝑥𝑥 and 𝑦𝑦.
Statistical inference using summary() and confint(): model <- lm(y ~ x1 + x2, data = mydata)
summary(model) Call: lm(formula = y ~ x1 + x2)
Residuals:
Min 1Q Median 3Q Max -3.5341 -0.6187 0.0296 0.6949 3.3571
Coefficients:
Estimate Std. Error t value Pr(>|t|) (Intercept) 2.50253 0.04622 54.14 <2e-16 *** x1 5.66755 0.02167 261.53 <2e-16 *** x2 3.19440 0.01246 256.41 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
[...output omitted...]
confint(model, level = 0.95)
2.5 % 97.5 %
(Intercept) 2.411718 2.593346 x1 5.624974 5.710130 x2 3.169919 3.218874
Model Fit
Coefficient of Determination (𝑹𝑹𝑹):
The 𝑅𝑅𝑹 value is based on partitioning the total variation of the observations
𝑦𝑦𝑖𝑖 (without considering predictors) into (1) variation of the fitted regression 𝑦𝑦𝑖𝑖 around the mean 𝑦𝑦 and (2) variation of observed values 𝑦𝑦𝑖𝑖 around the fitted regression 𝑦𝑦𝑖𝑖 :
(𝑦𝑦𝑖𝑖 − 𝑦𝑦) = (𝑦𝑦𝑖𝑖 − 𝑦𝑦) + (𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)
The same relation holds for the sums of squared deviations:
(𝑦𝑦𝑖𝑖 − 𝑦𝑦)𝑹 = (𝑦𝑦𝑖𝑖 − 𝑦𝑦)𝑹 + (𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)𝑹
Total sum of squares (TSS) | Sums of squares of the fitted regression around the mean (FSS) | Sums of squares of the observed values around the fitted regression (RSS) |
(General) Linear Model
(𝑦𝑦𝑖𝑖 − 𝑦𝑦)𝑹 = (𝑦𝑦𝑖𝑖 − 𝑦𝑦)𝑹 + (𝑦𝑦𝑖𝑖 − 𝑦𝑦𝑖𝑖)𝑹
Total sum Sums of squares Sums of squares of of squares of the fitted the observed values
(TSS) regression around around the fitted the mean (FSS) regression (RSS) • The coefficient of determination is defined as:
𝐹𝐹𝑆𝑆𝑆𝑆 𝑅𝑅𝑆𝑆𝑆𝑆
𝑅𝑅𝑹
= = 1
−
𝑇𝑇𝑆𝑆𝑆𝑆 𝑇𝑇𝑆𝑆𝑆𝑆
with 0 ≤ 𝑅𝑅𝑹 ≤ 1, while larger values indicate a better model fit.
• Alternatively, the coefficient of determination may also be interpreted as the squared correlation between observed (𝒚𝒚) and predicted (𝒚𝒚) values of the response variable:
𝑅𝑅𝑹 = 𝑐𝑐𝑐𝑐𝐶𝐶(𝒚𝒚, 𝒚𝒚)𝑹
Coefficient of Determination in R:
summary(model.h)
[... Output omitted ...]
Residual standard error: 1.001 on 497 degrees of freedom
Multiple R-squared: 0.9965, Adjusted R-squared: 0.9965 F-statistic: 7.127e+04 on 2 and 497 DF, p-value: < 2.2e-16
summary(model.h)$r.squared [1] 0.9965256
### ------ R² using sums of squares:
TSS <- sum((y-mean(y))^2)
FSS <- sum((fitted(model.h) - mean(y))^2)
RSS <- sum((y - fitted(model.h))^2)
FSS/TSS
[1] 0.9965256 1 - RSS/TSS
[1] 0.9965256
### ------- R² as the correlation between observed and fitted values:
cor(y, fitted(model.h))^2
[1] 0.9965256
Model Selection
• In practical applications we need to find an appropriate model.
• Consider the model ℳℎ which is the hierarchically higher model (i.e., the more complex model) with 𝒫𝒫ℎ parameters. For example:
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑖𝑖1 + 𝛽𝛽2𝑥𝑥𝑖𝑖2 + 𝜀𝜀𝑖𝑖.
• Let ℳ𝑐𝑐 be a restricted model (i.e., a hierarchically lower model) with 𝒫𝒫𝑐𝑐 parameters (while assuming 𝒫𝒫𝑐𝑐 < 𝒫𝒫ℎ). For example:
𝑦𝑦𝑖𝑖 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑖𝑖1 + 𝜀𝜀𝑖𝑖.
• We use the sum of the squared deviations of each observation (𝒚𝒚) around its estimated expected value (𝒚𝒚). The residual sum of squares
(𝑅𝑅𝑆𝑆𝑆𝑆) are defined as
Model ℳℎ: | 𝑅𝑅𝑆𝑆𝑆𝑆ℎ = ∑(𝒚𝒚 −𝒚𝒚)𝑹 = ∑[𝑦𝑦𝑖𝑖 − (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑖𝑖1 + 𝛽𝛽2𝑥𝑥𝑖𝑖2)]𝑹 |
Model ℳ𝑐𝑐: | 𝑅𝑅𝑆𝑆𝑆𝑆𝑐𝑐 = ∑(𝒚𝒚 −𝒚𝒚)𝑹 = ∑[𝑦𝑦𝑖𝑖 − (𝛽𝛽0 + 𝛽𝛽1𝑥𝑥𝑖𝑖1)]𝑹 |
• In general, one assumes 𝑅𝑅𝑆𝑆𝑆𝑆ℎ ≤ 𝑅𝑅𝑆𝑆𝑆𝑆𝑐𝑐 because the more parameters are in the model, the better will be the model fit (thus, the smaller are the deviations around the fitted regression line).
• When 𝑅𝑅𝑆𝑆𝑆𝑆ℎ is close to 𝑅𝑅𝑆𝑆𝑆𝑆𝑐𝑐, the additional parameters of the model ℳℎ do not help to better explain the variability of the observed data.
• Statistical Test:
𝑅𝑅𝑆𝑆𝑆𝑆𝑐𝑐 − 𝑅𝑅𝑆𝑆𝑆𝑆ℎ 𝑅𝑅𝑆𝑆𝑆𝑆ℎ
𝐹𝐹
= ÷
𝑑𝑑𝑑𝑑𝑐𝑐 − 𝑑𝑑𝑑𝑑ℎ 𝑑𝑑𝑑𝑑ℎ
with 𝑑𝑑𝑑𝑑𝑐𝑐 − 𝑑𝑑𝑑𝑑ℎ and 𝑑𝑑𝑑𝑑ℎ degrees of freedom (𝑑𝑑𝑑𝑑 = #𝑐𝑐𝑜𝑜𝑜𝑜 − #𝑝𝑝𝑝𝑝𝐶𝐶𝑝𝑝𝑝𝑝). The
null hypothesis of the F-test states that the reduces model (ℳ𝑐𝑐) is adequate, i.e., that the additional parameters are not needed.
Performing the F-test in R:
### -------- Estimate both models:
model.h <- lm(y ~ x1 + x2) model.r <- lm(y ~ x1)
### -------- Perform F-test:
anova(model.r, model.h) Analysis of Variance Table
Model 1: y ~ x1
Model 2: y ~ x1 + x2
Res.Df RSS Df Sum of Sq F Pr(>F) 1 498 66408 2 497 498 1 65910 65745 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion: The additional parameter is needed to explain the observed data.
Model Selection Strategies:
• Simultaneous Regression: All IVs are entered simultaneously, i.e., there is no theoretical bases to logically order the variables.
• Hierarchical Regression: IVs are entered cumulatively according to a pre-specified theoretical hierarchy with a focus on the change in 𝑅𝑅2. For example, when regressing 𝑦𝑦 on 𝑥𝑥 one obtains 𝑅𝑅𝑦𝑦2|𝑦𝑦 = 0.20. When regressing 𝑦𝑦 on 𝑥𝑥 and 𝑧𝑧 one obtaines 𝑅𝑅𝑦𝑦2|𝑦𝑦𝑥𝑥 = 0.27. Thus, after
controling for 𝑥𝑥, 𝑧𝑧 accounts for 7% of the variance in 𝑦𝑦 (i.e., 𝑅𝑅𝑦𝑦2|𝑦𝑦𝑥𝑥 −𝑅𝑅𝑦𝑦2|𝑦𝑦 = 0.07).
• Stepwise Regression: IVs are automatically selected based on their relative contribution (see R function step(...))
• Forward Selection: At each stage the IV with the largest change in 𝑅𝑅2 is selected.
• Backward Selection: In the first step, all IVs are used and the one IV with the smalles contribution is dropped.
Assignment Code
# -------------------------------------------------
# ------------- Analysis of Variances
# -------------------------------------------------
# read in data
load(file = "C:/myproject/anova_dat.RData")
# get means
aggregate(reading ~ group, data = anova_dat)
# group reading
#1 1 37.2
#2 2 45.2
#3 3 45.8
# ANOVA
m <- aov(reading ~ group, data = anova_dat)
#Terms:
# group Residuals
#Sum of Squares 184.9000 324.0333 >>--->> CAUTION: Wrong degrees of freedom
#Deg. of Freedom 1 13
#Residual standard error: 4.992559
#Estimated effects may be unbalanced
anova_dat$group.f <- factor(anova_dat$group)
m <- aov(reading ~ group.f, data = anova_dat)
summary(m)
# planned comparison
comp1 <- c(-1, 0.5, 0.5) # specifying intended comparison
comp2 <- c(0, -1, 1)
weight <- cbind(comp1, comp2) # weight matrix
contrasts(anova_dat$group.f) <- weight
m2 <- aov(reading ~ group.f, data = anova_dat)
summary(m2, split = list(group.f = list("Ave. Treat versus Control" = 1, "Par versus Non-Par" = 2)) )
# Df Sum Sq Mean Sq F value Pr(>F)
#group.f 2 230.5 115.3 4.968 0.02679 *
# group.f: Ave. Treat versus Control 1 229.6 229.6 9.898 0.00844 **
# group.f: Par versus Non-Par 1 0.9 0.9 0.039 0.84716
#Residuals 12 278.4 23.2
# Bonferroni alpha adjustment
#new alpha: .05 / 3 = 0.01666667
t.test(reading ~ group.f, data = subset(anova_dat, group.f != "3") ) # group 1 vs 2
#t.test(reading ~ group.f, data = subset(anova_dat, group.f == "1" | group.f == "2") )
t.test(reading ~ group.f, data = subset(anova_dat, group.f != "2") ) # group 1 vs 3
t.test(reading ~ group.f, data = subset(anova_dat, group.f != "1") ) # group 2 vs 3
pairwise.t.test(anova_dat$reading, anova_dat$group.f)
#data: anova_dat$reading and anova_dat$group.f
# 1 2
#2 0.046 -
#3 0.046 0.847
#P value adjustment method: holm
aggregate(reading ~ group.f, data = anova_dat, mean)
# group.f reading
#1 1 37.2
#2 2 45.2
#3 3 45.8
aggregate(reading ~ group.f, data = anova_dat, sd)
# group.f reading
#1 1 4.658326
#2 2 4.969909
#3 3 4.816638
aggregate(reading ~ group.f, data = anova_dat, summary)
# group.f reading.Min. reading.1st Qu. reading.Median reading.Mean reading.3rd Qu. reading.Max.
#1 1 33.0 34.0 35.0 37.2 40.0 44.0
#2 2 38.0 43.0 46.0 45.2 48.0 51.0
#3 3 42.0 43.0 44.0 45.8 46.0 54.0
# --- GLM Approach
anova_dat$nonpar <- ifelse(anova_dat$group.f == "2", 1, 0)
anova_dat$par <- ifelse(anova_dat$group.f == "3", 1, 0)
m.lm <- lm(reading ~ nonpar + par, data = anova_dat)
summary(m.lm)
#Call:
#lm(formula = reading ~ nonpar + par, data = anova_dat)
#Residuals:
# Min 1Q Median 3Q Max
# -7.2 -3.0 -1.8 2.8 8.2
#Coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 37.200 2.154 17.270 7.68e-10 ***
#nonpar 8.000 3.046 2.626 0.0221 * >>--->> nonpar vs control
#par 8.600 3.046 2.823 0.0154 * >>--->> par vs control
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Residual standard error: 4.817 on 12 degrees of freedom
#Multiple R-squared: 0.453, Adjusted R-squared: 0.3618
#F-statistic: 4.968 on 2 and 12 DF, p-value: 0.02679
m0 <- lm(reading ~ 1, data = anova_dat)
#Call:
#lm(formula = reading ~ 1, data = anova_dat)
#Coefficients:
#(Intercept)
# 42.73 >>--->> grand mean
anova(m0, m.lm)
#Analysis of Variance Table
#Model 1: reading ~ 1
#Model 2: reading ~ nonpar + par
# Res.Df RSS Df Sum of Sq F Pr(>F)
#1 14 508.93
#2 12 278.40 2 230.53 4.9684 0.02679 * >>--->> summary(aov(...))
Assignment Code
# ------------------------------------------
# ------------- Introduction into R
# ------------------------------------------
# --- generate variable in R
# numeric variable "age"
age <- c(3, 5, 2.5)
age
#[1] 3.0 5.0 2.5
age + 2
#[1] 5.0 7.0 4.5
# center variable (i.e. subtract mean of "age")
age.c <- age - mean(age)
mean(age)
#[1] 3.5
mean(age.c)
#[1] 0
age
#[1] 3.0 5.0 2.5
age.c
#[1] -0.5 1.5 -1.0
# categorical variable "gender"
gender <- c("female", "male", "female")
gender.2 <- c(1, 0, 1)
class(age)
#[1] "numeric"
class(gender.2)
#[1] "numeric"
gender.2 <- factor(gender.2)
levels(gender.2) <- c("male", "female")
# ---- datasets in R
dat <- data.frame(age, age.c, gender.2)
dat
# age age.c gender.2
#1 3.0 -0.5 female
#2 5.0 1.5 male
#3 2.5 -1.0 female
dat[2 , ]
# age age.c gender.2
#2 5 1.5 male
dat[2 , 3 ]
#[1] male
#Levels: male female
dat[ , 2 ]
#[1] -0.5 1.5 -1.0
# --- read in dataset
load(file = "D:/MU/teaching/quant-found/fall2018/data/stgallen.RData" )
# access variables in dataset
mean(mydata[ , 2], na.rm = TRUE)
#[1] 25.52995
mean(mydata$age, na.rm = TRUE)
#[1] 25.52995
# add variable "audit"
mydata$audit <- rowSums( mydata[ , 4:13 ] )
# remove variable "id"
mydata$id <- NULL
Assignment Code
# -------------------------------------------------
# ------------- Hypothesis Testing and Correlation
# -------------------------------------------------
# --- read in dataset
prosoc <- c(6, 7, 7, 8, 4, 4, 3, 6, 7, 6, 7, 4)
acad <- c(5, 9, 8, 5, 4, 3, 3, 7, 8, 8, 7, 3)
# --- descriptive statistics
mean(prosoc)
#[1] 5.75
median(prosoc)
#[1] 6
sd(prosoc)
#[1] 1.602555
quantile(prosoc)
# 0% 25% 50% 75% 100%
# 3 4 6 7 8
summary(prosoc)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 3.00 4.00 6.00 5.75 7.00 8.00
# --- Pearson correlation
cor(prosoc, acad)
#[1] 0.7691181
cor(data.frame(prosoc, acad) )
# prosoc acad
#prosoc 1.0000000 0.7691181
#acad 0.7691181 1.0000000
# --- visual check of linearity
plot(prosoc, acad, pch = 21, bg = "grey", cex = 2,
xlab = "Prosocial Behavior",
ylab = "Academic Competence",
main = "My first scatterplot",
xlim = c(0, 10), ylim = c(0, 10))
abline( lm(acad ~ prosoc) )
# --- significance test for correlation
# normality of variables:
par(mfrow = c(1,2))
hist(prosoc, col = "grey", xlim = c(0, 10),
xlab = "Prosocial Behavior")
hist(acad, col = "grey", xlim = c(0, 10),
xlab = "Academic Competence")
# run t-test:
cor.test(prosoc, acad)
# Pearson's product-moment correlation
#data: prosoc and acad
#t = 3.8056, df = 10, p-value = 0.003454
#alternative hypothesis: true correlation is not equal to 0
#95 percent confidence interval:
# 0.3494733 0.9317479
#sample estimates:
# cor
#0.7691181
Assignment Code
# -------------------------------------------------
# ------------- Regression Diagnostics
# -------------------------------------------------
# --- read-in data
mydata <- read.csv(file = "E:\MU\teaching\quant-found\Rdata\swb-dat.csv")
m <- lm(swls ~ fs, data = mydata)
# --- Normality assumption
r.swls <- resid(m)
hist(r.swls, breaks = 20, col = "red", xlab = "Residuals", main = "swls ~ fs")
boxplot(r.swls, col = "blue", xlab = "Residuals")
qqnorm(r.swls) # QQ plot
qqline(r.swls) # add QQ line
# significance tests
install.packages("moments") # only use once
library(moments)
shapiro.test(r.swls)
# --- Homoskedasticity
r.swls <- resid(m)
plot(mydata$fs, r.swls, pch = 21, bg = "grey",
xlab = "Psych. Well-Being", ylab = "Residuals",
ylim = c(-3,3))
abline(h = 0, lty = "dashed", lwd = 1.5)
# significance test
install.packages("lmtest") # run only once
library(lmtest)
bptest(m)
# studentized Breusch-Pagan test
#data: m
#BP = 8.1597, df = 1, p-value = 0.004283 -> violated assumption
# log transformation
summary(mydata$swls)
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.800 4.400 5.200 5.007 5.800 7.000 # no values <= 0
m.log <- lm(log(swls) ~ fs, data = mydata)
bptest(m.log)
# studentized Breusch-Pagan test
#data: m.log
#BP = 24.696, df = 1, p-value = 6.713e-07 -> still violated assumption
# Box-Cox transformation
install.packages("MASS")
library(MASS)
bc.out <- boxcox(m, lambda = seq(-3, 6, by = .0001), plotit = TRUE)
bc.out$x[which.max(bc.out$y)] # gives lambda = 1.7857
mydata$swls.bc <- (mydata$swls^1.7857 - 1) / 1.7857
par(mfrow=c(1,2))
hist(mydata$swls, breaks = 10, col = "grey")
hist(mydata$swls.bc, breaks = 10, col = "grey")
m.bc <- lm(swls.bc ~ fs, data = mydata)
bptest(m.bc)
# studentized Breusch-Pagan test
#data: m.bc
#BP = 0.3652, df = 1, p-value = 0.5456
# post-hoc correction of standard errors
install.packages("lmtest")
install.packages("sandwich")
library(lmtest) # coeftest
library(sandwich) # vcovHC
coeftest(m, vcov = vcovHC(m, type = "HC3") )
#t test of coefficients:
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.531472 0.233585 2.2753 0.02327 *
#fs 0.816283 0.040966 19.9258 < 2e-16 ***
#---
#Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# --- Outliers and Influential Data Points
# hat-values
h <- hatvalues(m)
outliers <- which(h > 3*mean(h) ) # identify outliers
mydata_reduced <- mydata[ -outliers, ] # delete outliers from dataset
nrow(mydata_reduced)
m.nooutlier <- lm(swls ~ fs, data = mydata_reduced)
coef(m)
#(Intercept) fs
# 0.5314718 0.8162830
coef(m.nooutlier) # no major difference --> go with original model
#(Intercept) fs
# 0.1732704 0.8786697
# Cook's distances
cd <- cooks.distance(m)
outliers2 <- which(cd > 4/(559 - 1 - 1)) # detect outliers
mydata_reduced2 <- mydata[ -outliers2 , ] # delete outliers
m.outlier2 <- lm(swls ~ fs, data = mydata_reduced2)
coef(m)
#(Intercept) fs
# 0.5314718 0.8162830
coef(m.outlier2) # no major difference --> go with original model
#(Intercept) fs
# 0.4176248 0.8455873
Assignment Code
# -------------------------------------------------
# ------------- Linear Regression Model
# -------------------------------------------------
# --- read-in data
mydata <- read.csv(file = "D:\MU\teaching\quant-found\Rdata\swb-dat.csv")
# --- scatterplot
plot(swls ~ age, data = mydata, pch = 21, bg = "grey", cex = 1.5, xlim = c(0, 80))
#plot(mydata$age, mydata$swls, pch = 21, bg = "grey", cex = 1.5)
# --- simple linear regression model
m1 <- lm(swls ~ age, data = mydata)
#Call:
#lm(formula = swls ~ age, data = mydata)
#Coefficients:
#(Intercept) age
# 4.95865 0.00163
# --- multiple linear regression model
m2 <- lm(swls ~ age + fs, data = mydata)
#Call:
#lm(formula = swls ~ age + fs, data = mydata)
#Coefficients:
#(Intercept) age fs
# 0.577138 -0.001753 0.817466
# --- model fit
summary(m2)$r.squared
#[1] 0.444049
pred <- predict(m2)
cor(pred, mydata$swls)^2
#[1] 0.444049
# --- model selection
anova(m1, m2)
#Model 1: swls ~ age
#Model 2: swls ~ age + fs
# Res.Df RSS Df Sum of Sq F Pr(>F)
#1 557 640.51
#2 556 356.19 1 284.31 443.8 < 2.2e-16 ***
anova(m2, m3)
# backward selection
m0 <- lm(swls ~ sex + age + fs, data = mydata)
m1 <- lm(swls ~ sex + fs, data = mydata)
m2 <- lm(swls ~ age + fs, data = mydata)
m3 <- lm(swls ~ fs, data = mydata)
m4 <- lm(swls ~ 1, data = mydata)
anova(m1, m0) # is age needed?
anova(m2, m0) # is sex needed?
anova(m3, m1) # is sex needed in new baseline model?
anova(m3, m2) # is age needed in new baseline model?
anova(m4, m3) # is fs needed?
# hierarchical
m0 <- lm(swls ~ age + sex, data = mydata) # baseline adjusting demographics
m1 <- lm(swls ~ age + sex + fs, data = mydata) # adding predictor of interest
anova(m0, m1)
# stepwise
m <- step( lm(swls ~ sex + age + fs, data = mydata), direction = "both" ) # backward and forward selection
m <- step( lm(swls ~ sex + age + fs, data = mydata), direction = "backward" ) # backward selection only
m <- step( lm(swls ~ sex + age + fs, data = mydata), direction = "forward" ) # forward selection only
Assignment Image
![R | R Studio Assignment Description Image [Solution]](/images_solved/1691011/11/im.webp)
Assignment Description
ESCP 8850: Quantitative Foundations in Educational Research:
Midterm Exam
Please send your answers to .... Due at 4pm Thursday, October 25th, 2018.
Part I: Basic Concepts (10 points)
Match the following 10 items with the best options. Each option can be used more than once and multiple options may be correct. Fill the letters in the parentheses.
( ) 1. HC3 correction
( ) 2. R-squared in linear regression
( ) 3. robust Breusch-Pagan test
( ) 4. Type II Error
( ) 5. Box Cox-transformation
( ) 6. Dummy Coding
( ) 7. D’Agostino test
( ) 8. P(data | H0)
( ) 9. Bonferroni correction ( ) 10. Slope Coefficient
Options:
(A) Adjusting Type I Error rate for multiple comparisons
(B) Testing for homoscedasticity
(C) Correcting standard errors for non-normality
(D) Correcting standard errors for heteroscedasticity
(E) The probability of rejecting the H0 given that the effect does not exist in the population
(F) Testing if a variable is asymmetrically distributed
(G) Using k – 1 binary variables to represent k groups
(H) Correcting for highly influential data points in a linear regression model
(I) the p-value, i.e., the probability that the alternative hypothesis is correct given the observed data
(J) The probability of retaining the H0 given that the effect exists in the population
(K) Stabilizing the error variance in linear regression models
(L) The squared correlation between observed and predicted response values
(M) the p-value, i.e., the probability of observed data under the null hypothesis
(N) Used to evaluate the model fit
(O) Expected change in the dependent variable for a unit increase of the independent variable
Part II: Applications in R (20 points)
The dataset yourname.csv contains the variables prosocial behavior (prosoc; standardized score), treatment status (treat; 1 = advanced teacher classroom management training, 2 = basic teacher classroom management training, and 3 = business as usual), and grade (grade; 1 = 1st grade, 2 = 2nd grade, 3 = 3rd grade) obtained from 543 students. When answering the questions, show the R code and the corresponding output.
1. Convert the variables grade and treat into factors and compute means and standard deviations of prosoc for all grade levels and treatment groups. (2 points)
2. Run a one-way ANOVA for prosoc and treat. Is there a significant difference of academic competence across treatment groups? Further, evaluate the ANOVA assumption of normality of the outcome variable. Is the assumption fulfilled? (4 points)
3. Compare the average treatment effect of “advanced teacher classroom management training” and “basic teacher classroom management training” with the observed effect of the control group (“business as usual”) using planned comparisons. What are your conclusions? (2 points)
4. Repeat the analyses of Question 2 using ordinary linear regression models using Dummy and Effect Coding. Interpret the regression coefficients for the treatment effects for both coding schemes. (4 points).
5. Perform all possible pairwise contrasts for the model in Question 2 using Tukey’s HSD method and Bonferroni’s correction approach. What are your conclusions? (4 points)
6. Extend the model in Question 4 by including the variable grade using Dummy Coding for both variables. It is hypothesized that grade has an additional effect on academic competence. Can you confirm this statement based on your analysis? (2 points)
7. Repeat your analysis in Question 6 using Effect Coding. Does a different coding scheme change your conclusions? (2 points).
Frequently Asked Questions
Yes. No hidden fees. You pay for the solution only, and all the explanations about how to run it are included in the price. It takes up to 24 hours to get a quote from an expert. In some cases, we can help you faster if an expert is available, but you should always order in advance to avoid the risks. You can place a new order here.
The cost depends on many factors: how far away the deadline is, how hard/big the task is, if it is code only or a report, etc. We try to give rough estimates here, but it is just for orientation (in USD):
Regular homework | $20 - $150 |
Advanced homework | $100 - $300 |
Group project or a report | $200 - $500 |
Mid-term or final project | $200 - $800 |
Live exam help | $100 - $300 |
Full thesis | $1000 - $3000 |
Credit card or PayPal. You don't need to create/have a Payal account in order to pay by a credit card. Paypal offers you "buyer's protection" in case of any issues.
We have no way to request money after we send you the solution. PayPal works as a middleman, which protects you in case of any disputes, so you should feel safe paying using PayPal.
No, unless it is a data analysis essay or report. This is because essays are very personal and it is easy to see when they are written by another person. This is not the case with math and programming.
It is because we don't want to lie - in such services no discount can be set in advance because we set the price knowing that there is a discount. For example, if we wanted to ask for $100, we could tell that the price is $200 and because you are special, we can do a 50% discount. It is the way all scam websites operate. We set honest prices instead, so there is no need for fake discounts.
No, it is simply not how we operate. How often do you meet a great programmer who is also a great speaker? Rarely. It is why we encourage our experts to write down explanations instead of having a live call. It is often enough to get you started - analyzing and running the solutions is a big part of learning.
Another expert will review the task, and if your claim is reasonable - we refund the payment and often block the freelancer from our platform. Because we are so harsh with our experts - the ones working with us are very trustworthy to deliver high-quality assignment solutions on time.