 Details
 Parent Category: Programming Assignments' Solutions
We Helped With This R Language Programming Assignment: Have A Similar One?
Category  Programming 

Subject  R  R Studio 
Difficulty  Undergraduate 
Status  Solved 
More Info  Statistics Assignment Experts 
Assignment Description
Homework #3 Scatterplots and Correlation
A sample of 24 faculty members at WMU was conducted and each professor was asked their age and annual income. The results of this study are listed below:
Person 
Age (X)  Income (Y) (x $1000) 
1  40  32 
2  31  24 
3  50  47 
4  53  50 
5  36  30 
6  55  55 
7  37  33 
8  45  41 
9  60  63 
10  41  34 
11  46  43 
12  38  35 
13  32  28 
14  56  57 
15  51  50 
16  37  30 
17  54  52 
18  42  35 
19  47  41 
20  33  26 
21  39  34 
22  52  49 
23  57  60 
24  55  51 
a. This data is in the FacSalaries.csv file on the eLearning page for GEOG 5670 in the data section — copy this to your USB drive in a folder called “Correlations.” In RStudio navigate to this folder and make it the working directory. Load this into a dataframe called Salaries using the read.csv() function.
b. The first step in any analysis of correlations is to generate a scatterplot. Install and load the ggplot2 package and use qplot() to generate a simple scatterplot. Specify the Age data for the x axis and the Salary data for the y. Set the geom= parameter equal to “point”. Specify appropriate labels for the x axis and y axis using the xlab= and ylab= arguments, and use main= to give the plot an appropriate title. In RStudio click on the Plot tab and Export button to save this to the clipboard. Paste this into a Word document.
c. Compute the covariance between Age and Income using the cov() function. Copy the output from this and paste it below the scatterplot in the Word document. Answer the following questions in the Word document:
1. What does the sign (+ or ) indicate?
2. If the salaries were expressed as dollars instead of the current $ x 1000, how do you think the value of the covariance would change?
d. Now use the cor.test() function to determine the correlation between the two variables. You can specify a fomula as: ~ Age + Income, data = Salaries and be sure to set the method to “pearson” Copy the output from this and paste it to the Word document beneath the covariance. Rerun this command with the methods set to “spearman” and “kendall”. Copy the output from these to your Word document. Answer the following questions:
1. Does correlation have more or less utility in determining the strength of the linear relationship between Age and Income? Briefly explain why.
2. If we had an additional ordinal variable that indicates the rank of each professor (Instructor, Assistant Professor, Associate Prof, Professor), could we use Pearson’s r to measure the strength of the relationship between Age and Income?
3. How do the values of the three methods (Pearson, Spearman, and Kendall) compare?
1. This question uses agricultural data from China provided by Dr. Veeck. Copy the ChinaAgCorr.csv file from the Data section of the GEOG 5670 elearning page and paste it in your Correlation folder. Import this to a data frame called ChinaAg.
The variables in this dataset are: MECHIDXan index of the amount of mechanization in ag.; AGCHEM—the amount of ag. chemicals used in each district; ECOCROPS—the area of ecologically benign crops (such as nuts) under cultivation; DIVERSIT an index of the biodiversity of the district; TOTFRMS—the total number of farms; AGAREA—the total area under cultivation; ARABLE—the total area of arable land; IRRIG—the total area of irrigated land.
2. A quick way to generate a matrix of scatterplots for all combinations of these eight variable is the pairs() function. Run this on the ChinaAg dataframe and copy the resulting graph to your Word document. Using the combinations below the diagonal names on this graph, indicate which combinations are positively related, negatively related, or weakly related and list these below the graph in your Word document.
3. This time, we’ll use the cor() function on the whole dataframe to get a correlation matrix. You can make a cleaner output for this matrix if you round the values of the correlation coefficient to 3 decimal places using the round() function. So nest the cov() function inside the round() function to accomplish this. Copy the output matrix and paste this in your Word document.
Print out your Word document and turn it in for credit on the homework ass
GEOG5670HW2Answers.doc
GEOG 5670 Spatial Analysis Name: ______________________
Homework #3
Scatterplots and Correlation
Paste scatterplot of Age vs. Income data:
Paste covariance output here:
1. What does the sign (+ or ) indicate?
2. If the salaries were expressed as dollars instead of the current $ x 1000, how do you think the value of the covariance would change?
Paste Pearson’s r output here:
3. What is the value for Pearson’s r? Briefly state what this means.
4. Does this value indicate a significantly significant correlation? What evidence can you cite from the output?
5. As compared to the covariance value computed earlier, does correlation have more or less utility in determining the strength of the linear relationship between Age and Income? Briefly explain why.
6. Paste ChinaAg scatterplot matrix here:
Pairs that are + correlated Pairs that are  correlated Pairs with weak correlation
7. Paste the correlation matrix here:
Assignment Description
CORRELATION
Preview
• Bivariate Random Variables
• Correlation Analysis
• Pearson’s r
• Spatial Autocorrelation
• Regression Analysis
• Linear Regression
• Goodness of Fit
• Read Section 5.7 through 5.7.5 and Chapter 15 of your book
• Homework #5 is on the course web page—try to finish by March 12
Multivariate Techniques
• Involve two or more variables
Covariance of Two Random Variables
𝑁
• Simple (bivariate) correlation analysis is an investigation
of the strength of association between two variables
• Simple regression analysis is a study of the nature of the relationship
• Estimating the value of one variable given another
1
𝐶𝑜𝑣 𝑋, 𝑌 = 𝑁 − 1
• C(X,Y) is the covariance
𝑋𝑖 − 𝑋ത
𝑖=1
𝑌𝑖 − 𝑌ത
• Values of 0 indicate no relationship
• If X increases result in increases in Y, + values
• If X increases result in decreases in Y,  values
• Problem with covariance is that it is in units of X and Y, so values are difficult to interpret
Sample Covariance
• The best point estimate for C(X,Y) is
S = 1 ån ( X  X )(Y  Y )
XY n 1 i i i=1
• SXY is positive if the two variables have a positive relationship and it is negative if they are negatively related
• Sample covariance has the same disadvantage as
C(X,Y)
• It is highly influenced by the units in which the two variables are measured
• If two random variables are jointly normally distributed
• The marginal distributions of both X and Y are univariate normal
• Any conditional distribution of X or Y is also univariate normal
• Five parameters to the bivariate normal density function
• m_{x} m_{y} s_{x} s_{y} r_{xy}
rxy
= C( X ,Y )
o s
x y
Pearson’s Product Moment Correlation Coefficient
• Population parameters m_{x} m_{y} s_{x} s_{y} r_{xy} are almost never
Sample Correlation Coefficient
• Substitutes appropriate point estimators for C(X,Y) into
known
• Must estimate r_{xy} from sample data
• In simplified form
rxy
= C( X ,Y )
o s
• Pearson’s r is the best point estimate of r_{xy}
• Where all points plot on a positively sloped line, r = 1
• Where all points plot on a negatively sloped line, r = 1
• If r is near 0, the scatter of points is nearly circular
• A scatter of points can have a strong nonlinear association but still have r near 0
x y
𝑟 =
Scatter Diagrams
• Each observation pair (Xi,Yi) represents one dot
Positive Linear Association
• r = 1 if all points lie on a positively sloped line
• r = 0.88 in the second example
Negative Linear Association
 
• r = 1 if all points lie on a negatively sloped line
• We want to see if there is a relationship between height of hair and income from real estate
Calculation of Pearson’s r
Xi  Yi  Xi^2  Yi^2  XiYi 
4.5  100  20.25  10000  450 
6.0  130  36.00  16900  780 
5.5  160  30.25  25600  880 
7.0  180  49.00  32400  1260 
7.5  190  56.25  36100  1425 
8.0  200  64.00  40000  1600 
10.0  220  100.00  48400  2200 
9.0  240  81.00  57600  2160 
10.5  280  110.25  78400  2940 
12.0  300  144.00  90000  3600 
80  2000  691  435400  17295 
Covariance and Correlation in R
• Both in base package
• cov() or cor()
• cov(x, y = NULL, use = c("everything", “all.obs”, “complete.obs”), method = c("pearson", "kendall", "spearman"))
• cor(x, y = NULL, use = c("everything", “all.obs”, “complete.obs”), method = c("pearson", "kendall", "spearman"))
• cor.test()
• cor.test(x, y, alternative = c("two.sided", "less", "greater"), method =
r = 17295  (80)(2000) /10
691 802 /10 435400  20002 /10
= 0.96
c("pearson", "kendall", "spearman"), exact = NULL, conf.level = 0.95, continuity = FALSE, ...)
• cor() can do all correlations in a dataframe while cor.test()
There is a strong positive correlation between hair height
and real estate saleshigher hair = more real estate sales
only does specified pairs of variables
Significance Testing
• Assume two random variables are bivariate
normally distributed
• H0: r = 0, HA: r ¹ 0, or r > 0 or r < 0
• The sampling distribution of r is tdistributed with n  2 degrees of freedom and an estimated standard error of:
1 r 2
sr = n  2
• The test statistic is:
Correlation Matrices
• A summary of the correlation coefficients between all pairs of variables in a set
• If for the Texas example we had data on
• Hair size
• Real Estate Income
• Percentage of gold chrome on car
• Tons of makeup applied annually
• Numbers of packs of More cigarettes smoked per day
• Monthly bill for Home Shopping Network
t = r =
Sr
r
(1 r 2 ) /(n  2)
= r n  2 1 r 2
Correlation Matrix for Texas Texas Example
Hair
Real Estate Gold Chrome Makeup
More Cigs HSN Bill
Hair Real Estate Gold Chrome Makeup More Cigs HSN
1.0  .96  .83  .91  .53  .92 
.96  1.0  .98  .80  .23  .91 
.83  .98  1.0  .62  .71  .94 
.91  .80  .62  1.0  .07  .21 
.53  .23  .71  .07  1.0  .77 
.92  .91  .94  .21  .77  1.0 
t = 2.306 0 2.306 10.2
> cor.test(~HairHgt + Income, data=TexasHair, method="pearson")
• You can also generate a matrix of scatterplots using the pairs() command in the R graphics package
Pearson's productmoment correlation data: HairHgt and Income
t = 10.223, df = 8, pvalue = 7.198e06
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval: 0.8499218 0.9916539
sample estimates:
cor
0.9637914
General Procedure for Correlations Using
R
• To compute basic correlation coefficients there are three main functions that can be used:
cor(), cor.test() and rcorr().
Pearson Correlation Output
Exam Anxiety Revise Exam 1.0000000 0.4409934 0.3967207
Anxiety 0.4409934 1.0000000 0.7092493
Revise 0.3967207 0.7092493 1.0000000
Correlations using R
• Pearson correlations:
• cor(examData, use = "complete.obs", method = "pearson")
• rcorr(examData, type = "pearson")
• cor.test(examData$Exam, examData$Anxiety, method = "pearson")
• If we predicted a negative correlation:
• cor.test(examData$Exam, examData$Anxiety, alternative = "less"), method = "pearson")
Reporting the Results
• Exam performance was significantly correlated with exam anxiety, r = .44, and time spent revising, r = .40; the time spent revising was also correlated with exam anxiety, r =
.71 (all ps < .001).
Things to Know about the Correlation
• It varies between 1 and +1
• 0 = no relationship
• It is an effect size
• ±.1 = small effect
• ±.3 = medium effect
• ±.5 = large effect
• Coefficient of determination, r2
• By squaring the value of r you get the proportion of variance in one variable shared by the other.
Correlation and Causality
• The thirdvariable problem:
• In any correlation, causality between two variables cannot be assumed because there may be other measured or unmeasured variables affecting the results.
• Direction of causality:
• Correlation coefficients say nothing about which variable causes the other to change.
Nonparametric Correlation
• Spearman’s rho
• Pearson’s correlation on the ranked data
• Kendall’s tau
• Better than Spearman’s for small samples
• World’s Biggest Liar competition
• 68 contestants
• Measures
•Where they were placed in the competition (first, second, third, etc.)
•Creativity questionnaire (maximum score 60)
Spearman’s Rho
cor(liarData$Position, liarData$Creativity, method =
"spearman")
• The output of this command will be:
[1] 0.3732184
• To get the significance value use rcorr() (NB: first convert the dataframe to a matrix):
liarMatrix<as.matrix(liarData[, c("Position", "Creativity")]) rcorr(liarMatrix)
• Or:
cor.test(liarData$Position, liarData$Creativity, alternative = "less", method = "spearman")
Spearman's Rho
Output
Spearman's rank correlation rho
data: liarData$Position and liarData$Creativity S = 71948.4, pvalue = 0.0008602
alternative hypothesis: true rho is less than 0 sample estimates:
rho
0.3732184
Kendall’s Tau (Nonparametric)
• To carry out Kendall’s correlation on the World’s Biggest Liar data simply follow the same steps as for Pearson and Spearman correlations but use method = “kendall”:
cor(liarData$Position, liarData$Creativity, method = "kendall")
cor.test(liarData$Position, liarData$Creativity, alternative = "less", method = "kendall")
Kendall’s Tau (Nonparametric)
• The output is much the same as for Spearman’s correlation.
Kendall's rank correlation tau
data: liarData$Position and liarData$Creativity z = 3.2252, pvalue = 0.0006294
alternative hypothesis: true tau is less than 0
sample estimates:
tau
0.3002413
Bootstrapping Correlations
• If we stick with our World’s Biggest Liar data and want to bootstrap Kendall’s tau, then our function will be:
bootTau<function(liarData,i) cor(liarData$Position[i], liarData$Creativity[i], use = "complete.obs", method = "kendall")
• To bootstrap a Pearson or Spearman correlation you do it in exactly the same way except that you specify method = “pearson” or method = “spearman” when you define the function.
Bootstrapping Correlations Output
• To create the bootstrap object, we execute:
library(boot)
boot_kendall<boot(liarData, bootTau, 2000) boot_kendall
• To get the 95% confidence interval for the
boot_kendall object:
boot.ci(boot_kendall)
Bootstrapping Correlations
• To bootstrap a Pearson or Spearman correlation you do it in exactly the same way except that you specify method = “pearson” or method = “spearman” when you define the function.
Bootstrapping Correlations Output
• The output below shows the contents of
boot_kendall:
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:
boot(data = liarData, statistic = bootTau, R = 2000)
Bootstrap Statistics :
original bias std. error
t1* 0.3002413 0.001058191 0.097663
Bootstrapping Correlations Output
• The output below shows the contents of the boot.ci()
function:
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 2000 bootstrap replicates
CALL :
boot.ci(boot.out = boot_kendall)
Intervals :
Level Normal Basic
95% (0.4927, 0.1099 ) (0.4956, 0.1126 )
Level Percentile BCa
95% (0.4879, 0.1049 ) (0.4777, 0.0941 )
Partial and Semipartial Correlations
• Partial correlation:
Exam Performance
1
• Measures the relationship between two variables, controlling for the effect that a third variable has on them both.
Variance Accounted for by Exam Anxiety (19.4%)
Exam Anxiety
Exam
• Semipartial correlation:
• Measures the relationship between two variables controlling for the effect that a third variable has on only one of the others.
2
Variance Accounted for by Revision Time (15.7%)

Performance
Exam Performance
Revision Time
Unique variance accounted for by Revision Time
Revision Time
Unique variance accounted for by Exam Anxiety
Exam Anxiety
Doing Partial Correlation using R
• The general form of pcor() is:
pcor(c("var1", "var2", "control1", "control2" etc.), var(dataframe))
• We can then see the partial correlation and the value of
R2 in the console by executing:
pc
pc^2
Partial Correlation SemiPartial Correlation
Doing Partial Correlation using R
• The general form of pcor.test() is:
pcor(pcor object, number of control variables, sample size)
• Basically, you enter an object that you have created with pcor() (or you can put the pcor() command directly into the function):
pcor.test(pc, 1, 103)
Partial Correlation Output
> pc
[1] 0.2466658
> pc^2
[1] 0.06084403
> t(pc, 1, 103)
$tval
[1] 2.545307
$df
[1] 100
$pvalue
[1] 0.01244581