- Details
- Parent Category: Programming Assignments' Solutions
We Helped With This R Programming Assignment: Have A Similar One?

Category | Programming |
---|---|
Subject | R | R Studio |
Difficulty | College |
Status | Solved |
More Info | Stats Homework Answers |
Assignment Description
The use of these linear contrasts or Multiple Comparisons Procedures is quite controversial and the behavior of many of the available procedures is poorly known. For further reading on which contrasts are well behaved and other issues in the use and interpretation of these procedures see the papers by Jones, Chew, and Day and Quinn in the supplemental readings. Also, read the notes in the lab manual or on the class web site under "lecture Notes" on “Multiple Comparisons.”
Further Instructions on Lab 6
Parts 1 and 2a of the exercise can be accomplished without R. However, students invariably get the Mean Square within wrong for question 1.3 if they do not use R.
To use R to do a single factor ANOVA (between subjects), you first need to enter your data the same way you would for an independent groups t-test, but with the integer code column with k different integers (one for each treatment group).
R will produce an ANOVA Table which will contain the relevant sums of squares, mean squares, degrees of freedom, F-ratios and significance values. These values are all you need for your ANOVA and the MS within is needed when performing linear contrasts (multiple comparisons).
Any experimentwise error rate correction you must then apply by hand calculation of the per contrast α, and comparison of the computed α to the adjusted critical α.
Lab 6 - ASSIGNMENT
PART 1- Introduction to ANOVA
1.1) Find F(α = 0.05) for an F ratio with:
a) Numerator df = 7, denominator df = 25
b) Numerator df = 10, denominator df = 8
c) Numerator df = 30, denominator df = 60
1.2) Find F(α) for an F ratio with 15 numerator and 12 denominator df for the following values of α:
a) α = 0.025
b) α = 0.050
c) α = 0.10
1.3) Independent random samples were selected from three populations, shown in the table below:
Sample 1 Sample 2 Sample 3
2.1 4.4 1.1
3.3 2.6 0.2
0.2 3.0 2.0
1.9
a) Calculate MSB for the data. What type of variability is measured by this quantity? How many degrees of freedom are associated with this quantity?
b) Calculate MSw for the data. What type of variability is measured by this quantity? How many degrees of freedom are associated with this quantity?
ASSIGNMENT PART 2a; a priori testing
2a.1) Given contrasts a through d below and equal group sample sizes, which pairs of contrasts are orthogonal?
Ca 1 -1 0 0 0
Cb 1 1 1 1 -4
Cc 1 -1 0 1 -1 Cd 0 1 -2 1 0
2a.2) Data for the pituitary function experiment can be found in the data file‘pit.csv’ . Use the 'contrasts' and 'aov' functions in R to test the following sets of contrasts. Knowing that these contrasts were developed a priori, interpret the results. Include all relevant output. Describe the null hypotheses to be tested by these contrasts. Did you reject or accept these hypotheses, and why? What do these results indicate with respect to the context of the problem at hand (e.g. relate this to pituitary function, chemicals, and control).
Grp 1 Grp 2 Grp 3 Grp 4
Contrast 1 3.0 -1.0 -1.0 -1.0
Contrast 2 1.0 0.0 0.0 -1.0
Contrast 3 -1.0 -1.0 -1.0 3.0
ASSIGNMENT PART 2b; a posteriori testing
2b.1) Examine the contrasts from problem 2a.2, under the assumption that these were conducted a posteriori. Keep the experiment - wise error rate at 5%. Compute the new comparison - wise α's using the Dunn-Sidak correction and interpret the results in relation to these. (Hint: Set αe = 5%, and solve for αc.) Again, do you accept or reject the null hypotheses at hand? Interpret these results. Compare these results to the simple ANOVA (since the tests are a posteriori it is assumed that you examined the ANOVA first).
2b.2) If you were responsible for really analyzing the data from the pituitary function experiment, what procedure would you use; a priori contrasts, or one of the a posteriori procedures? Briefly justify your choice.
Focusing on the Sun Sq column in the ANOVA tables, we see that the Type III sums of squares tends to have lower estimates of the sums of squares, since it estimates the effects of main effects and interactions, eliminating confounded variation between main effects and between main effects and interactions. The Type II sums of squares only correct the main effects for confounded variation with other main effects. Type I sums of squares does not adjust for confounded variation, and gives the first term in the model most of that variation. At this point, I would recommend that if you have unequal sample sizes that you use Type II sums of Squares. However, the stats packages SAS and SPSS often use Type III as a default.
Plot the data to get a better sense of the result.
# Plot the data |
aes(x=fsiten,y=speciesrichn,fill=ftrn)) + position=position_dodge(1)) |
p=ggplot(dat2, geom_boxplot( p |
Note that in this case we get a similar result even when using the data set with unequal sample sizes.
LAB - 7 Assignment
A) A data file on the plant growth experiment example above has been created called PLGR1 (plgr1.csv ). Each case has three variables: 1) The fertilizer treatment (1 through 3), 2) the light treatment (1 and 2), and 3) the dry weight of the plant after 10 weeks of growth. Use R to analyze the data. Specify: a) the hypotheses tested, b) the result of the hypothesis tests, and c) interpret the results with respect to the experiment.
13
B) Repeat A) using the aov function, but reversing the order the main effect factors in the aov syntax. Compare the results to those of A). What happens to the Sums-of Squares?
C) Graph the means for each treatment combination in the experiment. Does the graphical analysis corroborate the statistical analysis?
D) Use the "Anova" function from the "car" package in R for to perform analyses using the Type 2 and Type 3 Sums-of-Squares approaches. Report on any differences from
A).
E) Another researcher attempted to replicate the experiment. Unfortunately, during the final phase of the experiment a raccoon got into the plot and ate half the plants. The data from that experiment is on file PLGR2 (plgr2.csv). Analyze the results of this experiment using Type I, II and III SS. Use the aov command for the Type 1 analysis and reverse the order of the main effects in the aov syntax. Use the "Anova" function from the "car" package in R for the Type 2 and Type 3 Sums-of-Squares analyses. Compare the results of the Type 1,2 and 3 Sums-of-Squares analyses, particularly how the Sums-of-Squares differ.
Which method for dealing with unbalanced designs would be best? Which requires the largest number of arbitrary decisions?
14
rocks=factor(c(rep(1,6),rep(2,6),rep(3,6))) |
rocks)) + |
rocks ## [1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3 ## Levels: 1 2 3 pHH=factor(rep(pH,3)) pHH ## [1] 1 1 1 2 2 2 1 1 1 2 2 2 1 1 1 2 2 2 ## Levels: 1 2 # obtain a boxplot
p2=ggplot(dat2,aes(x=pHH,y=numinverts,fill= geom_boxplot() p2 |
The graph supports the conclusion that more invertebrates were detected in streams with neutral pH than in low pH streams, and show some suggestion of an interaction between pH and rock type although it was not statistically significant.
Lab 8 - Assignment PART 1: Nested ANOVA
Data on insect damage on oak trees was collected in order to answer questions concerning the effects of shading on damage levels. Six oak trees were selected at random, 3 in the shade and 3 in the sun. The data are included in the linked data file nest.csv. The file has the following form:
Column 1: Light level: 1 = shade; 2 = sun
Column 2: Tree. Trees 1 – 3 (six different trees 3 in sun and 3 in shade).
Column 3: Damage in percent.
Column 4: Tree2. Trees numbered 1-6
Reiterating, this file contains data which describe the response of trees in terms of damage; thus damage is the response variable. The data come from observations taken from six randomly selected oak trees. Three are located in the shade; and three are located in the sun. Thus this is a nested design, with trees nested within the level of light (shade or sun) and leaves nested within trees.
A sample of 15 leaves was taken at random from each tree.
To use the R syntax to perform a nested ANOVA, use the tree coding variable (tree2). To perform the analysis for pooling the sums of squares use the tree coding variable (tree).
1.1) What kind of design is this? Draw a graphical representation of the experimental design, where A = light level, B = trees, and G = group of subjects.
When I ask for a graphical representation of the experimental design, I am looking for a diagram like I have shown in class that shows all the factors in the experiment, their levels, and represents each group of subjects with the letter G appropriately subscripted under each treatment combination. I also ask that you accurately describe the pattern of crossing and nesting of factor levels and subjects.
1.2) Conduct an appropriate analysis of variance on these data.
a) specify the null hypotheses being tested and turn in the edited output
b) pool the appropriate sums - of - squares and perform the appropriate F test
c) interpret the results of the tests (at α = 0.05)
d) How could the data have been treated differently so a nested ANOVA could have been avoided?
In our particular example, our nested factor (trees) is a random effects factor since we chose them at random from a large number of possible trees. Therefore, the final test of the effect of factor A will involve computing an F ratio with the MS(B(A)) as the denominator
F test for factor A, F = MS(A)/MS(A/B).
Given that our nested factor is a random effects factor, it also makes little sense to compute an hypothesis test for the B(A) effect, but if one insisted on doing so then the within cell or error mean square would serve as the denominator of the F- ratio.
Finally, for our particular problem, an alternative way to analyze the data would be to average the values among the 15 leaves within each tree, and use the tree means to compute a independent groups t-test for differences in leaf damage between trees in the sun versus the shade. The results of this test would be identical to the test of the light effect in the Nested ANOVA (except the t-value would equal the square root of the F value). Hence, it is sometimes possible to turn a nested ANOVA into a simpler problem, particularly if the nesting of factor levels arises because multiple observations are made on the same subject. Here, by using tree averages, we turn a nested random factor (the tree factor) into our subjects, hence removing the nested factor levels and simplifying the analysis.
PART 2: Repeated Measures ANOVA
In this experiment, ten small-mammal trapping grids were established in order to study the effect of food addition on population densities. Five of the grids received food supplements while the others were not manipulated. The population levels on each grid were monitored 3 times at monthly intervals following the food addition. The data are contained in the linked data file rep.csv. It is in the following form:
Column 1: Food. Where: 1 = no addition; 2 = food added
Column 2: Grid. Grids 1 - 10
Column 3 - 5: Population density in June, July, and August.
Reiterating, here we have 10 grids. 5 have food added; 5 do not. The response variable, population density, was measured on each grid (subject) 3 times (a repeated measures factor).
2.1) What kind of design is this? Draw a graphical representation of the experimental design, where A = food treatment, B = time or month, and G = group of subjects.
2.2) Conduct an appropriate analysis of variance on these data and report the results for tests of the main effects of food treatment, time, and their interactions (no pooling of sums - of - squares is necessary.
a) specify the null hypotheses being tested and interpret the results of the tests (at α = 0.05) . Don't forget to include your syntax and results in your R.markdown file.
b) How could the study have been designed differently so that a repeated measures ANOVA could have been avoided?
This plot also does not have an equal scatter about the fitted lien (represented by the horizontal dashed line. However, we cannot tell if the low scatter on the right end of the plot is due to inherently unequal variances, or to there being few farms that plant a large number of crops in our sample of farms.
Further Instructions for Lab 9
Data files for regression and correlation require that each subject be represented by a line in the data file and each column represents a variable. So, for correlation or bivariate regression, an R data file need only have 2 columns of values. However, if you have more than two variables for a single set of subjects for which you want to calculate their correlations, just enter all the variables in separate columns and R can calculate the correlations between the variables in each pair of columns - a correlation matrix. instead of inserting the X and Y variable names when using the 'cor' function insert the name of the data.frame and all pairs of correlation will be calculated.
LAB - 9 Assignment
PART 1: Introduction to Correlation and Regression
The Bermuda Petrel is an oceanic bird spending most of its year on the open sea, only returning to land during the breeding season. Its nesting sites are on a small, uninhabited island of the Bermuda group, where careful hatching records have been kept over several years. The Bermuda Petrel feeds only upon fish caught in the open ocean waters far from land. Unfortunately, DDT is now so widespread, and is so concentrated by the biological amplification system knows as the "food chain," that the Bermuda Petrel can no longer lay hard shelled eggs. Since DDT breaks down so slowly, it would appear that this beautiful bird is doomed to extinction (along with how many others?)
You data below represent hatching rates of clutches of eggs over a number of years. Use correlation and linear regression in R to see if there is a significant relationship between the percent of clutches hatching over time. Interpret the output. Also produce a scatter plot of the relationship between hatching rate and year.
Year |
| 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 |
% of Clutches Hatching |
| 80 | 60 | 67 | 39 | 48 | 37 | 35 | 17 |
10
PART 2: Assumptions of simple linear regression
A) Using the kenyabees.csv data, is it possible to transform the CV data to an alternative scale on which the residuals and the Y variable are normally distributed? For example, what if we log transformed the CV data?
B) Estimate the linear regression model for each of the three sample data sets (reg1, reg3, reg5) using the lm function in R. Use data in Column 1 as the X-variate and column 2 as the Y-variate in each data file.
C) Write the regression equation for at least two of the data sets.
D) Reiterating from the lab, the null hypothesis to be tested in each instance states that Y is not a linear function of X, and thus X will not be a good predictor of Y. More specifically, under the null hypothesis we are testing that the slope, b1, will be equal to zero, since this would be indicative of no relationship between the two variables. At the α = 0.05 level, based on the output of the regression alone (F - test) for which of the three data sets would you reject the null hypothesis?
E) Based on the R2 values, which model reveals the best fit?
F) To see if the models are adequate, you must check to see if the assumptions of regression have been met. Use graphical and/or statistical methods to assess the assumptions of normality, homogeneity of variances, and linearity for each data set each data set? For which data sets is linear regression appropriate, and for which data sets is it clear that a linear regression model should not be imposed on the data? Would some transformation of scale for the Y or X data make these data normal and homoscedastic? Would transformation of X improve linearity?
11
2. Examine the correlations among pairs of predictor variables to check for multicolinearity. If for any pair r >>0.9 then try alternative models that eliminate one pair member.
3. Examine the diagnostic plots to make sure that there are no observations with high leverage or high influence. Influential data points will have Cook’s D values greater than 1.
4. Compare alternative models to determine if one or more models fit the data equally well.
5. The model with the best residual pattern, that is not beset with colinearity and influential data points, and that has the highest R2 is the best model. Note that R2 is the last criteria to use in choosing a model, not the first.
Lab 10 Assignment
The exercise to be performed in this lab is to use the StepF and/or regsubsets functions in R to generate a set of candidate models, and to select the individual "best" model or set of best models if 2 or more models seem to be equally good. You must discuss in detail the reasons for choosing the models that you have selected, including showing plots of residuals, information about the distribution of the response variable, examining outliers, and other metrics to demonstrate goodness-of-fit.
DESCRIPTION OF DATA
The data is stored in a file multr2.csv The variables are as follows (they are in the same order in the data sets):
VARIABLE (UNITS)
______________________________________
Mean elevation (feet)
Mean temperature (degrees F)
Mean annual precipitation (inches)
Vegetative density (percent cover)
Drainage area (miles2)
Latitude (degrees)
Longitude (degrees)
Elevation at temperature station (feet)
1-hour, 25-year precipitation. intensity (inches/hour)
Annual water yield (inches) (Dependent variable)
The data consists of values of these variables measured on all gauged watersheds in the western region of the USA. The dependent variable is underlined. Develop and evaluate a model for estimating water yield from un-gauged basins in the western USA.
14