- Details
- Parent Category: Programming Assignments' Solutions
We Helped With This R Studio Programming Homework: Have A Similar One?

Category | Programming |
---|---|
Subject | R | R Studio |
Difficulty | Undergraduate |
Status | Solved |
More Info | Probability Homework Help |
Short Assignment Requirements
Assignment Description
Data Management
R Assignment 2
In this assignment, you will be working with a data file obtained directly from the U.S. Census Department and transforming it into a format where you might begin an analytic effort. The data file is stored at:
https://s3.amazonaws.com/itao-30230/aff_2012.csv
Note that this file is large (approximately 100MB). It may take a few minutes to load into R, so be patient. Please load the data file directly from this URL in your code, rather than saving it on your computer and loading it from there. Otherwise, it will not work when I execute it.
The analytic effort that you are supporting seeks to determine whether employees are paid differently in firms that are owned by men, by women, or have ownership shared equally between men and women. You wish to break out your analysis by county and industry. Your goal is to produce a dataframe in R that contains the following six fields:
● county: the name of the county (e.g. “Nassau County”)
● state: the name of the state (e.g. “New York”)
● sector: the name of the industry (e.g. “Accommodation and food services”)
● equal: the average salary paid by firms that have ownership split equally among men and women
● female: the average salary paid by firms that are owned primarily by women
● male: the average salary paid by firms that are owned primarily by men
You should have all of the R skills necessary to complete this assignment from the DataCamp course that you completed and the material that we covered together in class.
You may wish to start by opening the file in Excel to help orient you to the data. Pay particular attention to the structure of the file. This is not a typical spreadsheet because it contains some summary information in rows. It would be a worthwhile exercise to look at a single industry in a single county to make sure that you understand how the data is structured. PLEASE FOLLOW THIS ADVICE. IF YOU START CODING WITHOUT UNDERSTANDING THE DATA, YOU WILL MAKE MISTAKES!
There is much more data in this file then you will need for your analysis. Removing unnecessary columns will help simplify your work. Removing unnecessary rows is critical to obtaining correct output. For example, if you find rows that summarize information found in other rows, leaving both in will result in double-counting. You should find yourself removing rows based upon the values they have in the gender column and the NAICS_code column.
Here are some additional requirements and facts that will help you with your work:
● There are some other files that contain information from the Census Department that you may need:
○ Notes File: https://s3.amazonaws.com/itao-30230/SBO_2012_00CSA01.txt
○ Metadata File: https://s3.amazonaws.com/itao-30230/SBO_2012_00CSA01_metadata.csv
● Explore the data carefully using the techniques you learned. Watch for strange things such as text appearing in a column that you think should only contain numbers and handle them appropriately.
● Discard information from your analysis that is associated with firms that do not have any employees early in the process. This should involve deleting columns, not rows. Be careful to purge this data completely -- you certainly want to remove columns that are specific to firms without employees but you also need to remove columns that total up information about firms with no employees and firms with employees.
● Use appropriate data types for each variable. Dollar amounts should be rounded to the nearest whole dollar.
● In cases where the data contains ranges instead of actual values, substitute the midpoint of the range as a value, rounding up. For example, if a row says that affected firms have between 5,000 and 9,999 employees, use the value 7,500.
● If values are withheld from the analysis due to quality standards (again, see the notes file!), set that value to NA.
● Some values may be marked with notes indicating that they are “out of scope” for the survey. You may disregard these notes.
● When you have a near-final table that is in the prescribed format, you may notice that missing information resulted in reintroducing some rows that have NAs in all three salary columns. This is not useful information and you should remove it. Only remove those cases where all three values are NA. Do not remove rows where one or two of the columns contain NA values.
● Sort your results first by state, then by county, then by sector. They should be sorted both in your R code and in your CSV file.
● Make sure that your columns have the following names {county, state, sector, equal, male, female} and are in that order in both your R code and your CSV file. Please be sure that the names are all lowercase.
● You will need to write the data to a CSV file. We did not cover this function in class, but it is very similar to the read_csv function (check ?write_csv for details). Your CSV file should not contain the string “NA”. Instead, leave those cells blank in the CSV. Your CSV file should contain a header row and should only contain six columns corresponding to the six variables in your final data frame.
● Before generating your CSV file, it would be a good idea to take one last look at the summary statistics for your final data frame and see if they make sense. Remember, this is real-world data.
This is a complex assignment and you may make some assumptions and/or decisions that cause you to reach a different outcome than my solution. Therefore, it is critical that you comment your code well! Explain not only what you are doing, but why you are doing it.
You should turn in the following items for this assignment:
● An R script named “lastname.r” that carries out this assignment from start to finish.
● The CSV file produced by your R script
Assignment Description
R Lab
Create a scatterplot with tuition on the x-axis and loan default rates on the y-axis.
# Create a scatterplot of loan default rates by tuition
ggplot(data=college) +
geom_point(mapping=aes(x=tuition, y=loan_default_rate))
Enhance your scatterplot by fitting a line to the data.
# Fit a line to that dataset
ggplot(data=college, mapping=aes(x=tuition, y=loan_default_rate)) + geom_point() + geom_smooth()
Change the color of the points to represent the region, without altering the line.
# Color-code by region
ggplot(data=college, mapping=aes(x=tuition, y=loan_default_rate)) + geom_point(mapping=aes(color=region)) + geom_smooth()
Create a box plot showing faculty salary broken out by the highest degree awarded by the institution.
# Boxplot faculty salary by highest degree
ggplot(data=college) +
geom_boxplot(mapping=aes(x=highest_degree, y=faculty_salary_avg))
Produce a statistical summary (using the summary() function) of all of the
schools with a loan default rate over 20%.
college %>%
filter(loan_default_rate>0.20) %>% summary()
Create a new tibble that contains only institutions with a faculty salary over
$10K/month, sorted in descending order of salary and including only the institution name, state, and average salary.
# Create a tibble consisting of only those institutions where faculty earn more than
$10K/month
# Sort it in descending order of faculty salary
# And include only the institution name, average faculty salary, and state
high_salary <- college %>% filter(faculty_salary_avg > 10000) %>% arrange(desc(faculty_salary_avg)) %>% select(name, faculty_salary_avg, state)
Create a new tibble showing the number of high salary schools in each state, with the highest count first.
# Take that same data and produce a table showing the number per state with the highest
# state first. Store that in a data frame called high_salary_by_state
high_salary_by_state <- high_salary %>% group_by(state) %>% summarize(count=n()) %>% arrange(desc(count))
Produce a tibble summarizing the total number of schools in each state.
# Now produce a data frame summarizing the number of schools in each state called schools_by_state
schools_by_state <- college %>% group_by(state) %>% summarize(count=n()) %>% arrange(desc(count))
Join the two previous tibbles together so that you have a record for every state. Name the column with
the total number of schools “schools”
and the count of high salary schools “high_salary_schools”
# And join them together
augmented_state_data <- schools_by_state %>% left_join(high_salary_by_state, by=“state")
# Rename count.x as schools and count.y as high_salary_schools augmented_state_data <- augmented_state_data %>% rename(schools=count.x, high_salary_schools=count.y)
Create column containing the % of schools in each state that are high salary, sort by that percentage in descending
order and only include states where the value is 30% or higher.
# Create a column containing the percentage of schools that are high salary
# then sort by that percentage in descending order and only show me states
# where the percentage is 30% or higher
augmented_state_data <- augmented_state_data %>% mutate(high_salary_percent = high_salary_schools / schools) %>% arrange(desc(high_salary_percent)) %>% filter(high_salary_percent >= .3)
augmented_state_data
Graph the last dataset using a column graph.
# Show this as a column graph
augmented_state_data %>%
ggplot() +
geom_col(mapping=aes(x=state, y=high_salary_percent))
Create this…
college %>%
mutate(rejection_rate=1-admission_rate) %>%
ggplot(mapping=aes(x=rejection_rate, y=median_debt, color=control)) + geom_smooth(se=FALSE) +
ggtitle("Student Debt Varies With School Selectivity", subtitle="Source: U.S. Department of Education") +
theme(panel.grid = element_blank(), panel.background = element_blank()) + theme(legend.key=element_blank()) + ylab("Median Debt (USD)") + xlab("Rejection Rate") + ylim(5000,45000) + xlim(0,1)
Upon Further Investigation…
# Take a look at the scatterplot of this data as well
college %>%
mutate(rejection_rate=1-admission_rate) %>%
ggplot(mapping=aes(x=rejection_rate, y=median_debt, color=control)) + geom_smooth(se=FALSE) + geom_point(alpha=0.1) +
ggtitle("Student Debt Varies With School Selectivity", subtitle="Source: U.S. Department of Education") +
theme(panel.grid = element_blank(), panel.background = element_blank()) + theme(legend.key=element_blank()) + ylab("Median Debt (USD)") + xlab("Rejection Rate") + ylim(5000,45000) + xlim(0,1)