Let us worry about your assignment instead!

We Helped With This R Studio Programming Homework: Have A Similar One?

SOLVED
CategoryProgramming
SubjectR | R Studio
DifficultyUndergraduate
StatusSolved
More InfoProbability Homework Help
183011

Short Assignment Requirements

Attached is the detailed assignment description as well as some of the decks we've gone through in class so you know what we've covered. Thanks!

Assignment Description

Data Management

R Assignment 2

 

In this assignment, you will be working with a data file obtained directly from the U.S. Census Department and transforming it into a format where you might begin an analytic effort.  The data file is stored at:

 

https://s3.amazonaws.com/itao-30230/aff_2012.csv

 

Note that this file is large (approximately 100MB).  It may take a few minutes to load into R, so be patient.  Please load the data file directly from this URL in your code, rather than saving it on your computer and loading it from there.  Otherwise, it will not work when I execute it.

 

The analytic effort that you are supporting seeks to determine whether employees are paid differently in firms that are owned by men, by women, or have ownership shared equally between men and women.  You wish to break out your analysis by county and industry.  Your goal is to produce a dataframe in R that contains the following six fields:

 

       county: the name of the county (e.g. “Nassau County”)

       state: the name of the state (e.g. “New York”)

       sector: the name of the industry (e.g. “Accommodation and food services”)

       equal: the average salary paid by firms that have ownership split equally among men and women

       female: the average salary paid by firms that are owned primarily by women

       male: the average salary paid by firms that are owned primarily by men

 

You should have all of the R skills necessary to complete this assignment from the DataCamp course that you completed and the material that we covered together in class. 

 

You may wish to start by opening the file in Excel to help orient you to the data.  Pay particular attention to the structure of the file.  This is not a typical spreadsheet because it contains some summary information in rows.  It would be a worthwhile exercise to look at a single industry in a single county to make sure that you understand how the data is structured.   PLEASE FOLLOW THIS ADVICE.  IF YOU START CODING WITHOUT UNDERSTANDING THE DATA, YOU WILL MAKE MISTAKES!

 

There is much more data in this file then you will need for your analysis.  Removing unnecessary columns will help simplify your work.  Removing unnecessary rows is critical to obtaining correct output.  For example, if you find rows that summarize information found in other rows, leaving both in will result in double-counting.  You should find yourself removing rows based upon the values they have in the gender column and the NAICS_code column.

 

Here are some additional requirements and facts that will help you with your work:

 

       There are some other files that contain information from the Census Department that you may need:

       Notes File: https://s3.amazonaws.com/itao-30230/SBO_2012_00CSA01.txt

       Metadata File: https://s3.amazonaws.com/itao-30230/SBO_2012_00CSA01_metadata.csv

 

       Explore the data carefully using the techniques you learned.  Watch for strange things such as text appearing in a column that you think should only contain numbers and handle them appropriately.

       Discard information from your analysis that is associated with firms that do not have any employees early in the process.  This should involve deleting columns, not rows.  Be careful to purge this data completely -- you certainly want to remove columns that are specific to firms without employees but you also need to remove columns that total up information about firms with no employees and firms with employees.

       Use appropriate data types for each variable.  Dollar amounts should be rounded to the nearest whole dollar.

       In cases where the data contains ranges instead of actual values, substitute the midpoint of the range as a value, rounding up.  For example, if a row says that affected firms have between 5,000 and 9,999 employees, use the value 7,500.

       If values are withheld from the analysis due to quality standards (again, see the notes file!), set that value to NA.

       Some values may be marked with notes indicating that they are “out of scope” for the survey. You may disregard these notes.

       When you have a near-final table that is in the prescribed format, you may notice that missing information resulted in reintroducing some rows that have NAs in all three salary columns.  This is not useful information and you should remove it.  Only remove those cases where all three values are NA.  Do not remove rows where one or two of the columns contain NA values.

       Sort your results first by state, then by county, then by sector.  They should be sorted both in your R code and in your CSV file.

       Make sure that your columns have the following names {county, state, sector, equal, male, female} and are in that order in both your R code and your CSV file.  Please be sure that the names are all lowercase.

       You will need to write the data to a CSV file.  We did not cover this function in class, but it is very similar to the read_csv function (check ?write_csv for details).  Your CSV file should not contain the string “NA”.  Instead, leave those cells blank in the CSV.  Your CSV file should contain a header row and should only contain six columns corresponding to the six variables in your final data frame.

       Before generating your CSV file, it would be a good idea to take one last look at the summary statistics for your final data frame and see if they make sense.  Remember, this is real-world data.

 

This is a complex assignment and you may make some assumptions and/or decisions that cause you to reach a different outcome than my solution.  Therefore, it is critical that you comment your code well!  Explain not only what you are doing, but why you are doing it. 

 

You should turn in the following items for this assignment:

       An R script named “lastname.r” that carries out this assignment from start to finish.

       The CSV file produced by your R script

 

 

 

Assignment Description

R Lab



Create a scatterplot with tuition on the x-axis and loan default rates on the y-axis.

# Create a scatterplot of loan default rates by tuition

ggplot(data=college) +

geom_point(mapping=aes(x=tuition, y=loan_default_rate))

Enhance your scatterplot by fitting a line to the data.

# Fit a line to that dataset

ggplot(data=college, mapping=aes(x=tuition, y=loan_default_rate)) +   geom_point() +   geom_smooth()

Change the color of the points to represent the region, without altering the line.

# Color-code by region

ggplot(data=college, mapping=aes(x=tuition, y=loan_default_rate)) +   geom_point(mapping=aes(color=region)) +   geom_smooth()

Create a box plot showing faculty salary broken out by the highest degree awarded by the institution.

# Boxplot faculty salary by highest degree

ggplot(data=college) +

  geom_boxplot(mapping=aes(x=highest_degree, y=faculty_salary_avg))

Produce a statistical summary (using the summary() function) of all of the

schools with a loan default rate over 20%.

college %>%

  filter(loan_default_rate>0.20) %>%   summary()

Create a new tibble that contains only institutions with a faculty salary over

$10K/month, sorted in descending order of salary and including only the institution name, state, and average salary.

# Create a tibble consisting of only those institutions where faculty earn more than

$10K/month

# Sort it in descending order of faculty salary

# And include only the institution name, average faculty salary, and state

high_salary <- college %>%   filter(faculty_salary_avg > 10000) %>%   arrange(desc(faculty_salary_avg)) %>%   select(name, faculty_salary_avg, state)

Create a new tibble showing the number of high salary schools in each state, with the highest count first.

# Take that same data and produce a table showing the number per state with the highest

# state first.  Store that in a data frame called high_salary_by_state

high_salary_by_state <- high_salary %>%   group_by(state) %>%   summarize(count=n()) %>%   arrange(desc(count))

Produce a tibble summarizing the total number of schools in each state.

# Now produce a data frame summarizing the number of schools in each state called schools_by_state

schools_by_state <- college %>%   group_by(state) %>%   summarize(count=n()) %>%   arrange(desc(count))

Join the two previous tibbles together so that you have a record for every state.  Name the column with

the total number of schools “schools”

and the count of high salary schools “high_salary_schools”

# And join them together

augmented_state_data <- schools_by_state %>%   left_join(high_salary_by_state, by=“state")

# Rename count.x as schools and count.y as high_salary_schools augmented_state_data <- augmented_state_data %>%   rename(schools=count.x, high_salary_schools=count.y)

Create column containing the % of schools in each state that are high salary, sort by that percentage in descending

order and only include states where the value is 30% or higher.

# Create a column containing the percentage of schools that are high salary

# then sort by that percentage in descending order and only show me states

# where the percentage is 30% or higher

augmented_state_data <- augmented_state_data %>%   mutate(high_salary_percent = high_salary_schools / schools) %>%   arrange(desc(high_salary_percent)) %>%   filter(high_salary_percent >= .3)

augmented_state_data

Graph the last dataset using a column graph.

# Show this as a column graph

augmented_state_data %>%

  ggplot() +

  geom_col(mapping=aes(x=state, y=high_salary_percent))

Create this…

college %>%

  mutate(rejection_rate=1-admission_rate) %>%

  ggplot(mapping=aes(x=rejection_rate, y=median_debt, color=control)) +   geom_smooth(se=FALSE) +

  ggtitle("Student Debt Varies With School Selectivity",           subtitle="Source: U.S. Department of Education") +

  theme(panel.grid = element_blank(), panel.background = element_blank()) +   theme(legend.key=element_blank()) +   ylab("Median Debt (USD)") +   xlab("Rejection Rate") +   ylim(5000,45000) +   xlim(0,1)

Upon Further Investigation…

# Take a look at the scatterplot of this data as well

college %>%

  mutate(rejection_rate=1-admission_rate) %>%

  ggplot(mapping=aes(x=rejection_rate, y=median_debt, color=control)) +   geom_smooth(se=FALSE) +   geom_point(alpha=0.1) +

  ggtitle("Student Debt Varies With School Selectivity",           subtitle="Source: U.S. Department of Education") +

  theme(panel.grid = element_blank(), panel.background = element_blank()) +   theme(legend.key=element_blank()) +   ylab("Median Debt (USD)") +   xlab("Rejection Rate") +   ylim(5000,45000) +   xlim(0,1)

Customer Feedback

"Thanks for explanations after the assignment was already completed... Emily is such a nice tutor! "

Order #13073

Find Us On

soc fb soc insta


Paypal supported