- Details
- Parent Category: Programming Assignments' Solutions
We Helped With This Python Programming Homework: Have A Similar One?
Short Assignment Requirements
Assignment Description
BIOL 419/519 Homework 4, Winter 2019
Due on Thursday, February 21 at 11:59pm
About this homework: This assignment is a bit of a departure from previous homework assignments. You’ll practice some research and hacking skills doing a data pull from a few spreadsheets, where things can get a little hairy. This homework will help you practice doing realistic data pulls and data exploration from datasets like the ones you may encounter in your course projects—the skill of figuring out how to do what you want by doing research on coding is very valuable!
Instructions: Submit the Jupyter notebook of your work. Your notebook solutions will include the code your wrote to solve the problem as well as the output/answer. Each part of each problem should be in a separate cell (or multiple cells) with clear comments labeling them, so that their outputs are easily found by the grader!
Expectations: Please seek help if you need it! You may ask questions at Friday’s lab, come to office hours, and get together with your classmates to troubleshoot together.
Collaboration: As noted in the Syllabus, what you turn in should reflect your own understanding of the material. Collaboration with your classmates is encouraged, and I ask you to clearly indicate these collaborations as comments your homework.
Data Description
You will find two spreadsheets with data to download on Canvas. The spreadsheets contain data from every county in the US, one with information related to education and the other with employment data. There’s also some meta data on every county. You are encouraged to open the files in Excel (or similar software) and examine them to familiarize yourself with the organization of the data.
Assignment
1. (1 pt) Pull in the data
Use Pandas to pull in the spreadsheets as two DataFrames. Please do not modify these spreadsheets by hand and re-save them in any way—your code must work with the original spreadsheets as given.
Hint: Watch for column labels, and when there’s extra rows in the spreadsheet that don’t contain actual data. After pulling in data, look at the data frame and make sure it’s the right shape and has the right columns.
2. (2 pt) Merging and cleaning the data
Merge the two dataframes by their FIPS county codes. You should end up with a single data frame with all the data about education and employment for each county.
Next, clean up this data frame by dropping any rows that have any missing data (how does missing data show up in the data frame?). Some counties are missing some data for some of the years, so we’re just going to ignore them for now. This step also has the side-effect of getting rid of rows that contain summaries of each state as a whole.
1
What is the shape of the final data frame you end up with after merging and cleaning?
Hint: You should look into how to merge data frames in Pandas. Pay attention to column names.
3. (7 pts) Visualizing distributions
As always, when making plots of data, be sure to label the axes and use clear legends when necessary.
(a) (1 pts) Make 4 histograms visualizing the distributions of the percent of adults with less than a high school diploma in 2000, one each for counties in the states of California, Georgia, New York, and Washington. Use the same bins for these histograms so they are directly comparable.
(b) (1 pts) Make a scatter plot of the percentage of adults with a bachelor’s degree or higher in the year 1980 versus the years 2012–2016. What is the distribution change in this percentage between these years? Plot the distribution and label its mean and median.
(c) (1 pts) Each county is labeled with a rural-urban continuum code, which is a number between 1 and 9. On the same plot, visualize the distributions of the percentage of adults with a bachelor’s degree or higher in 2012–2016 for the most urban counties (1’s) and the least urban counties (9’s).
(d) (2 pts) Make a plot of the mean and standard deviation of the “median household income in 2016” for counties grouped by their rural urban continuum code. To be specific, the horizontal axis has the numbers 1 through 9, and the vertical axis has the median household income.
Hint: You can use the groupby functionality in Pandas.
(e) (2 pts) Make a scatter plot of the unemployment rate in 2016 versus percentage of adults with bachelor’s degree in 2012–2016. Use one color for counties that are more urban (1–3) and another color for counties that are less urban (4–9).
4. (Extra Credit) Benford’s Law
There’s a fun phenomenon known as Benford’s Law about the distribution of leading digits in real-life sets of numerical data. Roughly speaking, the observation is that the leading digits of large sets of real-life numbers is distributed logarithmically, such that 1’s are more common than 2’s, which are more common than 3’s, etc.
You can look up Benford’s Law and read more about it. One interesting application of Benford’s Law is in fraud detection in accounting (for instance, read this article https://www.wsj.com/articles/ accountants-increasingly-use-data-analysis-to-catch-fraud-1417804886).
Let’s see how well our employment and education data follows Benford’s Law. Take all the numerical data in your data frame (without regard to what they are), extract the leading digits of all the numbers, and plot a histogram of their distribution. Does this distribution follow Benford’s Law? Describe your observations and comments.
5. (Informational) How many hours did you spend on this homework? How many of those hours were spent working alone (as opposed to in a group)?
2