- Details
- Parent Category: Programming Assignments' Solutions

# We Helped With This Python Programming Homework: Have A Similar One?

Category | Programming |
---|---|

Subject | Python |

Difficulty | Undergraduate |

Status | Solved |

More Info | Python Help |

## Short Assignment Requirements

## Assignment Description

BIOL 419/519 Homework 4, Winter 2019

Due on Thursday, February 21 at 11:59pm

**About this homework: **This assignment is a bit of a
departure from previous homework assignments. You’ll practice some research and
hacking skills doing a data pull from a few spreadsheets, where things can get
a little hairy. This homework will help you practice doing realistic data pulls
and data exploration from datasets like the ones you may encounter in your
course projects—the skill of figuring out how to do what you want by doing
research on coding is very valuable!

**Instructions: **Submit the Jupyter notebook of your
work. Your notebook solutions will include the code your wrote to solve the
problem as well as the output/answer. Each part of each problem should be in a
separate cell (or multiple cells) with clear comments labeling them, so that
their outputs are easily found by the grader!

**Expectations: **Please seek help if you need it! You
may ask questions at Friday’s lab, come to office hours, and get together with
your classmates to troubleshoot together.

**Collaboration: **As noted
in the Syllabus, what you turn in should reflect your own understanding of the
material. Collaboration with your classmates is encouraged, and I ask you to
clearly indicate these collaborations as comments your homework.

### Data Description

You will find two spreadsheets with data to download on Canvas. The spreadsheets contain data from every county in the US, one with information related to education and the other with employment data. There’s also some meta data on every county. You are encouraged to open the files in Excel (or similar software) and examine them to familiarize yourself with the organization of the data.

### Assignment

1. (1 pt) **Pull in the data**

Use Pandas to pull in the spreadsheets as two DataFrames. Please do not modify these spreadsheets by hand and re-save them in any way—your code must work with the original spreadsheets as given.

*Hint: *Watch for column labels, and when there’s
extra rows in the spreadsheet that don’t contain actual data. After pulling in
data, look at the data frame and make sure it’s the right shape and has the
right columns.

2. (2 pt) **Merging and cleaning the data**

Merge the two dataframes by their FIPS county codes. You should end up with a single data frame with all the data about education and employment for each county.

Next, clean up this data frame by dropping any rows that have any missing data (how does missing data show up in the data frame?). Some counties are missing some data for some of the years, so we’re just going to ignore them for now. This step also has the side-effect of getting rid of rows that contain summaries of each state as a whole.

1

What is the shape of the final data frame you end up with after merging and cleaning?

*Hint: *You should look into how to merge data frames in Pandas.
Pay attention to column names.

3. (7 pts) **Visualizing distributions**

As always, when making plots of data, be sure to label the axes and use clear legends when necessary.

(a) (1 pts) Make 4 histograms visualizing the distributions of the percent of adults with less than a high school diploma in 2000, one each for counties in the states of California, Georgia, New York, and Washington. Use the same bins for these histograms so they are directly comparable.

(b) (1 pts) Make a scatter plot of the percentage of adults with a bachelor’s degree or higher in the year 1980 versus the years 2012–2016. What is the distribution change in this percentage between these years? Plot the distribution and label its mean and median.

(c) (1 pts) Each county is labeled with a rural-urban continuum code,
which is a number between 1 and 9. On the same plot, visualize the
distributions of the percentage of adults with a bachelor’s degree or higher in
2012–2016 for the *most urban *counties (1’s) and the *least urban *counties
(9’s).

(d) (2 pts) Make a plot of the mean and standard deviation of the “median household income in 2016” for counties grouped by their rural urban continuum code. To be specific, the horizontal axis has the numbers 1 through 9, and the vertical axis has the median household income.

*Hint: *You can use the **groupby **functionality in Pandas.

(e) (2 pts) Make a scatter plot of the unemployment rate in 2016 versus percentage of adults with bachelor’s degree in 2012–2016. Use one color for counties that are more urban (1–3) and another color for counties that are less urban (4–9).

4. (Extra Credit) **Benford’s Law**

There’s a fun phenomenon known as Benford’s Law about the distribution of leading digits in real-life sets of numerical data. Roughly speaking, the observation is that the leading digits of large sets of real-life numbers is distributed logarithmically, such that 1’s are more common than 2’s, which are more common than 3’s, etc.

You can look up Benford’s Law and read more about it. One interesting application of Benford’s Law is in fraud detection in accounting (for instance, read this article https://www.wsj.com/articles/ accountants-increasingly-use-data-analysis-to-catch-fraud-1417804886).

Let’s see how well our employment and education data follows Benford’s Law. Take all the numerical data in your data frame (without regard to what they are), extract the leading digits of all the numbers, and plot a histogram of their distribution. Does this distribution follow Benford’s Law? Describe your observations and comments.

5. (Informational) How many hours did you spend on this homework? How many of those hours were spent working alone (as opposed to in a group)?

2