- Details
- Parent Category: Programming Assignments' Solutions
We Helped With This R Programming Assignment: Have A Similar One?

Category | Programming |
---|---|
Subject | R | R Studio |
Difficulty | College |
Status | Solved |
More Info | Statistics Help Online |
Short Assignment Requirements
Assignment Code
## <img src="https://datasciencedegree.wisconsin.edu/wp-content/themes/data-gulp/images/logo.svg" width="300">
#
#
# # Assignment 7
# ## Problem 1. Introduction
#
# The first problem has you write some functions in order to be able to do statistics on arbitrary text. First, we'll write a function determining the length of each word in a given sentence. Second, we'll apply that function to some given text. Third, we'll use it to solve a larger problem: determine properties of each sentence in a Jane Austin book.
#
# ## Problem 1(a). Word length
#
# ? Write a function called ```word_length_list()``` which takes a string and returns a list with the length of each word in the string.
#
# For each word, count the number of English, alphanumeric characters. Words are defined as text separated by spaces. Your function should ignore punctuation. For example, ```word_length_list("Haven't you eaten 8 oranges today?")``` should return ```[6,3,5,1,7,5]```.
#
# * Call or create other functions as necessary to organize your work.
# * Write your own code to do this from first principles. This means using Python built-in functions for splitting text, checking whether characters are punctuation, etc.
# * Do *not* use Python packages (like nltk) or code directly copied from online resources (such as regular expressions for splitting text) in order to divide sentences.
def word_length_list(s):
# leave only valid characters in string - this version does not work with
,
# so string must be cleaned from
before using this function
s = valid_chars(s)
counts = []
# split string into words and count letters of each, add into list.
for word in s.split(' '):
counts.append(len(word))
return counts
def valid_chars(s):
# leaves only letters, numbers and spaces in a string
new_s = ''
valid_c = ' 0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
for c in s:
if c in valid_c:
new_s = new_s + c
return new_s
# this line will run clean when you have solved the problem.
assert(word_length_list("Haven't you eaten 8 oranges today?")==[6,3,5,1,7,5])
# be sure to restart the kernel and run all cells before committing.
# this ensures the most recent state is what you think it is.
# ## Problem 1(b). Application -- "A Mourner."
#
# The text below is an anonymous essay published in The Boston Gazette and Country Journal on January 8, 1770.
#
# >The general Sympathy and Concern for the Murder of the Lad by the base and infamous Richardson on the 22d Instant, will be sufficient Reason for your Notifying the Public that he will be buried from his Father’s House in Frogg Lane, opposite Liberty-Tree, on Monday next, when all the Friends of Liberty may have an Opportunity of paying their last Respects to the Remains of this little Hero and first Martyr to the noble Cause--Whose manly Spirit (after this Accident happened) appear’d in his discreet Answers to his Doctor, his Thanks to the Clergymen who prayed with him, and Parents, and while he underwent the greatest Distress of bodily Pain; and with which he met the King of Terrors. These Things, together with the several heroic Pieces found in his Pocket, particularly Wolfe’s Summit of human Glory, gives Reason to think he had a martial Genius, and would have made a clever Man.
#
# > A Mourner.
#
# (Source: Michael Sullivan, _Statistics: Informed Decisions Using Data_, 4th ed. p. 188-189.)
#
# ? Use your function ```word_length_list()``` from 1(a) to find the length of each word in "A Mourner". (Note that your output should end in . . ., 3, 1, 7].)
#
# ###### Notes
#
# * check out `%pprint`. It turns off printing each list element on a separate row
%pprint
mourner = 'The general Sympathy and Concern for the Murder of the Lad by the base and infamous Richardson on the 22d Instant, will be sufficient Reason for your Notifying the Public that he will be buried from his Father’s House in Frogg Lane, opposite Liberty-Tree, on Monday next, when all the Friends of Liberty may have an Opportunity of paying their last Respects to the Remains of this little Hero and first Martyr to the noble Cause--Whose manly Spirit (after this Accident happened) appear’d in his discreet Answers to his Doctor, his Thanks to the Clergymen who prayed with him, and Parents, and while he underwent the greatest Distress of bodily Pain; and with which he met the King of Terrors. These Things, together with the several heroic Pieces found in his Pocket, particularly Wolfe’s Summit of human Glory, gives Reason to think he had a martial Genius, and would have made a clever Man. A Mourner.'
word_length_list(mourner)
# ## Problem 1(c). _Pride and Prejudice_.
#
# This problem is a bit bigger than parts (a) and (b), and requires some bigger thinking. You'll have to write loops and possibly list comprehensions to solve it!
#
# ? Create a function
# ```collect_statistics```
# to count the number of words and mean length of words in each sentence of *Pride and Prejudice*. We have provided `pride.txt` file in the repo, which is available without restrictions from [Project Gutenberg](https://www.gutenberg.org/).
#
# ###### Suggestions
#
# * You should use the `word_length_list` function from 1(a) in your solution.
# * Create additional functions as necessary to organize your work.
#
# Before you start programming, I suggest you compute the answers by hand, for the first few sentences of *Pride and Prejudice*. Be sure you go far enough in the file to encounter a few anomalies! This will give you a sense of
# * How to solve the problem using a computer. Make the computer do what you did!
# * What the answer looks like, in that you will know the first few terms in the data set you are constructing.
#
# ###### Requirements and notes
#
#
# * A sentence ends with a period, exclamation point, or question mark. A hyphen, dash, or apostrophe does not end a sentence. Quotation marks do not end a sentence. But also, some periods do not end sentences. For example, Mrs., Mr., Dr., Fr., Jr., St., are all commonly occurring abbreviations that almost never end sentences, and they occur enough in Pride and Prejudice that you need to deal with them or your averages will be impacted significantly. An ellipsis sometimes ends a sentence and sometimes does not, but for this assignment you may assume an ellipsis ends a sentence (but note it does not end 3 sentences!)
# * Do *not* use Python packages (like nltk) or code directly copied from online resources (such as regular expressions for splitting text) in order to divide sentences. Write your own code to do this from first principles.
# * The mean length of words in the sample sentence from 1(a) ```"Haven't you eaten 8 oranges today?"``` is 4.5.
#
# ###### Output
#
# * Include comments to explain the purpose and arguments of each function you create.
# * Save your result as a ```.csv``` file and include it with your submission.
def collect_statistics(filename):
# read txt
with open(filename, 'r', encoding="utf8") as myfile:
s = myfile.read()
# replace multiple '
' with '.' so we could count them as separate sentences
s = s.replace('
', '. ')
# replace
with space character
s = s.replace('
', ' ')
# replace ? and ! with .
# for axmaple sentence:
# “Do you not want to know who has taken it?” cried his wife impatiently.
# is splitted into two, as directed in the task by saying:
# A sentence ends with a period, exclamation point, or question mark
s = s.replace('?', '.')
s = s.replace('!', '.')
# replace Mrs., Mr., Dr., Fr., Jr., St. with versions withoud periods
s = s.replace('Mrs. ', 'Mrs ')
s = s.replace('Mr. ', 'Mr ')
s = s.replace('Dr. ', 'Dr ')
s = s.replace('Fr. ', 'Fr ')
s = s.replace('Jr. ', 'Jr ')
s = s.replace('St. ', 'St ')
with open('1c_results.txt', 'w', encoding="utf8") as wfile:
# write headers into file
wfile.write('word_count_in_sentence mean_word_length_in_sentence
')
# split sentences by .
for sentence in s.split('.'):
# get lengths of words in sentence
word_lengths = word_length_list(sentence)
# remove zeros from lists
word_lengths = [x for x in word_lengths if x != 0]
# if sentence not empty
if len(word_lengths)>0:
# print(sentence)
# print(word_lengths)
mean_length = sum(word_lengths)/len(word_lengths)
wfile.write(str(len(word_lengths)))
wfile.write(' ')
wfile.write(str(mean_length))
wfile.write('
')
collect_statistics('pride.txt')
# ---
#
# ## Problem 2. Introduction
#
# <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d5/English_letter_frequency_%28alphabetic%29.svg/600px-English_letter_frequency_%28alphabetic%29.svg.png" width="300">
#
# Here we will be counting e's in text files, capitalized and uncapitalized, and accented and unaccented.
#
# ? Your code in 2(b) must include a function called
# ```
# count_letter_e(filename, ignore_accents, ignore_case)
# ```
#
# That is, `count_letter_e`:
# * takes a filename, such as ```"pg1342.txt"```, as input, and
# * returns the number of _e_'s as output.
# * includes two optional arguments, ```ignore_accents``` and ```ignore_case```.
# * When ```ignore_accents = True```, your function should count accented characters such as _é_, _ê_, and _è_ as the same as _e_.
# * When ```ignore_case=True```, your function should treat uppercase and lowercase _e_ as the same letter.
# * The function ```count_letter_e()``` should return a *single number*, the total number of all characters that are being treated as equivalent to _e_.
# * Create other functions as necessary to organize your work.
# * Include comments which explain the purpose and arguments of each function you create.
# * The files are encoded as [utf-8](https://en.wikipedia.org/wiki/UTF-8). When reading the files, you may have to specify the encoding.
# ## Problem 2(a). Design a Test Suite
#
# ? Design a test suite (which will be in this case a set of input text) of at least four sentences that will allow you to quickly verify that all four optional argument possibilities are implemented correctly. Make sure that your test suite contains at least one of each of the 8 possible e's (e, é, ê, è, E, É, Ê, È).
#
# ? Save your test suite as a set of text files (1 sentence per text file) for use in your ```count_letter_e()``` function, and also include each test sentence in a different markdown cell below. For each sentence, count each type of e by hand and report (in the markdown cell) what the output should be for the four possible combinations of true and false for ```ignore_case``` and ```ignore_accents```.
#
# *Note that you can complete this portion before you have written a single line of code for your function ```count_letter_e()```.
#
# Test sentence 1:
# * Sentence: eee
# * Count with `ignore_case=True, ignore_accents=True`:3
# * Count with `ignore_case=True, ignore_accents=False`:3
# * Count with `ignore_case=False, ignore_accents=True`:3
# * Count with `ignore_case=False, ignore_accents=False`:3
# Test sentence 2:
# * Sentence: éêè
# * Count with `ignore_case=True, ignore_accents=True`:3
# * Count with `ignore_case=True, ignore_accents=False`:0
# * Count with `ignore_case=False, ignore_accents=True`:3
# * Count with `ignore_case=False, ignore_accents=False`:0
#
# Test sentence 3:
# * Sentence: EEE
# * Count with `ignore_case=True, ignore_accents=True`:3
# * Count with `ignore_case=True, ignore_accents=False`:3
# * Count with `ignore_case=False, ignore_accents=True`:0
# * Count with `ignore_case=False, ignore_accents=False`:0
# Test sentence 4:
# * Sentence: ÉÊÈ
# * Count with `ignore_case=True, ignore_accents=True`:3
# * Count with `ignore_case=True, ignore_accents=False`:0
# * Count with `ignore_case=False, ignore_accents=True`:0
# * Count with `ignore_case=False, ignore_accents=False`:0
# ? Describe how you can use your four test sentences to detect problems in your implementation of `count_letter_e()`. Why do you need four sentences, and not just one, or as many as 16?
# we have 16 function calls, so it is enough to test each category of letters with all ifs
#
# ## Problem 2(b). Create and Test Your Code
#
# ? Create the functions described in Problem 2 Introduction, and apply them to your test suite from 2(a). Do *not* use Python packages or code directly copied from online resources. Write your own functions from first principles.
#
# ? Print the results of applying the four combinations of optional arguments of your ```count_letter_e()``` function to your test suite. Verify that the output is correct. (If it isn't, modify your code until your function works correctly on your test suite.)
def count_letter_e(filename, ignore_accents = True, ignore_case = True):
# make function work with string inputs instead of filename:
if filename[-4::] == '.txt':
# read txt
with open(filename, 'r', encoding="utf8") as myfile:
s = myfile.read()
else:
s = filename
count_e = 0
# ignore_case=False, ignore_accents=False
if not ignore_accents and not ignore_case:
count_e = s.count('e')
# ignore_case=True, ignore_accents=False
elif not ignore_accents and ignore_case:
count_e = s.count('e') + s.count('E')
# ignore_case=False, ignore_accents=True
elif ignore_accents and not ignore_case:
count_e = s.count('e') + s.count('é') + s.count('ê') + s.count('è')
# ignore_case=True, ignore_accents=True
elif ignore_accents and ignore_case:
count_e = s.count('e') + s.count('é') + s.count('ê') + s.count('è') +
s.count('E') + s.count('É') + s.count('Ê') + s.count('È')
return count_e
sentence1 = 'eee'
print(sentence1)
print(count_letter_e(sentence1, True, True))
print(count_letter_e(sentence1, False, True))
print(count_letter_e(sentence1, True, False))
print(count_letter_e(sentence1, False, False))
sentence2 = 'éêè'
print(sentence2)
print(count_letter_e(sentence2, True, True))
print(count_letter_e(sentence2, False, True))
print(count_letter_e(sentence2, True, False))
print(count_letter_e(sentence2, False, False))
sentence3 = 'EEE'
print(sentence3)
print(count_letter_e(sentence3, True, True))
print(count_letter_e(sentence3, False, True))
print(count_letter_e(sentence3, True, False))
print(count_letter_e(sentence3, False, False))
sentence4 = 'ÉÊÈ'
print(sentence4)
print(count_letter_e(sentence4, True, True))
print(count_letter_e(sentence4, False, True))
print(count_letter_e(sentence4, True, False))
print(count_letter_e(sentence4, False, False))
# ## Problem 2(c). Apply Your Code
#
# ? Apply your code from 2(b) to the two provided `.txt` files for _Pride and Prejudice_ and _L'Enlèvement de la redoute_.
# ? For each file, print the output of all four combinations of the boolena arguments to `count_letter_e`.
#
print(count_letter_e('pride.txt', True, True))
print(count_letter_e('pride.txt', True, False))
print(count_letter_e('pride.txt', False, True))
print(count_letter_e('pride.txt', False, False))
print(count_letter_e("l'enlevement.txt", True, True))
print(count_letter_e("l'enlevement.txt", True, False))
print(count_letter_e("l'enlevement.txt", False, True))
print(count_letter_e("l'enlevement.txt", False, False))
Assignment Description
Assignment 8
Insert your name here
Load necessary packages here.
Problem 1: Analyzing A Mourner
Can we use statistical analysis of word lengths to identify the author of an anonymous essay? In Homework 7, you wrote a Python function that counted the lengths of words in the 1770 essay by “A Mourner”. Analysis of other articles published in The Boston Gazette and Country Journal in early 1770 finds that John Hancock wrote a 121-word article with a mean word length of 4.69 and standard deviation of 2.60.
a. We want to use R to assess whether it is plausible that John Hancock was A Mourner, based on his mean word length. Explain why a 2-sided, 2-sample t-test is appropriate for this.
b. Explain why the t.test() function is not appropriate for the data we have available.
c. Write your own function for performing a 2-sided, 2-sample t-test for equality of means when the raw data are not available. Use the information provided in T-test formulas.pdf.
• Use additional functions as needed to organize your work.
• Your function(s) should not use any variables from the global environment.
d. Test your function by comparing it to t.test() on a pair of samples. You may wish to use rnorm() to generate random data from a normal distribution. If the p-value from your function doesn’t match the p-value from t.test(), then revise your code from part c.
e. Apply your function to assess whether it is plausible that Hancock was A Mourner.
Write your conclusion as a sentence.
• Note: The null hypothesis for a 2-sample t-test of this question is H_0: mu_Mourner = mu_Hancock i.e., that A Mourner and Hancock have the same mean word length. In other words, the null hypothesis is that it is plausible that Hancock was A Mourner.
Problem 2: Identifying the language of an encrypted text
Problem overview
2. In homework 5, you counted the frequencies of letters in two encrypted texts. In this problem, you will use statistical analysis to identify the language in which the text was written, and decrypt it.
Here’s the basic idea: Suppose that the language FakeEnglish has just 2 letters, E and S, with E occurring 55% of the time and S occurring 45% of the time. Also, suppose that the language FakeWelsh also has just 2 letters, A (occurring 90% of the time) and M (occurring 10% of the time). Suppose your encrypted text uses the letter V 430 times and the letter X 570 times. Which language do you think it came from?
The encrypted text probably came from FakeEnglish, because the frequencies of each letter (43% and 57%) are much closer to the frequencies in FakeEnglish than to FakeWelsh. We can also say that the encrypted letter X probably represents the FakeEnglish letter E, and encrypted letter V probably represents FakeEnglish letter S. It doesn’t matter that V and X don’t occur in FakeEnglish or FakeWelsh, because the encrypted text is encrypted–it uses different letters to represent each letter in the language it came from.
So, our overall strategy to identify the language of each text will be as follows:
1. Put the encrypted letter frequencies in order of increasing frequency. We will guess that the most common letter in the encrypted text represents the most common letter in the real language (English or Welsh), the 2nd-most common letter represents the 2nd-most common letter, and so on. This is just like our guess in the example above, that X probably represents E.
2. Use a chi-squared goodness-of-fit test to test whether the frequencies in the encrypted data are consistent with the proportions in English or Welsh.
• You may need to combine some letter categories to satisfy the assumptions of the chi-squared goodness-of-fit test.
Tasks to complete
a. The file Letter Frequencies.csv contains data on the frequencies of letters in different languages. (Source: http://www.sttmedia.com/characterfrequency-english and http://www.sttmedia.com/characterfrequency-welsh, accessed 21 August 2015. Used by permission of Stefan Trost.) Read these data into R.
b. Make bar graphs of the frequencies in English and Welsh. Use the code
mutate(Letter = reorder(Letter, English))
(and similarly for Welsh) to sort the bars in increasing order of letter frequency.
c. Read the letter frequencies from encryptedA into R. Make a barplot of the letter frequencies, with the letters listed in order of increasing frequency.
d. Based on the shape of the plots in parts b and c, which language do you think encryptedA came from? Explain.
(Note: The order of the letters along the horizontal axis of each plot will be quite different, because the plots from part b show the frequencies in plain English or plain Welsh, and the plot from part c shows the frequencies in the encrypted text. So, you should ignore what letter is written below each bar when answering this question. Instead, look at things like how steeply the bars grow from the least-common letter to the most-common letter.)
e. Now that we have a visual understanding of the data, we will proceed with a hypothesis test. Start by putting the frequencies of letters in English in increasing order, and saving the results in a variable (either the same data frame or a new vector). Display the first few entries of that variable to verify that it is in increasing order.
• If you are using dplyr, the function arrange may be useful.
• If you are using the base R installation, the function sort may be useful.
f. Next, put the letter frequencies of encryptedA in increasing order, and save the results in a variable (either the same data frame or a new vector). Display the first few entries of that variable to verify that it is in increasing order.
• Note that homework 5 asked you to include all 26 letters in the frequency file (even if some letters had a frequency of 0) and no punctuation. Verify that you have exactly 26 frequencies of letters in encryptedA.
g. Write the null and alternative hypotheses for a chi-squared Goodness of Fit test of this question.
h. Use R to conduct the chi-squared Goodness of Fit test, and store the results in the variable test.
i. View the contents of test$expected.
Notice that some of the expected frequencies are below the threshold for the chi-squared test to be appropriate. Use the function you wrote in Homework 3, problem 2e to combine the frequencies in LetterFreqs$English so that the values in test$expected are greater than or equal to the threshold. Also combine counts of letters from encryptedA.txt to correspond with making the values in test$expected be greater than or equal to the threshold.
• Note that all three of the vectors LetterFreqs$English, test$expected, and encryptedA$count should be in increasing order.
• After the due date for Homework 3 has passed and you have submitted your own work for Homework 3, you are welcome to view your classmates’ pull requests for Homework 3 to see how they solved problem 2e.
j. Repeat the chi-squared goodness-of-fit test with your combined-category data.
• If you still get the warning message, “Chi-squared approximation may be incorrect,” one of two things has happened:
1. You did not combine enough categories in step i, or
2. You are using the wrong syntax for the chi-squared Goodness of Fit test.
– Check that the degrees of freedom (df) are 1 less than the number of categories you used. If the degrees of freedom are > 100, then double-check the syntax demonstrated in the Goodness of Fit video.
• If either of these things is true, your results will not be reliable.
k. Write your conclusion in the context of the problem.
• Note that the null hypothesis is that the observed counts of the most-frequent letter, 2nd-most frequent letter, etc. are consistent with the theoretical frequencies. Therefore, the null hypothesis is that the text is an encrypted piece of writing in English.
L. Repeat steps h-k for Welsh, and then repeat for both languages for encryptedB. (It may help to use functions or for loops to help you organize your code.) Fill in the p-values you get in place of the ???? in the following table:
Text | English | Welsh |
|————-|————-|————-|
EncryptedA | 0.???? | 0.???? |
EncryptedB | 0.???? | 0.???? |
m. Based on the hypothesis tests, which text do you think came from which language?
• This should be a reasonably clear decision. If all 4 of your p-values are near 2*10^(-16), or all 4 are near 0.5, double-check your work in steps h-j.
n. Optional: Try to decrypt the English text. Simon Singh’s Black Chamber website (http://www.simonsingh.net/The_Black_Chamber/substitutioncrackingtool.html) will automatically substitute letters for you, so you can test different possibilities for what English plaintext letter is represented by each letter in the ciphertext. Start by substituting the letter E for the most common letter in the ciphertext. Then use frequencies of letters in the ciphertext, common patterns of letters, and experimentation to determine other substitutions.
Assignment Description
Formulas for a 2-sided, 2-sample t-test
• The test statistic (a preliminary quantity needed to compute the p-value) is
𝑥̅ − 𝑥̅
𝑡
=
𝑆𝐸
where 𝑥̅ , 𝑥̅ are the sample means of the two samples, and SE is the standard error (similar to standard deviation, but for [in this case] the difference of sample means instead of the raw data).
• The standard error is
𝑆𝐸
=
where 𝑛 , 𝑛 are the sample sizes and 𝑠 , 𝑠 are the sample standard deviations. (This approach uses Welch’s t-test, which does not assume that the two populations have equal variances.)
• The p-value is
𝑝 = 2 ∗ 𝑝𝑡(−|𝑡|, 𝑑𝑓)
where −|𝑡| is the negative absolute value of t, the test statistic, df is the number of degrees of freedom, and pt() is the R function pt().
• The degrees of freedom are
𝑠 𝑠
𝑛 𝑛
𝑑𝑓 =
𝑛 − 1 𝑛 − 1