Let us worry about your assignment instead!

We Helped With This Python Programming Homework: Have A Similar One?

SOLVED
CategoryProgramming
SubjectPython
DifficultyCollege
StatusSolved
More InfoPython Homework
88991

Short Assignment Requirements

I want it in Austin, TX with all bonus points and comments done. Python 3 version.

Assignment Description

Data Foundations Homework

Web Scraping & Crawling

 

PART 1:

 

Instructions:

Write a Python script that will scrape Craig’s List for items for sale of a type you specify from a city you select, other than San Antonio. Here are the expectations:

·         You may hard-code the item type and city.

·         You may not knowingly pick the same city and item as another student.

·         You must scrape the date the item was posted to Craig’s list, its location, the full description of the item, and its price.

·         All items for sale on the page must be included in your output, even if some of the attributes are not provided (i.e. an item for sale does not include a location).

·         The script must write all scraped data in a nicely formatted CSV file, having a descriptive header row and properly delimited data/columns.

 

Required: Scrape the 1st page returned from the query for the item and the city. (50 points)

Bonus: Scrape all the pages returned from the query for the item and the city. You may NOT hard code the number of pages returned. Your script must scrape all the pages returned without knowing the number of pages that will be turned in advance. (25 points)

 

Turn In:

1.      All Python code in a single file named yourlastname_HW4_WebScrapePart1.py well commented to identify intended functionality of code segments.

2.      Your output CSV file, named yourlastname_HW4_ScrapedData.csv, properly formatted per the instructions above.

 

PART 2:

 

Instructions:

Comment the script written by Eric Bachura named HW4_part2_and_bonus.py. Turn in a commented version of the script. Ideal comments will indicate what the code is for and, if it is a function, what the function does, what it takes in as input (if anything) and what it provides as output (if anything).

 

Required: Code segments marked #REQUIRED. (50 points)

Bonus: Code segments marked #BONUS. (50 points)

 

Turn In:

1.      A commented version of Mr. Bachura’s code in a single file named yourlastname_HW4_WebScrapePart2.py

 

Assignment Code


"""
Created on Wed Nov 30 @way too late in th evening O'clock 2017

@author: Eric
"""
# INSTRUCTIONS: In the code below you will find each code
# section has either a "BONUS" or a "REQUIRED" comment tag
# at the front. If it has a "REQUIRED" comment tag, then it is
# part of the homework assignment and you must provide comments
# interpretting that portion of the code. Ideal comments will
# indicate what the code is for and, if it is a function, what
# the function does, what it takes in as input (if anything) and what
# it provides as output (if anything)
# The "BONUS" sections carry the same comment requirements but
# are NOT REQUIRED for a full score...however, they allow for
# extra points. The "BONUS" sections are ones you may not be familiar with.
# HINT: duckduckgo.com is YOUR FRIEND, there is NO SHAME in using
# any resources available to you to UNDERSTAND something.

# REQUIRED
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import re

# BONUS
from bs4.element import Comment
from string import ascii_lowercase
import random

# REQUIRED
def ensure_absolute(url):
    if bool(urllib.parse.urlparse(url).netloc):    
        return url
    else:
        return urllib.parse.urljoin(start,url)

# REQUIRED
def ensure_urls_good(urls):
    result = []
    basenetloc = urllib.parse.urlparse(start).netloc
    for url in urls:
        url = ensure_absolute(url)
        path = urllib.parse.urlparse(url).path
        netloc = urllib.parse.urlparse(url).netloc
        query = urllib.parse.urlparse(url).query
        fragment = urllib.parse.urlparse(url).fragment
        param = urllib.parse.urlparse(url).params
        if (netloc == basenetloc and re.match(r'^/wiki/', path) and query == '' and fragment == '' and param == ''):
            result += [url]
    return result

# REQUIRED
def getsource(url):
    req=urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'}) #sends GET request to URL
    uClient=urllib.request.urlopen(req)
    page_html=uClient.read() #reads returned data and puts it in a variable
    uClient.close() #close the connection
    page_soup=BeautifulSoup(page_html,"html.parser")
    return [page_soup, page_html]

# REQUIRED
def getanchors(pagesoup):
    result = []
    for anchor in pagesoup.find('div', {"id":'bodyContent'}).findAll('a'):
        result += [anchor.get('href')]
    result = ensure_urls_good(result)
    return result

# BONUS
def tag_visible(element):
    if element.parent.name in ['style', 'script', 'head', 'title', 'meta', '[document]']:
        return False
    if isinstance(element, Comment):
        return False
    return True

# BONUS
def text_from_html(page_html):
    soup = BeautifulSoup(page_html, 'html.parser')
    texts = soup.findAll(text=True)
    visible_texts = filter(tag_visible, texts)  
    return u" ".join(t.strip() for t in visible_texts)

# BONUS
def count_letters(texts):
    alphabet = {}
    for letter in ascii_lowercase:
        alphabet[letter] = texts.count(letter)
    return alphabet

# BONUS
def count_ngrams(texts, n):
    ngrams = {}
    grams = []
    pattern = re.compile(r'[w+]|([a-zA-Z]+'{0,1}[a-zA-Z]+)')
    for m in re.finditer(pattern, texts):
        if (str(m.group(1)) != 'None'):
            if (len(grams) < n):
                grams += [m.group(1)]
            else:
                ngram = ' '.join(grams);
                if (ngram in ngrams):
                    ngrams[ngram] = ngrams[ngram]+1
                else:
                    ngrams[ngram] = 1
                grams = grams[1:]
                grams += [m.group(1)]
    return ngrams

# BONUS
def combinedicts(dict1,dict2):
    result = { k: dict1.get(k, 0) + dict2.get(k, 0) for k in set(dict1) | set(dict2) }
    return result

# REQUIRED
def write_dict_to_csv(fname,header,data):
    f=open(fname,'w')
    f.write(header)
    f.write('
')
    for item in sorted(data, key=lambda i: int(data[i]), reverse=True):
        f.write(str(item)+','+str(data[item]))
        f.write('
')
    f.close()
    return

# REQUIRED
def crawl(url, limit):
    result1 = {}
    result2 = {}
    pagedata = getsource(url)
    anchors = getanchors(pagedata[0])
    for i in range(0,limit):
        secure_random = random.SystemRandom()
        random_url = secure_random.choice(anchors)
        pagedata = getsource(random_url)
        texts = text_from_html(pagedata[1]).lower()
        letterfreqs = count_letters(texts)
        ngramfreqs = count_ngrams(texts, desired_ngram_level)
        anchors = getanchors(pagedata[0])
        if len(result1) > 1:
            result1 = combinedicts(result1,letterfreqs)
        else:
            result1 = letterfreqs
        if len(result2) > 1:
            result2 = combinedicts(result2,ngramfreqs)
        else:
            result2 = ngramfreqs
    return [result1, result2]

# REQUIRED
start="https://en.wikipedia.org/wiki/Special:Random"

# REQUIRED
pagestocrawl = 20

# BONUS
desired_ngram_level = 2

# REQUIRED
freqs = crawl(start,pagestocrawl)
write_dict_to_csv('letter_freqs.csv','letter,frequency',freqs[0])
write_dict_to_csv('ngram_freqs.csv','ngram,frequency',freqs[1])

Customer Feedback

"Thanks for explanations after the assignment was already completed... Emily is such a nice tutor! "

Order #13073

Find Us On

soc fb soc insta


Paypal supported