- Parent Category: Programming Assignments' Solutions
We Helped With This Python Programming Homework: Have A Similar One?
|More Info||Python Programming Help|
Short Assignment Requirements
The objective of this project is to build a Bayesian classifier that predicts the sentiment of tweets.
Assignment should be submitted to Backboard before 10pm Tuesday October 23rd.
The distribution of marks is as follows:
1. Naive Bayes Algorithm 50%
2. A Basic Evaluation 10%
3. Research and Detailed Evaluation 40%
Parts 2 and 3 above will be submitted as a report along with your Python code from part 1.
The report should consist of two parts:
(i) The basic evaluation should describe the basic methods you have employed for cleaning
the dataset (for example, converting everything to lower-case, removal of punctuation, etc). It should also provide an account of the performance of the model and how it was impacted by the basic methods of cleaning the data.
(ii) The research and detailed evaluation of the algorithm should investigate the impact of
more advanced pre-processing techniques on the classification accuracy of your Naïve Bayes classifier. Remember you should test your algorithm using data that was not used to train the algorithm in the first place. The research element allows you to explore and report on various efforts you have made to improve the classification accuracy of the algorithm.
Naive Bayes classifiers are among the most successful known algorithms for learning to classify text documents. The primary technical objective of this assignment is to provide an implementation of a Multinomial Naive Bayes learning algorithm in Python for classify tweets.
On Blackboard you will find two files (train.csv and test.csv). Both files include the following columns:
2. Positive (1) and negative (0) label for a given tweet
3. Source: Sentiment140
Once you have trained your model you should assess the accuracy of your model using the test dataset.
Naïve Bayes will treat the presence of each word as a single feature/attribute. This would give you as many features as there are words in your vocabulary. You should use a “bag of words” (Multinomial model) approach. The Multinomial model places emphasis on the frequency of occurrence of a word within documents of a class (See Week 4 lecture slides for more details and examples).
Stage 1 –Vocabulary Composition and Word Frequency Calculations
Develop code for reading all tweets from both the positive and negative files.
You should initially create a data structure to store all unique words in a vocabulary. A set data structure in Python is ideal for this purpose. You can keep adding lists of words to the set and it will only retain unique words.
Your next step is to record the frequency with which words occur in both the positive and negative tweets. I recommend that you use dictionaries to store the frequency of each word. (Note the keys of each dictionary should correspond to all words in the vocabulary and the values should specify how often they occur for that class). For example, if the word “brilliant” occurs 55 times in the positive tweets then the key value pair in your positive dictionary should be <”brilliant” : 55>. You need to record the frequency of all the words for each class (positive and negative).
It can be useful when initially creating the positive or negative dictionary to use the values from the set (which contains all your unique words) to initialize all the keys for the dictionary. See example code below:
# this line creates a dictionary, which is initialized so that
#each key is a value from the set vocab negDict = dict.fromkeys(vocab, 0)
Stage 2 – Calculating Word Probability Calculations
Once you have populated your positive and negative dictionary with the frequency of each word, you must then work out the conditional probabilities for all words (for each class). In other words for each word w you should work out the P(w|positive) and P(w|negative). Refer to Week 3 lecture notes for more information. Remember this is a multinomial model.
Stage 3 – Classifying Unseen Tweets and Performing Basic Evaluation
The final section of your code will take as input a new tweet (a tweet that has not been used for training the algorithm) and classify the tweet as a positive or negative review. You will need to read all words from the tweet and determine the probability of that tweet being positive and the probability of it being negative.
For the basic evaluation of your algorithm you should run all tweets from the test folder through your algorithm and determine the level of accuracy (the percentage of tweets correctly classified for each class).
You should also try to clean the dataset by lower-casing all words and removing punctuation as much as possible. Your basic evaluation should describe the basic steps you took and if any impact on accuracy was observed.
Research and Detailed Evaluation
The research aspect of this project is worth 40%. You should research common methods used for potentially improving the classification accuracy of your Naïve Bayes algorithm. Please note that basic techniques such as lowering the case of all words and punctuations removal will not be considered. Your report should provide a detailed account of the research, the subsequent implementation as well as the updated results. You should cite all sources you used. Please note that you will not be docked marks for techniques that do you improve accuracy.
The regular expression library in Python may prove useful in performing pre-processing techniques. (re module https://docs.python.org/3.6/library/re.html ). This provides capabilities for extracting whole words and removing punctuation. See example on the next page. You can find a tutorial on regular expression at https://developers.google.com/edu/python/regular-expressions .
An alternative is the use of NLTK, Pythons natural language toolkit (http://nltk.org/ ). Note to use this from Spyder you will need to run nltk.download('all'). It is a power library that provides a range of capabilities including stemming, lemmatization, etc.