Let us worry about your assignment instead!

We Helped With This Python Programming Homework: Have A Similar One?

SOLVED
CategoryProgramming
SubjectPython
DifficultyUndergraduate
StatusSolved
More InfoPython Help
235111

Short Assignment Requirements

Need help with questions 2, 3 and 4. Can use libraries such as pandas, numpy, matplotlib, scikit-learn etc. All files are attached. Thank you!

Assignment Description

BIOL 419/519 Homework 5, Winter 2019

Due on Monday, March 4 at 11:59pm

Instructions: Submit the Jupyter notebook of your work. Your notebook solutions will include the code your wrote to solve the problem as well as the output/answer. Each part of each problem should be in a separate cell (or multiple cells) with clear comments labeling them, so that their outputs are easily found by the grader!

Expectations: Please seek help if you need it! You may ask questions at Friday’s lab, come to office hours, and get together with your classmates to troubleshoot together.

Collaboration: As noted in the Syllabus, what you turn in should reflect your own understanding of the material. Collaboration with your classmates is encouraged, and I ask you to clearly indicate these collaborations as comments your homework.

Data Description

This assignment uses a dataset where DNA microarrays were used to profile tumor samples by gene expression; you will find the data file on the class website. The dataset comprises 83 tumor samples and expression levels of 2308 genes. The tumor samples are of four types: Burkitt lymphoma (BL), Ewing sarcoma (EWS), neuroblastoma (NB), or rhabdomyosarcoma (RMS).

The data was originally described by Khan et al., 2001[1] and also analyzed by Tibshirani et al.[2], among others. The raw data was downloaded from https://home.ccr.cancer.gov/oncology/oncogenomics/.

Now let’s get to coding!

1.    (1 pt) Visualize the data

I’ve already loaded the microarray data organized the gene expression levels and cancer types into Numpy arrays. Let’s load the data using the starter code in the repository and take a look!

(a)    What is the shape of the genes array? What is the shape of the cancer_types array?

(b)    Visualize the genes data as an image using Matplotlib’s imshow function. Find a nice colormap for this data and label the plot.

2.    (3 pts) Managing dimensionality

The goal here is classification—if we are given the expression profile of these genes for another tumor, how well can we tell what type of tumor it is (assuming it’s one of these four)? The challenge, however, is the high dimensionality of the data. It is very difficult to build a classifier of m = 2308 dimensions, especially since the number of samples n = 83 is much smaller!

(a)    Compute the principal component analysis (PCA) of the gene expression data matrix. How many components come out of this decomposition?

(b)    Not all of these components are equally important. Make a plot of the fraction of variance explained by the first r components. This plot should have r on the horizontal axis and the cumulative variance explained by those first r components on the vertical axis.

How many components would you need to explain 90% of the variance in the data?

(c)    Transform the normalized gene expression data into PC coordinates. Make a plot with PC1 on the horizontal axis and PC2 on the vertical axis, where every tumor sample is a dot on this plot, and the 4 tumor types have different colors.

Can you see an easy separation of the 4 tumor types in this 2-dimensional space? Try it also for some other pairs of PC’s (e.g., PC3 vs PC10).

3.    (3 pts) Classifiers with cross validation

Even though the classification of these 4 tumor types may not be easily accomplished in 2 PC dimensions, fortunately our classification algorithms deal well with more PC’s. You can think of each of these PC’s as a feature of the data that is computed from a specific aggregate of expression levels of all the genes.

Since we only have 83 samples in all, it is important to avoid over-fitting these classifier models. We are going to use the cross-validation approach, separating the samples randomly into test/train groups. We will then use only the training samples for fitting the classifier, and assess this classifier by how well it does on the withheld test samples.

(a)    Write a function to randomly shuffle the N sample id’s into train and test portions, where test_frac is the fraction of samples reserved for testing.

def

test_train_id(N, test_frac): # add your code here return train, test

Note: There exists a test_train_split function in scikit-learn’s model_selection module that’s pretty useful for doing cross validation. It comes in very handy! However, it does not do what we need it to do for purposes of this homework, so we’re going to write our own.

(b)    Build a k-nearest neighbor classifier using the first 10 PC dimensions and considering only the closest 2 neighbors. Evaluate this simple classifier with 5-fold cross-validation (so test_frac=0.2) 100 times, and print the mean cross-validated accuracy of this classifier.

from

sklearn.neighbors

import

KNeighborsClassifier

Hint: It is important to keep a strict separation between training and testing data. If any computation is applied to the entire dataset before separating training from testing samples, then the testing data has been contaminated! Computations you may be tempted to apply to the whole dataset include: mean subtraction, normalization, and PCA! Therefore, for each randomly shuffled train/test, the PCA model must be re-computed for the training samples alone, followed by the classifier trained on the PC-transformed training data. To compute the accuracy of this classifier on the test samples, these samples must first be PC-transformed (using the same trained PCA model) and then fed to the classifier model.

4.    (3 pts) Comparing classifiers

Now it’s time take a few other classifiers for a spin.

(a)    Write a function that returns the cross-validated accuracy for classifier models, taking as input the classifier model, the data, the sample labels, the test sample fraction, and the number of repetitions.

def

cross_val_class_accuracy(model, X, y, r, test_frac, reps):

# add your code here return cv_acc

Here, X is the data, y is the class labels of the samples, and r is number of PC’s of the data to be used for the classifier.

This function can be called to evaluate any classifier model from sklearn; for instance, this should produce the same results as what you had accomplished in problem 3(b):

mymodel =

KNeighborsClassifier(n_neighbors=2)

knn_acc =

100)

cross_val_class_accuracy(mymodel, genes, cancer_types, 10, 0.2,

(b)    Using the function you wrote above, compare the cross-validated accuracy of the 6 classifier models below, using the implementation of each classifier available in scikit-learn. Run 200 random iterations of 5-fold cross validation for each model. Make a bar plot comparing the mean crossvalidated accuracies of these different models; make sure to add labels.

The models to compare are as follows (r is the number of the PC’s of the data to be used for the classifier):

k-nearest neighbor (KNN) with 2 neighbors, r = 20 k-nearest neighbor (KNN) with 10 neighbors , r = 20 linear discriminant analysis (LDA), r = 5 linear discriminant analysis (LDA), r = 20 support vector machine (SVM) with linear kernels, r = 20 decision tree, r = 20

5.    (Informational) How many hours did you spend on this homework? How many of those hours were spent working alone (as opposed to in a group)?



[1] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer, Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nat. Med., 7 (2001), pp. 673–679.

[2] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proc. Natl. Acad. Sci. USA, 99 (2002), pp. 6567–6572.

Frequently Asked Questions

Is it free to get my assignment evaluated?

Yes. No hidden fees. You pay for the solution only, and all the explanations about how to run it are included in the price. It takes up to 24 hours to get a quote from an expert. In some cases, we can help you faster if an expert is available, but you should always order in advance to avoid the risks. You can place a new order here.

How much does it cost?

The cost depends on many factors: how far away the deadline is, how hard/big the task is, if it is code only or a report, etc. We try to give rough estimates here, but it is just for orientation (in USD):

Regular homework$20 - $150
Advanced homework$100 - $300
Group project or a report$200 - $500
Mid-term or final project$200 - $800
Live exam help$100 - $300
Full thesis$1000 - $3000

How do I pay?

Credit card or PayPal. You don't need to create/have a Payal account in order to pay by a credit card. Paypal offers you "buyer's protection" in case of any issues.

Why do I need to pay in advance?

We have no way to request money after we send you the solution. PayPal works as a middleman, which protects you in case of any disputes, so you should feel safe paying using PayPal.

Do you do essays?

No, unless it is a data analysis essay or report. This is because essays are very personal and it is easy to see when they are written by another person. This is not the case with math and programming.

Why there are no discounts?

It is because we don't want to lie - in such services no discount can be set in advance because we set the price knowing that there is a discount. For example, if we wanted to ask for $100, we could tell that the price is $200 and because you are special, we can do a 50% discount. It is the way all scam websites operate. We set honest prices instead, so there is no need for fake discounts.

Do you do live tutoring?

No, it is simply not how we operate. How often do you meet a great programmer who is also a great speaker? Rarely. It is why we encourage our experts to write down explanations instead of having a live call. It is often enough to get you started - analyzing and running the solutions is a big part of learning.

What happens if I am not satisfied with the solution?

Another expert will review the task, and if your claim is reasonable - we refund the payment and often block the freelancer from our platform. Because we are so harsh with our experts - the ones working with us are very trustworthy to deliver high-quality assignment solutions on time.

Customer Feedback

"Thanks for explanations after the assignment was already completed... Emily is such a nice tutor! "

Order #13073

Find Us On

soc fb soc insta


Paypal supported