- Details
- Parent Category: Programming Assignments' Solutions
We Helped With This Python Programming Homework: Have A Similar One?
Category | Programming |
---|---|
Subject | Python |
Difficulty | Undergraduate |
Status | Solved |
More Info | Python Help |
Short Assignment Requirements
Assignment Description
BIOL 419/519 Homework 5, Winter 2019
Due on Monday, March 4 at 11:59pm
Instructions: Submit the Jupyter notebook of your work. Your notebook solutions will include the code your wrote to solve the problem as well as the output/answer. Each part of each problem should be in a separate cell (or multiple cells) with clear comments labeling them, so that their outputs are easily found by the grader!
Expectations: Please seek help if you need it! You may ask questions at Friday’s lab, come to office hours, and get together with your classmates to troubleshoot together.
Collaboration: As noted in the Syllabus, what you turn in should reflect your own understanding of the material. Collaboration with your classmates is encouraged, and I ask you to clearly indicate these collaborations as comments your homework.
Data Description
This assignment uses a dataset where DNA microarrays were used to profile tumor samples by gene expression; you will find the data file on the class website. The dataset comprises 83 tumor samples and expression levels of 2308 genes. The tumor samples are of four types: Burkitt lymphoma (BL), Ewing sarcoma (EWS), neuroblastoma (NB), or rhabdomyosarcoma (RMS).
The data was originally described by Khan et al., 2001^{[1]}^{ }and also analyzed by Tibshirani et al.^{[2]}, among others. The raw data was downloaded from https://home.ccr.cancer.gov/oncology/oncogenomics/.
Now let’s get to coding!
1. (1 pt) Visualize the data
I’ve already loaded the microarray data organized the gene expression levels and cancer types into Numpy arrays. Let’s load the data using the starter code in the repository and take a look!
(a) What is the shape of the genes array? What is the shape of the cancer_types array?
(b) Visualize the genes data as an image using Matplotlib’s imshow function. Find a nice colormap for this data and label the plot.
2. (3 pts) Managing dimensionality
The goal here is classification—if we are given the expression profile of these genes for another tumor, how well can we tell what type of tumor it is (assuming it’s one of these four)? The challenge, however, is the high dimensionality of the data. It is very difficult to build a classifier of m = 2308 dimensions, especially since the number of samples n = 83 is much smaller!
(a) Compute the principal component analysis (PCA) of the gene expression data matrix. How many components come out of this decomposition?
(b) Not all of these components are equally important. Make a plot of the fraction of variance explained by the first r components. This plot should have r on the horizontal axis and the cumulative variance explained by those first r components on the vertical axis.
How many components would you need to explain 90% of the variance in the data?
(c) Transform the normalized gene expression data into PC coordinates. Make a plot with PC1 on the horizontal axis and PC2 on the vertical axis, where every tumor sample is a dot on this plot, and the 4 tumor types have different colors.
Can you see an easy separation of the 4 tumor types in this 2-dimensional space? Try it also for some other pairs of PC’s (e.g., PC3 vs PC10).
3. (3 pts) Classifiers with cross validation
Even though the classification of these 4 tumor types may not be easily accomplished in 2 PC dimensions, fortunately our classification algorithms deal well with more PC’s. You can think of each of these PC’s as a feature of the data that is computed from a specific aggregate of expression levels of all the genes.
Since we only have 83 samples in all, it is important to avoid over-fitting these classifier models. We are going to use the cross-validation approach, separating the samples randomly into test/train groups. We will then use only the training samples for fitting the classifier, and assess this classifier by how well it does on the withheld test samples.
(a) Write a function to randomly shuffle the N sample id’s into train and test portions, where test_frac is the fraction of samples reserved for testing.
def | test_train_id(N, test_frac): # add your code here return train, test |
Note: There exists a test_train_split function in scikit-learn’s model_selection module that’s pretty useful for doing cross validation. It comes in very handy! However, it does not do what we need it to do for purposes of this homework, so we’re going to write our own.
(b) Build a k-nearest neighbor classifier using the first 10 PC dimensions and considering only the closest 2 neighbors. Evaluate this simple classifier with 5-fold cross-validation (so test_frac=0.2) 100 times, and print the mean cross-validated accuracy of this classifier.
from | sklearn.neighbors | import | KNeighborsClassifier |
Hint: It is important to keep a strict separation between training and testing data. If any computation is applied to the entire dataset before separating training from testing samples, then the testing data has been contaminated! Computations you may be tempted to apply to the whole dataset include: mean subtraction, normalization, and PCA! Therefore, for each randomly shuffled train/test, the PCA model must be re-computed for the training samples alone, followed by the classifier trained on the PC-transformed training data. To compute the accuracy of this classifier on the test samples, these samples must first be PC-transformed (using the same trained PCA model) and then fed to the classifier model.
4. (3 pts) Comparing classifiers
Now it’s time take a few other classifiers for a spin.
(a) Write a function that returns the cross-validated accuracy for classifier models, taking as input the classifier model, the data, the sample labels, the test sample fraction, and the number of repetitions.
def | cross_val_class_accuracy(model, X, y, r, test_frac, reps): # add your code here return cv_acc |
Here, X is the data, y is the class labels of the samples, and r is number of PC’s of the data to be used for the classifier.
This function can be called to evaluate any classifier model from sklearn; for instance, this should produce the same results as what you had accomplished in problem 3(b):
mymodel = | KNeighborsClassifier(n_neighbors=2) |
knn_acc = 100) | cross_val_class_accuracy(mymodel, genes, cancer_types, 10, 0.2, |
(b) Using the function you wrote above, compare the cross-validated accuracy of the 6 classifier models below, using the implementation of each classifier available in scikit-learn. Run 200 random iterations of 5-fold cross validation for each model. Make a bar plot comparing the mean crossvalidated accuracies of these different models; make sure to add labels.
The models to compare are as follows (r is the number of the PC’s of the data to be used for the classifier):
k-nearest neighbor (KNN) with 2 neighbors, r = 20 k-nearest neighbor (KNN) with 10 neighbors , r = 20 linear discriminant analysis (LDA), r = 5 linear discriminant analysis (LDA), r = 20 support vector machine (SVM) with linear kernels, r = 20 decision tree, r = 20
5. (Informational) How many hours did you spend on this homework? How many of those hours were spent working alone (as opposed to in a group)?
[1] J. Khan, J. S. Wei, M. Ringner, L. H. Saal, M. Ladanyi, F. Westermann, F. Berthold, M. Schwab, C. R. Antonescu, C. Peterson, and P. S. Meltzer, Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks, Nat. Med., 7 (2001), pp. 673–679.
[2] R. Tibshirani, T. Hastie, B. Narasimhan, and G. Chu, Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proc. Natl. Acad. Sci. USA, 99 (2002), pp. 6567–6572.