- Parent Category: Programming Assignments' Solutions
We Helped With This R Studio Economics Homework: Have A Similar One?
|Subject||R | R Studio|
|More Info||Economics Rstudio Homework|
Short Assignment Requirements
Predictive Analytics in Business: Homework I
Problem 3: d
As of July 2017, car2go is the largest car sharing company in the world with 2,500,000 registered members and a fleet of nearly 14,000 vehicles in 26 locations in North America (e.g., https://www.car2go.com/US/en/new-york-city/), Europe and Asia. [ref: https://en.wikipedia.org/wiki/Car2Go] car2go operated in San Diego California using a pure electric vehicle (EV) fleet, between November 2011 and December 2016. The service region of car2go consisted of 16 zip codes and vehicles were available within the defined boundary, as shown in the Figure below.
We have collected the vehicle status data between March and April, 2014, at every 5-min interval.
In the following, we will analyze the trip data (car2go_data) and try to understand its
1. Report the number of cars and total trips observed in the data set. [1 pts]
2. Plot the histogram of travel times for trips with distance less than 2000 meters. Discuss your observation on whether there are significant customers using car2go for one-way trips. [2 pts]
Predictive Analytics in Business: Homework I
If one assumes that a long travel time can be associated with taking a round trip as the distance is twice as long, then it can be concluded that there is a low percentage of round trips. For example, this could be concluded if a trip goes to a certain location in which the car is not parked by ending the trip on the application and then going to the previous location by still being logged into the same trip. However, car2go with its open fleet system allows users to log out each time after parking the car at the street. Therefore, it can be assumed that most users end their trip after going to the location. In this case, it could not be stated whether this person is taking a round trip with a car2go car or is taking another vehicle and there is no connection of the trip time and whether it is a return or one-way trip.
3. Count the trips for each SD hour on each sample day. (Hint: use functions “count” and “group_by” in package “plyr” and “dplyr”). Report the first 6 readings of trip counts, in time sequence of sample day and SD hour. [1 pts]
4. Due to disconnections in web-crawling, not every sample day has observations for all 24 hours. For the following study, we only include observations between sample day 10 to 26. Construct the time series using function “ts” and specify the frequency to 24 hours. Plot the time series of trip numbers over sampled hour and day and discuss the pattern. [2 pts]
5. Plot ACF and PACF to identify potential AR and MA model. [2 pts]
It is assumed, that number 5 is not related to number 4 and hence, the sample days 1 to 26 should be included.
6. Identification of best fit ARIMA model. Explain the resulting model, e.g., any (seasonal) differencing. [2 pts]
ARIMA(p,d,q) model because the time series is non-stationary. The series appears to have a daily seasonality in the afternoon (when people get home from work) and the sample autocorrelation fails to die out rapidly.
- p: The order of the autoregressive part: 17
- q: The current deviation from mean depends on q previous devia-
- d: Amount of differencing > 0 since we have a non-stationary series
7. Forecast hourly trips for the next 48 hours using the best fit ARIMA model. [2 pts]
Problem 1 (car2go)
We will continue working on the application of car2go. Based on the car2go dataset collected, we simulate the travel demand in each of 50 zip codes in San Diego County. [Note that our data set in HW1 only contains 16 zip codes, which is not enough for classification applications here. A zip code in US contains a fairly large area, e.g., as a town.]
We identify high-demand regions by comparing the travel demand with a threshold, e.g., weekly demand>2,000. The high-demand indicator along with demographics of zip codes are summarized in dataset “car2go_50zip” (in either “.RData” or “.csv” file) in “Data” folder on IVLE. The columns in the dataset are summarized as below.
Trip origin zip code
Population in the zip code
Median income in the zip code
Number of business establishment in the zip code
is.high (target variable)
“True” is the demand is high, “False” otherwise
In this assignment, we aim to develop a classification model to predict high demand regions, e.g., zip codes, from the dataset of car2go in San Diego, California. Hopefully, the resulting classification model can help us design better car sharing system in other cities or countries.
[Hint: you may refer to the examples in “Lec6_Classification.html” on IVLE for classification tree, “Machine Learning with R” Chapter 7 for Linear SVM, and Google discussions on logistic regression using “glm” function, e.g., https://www.r-bloggers.com/how-to-perform-a-logisticregression-in-r/]
a) Visualize the demand classes population and income. Discuss your observation from the plot. [2 pts]
b) Check the proportion of class variables in the entire dataset. [1 pts]
c) Create a random sample for training (e.g., 70%) and test (e.g., 30%) data. List the indices of samples in the training data, e.g., row numbers. [1 pt]
d) Build a classification tree on using the training data (you may use packages, e.g., “rpart”” other than “party”.) Plot the tree and discuss your interpretation of the tree. [3 pts]
e) Check the prediction in the training data in table and report the in-sample accuracy of the tree. [1 pts]
f) Report the prediction in the test data in table and report the out-of-sample accuracy of the tree. Discuss if there is any difference between the in-sample and out-of-sample accuracy. [2 pts]
g) Using the same training data to create SVM and report the results. (For example, using function “ksvm” in package “kernlab”. Alternatiely you may use package “e1071”) [2 pts]
h) Report the prediction of SVM in the test data in table and report its out-of-sample accuracy. [1 pts]
i) Using the same training data for logistic regression and report the results. (For example, using the generalized linear model function “glm” by setting the parameter < family = "binomial">.) [2 pts]
j) Interpret and discuss the results from logistic regression. [2 pts]
k) Report the prediction of logistic regression in the test data in table and report its out-ofsample accuracy. [1 pts]
l) From the above procedures, discuss which data mining method, e.g., classification tree, SVM or logistic regression, you would like to deploy to forecast high demand regions in new cities. [2 pts]