- Details
- Parent Category: Programming Assignments' Solutions

# We Helped With This R Programming Homework: Have A Similar One?

Category | Programming |
---|---|

Subject | R | R Studio |

Difficulty | Graduate |

Status | Solved |

More Info | Statistic Help Online |

## Short Assignment Requirements

## Assignment Description

5M project: stochastic optimization

Big data analytics

### Project outline

The purpose of this project is to learn the basics of contemporary stochastic optimization methods. You will learn how such methods operate by implementing them and by tuning their hyperparemeters. The stochastic optimization algorithms considered in the project are stochastic gradient descent (SGD), stochastic gradient descent with momentum (MSGD), stochastic gradient descent with Nesterov accelerated gradient (NAGSGD), AdaGrad, RMSProp and Adam.

Several of these stochastic optimization approaches were developed to train convolutional neural networks. For the purposes of the project, linear regression will be used instead of deep learning models. By using a familiar model such as linear regression, your focus will be placed on understanding the stochastic optimization algorithms.

Inference for linear regression coefficients via stochastic
optimization will be conducted for three different datasets. The first dataset
contains 1*,*000 simulated data points and is associated with two
regression coefficients, namely the intercept and slope of a simple linear
regression model. Starting with simple linear regression and simulated data,
you will develop the involved stochastic optimization algorithms. Once you have
implemented the stochastic optimization methods, you will use them on two real
datasets. The second dataset is available on Kaggle, pertains to weather in
Szegen between 2006-2016 and it contains 96*,*453 observations and six
parameters. The third dataset is the million song dataset (MSD), which you have
encountered in the first lab and is available on the UCI machine learning
repository. Recall that MSD is a big dataset, containing 515*,*345 data
points and 91 parameters, so you will have the chance to fit multiple linear
regression on a big dataset using stochastic optimization.

### R code template

A template folder is available on Moodle, which will help you get started with coding. It contains three R scripts, namely cost.r, optimizers.r and question1.r.

The script cost.r provides the definition of the cost function and its gradient for linear regression. This is the only cost function needed for the project, since linear regression is the only model applied to the three datasets.

The script optimizers.r provides the R function gd() for gradient descent (GD). You can use the gd() function for answering the relevant questions of the project and as a coding basis for developing the stochastic optimization algorithms. Moreover, optimizers.r contains R comments with function signatures for the required stochastic algorithms. You will need to complete the function body of these functions to tackle the project.

The script question1.r sources cost.r and optimizers.r initially, and provides a template of R comments subsequently to help structure your answer to question 1. Towards the end of question1.r, commented lines give you directions towards storing the numerical and visual output of your simulations for question 1.

The template folder contains also the data files simulated data.csv, weather data.csv and uci data.csv associated with questions 1, 2 and 3, respectively.

### An introduction to stochastic optimization

This section outlines the stochastic optimization methods appearing in the project. Gradient descent (GD), which has been taught in lectures 4-5 and has been coded in lab 1, is the starting point for stochastic optimization.

#### Gradient descent (GD)

Let *J*(θ*,D*) be a cost
function, where θ is
an *n _{θ}*-length parameter vector and

*D*a dataset of

*n*samples. Moreover, let ∇θ

_{d }*J*(θ

*,D*) be the gradient of cost function

*J*(θ) :=

*J*(θ

*,D*) with respect to θ. The approximation θ

^{(k) }of θ at step

*k*of GD given the approximation θ

^{(k−1) }of θ at step

*k*− 1 is defined as

θ(*k*) := θ(*k*−1) − *a*∇θ*J*(θ(*k*−1)*,D*)*, *(1)

where the learning rate *a *is a
positive real number.

In the case of a linear regression model,
the data are set to *D *:= {*X,y*}, where *X *is the *n _{d }*·

*n*design matrix and

_{θ }*y*is the

*n*-length output vector associated with linear regression.

_{d}#### Stochastic gradient descent (SGD)

For a big data set *D *consisting of a
large number *n _{d }*of data points, the evaluation of the
gradient ∇θ

*J*(θ

*,D*) can be computationally expensive. To reduce the computational complexity of GD, stochastic gradient descent (SGD) samples an

*s*-size subset of the data

*D*at step

*k*and evaluates the gradient) using the subset. Thus, the approximation θ

^{(k) }of θ at step

*k*of SGD given the approximation θ

^{(k−1) }of θ at step

*k*− 1 is given by

*. *(2)

SGD is more computationally efficient than GD in
large-scale machine learning problems entailing large *n _{d }*or
large

*n*.

_{θ}As an implementation hint, use the sample() function in R to sample *s *out
of *n _{d }*samples at step

*k*without replacement. Subsequently, make sure that you index the design matrix

*X*and output variable

*y*appropriately.

The rest of stochastic optimization methods of this project
also use a random subset of the data *D *at step *k*.

#### SGD with momentum (MSGD)

In SGD with momentum (MSGD), a ‘momentum’ term **m **is defined, which is a moving average of the gradient), and the parameters
are then updated using the momentum **m **instead of the gradient

The momentum **m**^{(k) }and parameter
approximation θ^{(k) }at the *k*-th step of MSGD are defined recursively according to

**m***, *(3) θ(*k*) := θ(*k*−1) − *a***m**(*k*)*. *(4)

MSGD has two hyper-parameters, a momentum
memory factor *b *with values in (0*,*1) and a positive learning rate *a*. Furthermore, an initial momentum value **m**^{(0) }is
required by equation (3) at step *k *= 1.

For values of *b *closer to one, the moving average
equation (3) retains stronger memory of momentum history. For values of *b *closer
to zero, the moving average equation (3) remembers less any past momentum
history and places more emphasis on the current gradient of the cost function.

Variant representations of MSGD are available in the literature. Equations (3)-(4) have been preferred due to providing a clearer understanding of MSGD.

#### SGD with Nesterov accelerated gradient (NAGSGD)

SGD with Nesterov accelerated gradient
(NAGSGD) differs from MSGD only in the way it calculates the gradient of the
cost function in the moving average equation. More specifically, NAGSGD
calculates the gradient of the cost function with respect to approximate future
position θ^{(k−1) }− *ab***m**^{(k−1) }rather than with respect to
the current parameter approximation θ^{(k−1)}.

Thus, step *k *of NAGSGD is defined
as

**m***, *(5) θ(*k*) := θ(*k*−1) − *a***m**(*k*)*. *(6)

Alternative expressions for NAGSGD can be found in the literature. Equations (5)-(6) have been chosen for the purpose of communicating the moving average model for momentum and the resulting parameter update clearly.

#### AdaGrad

The learning rate *a *of SGD can be
too small for one parameter and too large for another one. This problem tends
to manifest itself for large number *n _{θ }*of parameters. AdaGrad
mitigates such a problem by scaling a different learning rate for each
parameter adaptively and automatically.

Using the shorthand **g**), consider
the outer product **g**^{(k) }× **g**^{(k) }at
step *k*, which is an *n _{θ }*·

*n*matrix. Summing over all steps up to

_{θ }*k*yields the

*n*·

_{θ }*n*matrix

_{θ }*k*

*G*(*k*) := ∑**g**(*ℓ*) × **g**(*ℓ*)*. *(7)

*ℓ*=0

At step *k*, AdaGrad performs the
parameter update

θ(*k*) := θ(*k*−1) − *a*(diag(*G*(*k*−1)) + *ϵ*)⊙(−0*.*5) ⊙ **g**(*k*−1)*, *(8)

where *a *is a
positive learning rate, diag(*G*^{(k−1)}) is the diagonal
of matrix *G*^{(k−1)}, *ϵ *is a positive smoothing
term that avoids division by zero (usually on the order of 1*e *− 8) and ⊙
denotes Hadamard (element-wise) operations. Step *k *= 1 requires an
initial value *G*^{(0)}.

To further clarify AdaGrad, equation (8) will be written
below in an alternative form by using element-wise notation instead of matrix
notation. Let be the *j*-th partial derivative of the cost
function at step *k*, i.e. the *j*-th coordinate of the gradient ∇θ*J*(θ^{(k)}*,D _{s}*

^{(k)}). The AdaGrad update of the

*j*-th parameter

*θ*

_{j}^{(k) }at step

*k*is

*, *(9)

where *G _{jj}*(

*k*

^{−1) }is the (

*j,j*)-th diagonal element of

*G*

^{(k−1)}.

Equation
(9) makes it clear that AdaGrad uses a differently scaled learning rate *a/*√*G _{jj}*(

*k*

^{−1) }+

*ϵ*for each parameter

*θ*.

_{j}As a hint to assist implementation, you may focus on
equation (9). You do not need to compute the outer products appearing in
equation (8). It suffices to calculate the diagonal of matrix *G*^{(k)},
which means that it suffices to calculate the sum of squared gradients

*k*

*G _{jj}*(

*k*) = ∑(

*g*(

_{j}*ℓ*))

^{2}

*.*(10)

*ℓ*=0

AdaGrad eliminates the need to manually tune the learning rate, since it tunes adaptively the learning rate for each parameter according to equation (9). However, the adaptive tuning of learning rates in AdaGrad comes with a weakness. Squared gradients accumulate in the

denominator of learning rate *a/*√*G _{jj}*(

*k*

^{−1) }+

*ϵ*, hence learning rates shrink during training and AdaGrad is no longer able to perform parameter updates via equation (9).

#### RMSProp

RMSProp attempts to alleviate the aforementioned weakness of AdaGrad by applying a moving average model to squared gradients and by then using the output of the moving average model in the AdaGrad parameter update mechanism.

More concretely, the *j*-th
coordinate *v _{j}*

^{(k) }of the moving average of squared gradients and

*j*-th parameter at step

*k*of RMSprop are updated according to

*, *(11)

(12)

The hyperparameter *c *is
known as the memory factor for squared gradients, taking values in *c *∈ (0*,*1).
Equation (11) requires an initial value *v _{j}*

^{(0) }at step

*k*= 1. Taking into account all coordinates

*j*= 1

*,*2

*,...,n*, the initial value

_{θ}**v**

^{(0) }for the moving average model of RMSProp is an

*n*-length vector.

_{θ}#### Adam

Adaptive moment estimation (Adam) is a stochastic optimization method that computes a different learning rate for each parameter adaptively. Apart from retaining a moving average of squared gradients similarly to RMSProp, Adam stores a moving average of gradients additionally.

(*k*)

The
Adam update of the *j*-th parameter *θ _{j }*at
step

*k*is given by

(13)

*, *(14)

(15)

(16)

(17)

Adam has three hyperparameters, namely a
positive learning rate *a*, a memory factor *b *of gradients with
values in (0*,*1) and a memory factor *c *of squared gradients with
values in (0*,*1). Morevover, Adam requires the initial values **m**^{(0) }and **v**^{(0) }at step *k *= 1.

Equations (13) and (14) provide the moving average of
gradients and of squared gradients, respectively. Adam takes its name after
these first and second order estimators of gradients. Due to **m**^{(0) }and **v**^{(0) }being initialized typically as vectors of zeros, the
moving averages **m**^{(k) }and **v**^{(k) }are
asymptotically biased towards zero, especially when *b *and *c *are
set to a value close to one. To contouract such biases while estimating the
moving averages of gradients and of squared gradients, equations (15) and (16)
are introduced.

### Question 1

The goal of question 1 is to develop the stochastic optimizers of this project using a simulated dataset and a cost function for simple linear regression.

To help you get started, the provided template folder contains an optimizers.r file with R code for gradient descent, a cost.r file with R code for the cost function for linear regression and its gradient, a question1.r file with some preliminary R code towards answering question 1 and the simulated data.csv file on which simple linear regression will be fitted.

As part of your answer, you will edit the optimizers.r file to include in it your R code for the optimizers of this project and you will also edit the file question1.r to include in it your R code for running the optimizers and simulations of question 1. You will not provide any output files or a report as part of your project submission.

Make sure that every optimizer in optimizers.r is a standalone R function with a return statement return(list(theta=theta, cost=cost))

as in the provided gd() function, to store the iterations of the parameter estimates and of the cost function.

You need to make sure before submission that the R command source("question1.r") runs successfully and stores the required results in output files as explained further below. question1.r starts with the lines

source("optimizers.r") source("cost.r")

to load the R code for the cost function for linear regression and its gradient as well as the R code for the optimizers. The subsequent line in question1.r loads the data for question 1: simulated_data <- read.csv("simulated_data.csv", header=TRUE)

simulated data.csv consists of two columns; the first column is labelled as y and represents the output variable, while the second column is labelled as covariates and plays the role of covariate. The remaining lines in question1.r are R comments meant to guide you towards answering question 1.

Firstly, standardize the covariate and create the design matrix X including a first column of ones for intercept.

Fit a simple linear regression to the data using the standardized covariate and the lm() function in R. The model will have two parameters, namely a regression coefficient corresponding to covariate and an intercept.

Use the same initial value θ^{(0) }= (7*,*−8)* ^{T }*for all optimizers.

Run 100 iterations of GD tuning the learning rate *a *of
equation (1) to converge towards the regression coefficient estimates obtained
via lm().

Run 100 iterations of SGD by sampling a subset of *s *=
100 data points per iteration. Tune the learning rate *a *of equation (2)
to converge towards the regression coefficient estimates obtained via lm(). Notice that even for
such a simple example, SGD obtains an approximation to the regression
coefficient estimates known from lm().
Although SGD does not yield an exact solution to the optimization problem, it
provides a reasonable approximation.

Run 100 iterations of MSGD by sampling a subset of *s *=
100 data points per iteration. Set the initial value for momentum to be a
vector of zeros, that is set **m**^{(0) }= (0*,*0)* ^{T}*.
Tune the learning rate

*a*of equation (4) and memory factor

*b*of equation (3) to converge towards the regression coefficient estimates obtained via lm().

Run 100 iterations of NAGSGD by sampling a subset of *s *=
100 data points per iteration. Set the initial value for momentum to be a
vector of zeros, that is set **m**^{(0) }= (0*,*0)* ^{T}*.
Tune the learning rate

*a*of equation (6) and memory factor

*b*of equation (5) to converge towards the regression coefficient estimates obtained via lm().

Run 100 iterations of AdaGrad by sampling
a subset of *s *= 100 data points per iteration. Set diag(*G*^{(0)})
= (0*,*0)* ^{T }*and

*ϵ*= 1

*e*− 8. Tune the learning rate

*a*of equation (9) to converge towards the regression coefficient estimates obtained via lm().

Run 100 iterations of RMSProp by sampling
a subset of *s *= 100 data points per iteration. Set **v**^{(0) }=
(0*,*0)* ^{T }*and

*ϵ*= 1

*e*− 8. Tune the learning rate

*a*of equation (12) and memory factor

*c*of equation (11) to converge towards the regression coefficient estimates obtained via lm().

Run 100 iterations of Adam by sampling a
subset of *s *= 100 data points per iteration. Set **m**^{(0) }=
(0*,*0)* ^{T}*,

**v**

^{(0) }= (0

*,*0)

*and*

^{T }*ϵ*= 1

*e*−8. Tune the learning rate

*a*of equation (17), memory factor

*b*of equation (13) and memory factor

*c*of equation (14) to converge towards the regression coefficient estimates obtained via lm().

Your code in question1.r will store numerical summaries in CSV files and visual summaries in PDF files as follows:

(a) Save the design matrix X in the comma-separated value file answer1a.csv.

(b) Save the
parameter estimates obtained via lm() and the final parameter estimates at the last iteration of each
optimizer in the comma-separated value file answer1b.csv.
The parameter estimates saved in answer1b.csv correspond to a 2 · 8 matrix, in which the first and second row
represent the respective regression coefficient estimates *θ*_{0 }and *θ*_{1}, while the eight columns represent the parameter estimates
obtained via lm(), GD, SGD, MSGD, NAGSGD, AdaGrad, RMSProp and Adam. R code in
the form of comments has been provided in question1.r as guidance to facilitate the process of saving all parameter estimates.

(c) Save the value of the cost function at the last itration of each optimizer in the commaseparated value file answer1c.csv. The cost function values saved in answer1c.csv correspond to a 1·7 matrix, in which the the seven columns represent the parameter estimates obtained via GD, SGD, MSGD, NAGSGD, AdaGrad, RMSProp and Adam. R code in the form of comments has been provided in question1.r as guidance to facilitate the process of saving all cost function values.

(d) Plot the cost function of NAGSGD and of AdaGrad against the number of iterations in a single plot. Ensure that it is possible to distinguish between the overlaid cost function values of NAGSGD and of AdaGrad. Add a legend to the plot to make such a distinction clearer. Save the plot in the PDF file answer1d.pdf. R code in the form of comments has been provided in question1.r as guidance to facilitate the process of saving the plot of cost function values.

(e) Plot the
successive iterations of parameter *θ*_{0 }against the successive
iterations of parameter *θ*_{1 }for GD and RMSProp. Moreover, add
a point in the plot indicating the parameter estimate obtained via lm(). All optimizers started
from the same initial parameter value θ^{(0) }= (7*,*−8)* ^{T}*, so the trajectories of GD and RMSProp
must have the same starting point in the plot. Moreover, the end points of the
trajectories of GD and RMSProp must be in near proximity to the single point
representing the parameter estimate obtained via lm() if the optimizers were tuned appropriately. Ensure that it is possible
to distinguish between the overlaid trajectories of GD and RMSProp. Add a
legend to the plot to make such a distinction clearer. Save the plot in the PDF
file answer1e.pdf. R
code in the form of comments has been provided in question1.r as guidance to facilitate the
process of saving the plot of parameter trajectories.

Upload only the R scripts cost.r, optimizers.r and question1.r. Make sure that the R command source("question1.r") works free from bugs and that it generates the output files answer1a.csv, answer1b.csv, answer1c.csv, answer1d.pdf and answer1e.pdf. However, do not upload the output files themselves. I will assess your work by running the R command source("question1.r") and by looking into the generated output files.

### Question 2

The goal of question 2 is to run the stochastic optimizers of this project to fit multiple linear regression on a real dataset of relatively small size. Given that you have already developed the R code for stochastic optimization to tackle question 1, the only additional work required for tackling question 2 is to tune the hyperparameters of the optimizers.

The files cost.r and optimizers.r should be available after having answered question 1. Create an R script with name question2.r to include in it your R code for running the optimizers and simulations of question 2. You will not provide any output files or a report as part of your project submission.

You need to make sure before submission that the R command source("question2.r") runs successfully and stores the required results in output files as explained further below.

Similarly to question1.r, the file question2.r will start with the lines

source("optimizers.r") source("cost.r")

to load the R code for the cost function for linear regression and its gradient as well as the R code for the optimizers. All subsequent R code for answering question 2 should be saved in question2.r.

Start by loading the data file weather data.csv, whose first line is the header containing column labels. For the purposes of this project, it suffices to know that the column labelled apparent temperature is the output variable, while the columns labelled temperature, humidity, wind speed, wind bearing and pressure will be used as covariates in the multiple linear regression model.

The dataset weather data.csv can be found on Kaggle at

https://www.kaggle.com/budincsevity/szeged-weather

There is no need to use Kaggle in order to address question 2. If you decide to look up the dataset on Kaggle, keep in mind that the original column labels have been altered to make R coding more user-friendly.

Firstly, standardize all five covariates and create the design matrix X including a first column of ones for intercept.

Fit a multiple linear regression to the data using the standardized covariates and the lm() function in R. The model will have six parameters, namely five regression coefficient corresponding to the covariates and an intercept.

Use the same initial value θ^{(0) }= (−5*,*−3*,*4*,*1*,*10*,*−9)* ^{T }*for all optimizers.

Run 70 iterations of GD tuning the learning rate *a *of
equation (1) to converge towards the regression coefficient estimates obtained
via lm().

Run 200 iterations of SGD by sampling a subset of *s *=
1000 data points per iteration. Tune the learning rate *a *of equation (2)
to converge towards the regression coefficient estimates obtained via lm(). Notice that even for
such a simple example, SGD obtains an approximation to the regression
coefficient estimates known from lm().
Although SGD does not yield an exact solution to the optimization problem, it
provides a reasonable approximation.

Run 200 iterations of MSGD by sampling a subset of *s *=
1000 data points per iteration. Set the initial value for momentum to be a
vector of zeros, that is set **m**^{(0) }= (0*,*0*,*0*,*0*,*0*,*0)* ^{T}*.
Tune the learning rate

*a*of equation (4) and memory factor

*b*of equation (3) to converge towards the regression coefficient estimates obtained via lm().

Run 200 iterations of NAGSGD by sampling a subset of *s *=
1000 data points per iteration. Set the initial value for momentum to be a
vector of zeros, that is set **m**^{(0) }= (0*,*0*,*0*,*0*,*0*,*0)* ^{T}*.
Tune the learning rate

*a*of equation (6) and memory factor

*b*of equation (5) to converge towards the regression coefficient estimates obtained via lm().

Run 200 iterations of AdaGrad by sampling
a subset of *s *= 1000 data points per iteration. Set diag(*G*^{(0)})
= (0*,*0*,*0*,*0*,*0*,*0)* ^{T }*and

*ϵ*= 1

*e*− 8. Tune the learning rate

*a*of equation (9) to converge towards the regression coefficient estimates obtained via lm().

Run 200 iterations of RMSProp by sampling
a subset of *s *= 1000 data points per iteration. Set **v**^{(0) }=
(0*,*0*,*0*,*0*,*0*,*0)* ^{T }*and

*ϵ*= 1

*e*−8. Tune the learning rate

*a*of equation (12) and memory factor

*c*of equation (11) to converge towards the regression coefficient estimates obtained via lm().

Run 200 iterations of Adam by sampling a
subset of *s *= 1000 data points per iteration. Set **m**^{(0) }=
(0*,*0*,*0*,*0*,*0*,*0)* ^{T}*,

**v**

^{(0) }= (0

*,*0

*,*0

*,*0

*,*0

*,*0)

*and*

^{T }*ϵ*= 1

*e*− 8. Tune the learning rate

*a*of equation (17), memory factor

*b*of equation (13) and memory factor

*c*of equation (14) to converge towards the regression coefficient estimates obtained via lm().

Your code in question2.r will store numerical summaries in CSV files and visual summaries in PDF files as follows:

(a) Save the design matrix X in the comma-separated value file answer2a.csv.

(b) Save the parameter estimates obtained via lm() and the final parameter estimates at the last iteration of each optimizer in the comma-separated value file answer2b.csv. The parameter estimates saved in answer2b.csv correspond to a 6 · 8 matrix, in which each row represents one of the six regression coefficient estimates, while the eight columns represent the parameter estimates obtained via lm(), GD, SGD, MSGD, NAGSGD, AdaGrad, RMSProp and Adam.

(c) Save the value of the cost function at the last itration of each optimizer in the commaseparated value file answer2c.csv. The cost function values saved in answer2c.csv correspond to a 1·7 matrix, in which the the seven columns represent the parameter estimates obtained via GD, SGD, MSGD, NAGSGD, AdaGrad, RMSProp and Adam.

(d) Plot the cost function of NAGSGD and of AdaGrad against the number of iterations in a single plot. Ensure that it is possible to distinguish between the overlaid cost function values of NAGSGD and of AdaGrad. Add a legend to the plot to make such a distinction clearer. Save the plot in the PDF file answer2d.pdf.

(e) Plot the
successive iterations of parameter *θ*_{2 }against the successive
iterations of parameter *θ*_{3 }for GD, MSGD and AdaGrad,
recalling that indexing notation for parameters has been set to start from zero
for the intercept *θ*_{0}. Moreover, add a point in the plot
indicating the parameter estimate (*θ*_{2}*,θ*_{3})
obtained via lm(). All
optimizers started from the same initial parameter value θ^{(0) }= (−5*,*−3*,*4*,*1*,*10*,*−9)* ^{T}*,
so the trajectories of GD, MSGD and AdaGrad must have the same starting point
in the plot. Moreover, the end points of the trajectories of GD, MSGD and
AdaGrad must be in near proximity to the single point representing the
parameter estimate obtained via lm() if the optimizers were tuned appropriately. Ensure that it is possible
to distinguish between the overlaid trajectories of GD, MSGD and AdaGrad. Add a
legend to the plot to make such a distinction clearer. Save the plot in the PDF
file answer2e.pdf.

Upload only the R scripts cost.r, optimizers.r and question2.r. Make sure that the R command source("question2.r") works free from bugs and that it generates the output files answer2a.csv, answer2b.csv, answer2c.csv, answer2d.pdf and answer2e.pdf. However, do not upload the output files themselves. I will assess your work by running the R command source("question2.r") and by looking into the generated output files.

### Question 3 (not assessed)

**Question 3 will not be assessed, it is
provided for educational purposes**. If you do not tackle question 3, you
will not be penalized. However, it is advised that you try to answer question 3
in order to experience the computational advantage of stochastic optimization
over optimization in the case of big datasets.

The goal of question 3 is to run the stochastic optimizers of this project to fit multiple linear regression on a relativey big dataset. Given that you have already developed the R code for stochastic optimization to tackle question 1, the only additional work required for tackling question 3 is to handle data processing efficiently and to tune the hyperparameters of the optimizers.

The dataset uci
data.csv used in question 3 is the same subset of the million song data
set used in lab 1. It contains 515*,*345 songs (file rows) and 91 columns.
The first column is the year of release, ranging from 1922 to 2011, which plays
the role of output variable. The remaining 90 columns are continuous predictors
of the year of release. uci
data.csv is included in the template folder. More information about this subset of the million song data set
is available at

http://archive.ics.uci.edu/ml/datasets/YearPredictionMSD

You will observe the computational cost of fine-tuning
optimization given a relatively big dataset. At the same time, you will see how
stochastic optimization reduces runtime dramatically in comparison to gradient
descent. Moreover, you will see that stochastic optimization attains reasonable
parameter estimates in a problem somewhat less trivial in terms of data space
and parameter space dimensionality (*n _{d }*= 515

*,*345 and

*n*= 91).

_{θ }The files cost.r and optimizers.r should be available after having answered question 1. Create an R script with name question3.r to include in it your R code for running the optimizers and simulations of question 3.

The R command source("question3.r") will store the results in output files as explained further below. The anticipated runtime of the R command source("question3.r") is not more than fifteen minutes on a computer at one of the labs of the Mathematics and Statistics building.

Similarly to question1.r, the file question3.r will start with the lines

## Assignment Code

```
### Gradient descent (GD) given a cost function f and its gradient grad
gd <- function(f, grad, y, X, theta0, npars, ndata, a, niters) {
theta <- matrix(data=NA, nrow=niters, ncol=npars)
cost <- vector(mode="numeric", length=niters)
theta[1, ] <- theta0
cost[1] <- f(y, X, theta0, ndata)
for (i in 2:niters) {
theta[i, ] <- theta[i-1, ]-a*grad(y, X, theta[i-1, ], ndata)
cost[i] <- f(y, X, theta[i, ], ndata)
}
return(list(theta=theta, cost=cost))
}
### Stochastic gradient descent (SGD) given a cost function f and its gradient grad
# sgd <- function(f, grad, y, X, theta0, npars, ndata, a, niters, nsubsamples) {
# }
### Stochastic gradient descent with momentum (MSGD) given a cost function f and its gradient grad
# msgd <- function(f, grad, y, X, theta0, npars, ndata, a, niters, nsubsamples, b, m0) {
# }
### Stochastic gradient descent with Nesterov accelerated gradient (NAGSGD) given a cost function f and its gradient grad
# nagsgd <- function(f, grad, y, X, theta0, npars, ndata, a, niters, nsubsamples, b, m0) {
# }
### AdaGrad given a cost function f and its gradient grad
# adagrad <- function(f, grad, y, X, theta0, npars, ndata, a, niters, nsubsamples, epsilon, G0) {
# }
### RMSProp given a cost function f and its gradient grad
# rmsprop <- function(f, grad, y, X, theta0, npars, ndata, a, niters, nsubsamples, c, epsilon, v0) {
# }
### Adam given a cost function f and its gradient grad
# adam <- function(f, grad, y, X, theta0, npars, ndata, a, niters, nsubsamples, b, c, epsilon, m0, v0) {
# }
```

## Assignment Code

```
### Cost functions for linear regression
## Cost function for linear regression
lmf <- function(y, X, theta, ndata) {
sum((X%*%theta-y)^2)/(2*ndata)
}
## Gradient of cost function for linear regression
lmgrad <- function(y, X, theta, ndata){
t(X)%*%(X%*%theta-y)/ndata
}
```