The many shades of Gradient Descent (GD).
bool gd(arma::vec& init_out_vals, std::function<double (const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)> opt_objfn, void* opt_data); bool gd(arma::vec& init_out_vals, std::function<double (const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)> opt_objfn, void* opt_data, algo_settings_t& settings);
Function arguments:
init_out_vals
a column vector of initial values; will be replaced by the solution values.opt_objfn
the function to be minimized, taking three arguments:
vals_inp
a vector of inputs;grad_out
a vector to store the gradient; andopt_data
additional parameters passed to the function.opt_data
additional parameters passed to the function.settings
parameters controlling the optimization routine; see below.Optimization control parameters:
double err_tol
the tolerance value controlling how small $\| \nabla f \|$ should be before 'convergence' is declared.int iter_max
the maximum number of iterations/updates before the algorithm exits.bool vals_bound
whether the search space is bounded. If true, thenarma::vec lower_bounds
this defines the lower bounds.arma::vec upper_bounds
this defines the upper bounds.int gd_method
which updating formula to use; see the details section below.gd_settings_t gd_settings
a structure containing options for the available GD algorithms; see the details section below.Let $x^{(i)}$ denote the values at stage $i$ of the algorithm, and define $f^{(i)} := f (x^{(i)})$. The Gradient Descent (GD) updating rule is:
$$x^{(i+1)} = x^{(i)} - d^{(i)}$$where the descent direction $d^{(i)}$ is computed using one of several approaches described below, depending upon the choice of gd_method
.
gd_method = 0
: Basic GDstep_size
.
gd_method = 1
: GD with Momentummomentum_par
.
gd_method = 2
: Nesterov accelerated gradient descent (NAG)gd_method = 3
: AdaGradnorm_term
.
gd_method = 4
: RMSPropada_rho
.
gd_method = 5
: AdaDeltagd_method = 6
: Adam (Adaptive Moment Estimation) and AdaMaxadam_beta_1
and adam_beta_2
, respectively. Then
$$d^{(i)} = \alpha \times \dfrac{\hat{m}}{\sqrt{\hat{v}} + \epsilon}$$
ada_max = true
, then the updating rule for $v^{(i)}$ is no longer based on the $L_2$ norm; instead:gd_method = 7
: Nadam and NadaMaxada_max = true
, then the updating rule for $v^{(i)}$ is no longer based on the $L_2$ norm; instead:The algorithm stops when $\| \nabla f \|$ is less than err_tol
, or the total number of iterations exceeds iter_max
.
#include "optim.hpp" // sigmoid function inline arma::mat sigm(const arma::mat& X) { return 1.0 / (1.0 + arma::exp(-X)); } // log-likelihood function data struct ll_data_t { arma::vec Y; arma::mat X; }; // log-likelihood function with hessian double ll_fn_whess(const arma::vec& vals_inp, arma::vec* grad_out, arma::mat* hess_out, void* opt_data) { ll_data_t* objfn_data = reinterpret_cast<ll_data_t*>(opt_data); arma::vec Y = objfn_data->Y; arma::mat X = objfn_data->X; arma::vec mu = sigm(X*vals_inp); const double norm_term = static_cast<double>(Y.n_elem); const double obj_val = - arma::accu( Y%arma::log(mu) + (1.0-Y)%arma::log(1.0-mu) ) / norm_term; // if (grad_out) { *grad_out = X.t() * (mu - Y) / norm_term; } // if (hess_out) { arma::mat S = arma::diagmat( mu%(1.0-mu) ); *hess_out = X.t() * S * X / norm_term; } // return obj_val; } // log-likelihood function for Adam double ll_fn(const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data) { return ll_fn_whess(vals_inp,grad_out,nullptr,opt_data); } // int main() { int n_dim = 5; // dimension of parameter vector int n_samp = 4000; // sample length arma::mat X = arma::randn(n_samp,n_dim); arma::vec theta_0 = 1.0 + 3.0*arma::randu(n_dim,1); arma::vec mu = sigm(X*theta_0); arma::vec Y(n_samp); for (int i=0; i < n_samp; i++) { Y(i) = ( arma::as_scalar(arma::randu(1)) < mu(i) ) ? 1.0 : 0.0; } // fn data and initial values ll_data_t opt_data; opt_data.Y = std::move(Y); opt_data.X = std::move(X); arma::vec x = arma::ones(n_dim,1) + 1.0; // initial values // run Adam-based optim optim::algo_settings_t settings; settings.gd_method = 6; settings.gd_settings.step_size = 0.1; bool success = optim::gd(x,ll_fn,&opt_data,settings); arma::cout << "\nAdam: true values vs estimates:\n" << arma::join_rows(theta_0,x) << arma::endl; // // run Newton-based optim x = arma::ones(n_dim,1) + 1.0; // initial values success = optim::newton(x,ll_fn_whess,&opt_data); arma::cout << "\nnewton: true values vs estimates:\n" << arma::join_rows(theta_0,x) << arma::endl; // return 0; }Solution with timings:
Adam: logit_reg test completed successfully. elapsed time: 0.025128s Adam: true values vs estimates: 2.7850 2.6993 3.6561 3.6798 2.3379 2.3860 2.3167 2.4313 2.2465 2.3064 newton: logit_reg test completed successfully. elapsed time: 0.255909s newton: true values vs estimates: 2.7850 2.6993 3.6561 3.6798 2.3379 2.3860 2.3167 2.4313 2.2465 2.3064 }