OptimLib: Gradient Descent

The many shades of Gradient Descent (GD).

Definition and Syntax

bool gd(arma::vec& init_out_vals, std::function<double (const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)> opt_objfn, void* opt_data);
bool gd(arma::vec& init_out_vals, std::function<double (const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)> opt_objfn, void* opt_data, algo_settings_t& settings);

Function arguments:

init_out_vals a column vector of initial values; will be replaced by the solution values.
opt_objfn the function to be minimized, taking three arguments:
- vals_inp a vector of inputs;
- grad_out a vector to store the gradient; and
- opt_data additional parameters passed to the function.
opt_data additional parameters passed to the function.
settings parameters controlling the optimization routine; see below.

Optimization control parameters:

double err_tol the tolerance value controlling how small $\| \nabla f \|$ should be before 'convergence' is declared.
int iter_max the maximum number of iterations/updates before the algorithm exits.
bool vals_bound whether the search space is bounded. If true, then

arma::vec lower_bounds this defines the lower bounds.
arma::vec upper_bounds this defines the upper bounds.

int gd_method which updating formula to use; see the details section below.
gd_settings_t gd_settings a structure containing options for the available GD algorithms; see the details section below.

Details

Let $x^{(i)}$ denote the values at stage $i$ of the algorithm, and define $f^{(i)} := f (x^{(i)})$. The Gradient Descent (GD) updating rule is:

$$x^{(i+1)} = x^{(i)} - d^{(i)}$$

where the descent direction $d^{(i)}$ is computed using one of several approaches described below, depending upon the choice of gd_method.

gd_method = 0: Basic GD

learning rate

step_size

gd_method = 1: GD with Momentum

momentum_par

gd_method = 2: Nesterov accelerated gradient descent (NAG)
gd_method = 3: AdaGrad

norm_term

gd_method = 4: RMSProp

ada_rho

gd_method = 5: AdaDelta
gd_method = 6: Adam (Adaptive Moment Estimation) and AdaMax

adam_beta_1

adam_beta_2

If ada_max = true, then the updating rule for $v^{(i)}$ is no longer based on the $L_2$ norm; instead:

gd_method = 7: Nadam and NadaMax

If ada_max = true, then the updating rule for $v^{(i)}$ is no longer based on the $L_2$ norm; instead:

The algorithm stops when $\| \nabla f \|$ is less than err_tol, or the total number of iterations exceeds iter_max.

Examples

As an example, consider maximum likelihood estimation of a logit model, common in statistics and machine learning. In this case we have closed-form expressions for the gradient and hessian. We will compare Adam (Adaptive Moment Estimation) to a pure Newton-based algorithm.

#include "optim.hpp"

// sigmoid function

inline
arma::mat sigm(const arma::mat& X)
{
    return 1.0 / (1.0 + arma::exp(-X));
}

// log-likelihood function data

struct ll_data_t
{
    arma::vec Y;
    arma::mat X;
};

// log-likelihood function with hessian

double ll_fn_whess(const arma::vec& vals_inp, arma::vec* grad_out, arma::mat* hess_out, void* opt_data)
{
    ll_data_t* objfn_data = reinterpret_cast<ll_data_t*>(opt_data);

    arma::vec Y = objfn_data->Y;
    arma::mat X = objfn_data->X;

    arma::vec mu = sigm(X*vals_inp);

    const double norm_term = static_cast<double>(Y.n_elem);

    const double obj_val = - arma::accu( Y%arma::log(mu) + (1.0-Y)%arma::log(1.0-mu) ) / norm_term;

    //

    if (grad_out)
    {
        *grad_out = X.t() * (mu - Y) / norm_term;
    }

    //

    if (hess_out)
    {
        arma::mat S = arma::diagmat( mu%(1.0-mu) );
        *hess_out = X.t() * S * X / norm_term;
    }

    //

    return obj_val;
}

// log-likelihood function for Adam

double ll_fn(const arma::vec& vals_inp, arma::vec* grad_out, void* opt_data)
{
    return ll_fn_whess(vals_inp,grad_out,nullptr,opt_data);
}

//

int main()
{
    int n_dim = 5;     // dimension of parameter vector
    int n_samp = 4000; // sample length

    arma::mat X = arma::randn(n_samp,n_dim);
    arma::vec theta_0 = 1.0 + 3.0*arma::randu(n_dim,1);

    arma::vec mu = sigm(X*theta_0);

    arma::vec Y(n_samp);

    for (int i=0; i < n_samp; i++)
    {
        Y(i) = ( arma::as_scalar(arma::randu(1)) < mu(i) ) ? 1.0 : 0.0;
    }

    // fn data and initial values

    ll_data_t opt_data;
    opt_data.Y = std::move(Y);
    opt_data.X = std::move(X);

    arma::vec x = arma::ones(n_dim,1) + 1.0; // initial values

    // run Adam-based optim

    optim::algo_settings_t settings;

    settings.gd_method = 6;
    settings.gd_settings.step_size = 0.1;
 
    bool success = optim::gd(x,ll_fn,&opt_data,settings);
 
    arma::cout << "\nAdam: true values vs estimates:\n" << arma::join_rows(theta_0,x) << arma::endl;

    //
    // run Newton-based optim

    x = arma::ones(n_dim,1) + 1.0; // initial values
 
    success = optim::newton(x,ll_fn_whess,&opt_data);
 
    arma::cout << "\nnewton: true values vs estimates:\n" << arma::join_rows(theta_0,x) << arma::endl;

    //
 
    return 0;
}

Solution with timings:

Adam: logit_reg test completed successfully.
elapsed time: 0.025128s

Adam: true values vs estimates:
   2.7850   2.6993
   3.6561   3.6798
   2.3379   2.3860
   2.3167   2.4313
   2.2465   2.3064

newton: logit_reg test completed successfully.
elapsed time: 0.255909s

newton: true values vs estimates:
   2.7850   2.6993
   3.6561   3.6798
   2.3379   2.3860
   2.3167   2.4313
   2.2465   2.3064
}