Stocastic gradient descent (SGD)

From Virtual Reality, Augmented Reality Wiki
Jump to: navigation, search

Introduction

A common optimization algorithm in machine learning is stocastic gradient descent (SGD). The algorithm's goal is to minimize a given objective, such as the error in a neural network. This is done by iteratively updating parameters of the model in an opposite direction to the gradient of objective function.

Mathematical Formulation

Given a set denotated as w, and an objective function denotated as J(w ), the goal of optimization is to find the set that minimizes the objective function. The algorithm starts with a set of initial parameters and iteratively updates them in the opposite direction to the gradient of objective function. The following equation explains the update rule:

w = w - learning_rate * gradient_of_J_wrt_w

where learning_rate refers to a scalar value which controls the size of each update.

Stochasticity

Because it uses a random subset, or a "batch", of data to calculate the gradient at each iteration, it is called "stochastic gradient descent". This is in contrast with "batch" gradient descend, which uses all the data to calculate the gradient at each iteration. Although it can speed up convergence, using a random subset can also lead to better updates. This can be useful for escaping local minima.

Variations

There are many variations of the basic SGD algorithm.

  • Mini-batch gradient descent which uses a fixed size subset of the data to calculate the gradient at each iteration
  • Momentum: This adds a "momentum term" to the update rule to dampen oscillations, and speed up convergence
  • Nesterov accelerated Gradient, which uses the gradient at the future position of the parameters for computing the update
  • Adagrad which adapts the learning rate of each parameter based upon the historical gradient
  • Adam, which is an extension to the Adagrad and also includes momentum term.

Explain Like I'm 5 (ELI5)

A computer can use stochastic gradient descent to determine the best set (parameters), to make a prediction. The computer starts with a set instructions and then looks at a small portion of the data (batch). It adjusts the instructions depending on how accurate or poor the prediction was. It continues to adjust the instructions and look at different parts of data until it is satisfied with the results. It's like a treasure hunting, where the computer is trying find the best set instructions to make the most accurate predictions.