Intro:
Gradient descent is a first order optimisation algorithem used for finding for the local minimum of a real-valued function \(\min_x f(x)\) with respect to the variable \(x\).
Usually the functions \(f\) are multivariate i.e. functions that take multiple input variables and produce a single output, and have the form:
\[f : \mathbb{R}^n \to \mathbb{R}\]where \(\mathbb{R}^n\) denotes the \(n\)-dimensional real space, and the function \(f\) maps an \(n\)-dimensional vector of real numbers to a single real number.
Examples of Multivariate Functions
-
Two Variables: \(f(x, y) = x^2 + y^2\) This function takes two input variables, \(x\) and \(y\), and produces a single output which is the sum of the squares of the inputs.
-
Three Variables: \(g(x, y, z) = x \cdot y + y \cdot z + z \cdot x\) This function takes three input variables, \(x\), \(y\), and \(z\), and produces a single output which is the sum of the products of the pairs of inputs.
The gradient at a point, points in the direction of steepest ascent 1.
Gradient decent relies on that \(f(x_0)\) decreases fasterst if we move from \(x_0\) in the direction of the negative gradient \(-( (\nabla f)(x_0) )^\top\). Note, we uise the transpose for the gradient, otherwise vector dimensions will not work out.
Detailed Explanation
-
\(\nabla f\): This symbol represents the gradient of the function \(f\). The gradient is a vector of partial derivatives of \(f\) with respect to its input variables. If \(f\) is a function of \(n\) variables, \(\nabla f\) is a vector with \(n\) components.
-
\((\nabla f)(x_0)\): This means that the gradient \(\nabla f\) is evaluated at the specific point \(x_0\). The point \(x_0\) is in the domain of \(f\).
-
\((\nabla f)(x_0)^\top\): The \(\top\) symbol denotes the transpose of the gradient vector. If the gradient \((\nabla f)(x_0)\) is originally a column vector, its transpose will be a row vector, and vice versa.
-
\(- (\nabla f)(x_0)^\top\): The minus sign indicates that we are considering the negative of the transposed gradient vector evaluated at \(x_0\).
Example
Consider a function \(f(x, y) = x^2 + y^2\).
-
Gradient Calculation: \(\nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right) = (2x, 2y)\)
-
Evaluating at \(x_0 = (1, 1)\): \((\nabla f)(1, 1) = (2 \cdot 1, 2 \cdot 1) = (2, 2)\)
-
Transposing the Gradient: If \((2, 2)\) is a row vector, its transpose is still \((2, 2)\) but often, it is considered a column vector transposed to a row vector.
-
Negative Transposed Gradient: \(- (\nabla f)(1, 1)^\top = - (2, 2)\)
This negative gradient is used to update the current point \(x_0\) in the gradient descent method to find the function’s minimum.
If for a small step-size \(\nabla \geq 0\)
-
Mathematics for Machine Learning, Section 5.1, https://yung-web.github.io/home/courses/mathml.html ↩