Intro:

Gradient descent is a first order optimisation algorithem used for finding for the local minimum of a real-valued function \(\min_x f(x)\) with respect to the variable \(x\).

Usually the functions \(f\) are multivariate i.e. functions that take multiple input variables and produce a single output, and have the form:

\[f : \mathbb{R}^n \to \mathbb{R}\]

where \(\mathbb{R}^n\) denotes the \(n\)-dimensional real space, and the function \(f\) maps an \(n\)-dimensional vector of real numbers to a single real number.

Examples of Multivariate Functions

Two Variables: \(f(x, y) = x^2 + y^2\) This function takes two input variables, \(x\) and \(y\), and produces a single output which is the sum of the squares of the inputs.
Three Variables: \(g(x, y, z) = x \cdot y + y \cdot z + z \cdot x\) This function takes three input variables, \(x\), \(y\), and \(z\), and produces a single output which is the sum of the products of the pairs of inputs.

The gradient at a point, points in the direction of steepest ascent ¹.

Gradient decent relies on that \(f(x_0)\) decreases fasterst if we move from \(x_0\) in the direction of the negative gradient \(-( (\nabla f)(x_0) )^\top\). Note, we uise the transpose for the gradient, otherwise vector dimensions will not work out.

Detailed Explanation

\(\nabla f\): This symbol represents the gradient of the function \(f\). The gradient is a vector of partial derivatives of \(f\) with respect to its input variables. If \(f\) is a function of \(n\) variables, \(\nabla f\) is a vector with \(n\) components.
\((\nabla f)(x_0)\): This means that the gradient \(\nabla f\) is evaluated at the specific point \(x_0\). The point \(x_0\) is in the domain of \(f\).
\((\nabla f)(x_0)^\top\): The \(\top\) symbol denotes the transpose of the gradient vector. If the gradient \((\nabla f)(x_0)\) is originally a column vector, its transpose will be a row vector, and vice versa.
\(- (\nabla f)(x_0)^\top\): The minus sign indicates that we are considering the negative of the transposed gradient vector evaluated at \(x_0\).

Example

Consider a function \(f(x, y) = x^2 + y^2\).

Gradient Calculation: \(\nabla f = \left( \frac{\partial f}{\partial x}, \frac{\partial f}{\partial y} \right) = (2x, 2y)\)
Evaluating at \(x_0 = (1, 1)\): \((\nabla f)(1, 1) = (2 \cdot 1, 2 \cdot 1) = (2, 2)\)
Transposing the Gradient: If \((2, 2)\) is a row vector, its transpose is still \((2, 2)\) but often, it is considered a column vector transposed to a row vector.
Negative Transposed Gradient: \(- (\nabla f)(1, 1)^\top = - (2, 2)\)

This negative gradient is used to update the current point \(x_0\) in the gradient descent method to find the function’s minimum.

If for a small step-size \(\nabla \geq 0\)

Mathematics for Machine Learning, Section 5.1, https://yung-web.github.io/home/courses/mathml.html ↩

Notes on Gradient Decent

Intro:

Examples of Multivariate Functions

Detailed Explanation

Example

Further Reading

Code and Coffee Meeetup - Notes on LLM tokenizers

ML in PL Workshop, Generative methods in drug discovery, a practical introduction

Using GPT4 to generate git logs for OpenSource projects in the style of conventional commits via a terminal