## Deriving Gibbs distribution from stochastic gradients

Stochastic gradients is one of the most important tools in optimization and machine learning (especially for Deep Learning – see for example ConvNet). One of it’s advantage is that it behavior is well understood in general case, by application of methods of statistical mechanics.

In general form stochastic gradient descent could be written as

where is a random variable with expectation zero.

To apply methods of statistical mechanics we can rewrite it in continuous form, as stochastic gradient flow

and random variable *F(t)* we assume to be white noise for simplicity.

In that moment most of textbooks and papers refer to “methods of statistical mechanics” to show that

stochastic gradient flow has invariant probability distribution, which is called Gibbs distribution

and from here derive some interesting things like temperature and free energy.

The question is – how Gibbs distribution derived from stochastic gradient flow?

First we have to understand what stochastic gradient flow really means.

It’s not a partial differential equation (PDE), because it include random variable, which is not a function . In fact it’s a stochastic differential equation (SDE) . SDE use some complex mathematical machinery and relate to partial differential equations, probability/statistics, measure theory and ergodic theory. However they used a lot in finance, so there is quite a number of textbooks on SDE for non-mathematicians. For the short, minimalistic and accessible book on stochastic differential equation I can’t recommend highly enough introduction to SDE by L.C. Evans

SDE in question is called Ito diffusion. Solution of that equation is a stochastic process – collection of random variables parametrized by time. Sample path of stochastic process in question as a function of time is nowhere differentiable – it’s difficult to talk about it in term of derivatives, so it is defined through it’s integral form.

First I’ll notice that integral of white noise is actually Brownian motion, or Wiener process.

Assume that we have stochastic differential equation written in informal manner

with *X* -stochastic process and *F(t)* – white noise

It’s integral form is

where *W(t)* is a Wiener process

This equation is usually written in the form

This is only a notation for integral equation, *d* here is not a differential.

Returning to (1)

is an integral along sample path, it’s meaning is obvious, or it can be defined as limit of Riemann sums with respect to time.

The most notable thing here is

– integral with respect to Wiener process (3)

It’s a stochastic integral, and it’s defined in the courses of stochastic differential equation as the limit of Riemann sums of random variables, in the manner similar to definition of ordinary integral.

Curiously, stochastic integral is not quite well defined. Depending on the form of the sum it produce different results, like Ito integral:

Different Riemann sums produce different integral – Stratonovich integral:

Ito integral used more often in statistics because it use – it don’t “look forward”, and Stratonovich more used in theoretical physics.

Returning to Ito integral – Ito integral is stochastic process itself, and it has expectation zero for each *t*.

From definition of Ito integral follow one of the most important tools of stochastic calculus – Ito Lemma (or Ito formula)

Ito lemma states that for solution of SDE (2)

*X, b, W* – vectors, *g* – matrix

were *W* is Wiener process (actually some more general process) and *b* and *g* are good enough

where is the gradient.

From Ito lemma follow Ito product rule for scalar processes: applying Ito formula to process combined from two processes *X* and *Y* to function *u(V) = XY*

Using Ito formula and Ito product rule it is possible to get Feynman–Kac formula (derivation could be found in the wikipedia, it use only Ito formula, Ito product rule and the fact that expectation of Ito integral (3) is zero):

for partial differential equation (PDE)

with terminal condition

solution can be written as conditional expectation:

Feynman–Kac formula establish connection between PDE and stochastic process.

From Feynman–Kac formula taking and we get Kolmogorov backward equation :

for

equation

with terminal condition (4) have solution as conditional expectation

From Kolmogorov backward equation we can obtain Kolmogorov forward equation, which describe evolution of probability density for random process *X* (2)

In SDE courses it’s established that (2) is a Markov process and has transitional probability *P* and transitional density *p*:

*p(x, s, y, t) = *probability density at being at *y* in time *t*, on condition that it started at *x* in time *s*

taking *u* – solution of (5) with terminal condition (6)

From Markov property

from here

form here

from (5)

Now we introduce dual operator

By integration by part we can get

and from (7)

for *t=T*

This is true for any , wich is independent from *p*

And we get Kolmogorov forward equation for *p*. Integrating by *x* we get the same equation for probability density at any moment *T*

Now we return to Gibbs invariant distribution for gradient flow

Stochastic gradient flow in SDE notation

– integral of white noise is Wiener process

We want to find invariant probability density . Invariant – means it doesn’t change with time,

so from Kolmogorov forward equation

or

removing gradient

*C = 0* because we want integrable

and at last we get Gibbs distribution

Recalling again the chain of reasoning:

Wiener process →

SDE + Ito Lemma + Ito product rule + zero expecation of Ito integral →

Kolmogorov backward equation + Markov property of SDE →

Kolmogorov forward equation for probability density →