# Mirror Image

## Some simple ways to speed up convnet a little

There are quite a few complex methods for making convolutional networks converge faster. Natural gradient, dual coordinate ascent, second order hessian free methods and more. However those methods are usually require considerable memory spending and extensive modifications of existing code if you already had some working method before.

Here instead I’ll list some simple and lightweight methods, which don’t require extensive changes of already working code. Those methods works (I’ve tested them on CIFAR10), and at very worst don’t make convergence worse. You shouldn’t expect radical improvement from those methods – they are for little speed up.

1. Start training on the part of the data and gradually increase data size to full. For example for CIFAR10 I start with 1/5 of all the data and increase it by 1/5 each and time giving it 3 more epochs to train for each chunk. This trick was inspired by “curriculum learning”. Resulting acceleration is small but still visible (on order of 10%).

2. Random sampling. Running all the data batches in the same order theoretically can produce some small overfitting – coadaptaion of the data. Usually this effect is not visible, but just to be sure you can reshuffle data randomly each epoch, especially if it cost very little. I’m using simple reshuffling method based on the prime numbers table.
For each minibatch k I take samples indexed by i: (i*prime)%data_size where i is from k*minibatch_size to (k+1)*minibatch_size, and prime taken form prime numbers table change each epoch. If prime is more than all prime factors of data_size all samples are indexed once.

3. Gradient step size. The baseline method is to use simple momentum. Most popular method of gradient acceleration on top of momentum are RMSProp and Nesterov accelerated gadient (NAG).
NAG simply change order in which momentum and gradient are applied to weights. RMSProp is normalization of the gradient, so that it should have approximately same norm. One of the variants described here, which is similar to low pass filter.Another possible implementation – gradient is divided by running root mean square of it’s previous values. However in my tests on CIFAR10 neither NAG nor RMSprop or NAG+RMSProp show any visible improvement.

Nevertheless in my tests a simple modification of gradient step show better result then standard momentum. That is just a common optimization trick – discard or reduce those gradient step which increase error function. Straightforward implementation of it could be costly for convnet – to estimate cost function after gradient step require additional forward propagation. There is a workaround – estimate error on the next sample of the dataset and if error increased subtract part of the previous gradient. This is not precisely the same as reducing bad gradient, because we subtract offending gradient after we made next step, but because the step is small it still works. Essentially it could be seen as error-dependent gradient step

4. Dropout. Use of dropout require some care, especially if we want to insert dropout into networks built and tuned without dropout. Common recommendation is to increase number of filters proportionally to dropout value.
But there are some hidden traps there: while dropout should improve test error(results on the samples not used in training) it make training error noticeably worse. In practice it may make worse even test error. Dropout also make convergence considerably more slow. There is non-obvious trick which help: continue iterations without changing learning rate for some times even after training error is not decreasing any more. Intuition behind this trick – dropout reduce test error, not training error. Test error decrease often very slow and noisy comparing to training error, so, just to be sure it may help to increase number of epoch even after both stop decreasing, without decreasing learning rate. Reducing learning rate after both training and test error went plateau for some epochs may produce better results.

To be continued (may be)

Advertisement

28, December, 2014 Posted by | computer vision, Deep Learnig | , , , , | Comments Off on Some simple ways to speed up convnet a little

## Some thoughts about deep learning criticism

There is some deep learning (specifically #convnet) criticism based on the artificially constructed misclassification examples.
There is a new paper
“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images” by Nguen et al
http://arxiv.org/abs/1412.1897
and other direction of critics is based on the older, widely cited paper
“Intriguing properties of neural networks” by Szegedy et al
http://arxiv.org/abs/1312.6199
In the first paper authors construct examples which classified by convnet with confidence, but look nothing like label to human eye.
In the second paper authors show that correctly classified image could be converted to misclassified with small perturbation, which perturbation could be found by iterative procedure
What I think is that those phenomenons have no impact on practical performance of convolutional neural networks.

First paper is really simple to address. The real world images produced by camera are not dense in the image space(space of all pixel vectors of image size dimension).
In fact camera images belong to low-dimensional manifold in the image space, and there are some interesting works on dimensionality and property of that manifold. For example dimensionality of the space images of the fixed 3D scene it is around 7, which is not surprising, and the geodesics of that manifold could be defined through the optical flow.
Of cause if sample is outside of image manifold it will be misclassified, method of training notwithstanding. The images in the paper are clearly not real-world camera images, no wonder convnet assign nonsensical labels to them.

Second paper is more interesting. First I want to point that perturbation which cause misclassification is produced by iterative procedure. That hint that in the neighbourhood of the image perturbed misclassified images are belong to measure near-zero set.
Practically that mean that probability of this type of misclassification is near zero, and orders of magnitude less than “normal” misclassification rate of most deep networks.
But what is causing that misclassification? I’d suggest that just high dimensionality of the image and parameters spaces and try to illustrate it. In fact it’s the same reason why epsilon-sparse vector are ubiquitous in real-world application: If we take n-dimensional vector, probability that all it’s components more than $\epsilon$ is $(1-\epsilon)^n$, which is near zero. This and like effects explored in depth in compressed sensing ( also very good Igor Carron’s page)
Now to convnet – convnet classify images by signs of piecewise-linear functions.
Take any effective pixel which is affecting activations. Convolutional network separate image space into piecewise-linear areas, which are not aligned with coordinate axes. That mean if we change intensity of pixel far enough we are out of correct classification area.
We don’t know how incorrect areas are distributed in the image space, but for common convolutional network dimensionality of subspace of the hyperplanes which make piecewise-linear separation boundary is several times more than dimensionality of the image vector. This suggest that correlation between incorrect areas of different pixels is quite weak.
Now assume that image is stable to perturbation, that mean that exist \epsilon such that for any effective pixel it’s epsilon-neighbourhood is in the correct area. If incorrect areas are weakly correlated that mean probability of image being stable is about $(1-\epsilon)^n$, where n is number of effective pixels. That is probability of stable image is near zero. That illustrate suggestion that this “adversarial” effect is only caused by dimensionality of the problem and parameter space, not by some intrinsic deficiency of the deep network.

14, December, 2014

## Geoffrey Hinton on max pooling (reddit AMA)

Geoffrey Hinton, neural networks, ML, deep learning pioneer answered “Ask Me Anything” on reddit:
https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/
His answers only:
https://www.reddit.com/user/geoffhinton

His most controversial answer is:
The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.
If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its true that if the pools overlap enough, the positions of features will be accurately preserved by “coarse coding” (see my paper on “distributed representations” in 1986 for an explanation of this effect). But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer (by pose I mean position, orientation, and scale).
I think it makes much more sense to represent a pose as a small matrix that converts a vector of positional coordinates relative to the viewer into positional coordinates relative to the shape itself. This is what they do in computer graphics and it makes it easy to capture the effect of a change in viewpoint. It also explains why you cannot see a shape without imposing a rectangular coordinate frame on it, and if you impose a different frame, you cannot even recognize it as the same shape. Convolutional neural nets have no explanation for that, or at least none that I can think of.

It’s interesting though controversial opinion, and here is my take on it:
It looks like Convolutional networks discriminate input by switching between activation subnetworks (http://arxiv.org/abs/1410.1165, Understanding Locally Competitive Networks)
From that point of view max pooling operation is pretty natural – it provide both robustness to small deviations and switching between activation subsets. Aslo max pooling has a nice property to be semilinear to multiplication f(ax)=af(x) , which make all the layers stack but last cost layer semilinear, and if one use SVM/L2SVM as cost whole stack semilinear. Not sure if it help in discrimination, but it surely help in debugging.
As alternative to pooling pyramid Geoffrey Hinton suggest to get somehow global rotation(or other transformation) matrix from the image.
Getting it from image or from image patch is actually possible with low-rank matrix factorization from single image(Transform Invariant Low-rank Textures by Zhang et al) and by optical flow (or direct matching) from set of images, but it’s quite expensive in performance term. It’s definitely possible to combine those method with deep network, (ironically they are benefit using pooling pyramid)
but:
1. They will be kind of image augmentation
2. Those are problem-specific methods, they may not help in other domains like NLP, speech etc.

There are some work on using low-rank matrix factorization in deep networks, but they are mostly about fully connected, not convolutional layers. But IMHO if to try do factorization in conv layers, the pooling wouldn’t go away.

PS (edit): Here is a video with Hinton alternative to pooling – “capsules”
http://techtv.mit.edu/collections/bcs/videos/30698-what-s-wrong-with-convolutional-nets

11, November, 2014 Posted by | computer vision, Deep Learnig | , , | Comments Off on Geoffrey Hinton on max pooling (reddit AMA)

## Finds in arxiv, January.

Repost from googleplus stream

Finds in arxiv

Computer vision
Convex Relaxations of SE(2) and SE(3) for Visual Pose Estimation
Finding camera position from the series of 2D (or depth) images is one of the most common (and difficult) task of computer vision.
The biggest problem here is incorporating rotation into cost(energy) function,  that should minimize reprojection(fro 2D) error (seehttp://en.wikipedia.org/wiki/Bundle_adjustment). Usually it’s done by minimization on manifold – locally presenting rotation parameter space as linear subspace. The problem with that approach is that initial approximation should be good enough and minimization goes by small steps pojecting/reprojecting on manifold (SO(3) in the case) which can stuck in local minimum. Where to start if the is no initial approximation? Usually it’s just brute-forced with several initialization attempt. Here in the paper approach is different – no initial approximation needed, solution is straightforward and minimum is global. The idea is instead of minimizing rotation on the sphere minimize it inside the ball. In fact it’s a classical convexification approach – increase dimentionality of the problem to make it convex. The pay is a lot higher computational cost of cause
http://arxiv.org/abs/1401.3700

A Sparse Outliers Iterative Removal Algorithm to Model the Background in the Video Sequences
Background removal without nuclear norm. Some “good” frames chosen as dictionary and backround represented in it.
http://arxiv.org/abs/1401.6013

Deep Learning

Learning Mid-Level Features and Modeling Neuron Selectivity for Image Classification
Mid level features is  a relatively new concept but a very old practice. It’s a building “efficient”  subset of features from the set of features obtained by low-level feature descriptor(like SIFT, SURF, FREAK etc) Before it was done by PCA, later by sparse coding.
What authors built is some kind of hibrid of convolutional network with dictionary learning. Mid level features fed into neuron layer for sparsification and result into classifier. There are some benchmarks but no CIFAR or MINST
http://arxiv.org/abs/1401.5535

Optimization and CS

Alternating projections and coupling slope
Finding intersection of two non-convex sets with alternating projection. I didn’t new that transversality condition used a lot in convex geometry too.
http://arxiv.org/abs/1401.7569
Bayesian Pursuit Algorithms
Bayesian version of hard thresholding. Bayesian approach is in how threshold is chosen and update scaled.
http://arxiv.org/abs/1401.7538

1, February, 2014 Posted by | arxiv | , , , , | Comments Off on Finds in arxiv, January.

## “get nan or inf error” in cuda-convnet – possible fix

“get nan or inf” error happens sometimes on lower-end GPU’s in cuda-convnet. I have traced this error to NaN values in the weights of convolutional layers. I still not clear to me why these NaN values appear in the weights. Are they backpropagate from fully-connected layers or popping up in the convolution kernel? It looks to me latter is more likely. Anyway I made a temporary fix – just scan weight’s gradients with simple cuda kernel and replace NaN’a with zeroes. Didn’t observe the error after that.

I have pushed fix into windows version of cuda-convnet at

https://github.com/s271/win_convnet

Fix activated with option –fix-nan=1

There shouldn’t be any problem with making those changes for linux version – there are several small changes in  *.cu and *.py files only

PS

If anyone wondering what cuda-convnet is here is a nice explanation:

http://fastml.com/object-recognition-in-images-with-cuda-convnet/

And here is the main paper about cuda-convnet

Click to access NIPS2012_0534.pdf

17, January, 2014

## December finds in #arxiv

Repost from my googleplus stream

Computer Vision
Non-Local means is a local image denoising algorithm
Paper shows that non-local mean weights are not identify patches globally point in the images, but are susceptible to aperture problem:
http://en.wikipedia.org/wiki/Optical_flow#Estimation That’s why short radius NLM could be better then large radius NLM. Small radius cutoff play the role of regularizer, similar to the Total Variation in Horn-Shunk Optical flow.
http://en.wikipedia.org/wiki/Horn%E2%80%93Schunck_method (my comment – TV-L1 is generally better than TV-L2 in Horn-Schunk)
http://arxiv.org/abs/1311.3768

Deep Learning
Do Deep Nets Really Need to be Deep?
Authors state that shallow neural nets can in fact achieve similar performance to deep convolutional nets. The problem though is, that they had to be initialized or preconditioned – they can not be trained using existing algorithms.
And for that initialization they need deep nets. Authors hypothesize that there should be algorithms that allow training of those shallow nets to reach the same performance as deep nets.
http://arxiv.org/abs/1312.6184

Intriguing properties of neural networks
The linear combination of deep-level nodes produce the same results as the the original nodes. That  suggest that nodes the spaces itself rather it’s representation keep information for deep levels.
The input-output mapping also discontinuous – small perturbations cause misclassification. Those perturbation are not dependent on the training, only on input of classification. (My comment – sparse coding is generally not smooth on input, another argument that sparse coding is part of internal mechanics of deep learning)
http://arxiv.org/abs/1312.6199

From Maxout to Channel-Out: Encoding Information on Sparse Pathways
This paper start with observation that max-out is a form of sparse coding: only one of the input pathway is chosen for father processing. From this inferred development of that principle:
remove “middle” layer which “choose” maximum input, and transfer maximal input at once into next level – make choice function index-aware. Some other choice function beside the max is considered, but max still seems the best
Piecewise-constant choice function make interesting reference to previous paper (discontinuity of input-output mapping)
http://arxiv.org/abs/1312.1909

Unsupervised Feature Learning by Deep Sparse Coding
This, for a difference is not about convolutional network.
Instead SIFT(or similar) descriptors are used to produce bag-of-words, sparse coding is used with max-out, and manifold learning applied to it. (http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction)
http://arxiv.org/abs/1312.5783

Generative NeuroEvolution for Deep Learning
I’m generally wary of evolutionary methods, but this looks kind of interesting –  it’s based on compositional pattern producing network (CPPN)- encoding geometric pattern as composition of simple functions.
This CPPN is used to encode connectivity pattern of ANN (Convolutional newtwork most used). Thus complete process is the combination of ANN training and evolutionary CPPN training
http://arxiv.org/abs/1312.5355

Some Improvements on Deep Convolutional Neural Network Based Image ClassificationUnsupervised feature learning by augmenting single images
Botht papers seems about the same subject – squeeze more out of labeled images by applying a lot of transformation to them(Some of those transformations are implemented in cuda-convnet BTW)
http://arxiv.org/abs/1312.5402http://arxiv.org/abs/1312.5242

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Analytical exploration of toy 3-layer model *without_ actual non-linear neurons. Model completely linear to input (polynomial to weights). Nevertheless it show some interesting properties, like step in learning curve
http://arxiv.org/abs/1312.6120

Optimization
Distributed Interior-point Method for Loosely Coupled Problems
Mixing together all my favorite methods: Interior point, Newton, ADMM(Split-Bregman) into one algorithm  and make a distribute implementation of it.
Mixing Newton and ADMM, ADMM and Interior point looks risky to me, though with a lot of subiterations it may work(that’s why it’s distributed – require a lot of calculating power)
Also I’m not sure about convergene of the combined algorithm – each step’s convergence is proven, but I’m not sure the same could be applyed to the combination.
Newton and ADMM have kind of contradicting optimal conditions – smoothness vs piecewise linearity. Would like to see more research on this…
http://arxiv.org/abs/1312.5440

Total variation regularization for manifold-valued data
Proximal mapping and soft thresholding for manifolds – analog of ADMM for manifolds.
http://arxiv.org/abs/1312.7710

just interesting stuff
Coping with Physical Attacks on Random Network Structures
Include finding vulnerable spots and results of random attacks
(My comment – shouldn’t it be connected to precolation theory?)
http://arxiv.org/abs/1312.6189

5, January, 2014 Posted by | arxiv | , , , , | Comments Off on December finds in #arxiv

## November finds in #arxiv and NIPS 2013

This is “find in arxiv” reposts form my G+ stream for November.

NIPS 2013

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Stochastic gradient (SGD) is the major tool for Deep Learning. However if you look at the plot of cost function over iteration for   SGD you will see that after quite fast descent it becoming extremely slow, and error decrease could even become non-monotonous.  Author explain by necessity of trade of between the step size and variance of random factor – more precision require smaller variance but that mean smaller descent step and slower convergence. “Predictive variance” author suggest to mitigate problem is the same old “adding back the noise” trick, used for example in Split Bregman. Worth reading IMHO.

http://papers.nips.cc/paper/4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction

Predicting Parameters in Deep Learning
Output of the first layer of ConvNet is quite smooth, and that could be used for dimensionality reduction, using some dictionary, learned or fixed(just some simple kernel). For ConvNet predicting 75% of parameters with fixed dictionary have negligible effect on accuracy.
http://papers.nips.cc/paper/5025-predicting-parameters-in-deep-learning

Learning a Deep Compact Image Representation for Visual Tracking
Application of ADMM (Alternating Direction Method of Multipliers, of which Split Bregman again one of the prominent examples) to denoising autoencoder with sparsity.
http://papers.nips.cc/paper/5192-learning-a-deep-compact-image-representation-for-visual-tracking

Deep Neural Networks for Object Detection
People from Google are playing with Alex Krizhevsky’s ConvNet
http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection

–arxiv (last)–

Are all training examples equally valuable?
It’s intuitively obvious that some training sample are making training process worse. The question is – how to find wich sample should be removed from training? Kind of removing outliers. Authors define “training value” for each sample of binary classifier.
http://arxiv.org/pdf/1311.7080

Finding sparse solutions of systems of polynomial equations via group-sparsity optimization
Finding sparse solution of polynomial system with lifting method.
I still not completely understand why quite weak structure constraint is enough for found approximation to be solution with high probability. It would be obviously precise  for binary 0-1 solution, but why for general sparse?
http://arxiv.org/abs/1311.5871

Semi-Supervised Sparse Coding
Big dataset with small amount of labeled samples – what to do? Use unlabeled samples for sparse representation. And train labeled samples in sparse representation.
http://arxiv.org/abs/1311.6834
From the same author, similar theme – Cross-Domain Sparse Coding
Two domain training – use cross domain data representation to map all the samples from both source and target domains to a data representation space with a common distribution across domains.
http://arxiv.org/abs/1311.7080

Robust Low-rank Tensor Recovery: Models and Algorithms
More of tensor decomposition with trace norm
http://arxiv.org/abs/1311.6182

Complexity of Inexact Proximal Newton methods
Application of Proximal Newton (BFGS) to subset of coordinates each step – active set coordinate descent.
http://arxiv.org/pdf/1311.6547

Computational Complexity of Smooth Differential Equations
Polynomial-memory complexity of ordinary differential equations.
http://arxiv.org/abs/1311.5414

–arxiv (2)–

Deep Learning

Visualizing and Understanding Convolutional Neural Networks
This is exploration of Alex Krizhevsky’s  ConvNet
https://code.google.com/p/cuda-convnet/ )
using “deconvnet” approach – using deconvolution on output of each layer and visualizing it. Results looks interesting – strting from level 3 it’s something like thersholded edge enchantment, or sketch. Also there are evidences supporting “learn once use everywhere” approach – convnet trained on ImageNet is also effective on other datasets
http://arxiv.org/abs/1311.2901

Unsupervised Learning of Invariant Representations in Hierarchical Architectures
Another paper on why and how deep learning works.
Attempt to build theoretical framework for invariant features in deep learning. Interesting result – Gabor wavelets are optimal filters for simultaneous scale and translation invariance. Relations to sparsity and scattering transform
http://arxiv.org/abs/1311.4158

Computer Vision
An Experimental Comparison of Trust Region and Level Sets
Trust regions method for energy-based segmentation.
Trust region is one of the most important tools in optimization, especially non-convex.
http://en.wikipedia.org/wiki/Trust_region
http://arxiv.org/abs/1311.2102

Blind Deconvolution with Re-weighted Sparsity Promotion
Using reweighted L2 norm for sparsity in blind deconvolution
http://arxiv.org/abs/1311.4029

Optimization

Online universal gradient methods
about  Nesterov’s universal gradient method (
http://www.optimization-online.org/DB_FILE/2013/04/3833.pdf )
It use Bregman distance and related to ADMM.
The paper is application of universal gradient method  to online learning and give bound on regret function.
http://arxiv.org/abs/1311.3832

CS
A Component Lasso
Approximate covariance matrix with block-diagonal matrix and apply Lasso to each block separately
http://arxiv.org/abs/1311.4472

_FuSSO: Functional Shrinkage and Selection Operator
Lasso in functional space with some orthogonal basis_
http://arxiv.org/abs/1311.2234

Non-Convex Compressed Sensing Using Partial Support Information
More of Lp norm for sparse recovery. Reweighted this time.
http://arxiv.org/abs/1311.3773

–arxiv  (1)–

Optimization, CS
Scalable Frames and Convex Geometry
Frame theory is a basis(pun intended) of wavelets theory, compressed sening and overcomplete dictionaries in ML
http://en.wikipedia.org/wiki/Frame_of_a_vector_space
Here is a discussion how to make “tight frame”
http://en.wikipedia.org/wiki/Frame_of_a_vector_space#Tight_frames
from an ordinary frame by scaling m of its components
Interesting geometric insight provided – to do it mcomponents of frame should make “blunt cone”
http://en.wikipedia.org/wiki/Convex_cone#Blunt_and_pointed_cones
http://arxiv.org/abs/1310.8107

Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization
Some bounds for convergence of dictionary learning. Converge id initial error is O(1/s^2), s- sparcity level
http://arxiv.org/abs/1310.7991

Robust estimators
Robustness of ℓ1 minimization against sparse outliers and its implications in Statistics and Signal Recovery
This is another exploration of L1 estimator. It happens (contrary to common sense) that L1 is not especially robust from “breakdown point” point of view if there is no constraint of noise. However it practical usefulness can be explained that it’s very robust to sparse noise
http://arxiv.org/abs/1310.7637

Numerical
Local Fourier Analysis of Multigrid Methods with Polynomial Smoothers and Aggressive coarsening
Overrelaxaction with Chebyshev weights on the fine grid, with convergence analysis.
http://arxiv.org/abs/1310.8385

1, December, 2013 Posted by | arxiv | , , , , , | Comments Off on November finds in #arxiv and NIPS 2013

## Deriving Gibbs distribution from stochastic gradients

Stochastic gradients is one of the most important tools in optimization and machine learning (especially for Deep Learning – see for example ConvNet). One of it’s advantage is that it behavior is well understood in general case, by application of methods of statistical mechanics.

In general form stochastic gradient descent could be written as
$w_{n+1} = w_n - \alpha \bigtriangledown E(w_n) + F_n$
where $F_n$ is a random variable with expectation zero.
To apply methods of statistical mechanics we can rewrite it in continuous form, as stochastic gradient flow
$\partial J/\partial t = - \bigtriangledown_J E(J) +F(t)$
and random variable F(t) we assume to be white noise for simplicity.
In that moment most of textbooks and papers refer to “methods of statistical mechanics” to show that
stochastic gradient flow has invariant probability distribution, which is called Gibbs distribution
$p(x) = \frac {exp(-\beta E(x))}{Z}$
and from here derive some interesting things like temperature and free energy.
The question is – how Gibbs distribution derived from stochastic gradient flow?

First we have to understand what stochastic gradient flow really means.
It’s not a partial differential equation (PDE), because it include random variable, which is not a function $\mathbb{R} \to \mathbb{R}$. In fact it’s a stochastic differential equation (SDE) . SDE use some complex mathematical machinery and relate to partial differential equations, probability/statistics, measure theory and ergodic theory. However they used a lot in finance, so there is quite a number of textbooks on SDE for non-mathematicians. For the short, minimalistic and accessible book on stochastic differential equation I can’t recommend highly enough introduction to SDE by L.C. Evans

SDE in question is called Ito diffusion. Solution of that equation is a stochastic process – collection of random variables parametrized by time. Sample path of stochastic process in question as a function of time is nowhere differentiable – it’s difficult to talk about it in term of derivatives, so it is defined through it’s integral form.
First I’ll notice that integral of white noise is actually Brownian motion, or Wiener process.
Assume that we have stochastic differential equation written in informal manner
$\partial X / \partial t = b(X, t) + g(X, t) F(t)$ with X -stochastic process and F(t) – white noise
It’s integral form is
$X(T) - X(0) = \int_{0}^{T} b(X, t) dt + \int_{0}^{T} g(X, t) dW(t) \ \ (1)$
where W(t) is a Wiener process
$W(t) = \int F(t) dt$
This equation is usually written in the form
$dX = b(X,t)dt + g(X,t)dW \ \ (2)$
This is only a notation for integral equation, d here is not a differential.
Returning to (1)
$\int_{0}^{T} b(X, t) dt$ is an integral along sample path, it’s meaning is obvious, or it can be defined as limit of Riemann sums with respect to time.

The most notable thing here is
$\int_{0}^{T} g(X, t) dW(t)$ – integral with respect to Wiener process (3)
It’s a stochastic integral, and it’s defined in the courses of stochastic differential equation as the limit of Riemann sums of random variables, in the manner similar to definition of ordinary integral.
Curiously, stochastic integral is not quite well defined. Depending on the form of the sum it produce different results, like Ito integral:

$\int_0^t g \,d W =\lim_{n\rightarrow\infty} \sum_{[t_{i-1},t_i]\in\pi_n}g_{t_{i-1}}(W_{t_i}-W_{t_{i-1}})$
Different Riemann sums produce different integral – Stratonovich integral:
$\int_0^t g \,d W =\lim_{n\rightarrow\infty} \sum_{[t_{i-1},t_i]\in\pi_n}(g_{t_i} + g_{t_{i-1}})/2(W_{t_i}-W_{t_{i-1}})$

Ito integral used more often in statistics because it use $g_{t_{i-1}}$ – it don’t “look forward”, and Stratonovich more used in theoretical physics.

Returning to Ito integral – Ito integral is stochastic process itself, and it has expectation zero for each t.
From definition of Ito integral follow one of the most important tools of stochastic calculus – Ito Lemma (or Ito formula)
Ito lemma states that for solution of SDE (2)
$dX = b(X, t)dt + g(x, t)dW$ X, b, W – vectors, g – matrix
were W is Wiener process (actually some more general process) and b and g are good enough

$du(\mathbf{X}, t) = \frac{\partial u}{\partial t} dt + (\nabla_\mathbf{X}^{\mathsf T} u) d\mathbf{X} + \tfrac{1}{2} \sum_{i,j} (gg^\mathsf{T})_{i,j}\tfrac{ \partial^2 u}{\partial x_i \partial x_j} dt$

where $\nabla_\mathbf{X} u$ is the gradient.
From Ito lemma follow Ito product rule for scalar processes: applying Ito formula to process $V = { X \choose Y}$ combined from two processes X and Y to function u(V) = XY
$d(XY) = XdY + YdX + g_x \cdot g_y dt$
Using Ito formula and Ito product rule it is possible to get Feynman–Kac formula (derivation could be found in the wikipedia, it use only Ito formula, Ito product rule and the fact that expectation of Ito integral (3) is zero):
for partial differential equation (PDE)
$\frac{\partial u}{\partial t} + (\nabla_\mathbf{X}u)^{\mathsf T} b + \tfrac{1}{2} \sum_{i,j} (gg^\mathsf{T})_{i,j}\tfrac{ \partial^2 u}{\partial x_i \partial x_j} - vu = f$
with terminal condition
$u(x, T) = \psi(x) \ \ (4)$
solution can be written as conditional expectation:
$u(x,t) = E\left[ \int_t^T e^{- \int_t^r v(X_\tau,\tau)\, d\tau}f(X_r,r)dr + e^{-\int_t^T v(X_\tau,\tau)\, d\tau}\psi(X_T) \Bigg| X_t=x \right]$
Feynman–Kac formula establish connection between PDE and stochastic process.
From Feynman–Kac formula taking $v\equiv 0$ and $f \equiv 0$ we get Kolmogorov backward equation :
for
$Lu = (\nabla_\mathbf{X}u)^{\mathsf T} b + \tfrac{1}{2} \sum_{i,j} (gg^\mathsf{T})_{i,j}\tfrac{ \partial^2 u}{\partial x_i \partial x_j}$
equation
$\frac{\partial u}{\partial t} + Lu = 0 \ \ (5)$
with terminal condition (4) have solution as conditional expectation
$u(x, t) = E(\psi(X_T) | X_t = x) \ \ (6)$
From Kolmogorov backward equation we can obtain Kolmogorov forward equation, which describe evolution of probability density for random process X (2)
In SDE courses it’s established that (2) is a Markov process and has transitional probability P and transitional density p:
p(x, s, y, t) = probability density at being at y in time t, on condition that it started at x in time s
taking u – solution of (5) with terminal condition (6)
$u(x, s) = E(\psi(X_T) | X_s = x) = \int p(x, s, z, T) \psi(z) dz$
From Markov property
$p(x, s, z, T) = \int p(x, s, y, t)p(y, t, z, T) dz$
from here
$u(x, s) = \int \int p(x, s, y, t) p(y, t, z, T) \psi(z) dz dy$
$u(x, s) = \int p(x, s, y, t) u(y, t) dy$
form here
$0 = \frac{\partial u}{\partial t} = \int \frac {\partial p(x, s, y, t)}{\partial t}u(y, t) + p(x, s, y, t) \frac {\partial u(y, t)}{\partial t} dy$
from (5)
$\int \frac{\partial p(x, s, y, t)}{\partial t}u(y, t) - (Lu) p(x, s, y, t) dy = 0 \ \ (7)$
Now we introduce dual operator $L^*$
$\int p (Lu) dy = \int (L^{*}p) u dy$
By integration by part we can get
$L^*u = -\nabla_\mathbf{X}^{\mathsf T}(ub) + \tfrac{1}{2} \sum_{i,j} \tfrac{ \partial^2 }{\partial x_i \partial x_j}(gg^\mathsf{T}u)_{i,j}$
and from (7)
$\int (\frac{\partial p(x, s, y, t)}{\partial t} - L^{*}p) u(y, t) dy = 0$
for t=T
$\int (\frac{\partial p(x, s, y, T)}{\partial t} - L^{*}p(x, s, y, T)) \psi(y) dy = 0$
This is true for any $\psi$, wich is independent from p
And we get Kolmogorov forward equation for p. Integrating by x we get the same equation for probability density at any moment T
$\frac{\partial p}{\partial t} - L^{*}p = 0$
Now we return to Gibbs invariant distribution for gradient flow
$\tfrac{\partial J}{\partial t} = - \bigtriangledown_J E(J) + \sqrt{\tfrac{2}{\beta}}F(t)$
Stochastic gradient flow in SDE notation
$dJ = - \bigtriangledown_J E(J)dt + \sqrt{\tfrac{2}{\beta}} dW$ – integral of white noise is Wiener process
We want to find invariant probability density $p_G$. Invariant – means it doesn’t change with time, $\tfrac{\partial p_G}{\partial t} = 0$
so from Kolmogorov forward equation
$L^*p_G = 0$
or
$\nabla^{\mathsf T}(p_G\nabla E)+ \tfrac{1}{\beta} \nabla^{\mathsf T} \nabla p_G = 0$
removing gradient
$p_G\nabla E + \tfrac{1}{\beta} \nabla p_G = C$
C = 0 because we want $p_G$ integrable
$\nabla(E + \tfrac{1}{\beta} log(p_G)) = 0$
and at last we get Gibbs distribution
$p_G = \frac{exp(-\beta E)}{Z}$
Recalling again the chain of reasoning:
Wiener process

SDE + Ito Lemma + Ito product rule + zero expecation of Ito integral →

Kolmogorov forward equation for probability density  →

Gibbs invariant distribution

13, November, 2013 Posted by | Uncategorized | Comments Off on Deriving Gibbs distribution from stochastic gradients

## Meaning of conditional probability

Conditional probability was always baffling me. Empirical, frequentists meaning is clear, but the abstract definition, originating from Kolmogorov – what was its mathematical meaning? How it can be derived? It’s a nontrivial definition and is appearing in the textbooks out the air, without measure theory intuition behind it.

Here I mostly follow Chang&Pollard  paper  Conditioning as disintegartion. Beware that the paper use non-standard notation, but this post follow more common notation, same as in wikipedia.

Here is example form Chang&Pollard paper:

Suppose we have distribution on  $\mathbb{R}^2$ concentrated in two straight lines $L_1$ and $L_2$ with respective density $g_i(x)$ and angles with X axis $\alpha_i$. Observation (X,Y) taken, giving $X=x_0$, what is probability that point lies on the line $L_1$ ?

Standard approach would be approximate $x_0$ with $[ x_0, x_0 +\Delta ]$ and take limit with  $\Delta \to 0$

$\frac{\int_{x_0}^{x_0+ \Delta}g_1/cos(\alpha_1)}{\int_{x_0}^{x_0+ \Delta}g_1/cos(\alpha_1) + \int_{x_0}^{x_0+ \Delta}g_2/cos(\alpha_2)}$

Not only taking this limit is kind of cumbersome, it’s also not totally obvious that it’s the same conditional probability that defined in the abstract definition – we are replacing ratio with limit here.

Now what is “correct way” to define conditional probabilities, especially for distributions?

For simplicity we will first talk about single scalar random variable, defined on probability space. We will think of random variable X as function on the sample space. Now condition $X=x_0$ define fiber – inverse image of $x_0$.

Disintegration theorem say that probability measure on the sample space can be decomposed into two measures – parametric family of measures induced by original probability on each fiber and “orthogonal” measure on $\mathbb{R}$  – on the parameter space of the first measure. Here $\mathbb{R}$ is the space of values of X  and serve as parameter space for measures on fibers. Second measure induced by the inverse image of  the function (random variable) for each measurable set on $\mathbb{R}$. This second measure is called Pushforward measure. Pushforward measure is just for measurable set on $\mathbb{R}$ (in our case) taking its X  inverse image on sample space and measuring it with μ.

Fiber is in fact sample space for conditional event, and measure on fiber is our conditional distribution.

Full statement of the theorem require some term form measure theory. Following wikipedia

Let P(X) is collection of Borel probability measures on X,  P(Y) is collection of  Borel probability measures on Y

Let Y and X be two Radon spaces. Let μ ∈ P(Y), let $\pi$: Y → X be a Borel- measurable function, and let ν ∈ P(X) be the pushforward measure  $\mu \circ \pi^{-1}$.

* Then there exists a ν-almost everywhere uniquely determined family of probability measures $\mu_x$  ⊆  P(Y) such that
* the function $x \mapsto \mu_{x}$ is Borel measurable, in the sense that $x \mapsto \mu_{x} (B)$ is a Borel-measurable function for each Borel-measurable set B ⊆ Y;
*  $\mu_x$  “lives on” the  fiber $\pi^{-1}(x)$ : for ν-almost all x  ∈ X,

$\mu_x(Y \setminus \pi^{-1} (x))=0$

and so $\mu_x(E)=\mu_x(E \cap \pi^{-1}(x))$

* for every Borel-measurable function $f$ : Y → [0, ∞],

$\int_{Y} f(y) \, \mathrm{d} \mu (y) = \int_{X} \int_{\pi^{-1} (x)} f(y) \, \mathrm{d} \mu_{x} (y) \mathrm{d} \nu (x)$

From here for any event E form Y

$\mu (E) = \int_{X} \mu_{x} \left( E \right) \, \mathrm{d} \nu (x)$

This was complete statement of the disintegration theorem.

Now returning to Chang&Pollard example. For formal derivation I refer you to the original paper, here we will just “guess” $\mu_x$, and by uniqueness it give us disintegration. Our conditional distribution for $X=x$ will be just point masses on the intersections of lines $L_1$ and $L_2$ with axis  $X=x$
Here $\delta$ is delta function – point mass.

$d\mu_x(y) = \frac{\delta (L_1( x )-y) g_1( x ) / cos(\alpha_1) + \delta (L_2( x )-y) g_2( x ) / cos(\alpha_2)}{Z(x)}dy$

$Z(x) = g_1(x)/cos(\alpha_1) + g_2(x)/cos(\alpha_2)$

and for $\nu$
$d\nu = Z(x)dx$

Our conditional probability that event lies on $L_1$ with condition $X = x_0$, form conditional density $d\mu_{x_0}/dy$ thus

$\frac{g_1(x_0)/cos(\alpha_1)}{g_1(x_0)/cos(\alpha_1) + g_2(x_0)/cos(\alpha_2)}$

Another example from Chen&Pollard. It relate to sufficient statistics. Term sufficient statistic used if we have probability distribution depending on some parameter, like in maximum likelihood estimation. Sufficient statistic is some function of sample, if it’s possible to estimate parameter of distribution from only values of that function in the best possible way – adding more data form the sample will not give more information about parameter of distribution.
Let $\mathbb{P}_{\theta}$ be uniform distribution on the square $[0, \theta]^2$. In that case M = max(x, y) is sufficient statistics for $\theta$. How to show it?
Let take our function $(x,y) \to m$ , $m = max(x,y)$ and make disintegration.
$d\mu_m(x,y) = s_m(x,y)dxdy$ is a uniform distribution on edges where x and y equal m and $d\nu = (2m / \theta^2 )dm$
$s_m(x,y)$ is density of conditional probability $P(\cdot | M = m)$ and it doesn’t depend on $\theta$
For any $f \ \ \mathbb{P}_{\theta}(f| m) = \int f s_m(x,y) \ , \ \mathbb{P}_{\theta, m}(\cdot) = \mathbb{P}_{m}(\cdot)$ – means that M is  sufficient.

It seems that in most cases disintegration is not a tool for finding conditional distribution. Instead it can help to guess it and form uniqueness prove that the guess is correct. That correctness could be nontrivial – there are some paradoxes similar to Borel Paradox in Chang&Pollard paper.

7, November, 2013 Posted by | Math | , , | Comments Off on Meaning of conditional probability

## Finds in arxiv, october

This is duplication of my ongoing G+ series of post on interesting for me papers in arxiv. Older posts are not here but in my G+ thread.

Finds in #arxiv :
*Optimization, numerical & convex, ML*
The Linearized Bregman Method via Split Feasibility Problems: Analysis and Generalizations
Reformulation of  Split Bregman/ ADMM as split feasibility problem and algorithm/convergence for generalized split feasibility by Bregman projection. This general formulation include both Split Bregman and Kaczmarz (my comment – randomized Kaczmarz  seems could be here too)
http://arxiv.org/abs/1309.2094

Stochastic gradient descent and the randomized Kaczmarz algorithm
Hybrid of  randomized Kaczmarz  and stochastic gradient descent  – into my “to read” pile
http://arxiv.org/abs/1310.5715

Trust–Region Problems with Linear Inequality Constraints: Exact SDP Relaxation, Global Optimality and Robust Optimization
“Extended” trust region for linear inequalities constrains
http://arxiv.org/abs/1309.3000

Conic Geometric Programming
Unifing framwork for conic and geometric programming
http://arxiv.org/abs/1310.0899
http://en.wikipedia.org/wiki/Geometric_programming
http://en.wikipedia.org/wiki/Conic_programming

Gauge optimization, duality, and applications
Another big paper about different, not Lagrange duality, introduced by Freund (1987)
http://arxiv.org/abs/1310.2639

Color Bregman TV
mu parameters in split bregman made adaptive, to exploit coherence of edges in different color channels
http://arxiv.org/abs/1310.3146

Iteration Complexity Analysis of Block Coordinate Descent Methods
Some convergence analysis for BCD and projected gradient BCD
http://arxiv.org/abs/1310.6957

Successive Nonnegative Projection Algorithm for Robust Nonnegative Blind Source Separation
Nonnegative matrix factorization
http://arxiv.org/abs/1310.7529

Scaling SVM and Least Absolute Deviations via Exact Data Reduction
SVN for large-scale problems
http://arxiv.org/abs/1310.7048

Image Restoration using Total Variation with Overlapping Group Sparsity
While title is promising I have doubt about that paper.  The method authors suggest is equivalent to adding averaging filter to TV-L1 under L1 norm. There is no comparison  to just applying TV-L1 and smoothing filter interchangeably.The method author suggest is very costly, and using median filter instead of averaging would cost the same while obviously more robust.
http://arxiv.org/abs/1310.3447

*Deep learning*
Deep and Wide Multiscale Recursive Networks for Robust Image Labeling
_Open source_ matlab/c package coming soon(not yet available)
http://arxiv.org/abs/1310.0354

Improvements to deep convolutional neural networks for LVCSR
convolutional  networks, droput for speech recognition,
http://arxiv.org/abs/1309.1501v1

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Already discussed on G+  – open source framework in “learn one use everywhere” stile. Learning done off-line on GPU using ConvNet, and recognition is online in pure python.
http://arxiv.org/abs/1310.1531

Statistical mechanics of complex neural systems and high dimensional data
Big textbook-like overview paper on statistical mechanics of learning. I’ve put it in my “to read” pile.
http://arxiv.org/abs/1301.7115

Randomized co-training: from cortical neurons to machine learning and back again
“Selectron” instead of perception – neurons are “specializing” with weights.
http://arxiv.org/abs/1310.6536

Provable Bounds for Learning Some Deep Representations
http://arxiv.org/abs/1310.6343
Citation:”The current paper presents both an interesting family of denoising autoencoders as

well as new algorithms to provably learn almost all models in this family.”

*Computer vision*
Online Unsupervised Feature Learning for Visual Tracking
Sparse representation, overcomplete dictionary
http://arxiv.org/abs/1310.1690

From Shading to Local Shape
Shape restoration from local shading – could be very useful in low-feature environment.
http://arxiv.org/abs/1310.2916

Fast 3D Salient Region Detection in Medical Images using GPUs
Finding interest point in 3D images
http://arxiv.org/abs/1310.6736

Object Recognition System Design in Computer Vision: a Universal Approach
Grid-based universal framework for object detection/classification
http://arxiv.org/abs/1310.7170

Gaming :)
Lyapunov-based Low-thrust Optimal Orbit Transfer: An approach in Cartesian coordinates
For space sim enthusiast :)
http://arxiv.org/abs/1310.4201

29, October, 2013 Posted by | arxiv, computer vision | , , , , , | Comments Off on Finds in arxiv, october