There are quite a few complex methods for making convolutional networks converge faster. Natural gradient, dual coordinate ascent, second order hessian free methods and more. However those methods are usually require considerable memory spending and extensive modifications of existing code if you already had some working method before.
Here instead I’ll list some simple and lightweight methods, which don’t require extensive changes of already working code. Those methods works (I’ve tested them on CIFAR10), and at very worst don’t make convergence worse. You shouldn’t expect radical improvement from those methods – they are for little speed up.
1. Start training on the part of the data and gradually increase data size to full. For example for CIFAR10 I start with 1/5 of all the data and increase it by 1/5 each and time giving it 3 more epochs to train for each chunk. This trick was inspired by “curriculum learning”. Resulting acceleration is small but still visible (on order of 10%).
2. Random sampling. Running all the data batches in the same order theoretically can produce some small overfitting – coadaptaion of the data. Usually this effect is not visible, but just to be sure you can reshuffle data randomly each epoch, especially if it cost very little. I’m using simple reshuffling method based on the prime numbers table.
For each minibatch k I take samples indexed by i: (i*prime)%data_size where i is from k*minibatch_size to (k+1)*minibatch_size, and prime taken form prime numbers table change each epoch. If prime is more than all prime factors of data_size all samples are indexed once.
3. Gradient step size. The baseline method is to use simple momentum. Most popular method of gradient acceleration on top of momentum are RMSProp and Nesterov accelerated gadient (NAG).
NAG simply change order in which momentum and gradient are applied to weights. RMSProp is normalization of the gradient, so that it should have approximately same norm. One of the variants described here, which is similar to low pass filter.Another possible implementation – gradient is divided by running root mean square of it’s previous values. However in my tests on CIFAR10 neither NAG nor RMSprop or NAG+RMSProp show any visible improvement.
Nevertheless in my tests a simple modification of gradient step show better result then standard momentum. That is just a common optimization trick – discard or reduce those gradient step which increase error function. Straightforward implementation of it could be costly for convnet – to estimate cost function after gradient step require additional forward propagation. There is a workaround – estimate error on the next sample of the dataset and if error increased subtract part of the previous gradient. This is not precisely the same as reducing bad gradient, because we subtract offending gradient after we made next step, but because the step is small it still works. Essentially it could be seen as error-dependent gradient step
4. Dropout. Use of dropout require some care, especially if we want to insert dropout into networks built and tuned without dropout. Common recommendation is to increase number of filters proportionally to dropout value.
But there are some hidden traps there: while dropout should improve test error(results on the samples not used in training) it make training error noticeably worse. In practice it may make worse even test error. Dropout also make convergence considerably more slow. There is non-obvious trick which help: continue iterations without changing learning rate for some times even after training error is not decreasing any more. Intuition behind this trick – dropout reduce test error, not training error. Test error decrease often very slow and noisy comparing to training error, so, just to be sure it may help to increase number of epoch even after both stop decreasing, without decreasing learning rate. Reducing learning rate after both training and test error went plateau for some epochs may produce better results.
To be continued (may be)
There is some deep learning (specifically #convnet) criticism based on the artificially constructed misclassification examples.
There is a new paper
“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images” by Nguen et al
and other direction of critics is based on the older, widely cited paper
“Intriguing properties of neural networks” by Szegedy et al
In the first paper authors construct examples which classified by convnet with confidence, but look nothing like label to human eye.
In the second paper authors show that correctly classified image could be converted to misclassified with small perturbation, which perturbation could be found by iterative procedure
What I think is that those phenomenons have no impact on practical performance of convolutional neural networks.
First paper is really simple to address. The real world images produced by camera are not dense in the image space(space of all pixel vectors of image size dimension).
In fact camera images belong to low-dimensional manifold in the image space, and there are some interesting works on dimensionality and property of that manifold. For example dimensionality of the space images of the fixed 3D scene it is around 7, which is not surprising, and the geodesics of that manifold could be defined through the optical flow.
Of cause if sample is outside of image manifold it will be misclassified, method of training notwithstanding. The images in the paper are clearly not real-world camera images, no wonder convnet assign nonsensical labels to them.
Second paper is more interesting. First I want to point that perturbation which cause misclassification is produced by iterative procedure. That hint that in the neighbourhood of the image perturbed misclassified images are belong to measure near-zero set.
Practically that mean that probability of this type of misclassification is near zero, and orders of magnitude less than “normal” misclassification rate of most deep networks.
But what is causing that misclassification? I’d suggest that just high dimensionality of the image and parameters spaces and try to illustrate it. In fact it’s the same reason why epsilon-sparse vector are ubiquitous in real-world application: If we take n-dimensional vector, probability that all it’s components more than is , which is near zero. This and like effects explored in depth in compressed sensing ( also very good Igor Carron’s page)
Now to convnet – convnet classify images by signs of piecewise-linear functions.
Take any effective pixel which is affecting activations. Convolutional network separate image space into piecewise-linear areas, which are not aligned with coordinate axes. That mean if we change intensity of pixel far enough we are out of correct classification area.
We don’t know how incorrect areas are distributed in the image space, but for common convolutional network dimensionality of subspace of the hyperplanes which make piecewise-linear separation boundary is several times more than dimensionality of the image vector. This suggest that correlation between incorrect areas of different pixels is quite weak.
Now assume that image is stable to perturbation, that mean that exist \epsilon such that for any effective pixel it’s epsilon-neighbourhood is in the correct area. If incorrect areas are weakly correlated that mean probability of image being stable is about , where n is number of effective pixels. That is probability of stable image is near zero. That illustrate suggestion that this “adversarial” effect is only caused by dimensionality of the problem and parameter space, not by some intrinsic deficiency of the deep network.
Geoffrey Hinton, neural networks, ML, deep learning pioneer answered “Ask Me Anything” on reddit:
His answers only:
His most controversial answer is:
The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.
If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its true that if the pools overlap enough, the positions of features will be accurately preserved by “coarse coding” (see my paper on “distributed representations” in 1986 for an explanation of this effect). But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer (by pose I mean position, orientation, and scale).
I think it makes much more sense to represent a pose as a small matrix that converts a vector of positional coordinates relative to the viewer into positional coordinates relative to the shape itself. This is what they do in computer graphics and it makes it easy to capture the effect of a change in viewpoint. It also explains why you cannot see a shape without imposing a rectangular coordinate frame on it, and if you impose a different frame, you cannot even recognize it as the same shape. Convolutional neural nets have no explanation for that, or at least none that I can think of.
It’s interesting though controversial opinion, and here is my take on it:
It looks like Convolutional networks discriminate input by switching between activation subnetworks (http://arxiv.org/abs/1410.1165, Understanding Locally Competitive Networks)
From that point of view max pooling operation is pretty natural – it provide both robustness to small deviations and switching between activation subsets. Aslo max pooling has a nice property to be semilinear to multiplication f(ax)=af(x) , which make all the layers stack but last cost layer semilinear, and if one use SVM/L2SVM as cost whole stack semilinear. Not sure if it help in discrimination, but it surely help in debugging.
As alternative to pooling pyramid Geoffrey Hinton suggest to get somehow global rotation(or other transformation) matrix from the image.
Getting it from image or from image patch is actually possible with low-rank matrix factorization from single image(Transform Invariant Low-rank Textures by Zhang et al) and by optical flow (or direct matching) from set of images, but it’s quite expensive in performance term. It’s definitely possible to combine those method with deep network, (ironically they are benefit using pooling pyramid)
1. They will be kind of image augmentation
2. Those are problem-specific methods, they may not help in other domains like NLP, speech etc.
There are some work on using low-rank matrix factorization in deep networks, but they are mostly about fully connected, not convolutional layers. But IMHO if to try do factorization in conv layers, the pooling wouldn’t go away.
PS (edit): Here is a video with Hinton alternative to pooling – “capsules”
Finds in #arxiv :
*Optimization, numerical & convex, ML*
The Linearized Bregman Method via Split Feasibility Problems: Analysis and Generalizations
Reformulation of Split Bregman/ ADMM as split feasibility problem and algorithm/convergence for generalized split feasibility by Bregman projection. This general formulation include both Split Bregman and Kaczmarz (my comment – randomized Kaczmarz seems could be here too)
Stochastic gradient descent and the randomized Kaczmarz algorithm
Hybrid of randomized Kaczmarz and stochastic gradient descent – into my “to read” pile
Trust–Region Problems with Linear Inequality Constraints: Exact SDP Relaxation, Global Optimality and Robust Optimization
“Extended” trust region for linear inequalities constrains
Conic Geometric Programming
Unifing framwork for conic and geometric programming
Gauge optimization, duality, and applications
Another big paper about different, not Lagrange duality, introduced by Freund (1987)
Color Bregman TV
mu parameters in split bregman made adaptive, to exploit coherence of edges in different color channels
Iteration Complexity Analysis of Block Coordinate Descent Methods
Some convergence analysis for BCD and projected gradient BCD
Successive Nonnegative Projection Algorithm for Robust Nonnegative Blind Source Separation
Nonnegative matrix factorization
Scaling SVM and Least Absolute Deviations via Exact Data Reduction
SVN for large-scale problems
Image Restoration using Total Variation with Overlapping Group Sparsity
While title is promising I have doubt about that paper. The method authors suggest is equivalent to adding averaging filter to TV-L1 under L1 norm. There is no comparison to just applying TV-L1 and smoothing filter interchangeably.The method author suggest is very costly, and using median filter instead of averaging would cost the same while obviously more robust.
Deep and Wide Multiscale Recursive Networks for Robust Image Labeling
_Open source_ matlab/c package coming soon(not yet available)
Improvements to deep convolutional neural networks for LVCSR
convolutional networks, droput for speech recognition,
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Already discussed on G+ – open source framework in “learn one use everywhere” stile. Learning done off-line on GPU using ConvNet, and recognition is online in pure python.
Statistical mechanics of complex neural systems and high dimensional data
Big textbook-like overview paper on statistical mechanics of learning. I’ve put it in my “to read” pile.
Randomized co-training: from cortical neurons to machine learning and back again
“Selectron” instead of perception – neurons are “specializing” with weights.
Provable Bounds for Learning Some Deep Representations
Citation:”The current paper presents both an interesting family of denoising autoencoders as
Online Unsupervised Feature Learning for Visual Tracking
Sparse representation, overcomplete dictionary
From Shading to Local Shape
Shape restoration from local shading – could be very useful in low-feature environment.
Fast 3D Salient Region Detection in Medical Images using GPUs
Finding interest point in 3D images
Object Recognition System Design in Computer Vision: a Universal Approach
Grid-based universal framework for object detection/classification
Lyapunov-based Low-thrust Optimal Orbit Transfer: An approach in Cartesian coordinates
For space sim enthusiast :)
Split Bregamn method is one of the most used method for solving convex optimization problem with non-smooth regularization. Mostly it used for regularization, and here I want to talk about TV-L1 and similar regularizers ( like )
For simplest possible one-dimentional denoising case
Split Bregman iteration looks like
There is a lot of literature about choosing parameter from (1) for denosing. Methods include Morozov discrepancy principle, L-curve and more – that’s just parameter of Tikhonov regularization.
There is a lot less metods for choosing for Split Bregman.
Theoretically shouldn’t be very important – if Split Bregman converge solution is not depending on . In practice of cause convergence speed and stability depend on a lot. Here I’ll point out some simple consideration which may help to choose
Assume for simplicity that u is twice differentiable almoste everythere, the (1) become
We know if Split Bbregman converge
that mean , here exteranl derivative is a weak derivative.
For the solution b is therefore locally constant, and we even know what constant it is.
For we have
For converged solution
everythere where is not zero, with possible exeption of measure zero set.
Now returning to choice of for Split Bregman iterations.
From (4) we see that for small values b , b grow with step until it reach it’s maximum absolute value
Also if the b is “wrong”, if we want for it to switch value in one iteration should be more than
That give us lower boundary for :
For most of x
What happens if on some interval b reach for two consequtive iterations?
Then on the next iteration form (5) and (3) on that interval
See taht with big enuogh sign of stabilizing with high probaility and we could be close to true solution. The “could” in the last sentence is there because we are inside the interval, and if the values of u on the ends of interval are wrong b “flipping” can propagate inside our interval.
It’s more difficult to estimate upper boundary for .
Obviously for more easy solution of (2) or (5) shouldn’t be too big, so that eigenvalues of operator woudn’t be too close to 1. Because solutions of (2) are inexact in Split Breagman we obviously want having bigger eigenvalues, so that single (or small number of) iteration could suppress error good enough for (2) subproblem.
So in conclusion (and my limited experience) if you can estimate for solution could be a good value to start testing.
Something I did for Samsung (kernel of tracker). Biggest improvement in SARI 1.5 is the sensors fusion, which allow for a lot more robust tracking.
Here is example of run-time localization and mapping with SARI 1.5:
This is the AR EdiBear game (free in Samsung apps store)
Interior point method for nonlinear optimization often considered as complex, or highly nontrivial etc. The fact is, that for “simple” nonlinear optimization it’s quite simple, manageable and can even be explained in #3tweets. For those not familiar with it there is a simple introduction to it in wikipedia, which in turn follows an excellent paper by Margaret H. Wright.
Now about “if all you have is a hammer, everything looks like a nail”. Some of applications of interior point method could be quite unexpected.
Everyone who worked with Levenberg-Marquardt minimization algorithm know how much pain is the choice of the small parameter . Levenberg-Marquardt can also be seen as modification of Gauss-newton with a trust region. The of Levenberg-Marquardt do correspond to the trust region radius, but that dependence is highly complex and is difficult to estimate. You want trust region of the radius r, but what should be avlue of ? There is no easy answer to that question; there are some complex methods, or there is a testing with subdivision, which is what the original Levenberg-Marquardt implement.
Interior point can help here.
If we choose shape of trust region for Gauss-Newton as hypercube or simplex or like, we can formulate it as set of L1 norm inequality constrains. And that is the domain of interior point method! For hypercube the resulting equations looks especially nice
W – hessian, I – identity, diag – diagonal
That is a banded arrowhead matrix, and for it Cholesky decomposition cost insignificantly more than decomposition of original W. The matrix is not positive definite – Cholesky without square root should be used.
Now there is a temptation to use single constrain instead of set of constrain , but that will not work. should have to be linearized to be tractable, but it’s a second order condition – it’s linear part is zero, so linearization doesn’t constrain anything.
The same method could be used whenever we have to put constrain on the value of Gauss-Newton update, and shape of constrain in not important (or polygonal)
Now last touch – Interior point method has small parameter of it’s own. It’s called usually . In the “normal” method there is a nice rule for it update – take it as (in the notation from wikipedia article – is a value of constraint, is a value of slack variable at the previous iteration) That rule usually explicitly stated in the articles about Interior Point Method(IPM) for Linear Programming, but omitted (as obvious probably) in the papers about IPM for nonlinear optimization
In our case (IPM for trust region) we don’t need update at all – we move boundary of the region with each iteration, so each is an initial value. Have to remember, is not a size of trust region, but strength of it’s enforcement.
I seldom post in the this blog now, mostly because I’m positing on twitter and G+ a lot lately. I still haven’t figured out which post should go where – blog, G+ or twitter, so it’s kind of chaotic for now.
What of interest is going on: There are two paper on the CVPR11 which claim that compressed sensing(sparse recovery) is not applicable to some of the most important computer vision tasks:
Is face recognition really a Compressive Sensing problem?
Are Sparse Representations Really Relevant for Image Classification?
Both paper claim that space of the natural images(or their important subsets) are not really sparse.
Those claims however dont’t square with claim of high effectiveness of compact signature of Random Ferns.
Could both of those be true? In my opinion – yes. Difference of two approaches is that first two paper assumed explicit sparsity – that is they enforced sparsity on the feature vector. Compressed signature approach used implicit sparsity – feature vector underling the signature is assumed sparse but is not explicitly reconstructed. Why compressed signature is working while explicit approach didn’t? That could be the case if image space is sparse in the different coordinate system – that is here one is dealing with the union of subspaces. Assumption not of the simple sparsity, but of the union of subspaces is called blind compressed sensing.
Now if we look at the space of the natural images it’s easy to see why it is low dimensional. Natural image is the image of some scene, an that scene has limited number of moving object. So dimension of space images of the scene is approximately the sum of degree of freedom of the objects(and camera) of the scene, plus effects of occlusions, illumination and noise. Now if the add strong enough random error to the scene, the image is stop to be the natural image(that is image of any scene). That mean manifold of the images of the scene is isolated – there is no natural images in it’s neighborhood. That hint that up to some error the space of the natural images is at least is the union of isolated low-dimensional manifolds. The union of mainfolds is obviously is more complex structure than the union of subspace, but methods of blind compressed sensing could be applicable to it too. Of cause to think about union of manifolds could be necessary only if the space of images is not union of subspace, which is obviously preferable case
Cauchy estimator have some nice properties (Gonzales et al “Statistically-Efficient Filtering in Impulsive Environments: Weighted Myriad Filter” 2002):
By tuning in
it can approximate either least squares (big ), or mode – maximum of histogram – of sample set (small ). For small estimator behave the same way as power law distribution estimator with small .
Another property is that for several measurements with different scales estimator of their sum will be simple
which is convenient for estimation of random walks
I heard convulsion in the sky,
And flight of angel hosts on high,
And monsters moving in the deep
Those verses from The Prophet by A.Pushkin could be seen as metaphor of profound mathematical insight, encompassing bifurcations, higher dimensional algebra and murky depths of statistics.
I now intend to dive deeper into of statistics – toward “data depth”. Data depth is a generalization of median concept to multidimensional data. Remind you that median can be seen either as order parameter – value dividing the higher half of measurements from lower, or geometrically, as the minimum of norm. Second approach lead to geometric median, about which I already talked about.
First approach to generalizations of median is to try to apply order statistics to multidimensional vectors.The idea is to make some kind of partial order for n-dimensional points – “depth” of points, and to choose as the analog of median the point of maximum depth.
Basically all data depth concepts define “depth” as some characterization of how deep points are reside inside the point cloud.
Historically first and easiest to understand was convex hull approach – make convex hull of data set, assign points in the hull depth 1, remove it, get convex hull of points remained inside, assign new hull depth 2, remove etc.; repeat until there is no point inside last convex hull.
Later Tukey introduce similar “halfspace depth” concept – for each point X find the minimum number of points which could be cut from the dataset by plane through the point X. That number count as depth(see the nice overview of those and other geometrical definition of depth at Greg Aloupis page)
In 2002 Mizera introduced “global depth”, which is less geometric and more statistical. It start with assumption of some loss function (“criterial function” in Mizera definition) of measurement set . This function could be(but not necessary) cumulative probability distribution. Now for two parameters and , is more fit with respect if for all . is weakly optimal with respect to if there is nor better fit parameter with respect to . At last global depth of is the minimum possible size of such that is not weakly optimal with respect to – reminder of measurements. In other words global depth is minimum number of measurements which should be removed for stop being weakly optimal. Global depth is not easy to calculate or visualize, so Mizera introduce more simple concept – tangent depth.
Tangent depth defined as . What does it mean? Tangent depth is minimum number of “bad” points – such points that for specific direction loss function for themis growing.
Those definitions of “data depth” allow for another type of estimator, based not on likelihood, but on order statistics –maximum depth estimators. The advantage of those estimators is robustness(breakdown point ~25%-33%) and disadvantage – low precision (high bias). So I wouldn’t use them for precise estimation, but for sanity check or initial approximation. In some cases they could be computationally more cheap than M-estimators. As useful side effect they also give some insight into structure of dataset(it seems originally maximum depth estimators was seen as data visualization tool). Depth could be good criterion for outliers rejection.
Disclaimer: while I had very positive experience with Cauchy estimator, data depth is a new thing for me.I have yet to see how useful it could be for computer vision related problems.
In this post I was complaining that I don’t know what breakdown point for redescending M-estimators is. Now I found out that upper bound for breakdown point of redescending of M-estimators was given by Mueller in 1995, for linear regression (that is statisticians word for simple estimation of p-dimensional hyperplane):
– number of measurements and is little tricky: it is a maximum number of measurement vectors X lying in the same p-dimensional hyperplane. If number of measurements N >> p that mean breakdown point is near 50% – You can have half measurement results completely out of the blue and estimator will still work.
That only work if the error present only in results of measurements, which is reasonable condition – in most cases we can move random error from x part to y part.
Now which M-estimators attain this upper bound?
The condition is “slow variation”(Mizera and Mueller 1999)
Mentioned in previous post Cauchy estimator is satisfy that condition:
and its derivative
In practice we always work with , not so Cauchy estimator is easy to calculate.
Rule of the thumb: if you don’t know which robust estimator to use, use Cauchy: It’s fast(which is important in real time apps), its easy to understand, it’s differentiable, and it is as robust as possible (that is for redescending M-estimator)