There are quite a few complex methods for making convolutional networks converge faster. Natural gradient, dual coordinate ascent, second order hessian free methods and more. However those methods are usually require considerable memory spending and extensive modifications of existing code if you already had some working method before.
Here instead I’ll list some simple and lightweight methods, which don’t require extensive changes of already working code. Those methods works (I’ve tested them on CIFAR10), and at very worst don’t make convergence worse. You shouldn’t expect radical improvement from those methods – they are for little speed up.
1. Start training on the part of the data and gradually increase data size to full. For example for CIFAR10 I start with 1/5 of all the data and increase it by 1/5 each and time giving it 3 more epochs to train for each chunk. This trick was inspired by “curriculum learning”. Resulting acceleration is small but still visible (on order of 10%).
2. Random sampling. Running all the data batches in the same order theoretically can produce some small overfitting – coadaptaion of the data. Usually this effect is not visible, but just to be sure you can reshuffle data randomly each epoch, especially if it cost very little. I’m using simple reshuffling method based on the prime numbers table.
For each minibatch k I take samples indexed by i: (i*prime)%data_size where i is from k*minibatch_size to (k+1)*minibatch_size, and prime taken form prime numbers table change each epoch. If prime is more than all prime factors of data_size all samples are indexed once.
3. Gradient step size. The baseline method is to use simple momentum. Most popular method of gradient acceleration on top of momentum are RMSProp and Nesterov accelerated gadient (NAG).
NAG simply change order in which momentum and gradient are applied to weights. RMSProp is normalization of the gradient, so that it should have approximately same norm. One of the variants described here, which is similar to low pass filter.Another possible implementation – gradient is divided by running root mean square of it’s previous values. However in my tests on CIFAR10 neither NAG nor RMSprop or NAG+RMSProp show any visible improvement.
Nevertheless in my tests a simple modification of gradient step show better result then standard momentum. That is just a common optimization trick – discard or reduce those gradient step which increase error function. Straightforward implementation of it could be costly for convnet – to estimate cost function after gradient step require additional forward propagation. There is a workaround – estimate error on the next sample of the dataset and if error increased subtract part of the previous gradient. This is not precisely the same as reducing bad gradient, because we subtract offending gradient after we made next step, but because the step is small it still works. Essentially it could be seen as error-dependent gradient step
4. Dropout. Use of dropout require some care, especially if we want to insert dropout into networks built and tuned without dropout. Common recommendation is to increase number of filters proportionally to dropout value.
But there are some hidden traps there: while dropout should improve test error(results on the samples not used in training) it make training error noticeably worse. In practice it may make worse even test error. Dropout also make convergence considerably more slow. There is non-obvious trick which help: continue iterations without changing learning rate for some times even after training error is not decreasing any more. Intuition behind this trick – dropout reduce test error, not training error. Test error decrease often very slow and noisy comparing to training error, so, just to be sure it may help to increase number of epoch even after both stop decreasing, without decreasing learning rate. Reducing learning rate after both training and test error went plateau for some epochs may produce better results.
To be continued (may be)
There is some deep learning (specifically #convnet) criticism based on the artificially constructed misclassification examples.
There is a new paper
“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images” by Nguen et al
and other direction of critics is based on the older, widely cited paper
“Intriguing properties of neural networks” by Szegedy et al
In the first paper authors construct examples which classified by convnet with confidence, but look nothing like label to human eye.
In the second paper authors show that correctly classified image could be converted to misclassified with small perturbation, which perturbation could be found by iterative procedure
What I think is that those phenomenons have no impact on practical performance of convolutional neural networks.
First paper is really simple to address. The real world images produced by camera are not dense in the image space(space of all pixel vectors of image size dimension).
In fact camera images belong to low-dimensional manifold in the image space, and there are some interesting works on dimensionality and property of that manifold. For example dimensionality of the space images of the fixed 3D scene it is around 7, which is not surprising, and the geodesics of that manifold could be defined through the optical flow.
Of cause if sample is outside of image manifold it will be misclassified, method of training notwithstanding. The images in the paper are clearly not real-world camera images, no wonder convnet assign nonsensical labels to them.
Second paper is more interesting. First I want to point that perturbation which cause misclassification is produced by iterative procedure. That hint that in the neighbourhood of the image perturbed misclassified images are belong to measure near-zero set.
Practically that mean that probability of this type of misclassification is near zero, and orders of magnitude less than “normal” misclassification rate of most deep networks.
But what is causing that misclassification? I’d suggest that just high dimensionality of the image and parameters spaces and try to illustrate it. In fact it’s the same reason why epsilon-sparse vector are ubiquitous in real-world application: If we take n-dimensional vector, probability that all it’s components more than is , which is near zero. This and like effects explored in depth in compressed sensing ( also very good Igor Carron’s page)
Now to convnet – convnet classify images by signs of piecewise-linear functions.
Take any effective pixel which is affecting activations. Convolutional network separate image space into piecewise-linear areas, which are not aligned with coordinate axes. That mean if we change intensity of pixel far enough we are out of correct classification area.
We don’t know how incorrect areas are distributed in the image space, but for common convolutional network dimensionality of subspace of the hyperplanes which make piecewise-linear separation boundary is several times more than dimensionality of the image vector. This suggest that correlation between incorrect areas of different pixels is quite weak.
Now assume that image is stable to perturbation, that mean that exist \epsilon such that for any effective pixel it’s epsilon-neighbourhood is in the correct area. If incorrect areas are weakly correlated that mean probability of image being stable is about , where n is number of effective pixels. That is probability of stable image is near zero. That illustrate suggestion that this “adversarial” effect is only caused by dimensionality of the problem and parameter space, not by some intrinsic deficiency of the deep network.
Geoffrey Hinton, neural networks, ML, deep learning pioneer answered “Ask Me Anything” on reddit:
His answers only:
His most controversial answer is:
The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.
If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its true that if the pools overlap enough, the positions of features will be accurately preserved by “coarse coding” (see my paper on “distributed representations” in 1986 for an explanation of this effect). But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer (by pose I mean position, orientation, and scale).
I think it makes much more sense to represent a pose as a small matrix that converts a vector of positional coordinates relative to the viewer into positional coordinates relative to the shape itself. This is what they do in computer graphics and it makes it easy to capture the effect of a change in viewpoint. It also explains why you cannot see a shape without imposing a rectangular coordinate frame on it, and if you impose a different frame, you cannot even recognize it as the same shape. Convolutional neural nets have no explanation for that, or at least none that I can think of.
It’s interesting though controversial opinion, and here is my take on it:
It looks like Convolutional networks discriminate input by switching between activation subnetworks (http://arxiv.org/abs/1410.1165, Understanding Locally Competitive Networks)
From that point of view max pooling operation is pretty natural – it provide both robustness to small deviations and switching between activation subsets. Aslo max pooling has a nice property to be semilinear to multiplication f(ax)=af(x) , which make all the layers stack but last cost layer semilinear, and if one use SVM/L2SVM as cost whole stack semilinear. Not sure if it help in discrimination, but it surely help in debugging.
As alternative to pooling pyramid Geoffrey Hinton suggest to get somehow global rotation(or other transformation) matrix from the image.
Getting it from image or from image patch is actually possible with low-rank matrix factorization from single image(Transform Invariant Low-rank Textures by Zhang et al) and by optical flow (or direct matching) from set of images, but it’s quite expensive in performance term. It’s definitely possible to combine those method with deep network, (ironically they are benefit using pooling pyramid)
1. They will be kind of image augmentation
2. Those are problem-specific methods, they may not help in other domains like NLP, speech etc.
There are some work on using low-rank matrix factorization in deep networks, but they are mostly about fully connected, not convolutional layers. But IMHO if to try do factorization in conv layers, the pooling wouldn’t go away.
PS (edit): Here is a video with Hinton alternative to pooling – “capsules”