There are quite a few complex methods for making convolutional networks converge faster. Natural gradient, dual coordinate ascent, second order hessian free methods and more. However those methods are usually require considerable memory spending and extensive modifications of existing code if you already had some working method before.
Here instead I’ll list some simple and lightweight methods, which don’t require extensive changes of already working code. Those methods works (I’ve tested them on CIFAR10), and at very worst don’t make convergence worse. You shouldn’t expect radical improvement from those methods – they are for little speed up.
1. Start training on the part of the data and gradually increase data size to full. For example for CIFAR10 I start with 1/5 of all the data and increase it by 1/5 each and time giving it 3 more epochs to train for each chunk. This trick was inspired by “curriculum learning”. Resulting acceleration is small but still visible (on order of 10%).
2. Random sampling. Running all the data batches in the same order theoretically can produce some small overfitting – coadaptaion of the data. Usually this effect is not visible, but just to be sure you can reshuffle data randomly each epoch, especially if it cost very little. I’m using simple reshuffling method based on the prime numbers table.
For each minibatch k I take samples indexed by i: (i*prime)%data_size where i is from k*minibatch_size to (k+1)*minibatch_size, and prime taken form prime numbers table change each epoch. If prime is more than all prime factors of data_size all samples are indexed once.
3. Gradient step size. The baseline method is to use simple momentum. Most popular method of gradient acceleration on top of momentum are RMSProp and Nesterov accelerated gadient (NAG).
NAG simply change order in which momentum and gradient are applied to weights. RMSProp is normalization of the gradient, so that it should have approximately same norm. One of the variants described here, which is similar to low pass filter.Another possible implementation – gradient is divided by running root mean square of it’s previous values. However in my tests on CIFAR10 neither NAG nor RMSprop or NAG+RMSProp show any visible improvement.
Nevertheless in my tests a simple modification of gradient step show better result then standard momentum. That is just a common optimization trick – discard or reduce those gradient step which increase error function. Straightforward implementation of it could be costly for convnet – to estimate cost function after gradient step require additional forward propagation. There is a workaround – estimate error on the next sample of the dataset and if error increased subtract part of the previous gradient. This is not precisely the same as reducing bad gradient, because we subtract offending gradient after we made next step, but because the step is small it still works. Essentially it could be seen as error-dependent gradient step
4. Dropout. Use of dropout require some care, especially if we want to insert dropout into networks built and tuned without dropout. Common recommendation is to increase number of filters proportionally to dropout value.
But there are some hidden traps there: while dropout should improve test error(results on the samples not used in training) it make training error noticeably worse. In practice it may make worse even test error. Dropout also make convergence considerably more slow. There is non-obvious trick which help: continue iterations without changing learning rate for some times even after training error is not decreasing any more. Intuition behind this trick – dropout reduce test error, not training error. Test error decrease often very slow and noisy comparing to training error, so, just to be sure it may help to increase number of epoch even after both stop decreasing, without decreasing learning rate. Reducing learning rate after both training and test error went plateau for some epochs may produce better results.
To be continued (may be)
There is some deep learning (specifically #convnet) criticism based on the artificially constructed misclassification examples.
There is a new paper
“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images” by Nguen et al
and other direction of critics is based on the older, widely cited paper
“Intriguing properties of neural networks” by Szegedy et al
In the first paper authors construct examples which classified by convnet with confidence, but look nothing like label to human eye.
In the second paper authors show that correctly classified image could be converted to misclassified with small perturbation, which perturbation could be found by iterative procedure
What I think is that those phenomenons have no impact on practical performance of convolutional neural networks.
First paper is really simple to address. The real world images produced by camera are not dense in the image space(space of all pixel vectors of image size dimension).
In fact camera images belong to low-dimensional manifold in the image space, and there are some interesting works on dimensionality and property of that manifold. For example dimensionality of the space images of the fixed 3D scene it is around 7, which is not surprising, and the geodesics of that manifold could be defined through the optical flow.
Of cause if sample is outside of image manifold it will be misclassified, method of training notwithstanding. The images in the paper are clearly not real-world camera images, no wonder convnet assign nonsensical labels to them.
Second paper is more interesting. First I want to point that perturbation which cause misclassification is produced by iterative procedure. That hint that in the neighbourhood of the image perturbed misclassified images are belong to measure near-zero set.
Practically that mean that probability of this type of misclassification is near zero, and orders of magnitude less than “normal” misclassification rate of most deep networks.
But what is causing that misclassification? I’d suggest that just high dimensionality of the image and parameters spaces and try to illustrate it. In fact it’s the same reason why epsilon-sparse vector are ubiquitous in real-world application: If we take n-dimensional vector, probability that all it’s components more than is , which is near zero. This and like effects explored in depth in compressed sensing ( also very good Igor Carron’s page)
Now to convnet – convnet classify images by signs of piecewise-linear functions.
Take any effective pixel which is affecting activations. Convolutional network separate image space into piecewise-linear areas, which are not aligned with coordinate axes. That mean if we change intensity of pixel far enough we are out of correct classification area.
We don’t know how incorrect areas are distributed in the image space, but for common convolutional network dimensionality of subspace of the hyperplanes which make piecewise-linear separation boundary is several times more than dimensionality of the image vector. This suggest that correlation between incorrect areas of different pixels is quite weak.
Now assume that image is stable to perturbation, that mean that exist \epsilon such that for any effective pixel it’s epsilon-neighbourhood is in the correct area. If incorrect areas are weakly correlated that mean probability of image being stable is about , where n is number of effective pixels. That is probability of stable image is near zero. That illustrate suggestion that this “adversarial” effect is only caused by dimensionality of the problem and parameter space, not by some intrinsic deficiency of the deep network.
I have seen an excellent wlakthrough on building Alex Krizhevsky’s cuda-convnet for windows, but difference in configuration and installed packages could be tiresome. So here is complete build of convnet for windows 64:
It require having CUDA compute capability 2.0 or better GPU of cause, Windows 64bit, Visual Studio 64bits and Python 64bit with NumPy installed. The rest of libs and dlls are precomplied. In building it I’ve followed instructions by Yalong Bai (Wyvernbai) from http://www.asiteof.me/archives/50.
Read Readme.md before installation – you may (or may not) require PYTHONPATH environmental variable set.
On side note I’ve used WinPython for both libraries and running the package. WinPython seems a very nice package, which include Spyder IDE. I have some problems with Spyder though – sometimes it behave erratically during debugging/running the code. Could be my inexperience with Spyder though. Another nice package – PythonXY – regretfully can not be used – it has only 32 bit version and will not compile with/run 64 bit modules.