There are quite a few complex methods for making convolutional networks converge faster. Natural gradient, dual coordinate ascent, second order hessian free methods and more. However those methods are usually require considerable memory spending and extensive modifications of existing code if you already had some working method before.
Here instead I’ll list some simple and lightweight methods, which don’t require extensive changes of already working code. Those methods works (I’ve tested them on CIFAR10), and at very worst don’t make convergence worse. You shouldn’t expect radical improvement from those methods – they are for little speed up.
1. Start training on the part of the data and gradually increase data size to full. For example for CIFAR10 I start with 1/5 of all the data and increase it by 1/5 each and time giving it 3 more epochs to train for each chunk. This trick was inspired by “curriculum learning”. Resulting acceleration is small but still visible (on order of 10%).
2. Random sampling. Running all the data batches in the same order theoretically can produce some small overfitting – coadaptaion of the data. Usually this effect is not visible, but just to be sure you can reshuffle data randomly each epoch, especially if it cost very little. I’m using simple reshuffling method based on the prime numbers table.
For each minibatch k I take samples indexed by i: (i*prime)%data_size where i is from k*minibatch_size to (k+1)*minibatch_size, and prime taken form prime numbers table change each epoch. If prime is more than all prime factors of data_size all samples are indexed once.
3. Gradient step size. The baseline method is to use simple momentum. Most popular method of gradient acceleration on top of momentum are RMSProp and Nesterov accelerated gadient (NAG).
NAG simply change order in which momentum and gradient are applied to weights. RMSProp is normalization of the gradient, so that it should have approximately same norm. One of the variants described here, which is similar to low pass filter.Another possible implementation – gradient is divided by running root mean square of it’s previous values. However in my tests on CIFAR10 neither NAG nor RMSprop or NAG+RMSProp show any visible improvement.
Nevertheless in my tests a simple modification of gradient step show better result then standard momentum. That is just a common optimization trick – discard or reduce those gradient step which increase error function. Straightforward implementation of it could be costly for convnet – to estimate cost function after gradient step require additional forward propagation. There is a workaround – estimate error on the next sample of the dataset and if error increased subtract part of the previous gradient. This is not precisely the same as reducing bad gradient, because we subtract offending gradient after we made next step, but because the step is small it still works. Essentially it could be seen as error-dependent gradient step
4. Dropout. Use of dropout require some care, especially if we want to insert dropout into networks built and tuned without dropout. Common recommendation is to increase number of filters proportionally to dropout value.
But there are some hidden traps there: while dropout should improve test error(results on the samples not used in training) it make training error noticeably worse. In practice it may make worse even test error. Dropout also make convergence considerably more slow. There is non-obvious trick which help: continue iterations without changing learning rate for some times even after training error is not decreasing any more. Intuition behind this trick – dropout reduce test error, not training error. Test error decrease often very slow and noisy comparing to training error, so, just to be sure it may help to increase number of epoch even after both stop decreasing, without decreasing learning rate. Reducing learning rate after both training and test error went plateau for some epochs may produce better results.
To be continued (may be)
There is some deep learning (specifically #convnet) criticism based on the artificially constructed misclassification examples.
There is a new paper
“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images” by Nguen et al
and other direction of critics is based on the older, widely cited paper
“Intriguing properties of neural networks” by Szegedy et al
In the first paper authors construct examples which classified by convnet with confidence, but look nothing like label to human eye.
In the second paper authors show that correctly classified image could be converted to misclassified with small perturbation, which perturbation could be found by iterative procedure
What I think is that those phenomenons have no impact on practical performance of convolutional neural networks.
First paper is really simple to address. The real world images produced by camera are not dense in the image space(space of all pixel vectors of image size dimension).
In fact camera images belong to low-dimensional manifold in the image space, and there are some interesting works on dimensionality and property of that manifold. For example dimensionality of the space images of the fixed 3D scene it is around 7, which is not surprising, and the geodesics of that manifold could be defined through the optical flow.
Of cause if sample is outside of image manifold it will be misclassified, method of training notwithstanding. The images in the paper are clearly not real-world camera images, no wonder convnet assign nonsensical labels to them.
Second paper is more interesting. First I want to point that perturbation which cause misclassification is produced by iterative procedure. That hint that in the neighbourhood of the image perturbed misclassified images are belong to measure near-zero set.
Practically that mean that probability of this type of misclassification is near zero, and orders of magnitude less than “normal” misclassification rate of most deep networks.
But what is causing that misclassification? I’d suggest that just high dimensionality of the image and parameters spaces and try to illustrate it. In fact it’s the same reason why epsilon-sparse vector are ubiquitous in real-world application: If we take n-dimensional vector, probability that all it’s components more than is , which is near zero. This and like effects explored in depth in compressed sensing ( also very good Igor Carron’s page)
Now to convnet – convnet classify images by signs of piecewise-linear functions.
Take any effective pixel which is affecting activations. Convolutional network separate image space into piecewise-linear areas, which are not aligned with coordinate axes. That mean if we change intensity of pixel far enough we are out of correct classification area.
We don’t know how incorrect areas are distributed in the image space, but for common convolutional network dimensionality of subspace of the hyperplanes which make piecewise-linear separation boundary is several times more than dimensionality of the image vector. This suggest that correlation between incorrect areas of different pixels is quite weak.
Now assume that image is stable to perturbation, that mean that exist \epsilon such that for any effective pixel it’s epsilon-neighbourhood is in the correct area. If incorrect areas are weakly correlated that mean probability of image being stable is about , where n is number of effective pixels. That is probability of stable image is near zero. That illustrate suggestion that this “adversarial” effect is only caused by dimensionality of the problem and parameter space, not by some intrinsic deficiency of the deep network.
Repost from googleplus stream
Finds in arxiv
Convex Relaxations of SE(2) and SE(3) for Visual Pose Estimation
Finding camera position from the series of 2D (or depth) images is one of the most common (and difficult) task of computer vision.
The biggest problem here is incorporating rotation into cost(energy) function, that should minimize reprojection(fro 2D) error (seehttp://en.wikipedia.org/wiki/Bundle_adjustment). Usually it’s done by minimization on manifold – locally presenting rotation parameter space as linear subspace. The problem with that approach is that initial approximation should be good enough and minimization goes by small steps pojecting/reprojecting on manifold (SO(3) in the case) which can stuck in local minimum. Where to start if the is no initial approximation? Usually it’s just brute-forced with several initialization attempt. Here in the paper approach is different – no initial approximation needed, solution is straightforward and minimum is global. The idea is instead of minimizing rotation on the sphere minimize it inside the ball. In fact it’s a classical convexification approach – increase dimentionality of the problem to make it convex. The pay is a lot higher computational cost of cause
A Sparse Outliers Iterative Removal Algorithm to Model the Background in the Video Sequences
Background removal without nuclear norm. Some “good” frames chosen as dictionary and backround represented in it.
Learning Mid-Level Features and Modeling Neuron Selectivity for Image Classification
Mid level features is a relatively new concept but a very old practice. It’s a building “efficient” subset of features from the set of features obtained by low-level feature descriptor(like SIFT, SURF, FREAK etc) Before it was done by PCA, later by sparse coding.
What authors built is some kind of hibrid of convolutional network with dictionary learning. Mid level features fed into neuron layer for sparsification and result into classifier. There are some benchmarks but no CIFAR or MINST
Optimization and CS
Alternating projections and coupling slope
Finding intersection of two non-convex sets with alternating projection. I didn’t new that transversality condition used a lot in convex geometry too.
Bayesian Pursuit Algorithms
Bayesian version of hard thresholding. Bayesian approach is in how threshold is chosen and update scaled.
Repost from my googleplus stream
Non-Local means is a local image denoising algorithm
Paper shows that non-local mean weights are not identify patches globally point in the images, but are susceptible to aperture problem:
http://en.wikipedia.org/wiki/Optical_flow#Estimation That’s why short radius NLM could be better then large radius NLM. Small radius cutoff play the role of regularizer, similar to the Total Variation in Horn-Shunk Optical flow.
http://en.wikipedia.org/wiki/Horn%E2%80%93Schunck_method (my comment – TV-L1 is generally better than TV-L2 in Horn-Schunk)
Do Deep Nets Really Need to be Deep?
Authors state that shallow neural nets can in fact achieve similar performance to deep convolutional nets. The problem though is, that they had to be initialized or preconditioned – they can not be trained using existing algorithms.
And for that initialization they need deep nets. Authors hypothesize that there should be algorithms that allow training of those shallow nets to reach the same performance as deep nets.
Intriguing properties of neural networks
The linear combination of deep-level nodes produce the same results as the the original nodes. That suggest that nodes the spaces itself rather it’s representation keep information for deep levels.
The input-output mapping also discontinuous – small perturbations cause misclassification. Those perturbation are not dependent on the training, only on input of classification. (My comment – sparse coding is generally not smooth on input, another argument that sparse coding is part of internal mechanics of deep learning)
From Maxout to Channel-Out: Encoding Information on Sparse Pathways
This paper start with observation that max-out is a form of sparse coding: only one of the input pathway is chosen for father processing. From this inferred development of that principle:
remove “middle” layer which “choose” maximum input, and transfer maximal input at once into next level – make choice function index-aware. Some other choice function beside the max is considered, but max still seems the best
Piecewise-constant choice function make interesting reference to previous paper (discontinuity of input-output mapping)
Unsupervised Feature Learning by Deep Sparse Coding
This, for a difference is not about convolutional network.
Instead SIFT(or similar) descriptors are used to produce bag-of-words, sparse coding is used with max-out, and manifold learning applied to it. (http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction)
Generative NeuroEvolution for Deep Learning
I’m generally wary of evolutionary methods, but this looks kind of interesting – it’s based on compositional pattern producing network (CPPN)- encoding geometric pattern as composition of simple functions.
This CPPN is used to encode connectivity pattern of ANN (Convolutional newtwork most used). Thus complete process is the combination of ANN training and evolutionary CPPN training
Some Improvements on Deep Convolutional Neural Network Based Image Classification, Unsupervised feature learning by augmenting single images
Botht papers seems about the same subject – squeeze more out of labeled images by applying a lot of transformation to them(Some of those transformations are implemented in cuda-convnet BTW)
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Analytical exploration of toy 3-layer model *without_ actual non-linear neurons. Model completely linear to input (polynomial to weights). Nevertheless it show some interesting properties, like step in learning curve
Distributed Interior-point Method for Loosely Coupled Problems
Mixing together all my favorite methods: Interior point, Newton, ADMM(Split-Bregman) into one algorithm and make a distribute implementation of it.
Mixing Newton and ADMM, ADMM and Interior point looks risky to me, though with a lot of subiterations it may work(that’s why it’s distributed – require a lot of calculating power)
Also I’m not sure about convergene of the combined algorithm – each step’s convergence is proven, but I’m not sure the same could be applyed to the combination.
Newton and ADMM have kind of contradicting optimal conditions – smoothness vs piecewise linearity. Would like to see more research on this…
Total variation regularization for manifold-valued data
Proximal mapping and soft thresholding for manifolds – analog of ADMM for manifolds.
just interesting stuff
Coping with Physical Attacks on Random Network Structures
Include finding vulnerable spots and results of random attacks
(My comment – shouldn’t it be connected to precolation theory?)
Finds in #arxiv :
*Optimization, numerical & convex, ML*
The Linearized Bregman Method via Split Feasibility Problems: Analysis and Generalizations
Reformulation of Split Bregman/ ADMM as split feasibility problem and algorithm/convergence for generalized split feasibility by Bregman projection. This general formulation include both Split Bregman and Kaczmarz (my comment – randomized Kaczmarz seems could be here too)
Stochastic gradient descent and the randomized Kaczmarz algorithm
Hybrid of randomized Kaczmarz and stochastic gradient descent – into my “to read” pile
Trust–Region Problems with Linear Inequality Constraints: Exact SDP Relaxation, Global Optimality and Robust Optimization
“Extended” trust region for linear inequalities constrains
Conic Geometric Programming
Unifing framwork for conic and geometric programming
Gauge optimization, duality, and applications
Another big paper about different, not Lagrange duality, introduced by Freund (1987)
Color Bregman TV
mu parameters in split bregman made adaptive, to exploit coherence of edges in different color channels
Iteration Complexity Analysis of Block Coordinate Descent Methods
Some convergence analysis for BCD and projected gradient BCD
Successive Nonnegative Projection Algorithm for Robust Nonnegative Blind Source Separation
Nonnegative matrix factorization
Scaling SVM and Least Absolute Deviations via Exact Data Reduction
SVN for large-scale problems
Image Restoration using Total Variation with Overlapping Group Sparsity
While title is promising I have doubt about that paper. The method authors suggest is equivalent to adding averaging filter to TV-L1 under L1 norm. There is no comparison to just applying TV-L1 and smoothing filter interchangeably.The method author suggest is very costly, and using median filter instead of averaging would cost the same while obviously more robust.
Deep and Wide Multiscale Recursive Networks for Robust Image Labeling
_Open source_ matlab/c package coming soon(not yet available)
Improvements to deep convolutional neural networks for LVCSR
convolutional networks, droput for speech recognition,
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Already discussed on G+ – open source framework in “learn one use everywhere” stile. Learning done off-line on GPU using ConvNet, and recognition is online in pure python.
Statistical mechanics of complex neural systems and high dimensional data
Big textbook-like overview paper on statistical mechanics of learning. I’ve put it in my “to read” pile.
Randomized co-training: from cortical neurons to machine learning and back again
“Selectron” instead of perception – neurons are “specializing” with weights.
Provable Bounds for Learning Some Deep Representations
Citation:”The current paper presents both an interesting family of denoising autoencoders as
Online Unsupervised Feature Learning for Visual Tracking
Sparse representation, overcomplete dictionary
From Shading to Local Shape
Shape restoration from local shading – could be very useful in low-feature environment.
Fast 3D Salient Region Detection in Medical Images using GPUs
Finding interest point in 3D images
Object Recognition System Design in Computer Vision: a Universal Approach
Grid-based universal framework for object detection/classification
Lyapunov-based Low-thrust Optimal Orbit Transfer: An approach in Cartesian coordinates
For space sim enthusiast :)
I have seen an excellent wlakthrough on building Alex Krizhevsky’s cuda-convnet for windows, but difference in configuration and installed packages could be tiresome. So here is complete build of convnet for windows 64:
It require having CUDA compute capability 2.0 or better GPU of cause, Windows 64bit, Visual Studio 64bits and Python 64bit with NumPy installed. The rest of libs and dlls are precomplied. In building it I’ve followed instructions by Yalong Bai (Wyvernbai) from http://www.asiteof.me/archives/50.
Read Readme.md before installation – you may (or may not) require PYTHONPATH environmental variable set.
On side note I’ve used WinPython for both libraries and running the package. WinPython seems a very nice package, which include Spyder IDE. I have some problems with Spyder though – sometimes it behave erratically during debugging/running the code. Could be my inexperience with Spyder though. Another nice package – PythonXY – regretfully can not be used – it has only 32 bit version and will not compile with/run 64 bit modules.
Something I did for Samsung (kernel of tracker). Biggest improvement in SARI 1.5 is the sensors fusion, which allow for a lot more robust tracking.
Here is example of run-time localization and mapping with SARI 1.5:
This is the AR EdiBear game (free in Samsung apps store)
I seldom post in the this blog now, mostly because I’m positing on twitter and G+ a lot lately. I still haven’t figured out which post should go where – blog, G+ or twitter, so it’s kind of chaotic for now.
What of interest is going on: There are two paper on the CVPR11 which claim that compressed sensing(sparse recovery) is not applicable to some of the most important computer vision tasks:
Is face recognition really a Compressive Sensing problem?
Are Sparse Representations Really Relevant for Image Classification?
Both paper claim that space of the natural images(or their important subsets) are not really sparse.
Those claims however dont’t square with claim of high effectiveness of compact signature of Random Ferns.
Could both of those be true? In my opinion – yes. Difference of two approaches is that first two paper assumed explicit sparsity – that is they enforced sparsity on the feature vector. Compressed signature approach used implicit sparsity – feature vector underling the signature is assumed sparse but is not explicitly reconstructed. Why compressed signature is working while explicit approach didn’t? That could be the case if image space is sparse in the different coordinate system – that is here one is dealing with the union of subspaces. Assumption not of the simple sparsity, but of the union of subspaces is called blind compressed sensing.
Now if we look at the space of the natural images it’s easy to see why it is low dimensional. Natural image is the image of some scene, an that scene has limited number of moving object. So dimension of space images of the scene is approximately the sum of degree of freedom of the objects(and camera) of the scene, plus effects of occlusions, illumination and noise. Now if the add strong enough random error to the scene, the image is stop to be the natural image(that is image of any scene). That mean manifold of the images of the scene is isolated – there is no natural images in it’s neighborhood. That hint that up to some error the space of the natural images is at least is the union of isolated low-dimensional manifolds. The union of mainfolds is obviously is more complex structure than the union of subspace, but methods of blind compressed sensing could be applicable to it too. Of cause to think about union of manifolds could be necessary only if the space of images is not union of subspace, which is obviously preferable case
Thanks to Igor Carron I’ve watched a great videolecture by Stephane Mallat High dimensional classification by recursive interferometry. Actually I watched it twice, and I think I understand most of it now))). And it was not about compressed sensing, not even about manifold learning much . It was mostly about a new application of wavelets . How to use wavelets to produce low dimensional data (image descriptor if we are talking about computer vision) from high dimensional data(that is image). The idea is to transcend linear representation and use nonlinear operation – absolute value of wavelet. Absolute value – square root of wavelet square carry information of frequencies differences. It’s invert Fourier transform have new harmonics – differences of frequencies of original function. That interference of harmonics of original image. Now it was reminding me something. Yep – phase congruency (pdf). Phase congruency also use absolute value of wavelet(windowed Fourier). It seems to me it has perfect explanation. Interference pattern defined by how in-phase both wave are. That is it’s like a phase congruency taken into each point. Phase congruency edge-detector is in fact finding maximum of somehow normalized interference pattern. In that sense this Mallat’s method producing invariants from high-dimensional data is analogous to producing sketch from photo.
Ok, enough rambling for now.