## Some simple ways to speed up convnet a little

There are quite a few complex methods for making convolutional networks converge faster. Natural gradient, dual coordinate ascent, second order hessian free methods and more. However those methods are usually require considerable memory spending and extensive modifications of existing code if you already had some working method before.

Here instead I’ll list some *simple* and lightweight methods, which don’t require extensive changes of already working code. Those methods works (I’ve tested them on CIFAR10), and at very worst don’t make convergence worse. You shouldn’t expect radical improvement from those methods – they are for *little *speed up.

1. Start training on the part of the data and gradually increase data size to full. For example for CIFAR10 I start with 1/5 of all the data and increase it by 1/5 each and time giving it 3 more epochs to train for each chunk. This trick was inspired by “curriculum learning”. Resulting acceleration is small but still visible (on order of 10%).

2. Random sampling. Running all the data batches in the same order theoretically can produce some small overfitting – coadaptaion of the data. Usually this effect is not visible, but just to be sure you can reshuffle data randomly each epoch, especially if it cost very little. I’m using simple reshuffling method based on the prime numbers table.

For each minibatch k I take samples indexed by i: (i*prime)%data_size where i is from k*minibatch_size to (k+1)*minibatch_size, and prime taken form prime numbers table change each epoch. If prime is more than all prime factors of data_size all samples are indexed once.

3. Gradient step size. The baseline method is to use simple momentum. Most popular method of gradient acceleration on top of momentum are RMSProp and Nesterov accelerated gadient (NAG).

NAG simply change order in which momentum and gradient are applied to weights. RMSProp is normalization of the gradient, so that it should have approximately same norm. One of the variants described here, which is similar to low pass filter.Another possible implementation – gradient is divided by running root mean square of it’s previous values. However in my tests on CIFAR10 neither NAG nor RMSprop or NAG+RMSProp show any visible improvement.

Nevertheless in my tests a simple modification of gradient step show better result then standard momentum. That is just a common optimization trick – discard or reduce those gradient step which increase error function. Straightforward implementation of it could be costly for convnet – to estimate cost function after gradient step require additional forward propagation. There is a workaround – estimate error on the next sample of the dataset and if error increased subtract part of the previous gradient. This is not precisely the same as reducing bad gradient, because we subtract offending gradient *after* we made next step, but because the step is small it still works. Essentially it could be seen as error-dependent gradient step

4. Dropout. Use of dropout require some care, especially if we want to insert dropout into networks built and tuned without dropout. Common recommendation is to increase number of filters proportionally to dropout value.

But there are some hidden traps there: while dropout should improve *test *error(results on the samples not used in training) it make *training *error noticeably worse. In practice it may make worse even *test *error. Dropout also make convergence considerably more slow. There is non-obvious trick which help: continue iterations without changing learning rate for some times even after *training *error is not decreasing any more. Intuition behind this trick – dropout reduce *test *error, not *training *error. *Test *error decrease often very slow and noisy comparing to *training *error, so, just to be sure it may help to increase number of epoch even after both stop decreasing, without decreasing learning rate. Reducing learning rate after both training and test error went plateau for some epochs may produce better results.

To be continued (may be)

## Some thoughts about deep learning criticism

There is some deep learning (specifically #convnet) criticism based on the artificially constructed misclassification examples.

There is a new paper

“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images” by Nguen et al

http://arxiv.org/abs/1412.1897

and other direction of critics is based on the older, widely cited paper

“Intriguing properties of neural networks” by Szegedy et al

http://arxiv.org/abs/1312.6199

In the first paper authors construct examples which classified by convnet with confidence, but look nothing like label to human eye.

In the second paper authors show that correctly classified image could be converted to misclassified with small perturbation, which perturbation could be found by iterative procedure

What I think is that those phenomenons have no impact on practical performance of convolutional neural networks.

First paper is really simple to address. The real world images produced by camera are not dense in the image space(space of all pixel vectors of image size dimension).

In fact camera images belong to low-dimensional manifold in the image space, and there are some interesting works on dimensionality and property of that manifold. For example dimensionality of the space images of the fixed 3D scene it is around 7, which is not surprising, and the geodesics of that manifold could be defined through the optical flow.

Of cause if sample is outside of image manifold it will be misclassified, method of training notwithstanding. The images in the paper are clearly not real-world camera images, no wonder convnet assign nonsensical labels to them.

Second paper is more interesting. First I want to point that perturbation which cause misclassification is produced by iterative procedure. That hint that in the neighbourhood of the image perturbed misclassified images are belong to measure near-zero set.

Practically that mean that probability of this type of misclassification is near zero, and orders of magnitude less than “normal” misclassification rate of most deep networks.

But what is causing that misclassification? I’d suggest that just high dimensionality of the image and parameters spaces and try to illustrate it. In fact it’s the same reason why epsilon-sparse vector are ubiquitous in real-world application: If we take *n*-dimensional vector, probability that all it’s components more than is , which is near zero. This and like effects explored in depth in compressed sensing ( also very good Igor Carron’s page)

Now to convnet – convnet classify images by signs of piecewise-linear functions.

Take any effective pixel which is affecting activations. Convolutional network separate image space into piecewise-linear areas, which are not aligned with coordinate axes. That mean if we change intensity of pixel far enough we are out of correct classification area.

We don’t know how incorrect areas are distributed in the image space, but for common convolutional network dimensionality of subspace of the hyperplanes which make piecewise-linear separation boundary is several times more than dimensionality of the image vector. This suggest that correlation between incorrect areas of different pixels is quite weak.

Now assume that image is stable to perturbation, that mean that exist \epsilon such that for any effective pixel it’s epsilon-neighbourhood is in the correct area. If incorrect areas are weakly correlated that mean probability of image being stable is about , where *n *is number of effective pixels. That is probability of stable image is near zero. That illustrate suggestion that this “adversarial” effect is only caused by dimensionality of the problem and parameter space, not by some intrinsic deficiency of the deep network.

## Finds in arxiv, January.

Repost from googleplus stream

Finds in arxiv

**Computer vision**

*Convex Relaxations of SE(2) and SE(3) for Visual Pose Estimation*

Finding camera position from the series of 2D (or depth) images is one of the most common (and difficult) task of computer vision.

The biggest problem here is incorporating rotation into cost(energy) function, that should minimize reprojection(fro 2D) error (seehttp://en.wikipedia.org/wiki/Bundle_adjustment). Usually it’s done by minimization on manifold – locally presenting rotation parameter space as linear subspace. The problem with that approach is that initial approximation should be good enough and minimization goes by small steps pojecting/reprojecting on manifold (SO(3) in the case) which can stuck in local minimum. Where to start if the is no initial approximation? Usually it’s just brute-forced with several initialization attempt. Here in the paper approach is different – no initial approximation needed, solution is straightforward and minimum is global. The idea is instead of minimizing rotation on the sphere minimize it inside the ball. In fact it’s a classical **convexification** approach – increase dimentionality of the problem to make it convex. The pay is a lot higher computational cost of cause

http://arxiv.org/abs/1401.3700

*A Sparse Outliers Iterative Removal Algorithm to Model the Background in the Video Sequences*

Background removal without nuclear norm. Some “good” frames chosen as dictionary and backround represented in it.

http://arxiv.org/abs/1401.6013

**Deep Learning**

*Learning Mid-Level Features and Modeling Neuron Selectivity for Image Classification*

Mid level features is a relatively new concept but a very old practice. It’s a building “efficient” subset of features from the set of features obtained by low-level feature descriptor(like SIFT, SURF, FREAK etc) Before it was done by PCA, later by sparse coding.

What authors built is some kind of hibrid of convolutional network with dictionary learning. Mid level features fed into neuron layer for sparsification and result into classifier. There are some benchmarks but no CIFAR or MINST

http://arxiv.org/abs/1401.5535

**Optimization and CS**

*Alternating projections and coupling slope*

Finding intersection of two non-convex sets with alternating projection. I didn’t new that transversality condition used a lot in convex geometry too.

http://arxiv.org/abs/1401.7569

*Bayesian Pursuit Algorithms*

Bayesian version of hard thresholding. Bayesian approach is in how threshold is chosen and update scaled.

http://arxiv.org/abs/1401.7538

## December finds in #arxiv

Repost from my googleplus stream

**Computer Vision**

*Non-Local means is a local image denoising algorithm*

Paper shows that non-local mean weights are not identify patches globally point in the images, but are susceptible to aperture problem:

http://en.wikipedia.org/wiki/Optical_flow#Estimation That’s why short radius NLM could be better then large radius NLM. Small radius cutoff play the role of regularizer, similar to the Total Variation in Horn-Shunk Optical flow.

http://en.wikipedia.org/wiki/Horn%E2%80%93Schunck_method (my comment – TV-L1 is generally better than TV-L2 in Horn-Schunk)

http://arxiv.org/abs/1311.3768

**Deep Learning**

*Do Deep Nets Really Need to be Deep?*

Authors state that shallow neural nets can in fact achieve similar performance to deep convolutional nets. The problem though is, that they had to be initialized or preconditioned – they can not be trained using existing algorithms.

And for that initialization they need deep nets. Authors hypothesize that there should be algorithms that allow training of those shallow nets to reach the same performance as deep nets.

http://arxiv.org/abs/1312.6184

*Intriguing properties of neural networks*

The linear combination of deep-level nodes produce the same results as the the original nodes. That suggest that nodes the spaces itself rather it’s representation keep information for deep levels.

The input-output mapping also discontinuous – small perturbations cause misclassification. Those perturbation are not dependent on the training, only on input of classification. (My comment – sparse coding is generally not smooth on input, another argument that sparse coding is part of internal mechanics of deep learning)

http://arxiv.org/abs/1312.6199

*From Maxout to Channel-Out: Encoding Information on Sparse Pathways*

This paper start with observation that max-out is a form of sparse coding: only one of the input pathway is chosen for father processing. From this inferred development of that principle:

remove “middle” layer which “choose” maximum input, and transfer maximal input at once into next level – make choice function index-aware. Some other choice function beside the max is considered, but max still seems the best

Piecewise-constant choice function make interesting reference to previous paper (discontinuity of input-output mapping)

http://arxiv.org/abs/1312.1909

*Unsupervised Feature Learning by Deep Sparse Coding*

This, for a difference is not about convolutional network.

Instead SIFT(or similar) descriptors are used to produce bag-of-words, sparse coding is used with max-out, and manifold learning applied to it. (http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction)

http://arxiv.org/abs/1312.5783

*Generative NeuroEvolution for Deep Learning*

I’m generally wary of evolutionary methods, but this looks kind of interesting – it’s based on *compositional pattern producing network* (CPPN)- encoding geometric pattern as composition of simple functions.

This CPPN is used to encode connectivity pattern of ANN (Convolutional newtwork most used). Thus complete process is the combination of ANN training and evolutionary CPPN training

http://arxiv.org/abs/1312.5355

*Some Improvements on Deep Convolutional Neural Network Based Image Classification*, *Unsupervised feature learning by augmenting single images*

Botht papers seems about the same subject – squeeze more out of labeled images by applying a lot of transformation to them(Some of those transformations are implemented in cuda-convnet BTW)

http://arxiv.org/abs/1312.5402, http://arxiv.org/abs/1312.5242

*Exact solutions to the nonlinear dynamics of learning in deep linear neural networks*

Analytical exploration of toy 3-layer model *without_ actual non-linear neurons. Model completely linear to input (polynomial to weights). Nevertheless it show some interesting properties, like step in learning curve

http://arxiv.org/abs/1312.6120

**Optimization**

*Distributed Interior-point Method for Loosely Coupled Problems*

Mixing together all my favorite methods: Interior point, Newton, ADMM(Split-Bregman) into one algorithm and make a distribute implementation of it.

Mixing Newton and ADMM, ADMM and Interior point looks risky to me, though with a lot of subiterations it may work(that’s why it’s distributed – require a lot of calculating power)

Also I’m not sure about convergene of the combined algorithm – each step’s convergence is proven, but I’m not sure the same could be applyed to the combination.

Newton and ADMM have kind of contradicting optimal conditions – smoothness vs piecewise linearity. Would like to see more research on this…

http://arxiv.org/abs/1312.5440

*Total variation regularization for manifold-valued data*

Proximal mapping and soft thresholding for manifolds – analog of ADMM for manifolds.

http://arxiv.org/abs/1312.7710

**just interesting stuff**

*Coping with Physical Attacks on Random Network Structures*

Include finding vulnerable spots and results of random attacks

(My comment – shouldn’t it be connected to precolation theory?)

http://arxiv.org/abs/1312.6189

## Finds in arxiv, october

This is duplication of my ongoing G+ series of post on interesting for me papers in arxiv. Older posts are not here but in my G+ thread.

Finds in #arxiv :

*Optimization, numerical & convex, ML*

The Linearized Bregman Method via Split Feasibility Problems: Analysis and Generalizations

Reformulation of Split Bregman/ ADMM as split feasibility problem and algorithm/convergence for generalized split feasibility by Bregman projection. This general formulation include both Split Bregman and Kaczmarz (my comment – randomized Kaczmarz seems could be here too)

http://arxiv.org/abs/1309.2094

Stochastic gradient descent and the randomized Kaczmarz algorithm

Hybrid of randomized Kaczmarz and stochastic gradient descent – into my “to read” pile

http://arxiv.org/abs/1310.5715

Trust–Region Problems with Linear Inequality Constraints: Exact SDP Relaxation, Global Optimality and Robust Optimization

“Extended” trust region for linear inequalities constrains

http://arxiv.org/abs/1309.3000

Conic Geometric Programming

Unifing framwork for conic and geometric programming

http://arxiv.org/abs/1310.0899

http://en.wikipedia.org/wiki/Geometric_programming

http://en.wikipedia.org/wiki/Conic_programming

Gauge optimization, duality, and applications

Another big paper about different, not Lagrange duality, introduced by Freund (1987)

http://arxiv.org/abs/1310.2639

Color Bregman TV

mu parameters in split bregman made adaptive, to exploit coherence of edges in different color channels

http://arxiv.org/abs/1310.3146

Iteration Complexity Analysis of Block Coordinate Descent Methods

Some convergence analysis for BCD and projected gradient BCD

http://arxiv.org/abs/1310.6957

Successive Nonnegative Projection Algorithm for Robust Nonnegative Blind Source Separation

Nonnegative matrix factorization

http://arxiv.org/abs/1310.7529

Scaling SVM and Least Absolute Deviations via Exact Data Reduction

SVN for large-scale problems

http://arxiv.org/abs/1310.7048

Image Restoration using Total Variation with Overlapping Group Sparsity

While title is promising I have doubt about that paper. The method authors suggest is equivalent to adding averaging filter to TV-L1 under L1 norm. There is no comparison to just applying TV-L1 and smoothing filter interchangeably.The method author suggest is very costly, and using median filter instead of averaging would cost the same while obviously more robust.

http://arxiv.org/abs/1310.3447

*Deep learning*

Deep and Wide Multiscale Recursive Networks for Robust Image Labeling

_Open source_ matlab/c package coming soon(not yet available)

http://arxiv.org/abs/1310.0354

Improvements to deep convolutional neural networks for LVCSR

convolutional networks, droput for speech recognition,

http://arxiv.org/abs/1309.1501v1

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition

Already discussed on G+ – open source framework in “learn one use everywhere” stile. Learning done off-line on GPU using ConvNet, and recognition is online in pure python.

http://arxiv.org/abs/1310.1531

Statistical mechanics of complex neural systems and high dimensional data

Big textbook-like overview paper on statistical mechanics of learning. I’ve put it in my “to read” pile.

http://arxiv.org/abs/1301.7115

Randomized co-training: from cortical neurons to machine learning and back again

“Selectron” instead of perception – neurons are “specializing” with weights.

http://arxiv.org/abs/1310.6536

Provable Bounds for Learning Some Deep Representations

http://arxiv.org/abs/1310.6343

Citation:”The current paper presents both an interesting family of denoising autoencoders as

*Computer vision*

Online Unsupervised Feature Learning for Visual Tracking

Sparse representation, overcomplete dictionary

http://arxiv.org/abs/1310.1690

From Shading to Local Shape

Shape restoration from local shading – could be very useful in low-feature environment.

http://arxiv.org/abs/1310.2916

Fast 3D Salient Region Detection in Medical Images using GPUs

Finding interest point in 3D images

http://arxiv.org/abs/1310.6736

Object Recognition System Design in Computer Vision: a Universal Approach

Grid-based universal framework for object detection/classification

http://arxiv.org/abs/1310.7170

Gaming :)

Lyapunov-based Low-thrust Optimal Orbit Transfer: An approach in Cartesian coordinates

For space sim enthusiast :)

http://arxiv.org/abs/1310.4201

## ConvNet for windows

I have seen an excellent wlakthrough on building Alex Krizhevsky’s cuda-convnet for windows, but difference in configuration and installed packages could be tiresome. So here is complete build of convnet for windows 64:

https://github.com/s271/win_convnet

It require having CUDA compute capability 2.0 or better GPU of cause, Windows 64bit, Visual Studio 64bits and Python 64bit with NumPy installed. The rest of libs and dlls are precomplied. In building it I’ve followed instructions by Yalong Bai (Wyvernbai) from http://www.asiteof.me/archives/50.

Read Readme.md before installation – you may (or may not) require PYTHONPATH environmental variable set.

On side note I’ve used WinPython for both libraries and running the package. WinPython seems a very nice package, which include Spyder IDE. I have some problems with Spyder though – sometimes it behave erratically during debugging/running the code. Could be my inexperience with Spyder though. Another nice package – PythonXY – regretfully can not be used – it has only 32 bit version and will not compile with/run 64 bit modules.

## Samsung SARI 1.5 Augmented Reality SDK is out in the wild

Something I did for Samsung (kernel of tracker). Biggest improvement in SARI 1.5 is the sensors fusion, which allow for a lot more robust tracking.

Here is example of run-time localization and mapping with SARI 1.5:

This is the AR EdiBear game (free in Samsung apps store)

## Effectivness of compressed sensing in image processing and other stuff

I seldom post in the this blog now, mostly because I’m positing on twitter and G+ a lot lately. I still haven’t figured out which post should go where – blog, G+ or twitter, so it’s kind of chaotic for now.

What of interest is going on: There are two paper on the CVPR11 which claim that compressed sensing(sparse recovery) is not applicable to some of the most important computer vision tasks:

Is face recognition really a Compressive Sensing problem?

Are Sparse Representations Really Relevant for Image Classification?

Both paper claim that space of the natural images(or their important subsets) are not really sparse.

Those claims however dont’t square with claim of high effectiveness of compact signature of Random Ferns.

Could both of those be true? In my opinion – yes. Difference of two approaches is that first two paper assumed *explicit* sparsity – that is they enforced sparsity on the feature vector. Compressed signature approach used *implicit* sparsity – feature vector underling the signature is assumed sparse but is not explicitly reconstructed. Why compressed signature is working while explicit approach didn’t? That could be the case if image space is sparse in the different coordinate system – that is here one is dealing with the *union of subspaces*. Assumption not of the simple sparsity, but of the union of subspaces is called blind compressed sensing.

Now if we look at the space of the natural images it’s easy to see why it is low dimensional. Natural image is the image of some scene, an that scene has limited number of moving object. So dimension of space images of the scene is approximately the sum of degree of freedom of the objects(and camera) of the scene, plus effects of occlusions, illumination and noise. Now if the add strong enough random error to the scene, the image is stop to be the natural image(that is image of any scene). That mean manifold of the images of the scene is isolated – there is no natural images in it’s neighborhood. That hint that up to some error the space of the natural images is at least is *the union of isolated low-dimensional manifolds*. The union of mainfolds is obviously is more complex structure than the union of subspace, but methods of blind compressed sensing could be applicable to it too. Of cause to think about union of manifolds could be necessary only if the space of images is not union of subspace, which is obviously preferable case

## Recursive Interferometry – Phase Congruency?

Thanks to Igor Carron I’ve watched a great videolecture by Stephane Mallat High dimensional classification by recursive interferometry. Actually I watched it twice, and I think I understand most of it now))). And it was not about compressed sensing, not even about manifold learning much . It was mostly about a new application of wavelets . How to use wavelets to produce low dimensional data (image descriptor if we are talking about computer vision) from high dimensional data(that is image). The idea is to transcend linear representation and use nonlinear operation – absolute value of wavelet. Absolute value – square root of wavelet square carry information of frequencies differences. It’s invert Fourier transform have new harmonics – differences of frequencies of original function. That interference of harmonics of original image. Now it was reminding me something. Yep – phase congruency (pdf). Phase congruency also use absolute value of wavelet(windowed Fourier). It seems to me it has perfect explanation. Interference pattern defined by how in-phase both wave are. That is it’s like a phase congruency taken into each point. Phase congruency edge-detector is in fact finding maximum of somehow normalized interference pattern. In that sense this Mallat’s method producing invariants from high-dimensional data is analogous to producing sketch from photo.

Ok, enough rambling for now.

## SMMT is now open sourced

SMMT is now open sourced under BSD-like license. Sourceforge page is here.