## Some thoughts about deep learning criticism

There is some deep learning (specifically #convnet) criticism based on the artificially constructed misclassification examples.

There is a new paper

“Deep Neural Networks are Easily Fooled: High Confidence Predictions for Unrecognizable Images” by Nguen et al

http://arxiv.org/abs/1412.1897

and other direction of critics is based on the older, widely cited paper

“Intriguing properties of neural networks” by Szegedy et al

http://arxiv.org/abs/1312.6199

In the first paper authors construct examples which classified by convnet with confidence, but look nothing like label to human eye.

In the second paper authors show that correctly classified image could be converted to misclassified with small perturbation, which perturbation could be found by iterative procedure

What I think is that those phenomenons have no impact on practical performance of convolutional neural networks.

First paper is really simple to address. The real world images produced by camera are not dense in the image space(space of all pixel vectors of image size dimension).

In fact camera images belong to low-dimensional manifold in the image space, and there are some interesting works on dimensionality and property of that manifold. For example dimensionality of the space images of the fixed 3D scene it is around 7, which is not surprising, and the geodesics of that manifold could be defined through the optical flow.

Of cause if sample is outside of image manifold it will be misclassified, method of training notwithstanding. The images in the paper are clearly not real-world camera images, no wonder convnet assign nonsensical labels to them.

Second paper is more interesting. First I want to point that perturbation which cause misclassification is produced by iterative procedure. That hint that in the neighbourhood of the image perturbed misclassified images are belong to measure near-zero set.

Practically that mean that probability of this type of misclassification is near zero, and orders of magnitude less than “normal” misclassification rate of most deep networks.

But what is causing that misclassification? I’d suggest that just high dimensionality of the image and parameters spaces and try to illustrate it. In fact it’s the same reason why epsilon-sparse vector are ubiquitous in real-world application: If we take *n*-dimensional vector, probability that all it’s components more than is , which is near zero. This and like effects explored in depth in compressed sensing ( also very good Igor Carron’s page)

Now to convnet – convnet classify images by signs of piecewise-linear functions.

Take any effective pixel which is affecting activations. Convolutional network separate image space into piecewise-linear areas, which are not aligned with coordinate axes. That mean if we change intensity of pixel far enough we are out of correct classification area.

We don’t know how incorrect areas are distributed in the image space, but for common convolutional network dimensionality of subspace of the hyperplanes which make piecewise-linear separation boundary is several times more than dimensionality of the image vector. This suggest that correlation between incorrect areas of different pixels is quite weak.

Now assume that image is stable to perturbation, that mean that exist \epsilon such that for any effective pixel it’s epsilon-neighbourhood is in the correct area. If incorrect areas are weakly correlated that mean probability of image being stable is about , where *n *is number of effective pixels. That is probability of stable image is near zero. That illustrate suggestion that this “adversarial” effect is only caused by dimensionality of the problem and parameter space, not by some intrinsic deficiency of the deep network.

## 4 Comments

Sorry, the comment form is closed at this time.

Well, your objection to 1st paper is equally easy to object to: how come humans do not mis-identify them, even though they are not in “image manifold”?

To both papers, what they are saying (or rather “demonstrating”) is, basically, that neural networks are just hacks and not real solution to cv problem. It’s like we know we should have used other code for the task, but we don’t know what it should be. So we just throw whatever “works” at the problem. No surprise we end up with unexpected “bugs” later.

Comment by makc3d | 14, December, 2014

And of course, our neural nets are never confused by similar random data …. Anyone remember Holusions?

I think you are correct – given a ‘normal’ image, it should be reasonably accurate. Given random data, both human neural nets and simulations are due to misclassification – the rorschach test relies on this.

Comment by Renor | 24, December, 2014

@mack3d The difference between dnn and human that human are trained on a lot more labeled data, including abstract patters outside of domain of natural images. However on difficult pattern many humans have the same problem – for example “Day and Night” by Escher.

About hack – it’s just a question of terminology. In that sense all the mathematics is “just a hack”. But dnn work for practical applications and sometimes better then human.

Comment by mirror2image | 28, December, 2014

Your objection to the first paper is not at all valid. Supposedly, these deep NN models are good at classifying images regardless of the quality of the image. Plus, humans have no problem classifying these images, so these models cannot at all be looked at as a viable model for our classification capabilities.

Your objection to the second paper did not address the “real” issue, which is essentially more theoretical in nature and has to do with the fact that NNs (deep or otherwise) work only on data, and functions that work combinations of data only can, by the nature of number-theoretic functions, produce the same output for different combinations of inputs. This is a theoretical result that cannot be resolved by these models.

Comment by walid saba | 23, January, 2015