Geoffrey Hinton, neural networks, ML, deep learning pioneer answered “Ask Me Anything” on reddit:
His answers only:
His most controversial answer is:
The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.
If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its true that if the pools overlap enough, the positions of features will be accurately preserved by “coarse coding” (see my paper on “distributed representations” in 1986 for an explanation of this effect). But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer (by pose I mean position, orientation, and scale).
I think it makes much more sense to represent a pose as a small matrix that converts a vector of positional coordinates relative to the viewer into positional coordinates relative to the shape itself. This is what they do in computer graphics and it makes it easy to capture the effect of a change in viewpoint. It also explains why you cannot see a shape without imposing a rectangular coordinate frame on it, and if you impose a different frame, you cannot even recognize it as the same shape. Convolutional neural nets have no explanation for that, or at least none that I can think of.
It’s interesting though controversial opinion, and here is my take on it:
It looks like Convolutional networks discriminate input by switching between activation subnetworks (http://arxiv.org/abs/1410.1165, Understanding Locally Competitive Networks)
From that point of view max pooling operation is pretty natural – it provide both robustness to small deviations and switching between activation subsets. Aslo max pooling has a nice property to be semilinear to multiplication f(ax)=af(x) , which make all the layers stack but last cost layer semilinear, and if one use SVM/L2SVM as cost whole stack semilinear. Not sure if it help in discrimination, but it surely help in debugging.
As alternative to pooling pyramid Geoffrey Hinton suggest to get somehow global rotation(or other transformation) matrix from the image.
Getting it from image or from image patch is actually possible with low-rank matrix factorization from single image(Transform Invariant Low-rank Textures by Zhang et al) and by optical flow (or direct matching) from set of images, but it’s quite expensive in performance term. It’s definitely possible to combine those method with deep network, (ironically they are benefit using pooling pyramid)
1. They will be kind of image augmentation
2. Those are problem-specific methods, they may not help in other domains like NLP, speech etc.
There are some work on using low-rank matrix factorization in deep networks, but they are mostly about fully connected, not convolutional layers. But IMHO if to try do factorization in conv layers, the pooling wouldn’t go away.
PS (edit): Here is a video with Hinton alternative to pooling – “capsules”
Repost from googleplus stream
Finds in arxiv
Convex Relaxations of SE(2) and SE(3) for Visual Pose Estimation
Finding camera position from the series of 2D (or depth) images is one of the most common (and difficult) task of computer vision.
The biggest problem here is incorporating rotation into cost(energy) function, that should minimize reprojection(fro 2D) error (seehttp://en.wikipedia.org/wiki/Bundle_adjustment). Usually it’s done by minimization on manifold – locally presenting rotation parameter space as linear subspace. The problem with that approach is that initial approximation should be good enough and minimization goes by small steps pojecting/reprojecting on manifold (SO(3) in the case) which can stuck in local minimum. Where to start if the is no initial approximation? Usually it’s just brute-forced with several initialization attempt. Here in the paper approach is different – no initial approximation needed, solution is straightforward and minimum is global. The idea is instead of minimizing rotation on the sphere minimize it inside the ball. In fact it’s a classical convexification approach – increase dimentionality of the problem to make it convex. The pay is a lot higher computational cost of cause
A Sparse Outliers Iterative Removal Algorithm to Model the Background in the Video Sequences
Background removal without nuclear norm. Some “good” frames chosen as dictionary and backround represented in it.
Learning Mid-Level Features and Modeling Neuron Selectivity for Image Classification
Mid level features is a relatively new concept but a very old practice. It’s a building “efficient” subset of features from the set of features obtained by low-level feature descriptor(like SIFT, SURF, FREAK etc) Before it was done by PCA, later by sparse coding.
What authors built is some kind of hibrid of convolutional network with dictionary learning. Mid level features fed into neuron layer for sparsification and result into classifier. There are some benchmarks but no CIFAR or MINST
Optimization and CS
Alternating projections and coupling slope
Finding intersection of two non-convex sets with alternating projection. I didn’t new that transversality condition used a lot in convex geometry too.
Bayesian Pursuit Algorithms
Bayesian version of hard thresholding. Bayesian approach is in how threshold is chosen and update scaled.
Repost from my googleplus stream
Non-Local means is a local image denoising algorithm
Paper shows that non-local mean weights are not identify patches globally point in the images, but are susceptible to aperture problem:
http://en.wikipedia.org/wiki/Optical_flow#Estimation That’s why short radius NLM could be better then large radius NLM. Small radius cutoff play the role of regularizer, similar to the Total Variation in Horn-Shunk Optical flow.
http://en.wikipedia.org/wiki/Horn%E2%80%93Schunck_method (my comment – TV-L1 is generally better than TV-L2 in Horn-Schunk)
Do Deep Nets Really Need to be Deep?
Authors state that shallow neural nets can in fact achieve similar performance to deep convolutional nets. The problem though is, that they had to be initialized or preconditioned – they can not be trained using existing algorithms.
And for that initialization they need deep nets. Authors hypothesize that there should be algorithms that allow training of those shallow nets to reach the same performance as deep nets.
Intriguing properties of neural networks
The linear combination of deep-level nodes produce the same results as the the original nodes. That suggest that nodes the spaces itself rather it’s representation keep information for deep levels.
The input-output mapping also discontinuous – small perturbations cause misclassification. Those perturbation are not dependent on the training, only on input of classification. (My comment – sparse coding is generally not smooth on input, another argument that sparse coding is part of internal mechanics of deep learning)
From Maxout to Channel-Out: Encoding Information on Sparse Pathways
This paper start with observation that max-out is a form of sparse coding: only one of the input pathway is chosen for father processing. From this inferred development of that principle:
remove “middle” layer which “choose” maximum input, and transfer maximal input at once into next level – make choice function index-aware. Some other choice function beside the max is considered, but max still seems the best
Piecewise-constant choice function make interesting reference to previous paper (discontinuity of input-output mapping)
Unsupervised Feature Learning by Deep Sparse Coding
This, for a difference is not about convolutional network.
Instead SIFT(or similar) descriptors are used to produce bag-of-words, sparse coding is used with max-out, and manifold learning applied to it. (http://en.wikipedia.org/wiki/Nonlinear_dimensionality_reduction)
Generative NeuroEvolution for Deep Learning
I’m generally wary of evolutionary methods, but this looks kind of interesting – it’s based on compositional pattern producing network (CPPN)- encoding geometric pattern as composition of simple functions.
This CPPN is used to encode connectivity pattern of ANN (Convolutional newtwork most used). Thus complete process is the combination of ANN training and evolutionary CPPN training
Some Improvements on Deep Convolutional Neural Network Based Image Classification, Unsupervised feature learning by augmenting single images
Botht papers seems about the same subject – squeeze more out of labeled images by applying a lot of transformation to them(Some of those transformations are implemented in cuda-convnet BTW)
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
Analytical exploration of toy 3-layer model *without_ actual non-linear neurons. Model completely linear to input (polynomial to weights). Nevertheless it show some interesting properties, like step in learning curve
Distributed Interior-point Method for Loosely Coupled Problems
Mixing together all my favorite methods: Interior point, Newton, ADMM(Split-Bregman) into one algorithm and make a distribute implementation of it.
Mixing Newton and ADMM, ADMM and Interior point looks risky to me, though with a lot of subiterations it may work(that’s why it’s distributed – require a lot of calculating power)
Also I’m not sure about convergene of the combined algorithm – each step’s convergence is proven, but I’m not sure the same could be applyed to the combination.
Newton and ADMM have kind of contradicting optimal conditions – smoothness vs piecewise linearity. Would like to see more research on this…
Total variation regularization for manifold-valued data
Proximal mapping and soft thresholding for manifolds – analog of ADMM for manifolds.
just interesting stuff
Coping with Physical Attacks on Random Network Structures
Include finding vulnerable spots and results of random attacks
(My comment – shouldn’t it be connected to precolation theory?)
This is “find in arxiv” reposts form my G+ stream for November.
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
Stochastic gradient (SGD) is the major tool for Deep Learning. However if you look at the plot of cost function over iteration for SGD you will see that after quite fast descent it becoming extremely slow, and error decrease could even become non-monotonous. Author explain by necessity of trade of between the step size and variance of random factor – more precision require smaller variance but that mean smaller descent step and slower convergence. “Predictive variance” author suggest to mitigate problem is the same old “adding back the noise” trick, used for example in Split Bregman. Worth reading IMHO.
Predicting Parameters in Deep Learning
Output of the first layer of ConvNet is quite smooth, and that could be used for dimensionality reduction, using some dictionary, learned or fixed(just some simple kernel). For ConvNet predicting 75% of parameters with fixed dictionary have negligible effect on accuracy.
Learning a Deep Compact Image Representation for Visual Tracking
Application of ADMM (Alternating Direction Method of Multipliers, of which Split Bregman again one of the prominent examples) to denoising autoencoder with sparsity.
Deep Neural Networks for Object Detection
People from Google are playing with Alex Krizhevsky’s ConvNet
Are all training examples equally valuable?
It’s intuitively obvious that some training sample are making training process worse. The question is – how to find wich sample should be removed from training? Kind of removing outliers. Authors define “training value” for each sample of binary classifier.
Finding sparse solutions of systems of polynomial equations via group-sparsity optimization
Finding sparse solution of polynomial system with lifting method.
I still not completely understand why quite weak structure constraint is enough for found approximation to be solution with high probability. It would be obviously precise for binary 0-1 solution, but why for general sparse?
Semi-Supervised Sparse Coding
Big dataset with small amount of labeled samples – what to do? Use unlabeled samples for sparse representation. And train labeled samples in sparse representation.
From the same author, similar theme – Cross-Domain Sparse Coding
Two domain training – use cross domain data representation to map all the samples from both source and target domains to a data representation space with a common distribution across domains.
Robust Low-rank Tensor Recovery: Models and Algorithms
More of tensor decomposition with trace norm
Complexity of Inexact Proximal Newton methods
Application of Proximal Newton (BFGS) to subset of coordinates each step – active set coordinate descent.
Computational Complexity of Smooth Differential Equations
Polynomial-memory complexity of ordinary differential equations.
Visualizing and Understanding Convolutional Neural Networks
This is exploration of Alex Krizhevsky’s ConvNet
( https://code.google.com/p/cuda-convnet/ )
using “deconvnet” approach – using deconvolution on output of each layer and visualizing it. Results looks interesting – strting from level 3 it’s something like thersholded edge enchantment, or sketch. Also there are evidences supporting “learn once use everywhere” approach – convnet trained on ImageNet is also effective on other datasets
Unsupervised Learning of Invariant Representations in Hierarchical Architectures
Another paper on why and how deep learning works.
Attempt to build theoretical framework for invariant features in deep learning. Interesting result – Gabor wavelets are optimal filters for simultaneous scale and translation invariance. Relations to sparsity and scattering transform
An Experimental Comparison of Trust Region and Level Sets
Trust regions method for energy-based segmentation.
Trust region is one of the most important tools in optimization, especially non-convex.
Blind Deconvolution with Re-weighted Sparsity Promotion
Using reweighted L2 norm for sparsity in blind deconvolution
Online universal gradient methods
about Nesterov’s universal gradient method (
It use Bregman distance and related to ADMM.
The paper is application of universal gradient method to online learning and give bound on regret function.
A Component Lasso
Approximate covariance matrix with block-diagonal matrix and apply Lasso to each block separately
_FuSSO: Functional Shrinkage and Selection Operator
Lasso in functional space with some orthogonal basis_
Non-Convex Compressed Sensing Using Partial Support Information
More of Lp norm for sparse recovery. Reweighted this time.
Scalable Frames and Convex Geometry
Frame theory is a basis(pun intended) of wavelets theory, compressed sening and overcomplete dictionaries in ML
Here is a discussion how to make “tight frame”
from an ordinary frame by scaling m of its components
Interesting geometric insight provided – to do it mcomponents of frame should make “blunt cone”
Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization
Some bounds for convergence of dictionary learning. Converge id initial error is O(1/s^2), s- sparcity level
Robustness of ℓ1 minimization against sparse outliers and its implications in Statistics and Signal Recovery
This is another exploration of L1 estimator. It happens (contrary to common sense) that L1 is not especially robust from “breakdown point” point of view if there is no constraint of noise. However it practical usefulness can be explained that it’s very robust to sparse noise
Local Fourier Analysis of Multigrid Methods with Polynomial Smoothers and Aggressive coarsening
Overrelaxaction with Chebyshev weights on the fine grid, with convergence analysis.
Finds in #arxiv :
*Optimization, numerical & convex, ML*
The Linearized Bregman Method via Split Feasibility Problems: Analysis and Generalizations
Reformulation of Split Bregman/ ADMM as split feasibility problem and algorithm/convergence for generalized split feasibility by Bregman projection. This general formulation include both Split Bregman and Kaczmarz (my comment – randomized Kaczmarz seems could be here too)
Stochastic gradient descent and the randomized Kaczmarz algorithm
Hybrid of randomized Kaczmarz and stochastic gradient descent – into my “to read” pile
Trust–Region Problems with Linear Inequality Constraints: Exact SDP Relaxation, Global Optimality and Robust Optimization
“Extended” trust region for linear inequalities constrains
Conic Geometric Programming
Unifing framwork for conic and geometric programming
Gauge optimization, duality, and applications
Another big paper about different, not Lagrange duality, introduced by Freund (1987)
Color Bregman TV
mu parameters in split bregman made adaptive, to exploit coherence of edges in different color channels
Iteration Complexity Analysis of Block Coordinate Descent Methods
Some convergence analysis for BCD and projected gradient BCD
Successive Nonnegative Projection Algorithm for Robust Nonnegative Blind Source Separation
Nonnegative matrix factorization
Scaling SVM and Least Absolute Deviations via Exact Data Reduction
SVN for large-scale problems
Image Restoration using Total Variation with Overlapping Group Sparsity
While title is promising I have doubt about that paper. The method authors suggest is equivalent to adding averaging filter to TV-L1 under L1 norm. There is no comparison to just applying TV-L1 and smoothing filter interchangeably.The method author suggest is very costly, and using median filter instead of averaging would cost the same while obviously more robust.
Deep and Wide Multiscale Recursive Networks for Robust Image Labeling
_Open source_ matlab/c package coming soon(not yet available)
Improvements to deep convolutional neural networks for LVCSR
convolutional networks, droput for speech recognition,
DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Already discussed on G+ – open source framework in “learn one use everywhere” stile. Learning done off-line on GPU using ConvNet, and recognition is online in pure python.
Statistical mechanics of complex neural systems and high dimensional data
Big textbook-like overview paper on statistical mechanics of learning. I’ve put it in my “to read” pile.
Randomized co-training: from cortical neurons to machine learning and back again
“Selectron” instead of perception – neurons are “specializing” with weights.
Provable Bounds for Learning Some Deep Representations
Citation:”The current paper presents both an interesting family of denoising autoencoders as
Online Unsupervised Feature Learning for Visual Tracking
Sparse representation, overcomplete dictionary
From Shading to Local Shape
Shape restoration from local shading – could be very useful in low-feature environment.
Fast 3D Salient Region Detection in Medical Images using GPUs
Finding interest point in 3D images
Object Recognition System Design in Computer Vision: a Universal Approach
Grid-based universal framework for object detection/classification
Lyapunov-based Low-thrust Optimal Orbit Transfer: An approach in Cartesian coordinates
For space sim enthusiast :)