Total Variation in Image Processing and classical Action
This post is inspired by Extremal Principles in Classical, Statistical and Quantum Mechanics in Azimuth blog.
Total Variation used a lot in image processing. Image denoising, optical flow, depth maps processing. The standard form of Total Variation f or norm is minimizing “energy” of the form
(I’m talking about Total Variaton- for now, not
) over all functions
In case of image denoising it would be
where is original image and
is denoised image
Part is called “fidelity term” and
is “regularizer”
Regularizer part is to provide smoothness of solution and fidelity term is to force smooth solution to resemble original image (that is in case of image denoising)
Now if we return to classical Action, movement of the point is defined by the minimum of functional
, over trajectories
where
is kinetic energy and
is potential energy, or
One-dimensional total variation for image denoising is the same as classical mechanics of the particle, with potential energy defined by iamge and smoothness of denoised image as kinetic energy! For optical flow potential energy is differences between tranformed first image and the second
and kinetic energy is the smoothness of the optical flow.
Of cause the strict equality hold only for one-dimentional image and , and potential energy is quite strange – it depend not on coordinate but on velocity, like some kind of friction.
While it hold some practical meaning, most of practical task have two or more dimensional image and or
regulariser. So in term of classical mechanics we have movement in multidimensional time with non-classical kinetic energy
which has uncanny resemblance to Lagrangian of relativistic particle
So total variation in image processing is equivalent to physics of non-classical movement with multidimensional time, in the field with potential energy defined by image. I have no idea what does it signify, but it sounds cool
. Holographic principle? May be crowd from Azimuth or n-category cafe will give some explanation eventually…
And another, related question: regularizer in Total Variation. There is inherent connection between regularizers and Bayesian priors. What TV-L1 regularizer mean from Bayesian statistics point of view?
PS I’m posting mostly on my google plus now, so this blog is a small part of my posts.
Samsung SARI 1.5 Augmented Reality SDK is out in the wild
Something I did for Samsung (kernel of tracker). Biggest improvement in SARI 1.5 is the sensors fusion, which allow for a lot more robust tracking.
Here is example of run-time localization and mapping with SARI 1.5:
This is the AR EdiBear game (free in Samsung apps store)
Contact Eduard Oks if u Android dev and want to test SARI 1.5 . You have to understand matrices OpenGL to use it.
Interior point method – if all you have is a hammer
Interior point method for nonlinear optimization often considered as complex, or highly nontrivial etc. The fact is, that for “simple” nonlinear optimization it’s quite simple, manageable and can even be explained in #3tweets. For those not familiar with it there is a simple introduction to it in wikipedia, which in turn follows an excellent paper by Margaret H. Wright.
Now about “if all you have is a hammer, everything looks like a nail”. Some of applications of interior point method could be quite unexpected.
Everyone who worked with Levenberg-Marquardt minimization algorithm know how much pain is the choice of the small parameter . Levenberg-Marquardt can also be seen as modification of Gauss-newton with a trust region. The
of Levenberg-Marquardt do correspond to the trust region radius, but that dependence is highly complex and is difficult to estimate. You want trust region of the radius r, but what should be avlue of
? There is no easy answer to that question; there are some complex methods, or there is a testing with subdivision, which is what the original Levenberg-Marquardt implement.
Interior point can help here.
If we choose shape of trust region for Gauss-Newton as hypercube or simplex or like, we can formulate it as set of L1 norm inequality constrains. And that is the domain of interior point method! For hypercube the resulting equations looks especially nice
W – hessian, I – identity, diag – diagonal
That is a banded arrowhead matrix, and for it Cholesky decomposition cost insignificantly more than decomposition of original W. The matrix is not positive definite – Cholesky without square root should be used.
Now there is a temptation to use single constrain instead of set of constrain
, but that will not work.
should have to be linearized to be tractable, but it’s a second order condition – it’s linear part is zero, so linearization doesn’t constrain anything.
The same method could be used whenever we have to put constrain on the value of Gauss-Newton update, and shape of constrain in not important (or polygonal)
Now last touch – Interior point method has small parameter of it’s own. It’s called usually . In the “normal” method there is a nice rule for it update – take it as
(in the notation from wikipedia article –
is a value of constraint,
is a value of slack variable at the previous iteration) That rule usually explicitly stated in the articles about Interior Point Method(IPM) for Linear Programming, but omitted (as obvious probably) in the papers about IPM for nonlinear optimization
In our case (IPM for trust region) we don’t need update at all – we move boundary of the region with each iteration, so each
is an initial value. Have to remember,
is not a size of trust region, but strength of it’s enforcement.
Effectivness of compressed sensing in image processing and other stuff
I seldom post in the this blog now, mostly because I’m positing on twitter and G+ a lot lately. I still haven’t figured out which post should go where – blog, G+ or twitter, so it’s kind of chaotic for now.
What of interest is going on: There are two paper on the CVPR11 which claim that compressed sensing(sparse recovery) is not applicable to some of the most important computer vision tasks:
Is face recognition really a Compressive Sensing problem?
Are Sparse Representations Really Relevant for Image Classification?
Both paper claim that space of the natural images(or their important subsets) are not really sparse.
Those claims however dont’t square with claim of high effectiveness of compact signature of Random Ferns.
Could both of those be true? In my opinion – yes. Difference of two approaches is that first two paper assumed explicit sparsity – that is they enforced sparsity on the feature vector. Compressed signature approach used implicit sparsity – feature vector underling the signature is assumed sparse but is not explicitly reconstructed. Why compressed signature is working while explicit approach didn’t? That could be the case if image space is sparse in the different coordinate system – that is here one is dealing with the union of subspaces. Assumption not of the simple sparsity, but of the union of subspaces is called blind compressed sensing.
Now if we look at the space of the natural images it’s easy to see why it is low dimensional. Natural image is the image of some scene, an that scene has limited number of moving object. So dimension of space images of the scene is approximately the sum of degree of freedom of the objects(and camera) of the scene, plus effects of occlusions, illumination and noise. Now if the add strong enough random error to the scene, the image is stop to be the natural image(that is image of any scene). That mean manifold of the images of the scene is isolated – there is no natural images in it’s neighborhood. That hint that up to some error the space of the natural images is at least is the union of isolated low-dimensional manifolds. The union of mainfolds is obviously is more complex structure than the union of subspace, but methods of blind compressed sensing could be applicable to it too. Of cause to think about union of manifolds could be necessary only if the space of images is not union of subspace, which is obviously preferable case
Stuff and AR
I once asked, what’s 3d registration/reconstruction/pose estimation is about – optimization or statistics? The more I think about it, the more I convinced it’s at least 80% statistics. Often specifically optimization tricks like Tikhonov regularization have statistical underpinning. Stability of optimization is robust statistics(Yes I know, I repeat it way too often). Cost function formulation is a formulation for error distribution and define convergence speed.
Now unrelated(almost) AR stuff:
I already mentioned on Twitter that version of markerless tracker for which I did a lot of work is part of Samsung AR SDK (SARI) for Android and Bada. It was was shown at AP2011(Presentaion and also include nice Bada code). AR SDK presentation is here.
Some videos form presentation – Edi Bear game demo with non-reference tracking at the end of the video and less trivial elements of SLAM tracking. Other application of SARI SDK – PBI (This one seems use earlier version).
XKCD turtles

I’m Achilles!
I’m a turtle
I’m Spartacus!
I’m a turtle
I think therefore I am!
I’m a turtle
I’m ClearCase!
I’m a turtle
I am the alpha and the omega!
I’m a turtle
via xkcd
Robust estimators III: Into the deep
Cauchy estimator have some nice properties (Gonzales et al “Statistically-Efficient Filtering in Impulsive Environments: Weighted Myriad Filter” 2002):
By tuning in
it can approximate either least squares (big ), or mode – maximum of histogram – of sample set (small
). For small
estimator behave the same way as power law distribution estimator with small
.
Another property is that for several measurements with different scales estimator of their sum will be simple
which is convenient for estimation of random walks
I heard convulsion in the sky,
And flight of angel hosts on high,
And monsters moving in the deep
Those verses from The Prophet by A.Pushkin could be seen as metaphor of profound mathematical insight, encompassing bifurcations, higher dimensional algebra and murky depths of statistics.
I now intend to dive deeper into of statistics – toward “data depth”. Data depth is a generalization of median concept to multidimensional data. Remind you that median can be seen either as order parameter – value dividing the higher half of measurements from lower, or geometrically, as the minimum of norm. Second approach lead to geometric median, about which I already talked about.
First approach to generalizations of median is to try to apply order statistics to multidimensional vectors.The idea is to make some kind of partial order for n-dimensional points – “depth” of points, and to choose as the analog of median the point of maximum depth.
Basically all data depth concepts define “depth” as some characterization of how deep points are reside inside the point cloud.
Historically first and easiest to understand was convex hull approach – make convex hull of data set, assign points in the hull depth 1, remove it, get convex hull of points remained inside, assign new hull depth 2, remove etc.; repeat until there is no point inside last convex hull.
Later Tukey introduce similar “halfspace depth” concept – for each point X find the minimum number of points which could be cut from the dataset by plane through the point X. That number count as depth(see the nice overview of those and other geometrical definition of depth at Greg Aloupis page)
In 2002 Mizera introduced “global depth”, which is less geometric and more statistical. It start with assumption of some loss function (“criterial function” in Mizera definition) of measurement set
. This function could be(but not necessary) cumulative probability distribution. Now for two parameters
and
,
is more fit with respect
if for all
.
is weakly optimal with respect to
if there is nor better fit parameter with respect to
. At last global depth of
is the minimum possible size of
such that
is not weakly optimal with respect to
– reminder of measurements. In other words global depth is minimum number of measurements which should be removed for
stop being weakly optimal. Global depth is not easy to calculate or visualize, so Mizera introduce more simple concept – tangent depth.
Tangent depth defined as . What does it mean? Tangent depth is minimum number of “bad” points – such points that for specific direction loss function for themis growing.
Those definitions of “data depth” allow for another type of estimator, based not on likelihood, but on order statistics -maximum depth estimators. The advantage of those estimators is robustness(breakdown point ~25%-33%) and disadvantage – low precision (high bias). So I wouldn’t use them for precise estimation, but for sanity check or initial approximation. In some cases they could be computationally more cheap than M-estimators. As useful side effect they also give some insight into structure of dataset(it seems originally maximum depth estimators was seen as data visualization tool). Depth could be good criterion for outliers rejection.
Disclaimer: while I had very positive experience with Cauchy estimator, data depth is a new thing for me.I have yet to see how useful it could be for computer vision related problems.
Robust estimators II
In this post I was complaining that I don’t know what breakdown point for redescending M-estimators is. Now I found out that upper bound for breakdown point of redescending of M-estimators was given by Mueller in 1995, for linear regression (that is statisticians word for simple estimation of p-dimensional hyperplane):
– number of measurements and
is little tricky: it is a maximum number of measurement vectors X lying in the same p-dimensional hyperplane. If number of measurements N >> p that mean breakdown point is near 50% – You can have half measurement results completely out of the blue and estimator will still work.
That only work if the error present only in results of measurements, which is reasonable condition – in most cases we can move random error from x part to y part.
Now which M-estimators attain this upper bound?
The condition is “slow variation”(Mizera and Mueller 1999)
Mentioned in previous post Cauchy estimator is satisfy that condition:
and its derivative
In practice we always work with , not
so Cauchy estimator is easy to calculate.
Rule of the thumb: if you don’t know which robust estimator to use, use Cauchy: It’s fast(which is important in real time apps), its easy to understand, it’s differentiable, and it is as robust as possible (that is for redescending M-estimator)
Robust estimators – understand or die… err… be bored trying
This is continuation of my attempt to understand internal mechanics of robust statistics. First I want to say that robust statistics “just works”. It’s not necessary to have deep understanding of it to use it and even to use it creatively. However without that deeper understanding I feel myself kind of blind. I can modify or invent robust estimators empirically, but I can not see clearly the reasons, why use this and not that modification.
Now about robust estimators. They could be divided into two groups: maximum likelihood estimators(M-estimators), which in case of robust statistics usually, but not always are redescending estimators (notable not redescending estimator is norm), and all the rest of estimators.
This second “all the rest” group include subset of L-estimators(think of median, which is also M-estimator with norm.Yea, it’s kind of messy), S-estimators (use global scale estimation for all the measurements) R-estimators, which like L-estimator use order statistics but use it for weights. There may be some others too, but I don’t know much about this second group.
It’s easy to understand what M-estimators do: just find the value of parameter which give maximum probability of given set of measurements.
or
which give us traditional M-estimator form
or
,
Practically we are usually work not with measurements per se, but with some distribution of cost function of the measurements
, so it become
it’s the same as the previous equation just defined in such a way as to separate statistical part from cost function part.
Now if we make a set of weights it become
We see that it could be considered as “nonlinear least squares”, which could be solved with iteratively reweighted least squares
Now for second group of estimators we have probability of joint distribution
All the global factors – sort order, global scale etc. are incorporated into measurements dependence.
It seems the difference between this formulation of second group of estimators and M-estimator is that conditional independence assumption about measurements is dropped.
Another interesting thing is that if some of measurements are not dependent on others, this formulation can get us bayesian network
Now lets return to M-estimators. M-estimator is defined by assumption about probability distribution of the measurements.
So M-estimator and probabilistic distribution through which it is defined are essentially the same. Least squares, for example, is produced by normal(gausssian) distribution. Just take sum of logarithms of gaussian and you get least squares estimator.
If we are talking about normal (pun intended), non-robust estimator, their defining feature is finite variance of distribution.
We have central limit theorem which saying that for any distribution mean value of samples will have approximately normal(or Gaussian) distribution.
From this follow property of asymptotic normality – for estimator with finite variance its distribution around true value of parameter approximate normal distribution.
We are discussing robust estimators, which are stable to error and have “thick-tailed” distribution, so we can not assume finite variance of distribution.
Nevertheless to have “true” result we want some form of probabilistic convergence of measurements to true value. As it happens such class of distribution with infinite variance exists. It’s called alpha-stable distributions.
Alpha stable distribution are those distributions for which linear combination of random variables have the same distribution, up to scale factor. From this follow analog of central limit theorem for stable distribution.
The most well known alpha-stable distribution is Cauchy distribution, which correspond to widely used redescending estimator
Cauchy distribution can be generalized in several way, including recent GCD – generalized Cauchy distribution(Carrillo et al), with density function
and estimator
Carrillo also introduce Cauchy distribution-based “norm” (it’s not a real norm obviously) which he called “Lorentzian norm”
is correspond classical Cauchy distribution
He successfully applied Lorentzian norm based basis pursuit to compressed sensing problem, which support idea that compressed sensing and robust statistics are dual each other.
Is Robust Statistics have formal mathematical foundation?
As I have already written I have a trouble understanding what robust estimators actually estimate from probabilistic or other formal point of view. I mean estimators which are not maximum likelihood estimators. There is a formal definition which doesn’t explain a lot to me. It looks like estimator estimate some quantity, and we know how good we are at estimating it, but how we know what we are actually estimate? Or does this question even make sense? But that is actually a minor bummer. A problem with understanding outliers is a lot worse for me. A breakdown point is a fundamental concept in robust statistics. And breakdown point is defined as a relative number of outliers in the sample set. The problem is, it seems there is no formal definition of outlier in statistics or probability theory. We can talk about mixture models, and tail distributions but those concepts are not quite consistent with breakdown point. Breakdown point looks like it belong to area of optimization/topology, not statistics. Could it be that outliers could be defined consistently only if we have some additional structural information/constraints beside statistical (distribution)? That inability to reconcile statistics and optimization is a problem which causing cognitive headache for me.