Mirror Image

Minimum sum of distance vs L1 and geometric median

All this post is just a more detailed explanation of the end of the previous post.
Assume we want to estimating a state $x \in R^n$ from $m \gg n$ noisy linear measurements $y \in R^m$, $y = Ax + z$, $z$– noise with outliers, like in the paper by Sharon, Wright and Ma Minimum Sum of Distances Estimator: Robustness and Stability
Sharon at al show that minimum $L_1$ norm estimator, that is
$arg \min_{x} \sum_{i=1}^{m} \| a_i^T x - y_i \|_{1}$
is a robust estimator with stable breakdown point, not depending on the noise level. What Sharon did was to use as cost function the sum of absolute values of all components of errors vector. However there are exists another approach.
In one-dimensional case minimum $L_1$ norm is a median.But there exist generalization of median to $R^n$geometric median. In our case it will be
$arg \min_{x} \sum_{i=1}^{m} \| A x - y_i \|_{2}$
That is not a least squares – minimized the sum of $L_2$ norm, not the sum of squares of $L_2$ norm.
Now why is this a stable and robust estimator? If we look at the Jacobian
$\sum_{i=1}^{m} A^T \frac{A x - y_i}{ \| A x - y_i \|_{2}}$
we see it’s asymptotically constant, and it’s norm doesn’t depend on the norm of the outliers. While it’s not a formal proof it’s quite intuitive, and can probably be formalized along the lines of Sharon paper.
While first approach with $L_1$ norm can be solved with linear programming, for example simplex method and interior point method, the second approach with $L_2$ norm can be solved with second order cone programming and …surprise, interior point method again.
For interior point method, in both cases original cost function is replaced with
$\sum f$
And the value of $f$ is defined by constraints. For $L_1$
$f_{i} \ge a_ix-y_i$, $f_{i} \ge -a_ix+y_i$
Sometimes it’s formulated by splitting absolute value is into the sum of positive and negative parts
$f_{+_{i}} \ge a_ix-y_i$, $f_{-_{i}} \ge -a_ix+y_i$, $f_{+_{i}} \ge 0$, $f_{-_{i}} \ge 0$
And for $L_2$ it’s a simple
$f_i \ge \| A x - y_i \|_{2}$
Formulations are very similar, and stability/performance are similar too (there was a paper about it, just had to dig it out)

10, April, 2011

L1, robust statisrics and compressed sensing

Anyone who did 3D reconstruction and camera pose estimation know, that outliers one of the main, if not the main problem there. There are several ways to deal with outliers, RANSAC and trimming are probably most common. Both of them has major drawback though – they based on the initial error estimation. But for example, in pose estimation, situation where the wrong values have initial error order of magnitude less then correct values is quite common. RANSAC and trimming would make situation worse in that case. What really work there are robust estimators, which is, in many cases just statisticians name for reweighted iterative least square.
Now why and how robust estimators works is really interesting. Basically robust estimator is a maximum likelihood estimator for non-normal distribution, that is distribution with “thick” tail. One of the simplest of robust estimators is L1, which correspond to Laplace distribution. Laplace distribution descend more slow than normal distribution, so it’s obviously more robust. And now the really interesting things start. L1 estimator is the fundamental concept of compressed sensing. And compressed sensing is all about finding “sparse” solution, that is solution which are mostly zeros, but with few components are quite big. And what outliers are? They are exactly “sparse” big components of error vector. If we have linear system with noise in right part, and the noise is dominated by small number of really big outliers then, as Terence Tao pointed out, we can multiply both part of the system on the appropriate matrix and get a sparse system of equations for outliers. That would be a classical compressed sensing problem, for which L1 minimization works perfectly. Recently it was proven, using compressed sensing inspired technique, that L1 estimator for system with outliers really behave similar to L1 minimizer for sparse solutions – it has stable breaking point(Sharon, Wright, Ma, Minimum Sum of Distances Estimator: Robustness and Stability).
That make me think about things I really don’t understand – what is connection between other, redescending estimators and L1 estimator? In practical applications redescending estimators often works better than L1. But redescending estimators practically is not much different form trimming. Does it mean that they are just convenient shortcuts, and in general case L1 is more robust?(One drawback of redescending estimator that it can has multiple local minima) Which assumptions about outliers we should do to to choose most appropriate estimator? I would like to read some theory of redescending estimators, their breaking point and especially their relation to L1, but so far not sure even where to start…
(PPS In this post I talk more about Mueller work on redescending M-estimators which partially answer the question)

PS Another interesting (for me that is, for someone else it could be trivial) problem is dimensionality. For 1-dimensional variable L1 and distance is the same. For vector they are not. So for vector-valued variables estimator “minimum sum of distance” estimator is not the same as L1 estimator. Would be L1 more robust than “minimum sum of distance” for vectors? Compressed sensing logic say that it should, but L1 estimator is anisotropic, it depend on coordinate system. That is for L1 to be effective the outliers should be aligned with coordinate system. Here there is the difference between overall dimensionality of the problem -number of samples and “micro” dimensionality – dimensionality of each sample. I’ll try to sort it out later.

2, April, 2011