# Mirror Image

## Minimum sum of distance vs L1 and geometric median

All this post is just a more detailed explanation of the end of the previous post.
Assume we want to estimating a state $x \in R^n$ from $m \gg n$ noisy linear measurements $y \in R^m$, $y = Ax + z$, $z$– noise with outliers, like in the paper by Sharon, Wright and Ma Minimum Sum of Distances Estimator: Robustness and Stability
Sharon at al show that minimum $L_1$ norm estimator, that is
$arg \min_{x} \sum_{i=1}^{m} \| a_i^T x - y_i \|_{1}$
is a robust estimator with stable breakdown point, not depending on the noise level. What Sharon did was to use as cost function the sum of absolute values of all components of errors vector. However there are exists another approach.
In one-dimensional case minimum $L_1$ norm is a median.But there exist generalization of median to $R^n$geometric median. In our case it will be
$arg \min_{x} \sum_{i=1}^{m} \| A x - y_i \|_{2}$
That is not a least squares – minimized the sum of $L_2$ norm, not the sum of squares of $L_2$ norm.
Now why is this a stable and robust estimator? If we look at the Jacobian
$\sum_{i=1}^{m} A^T \frac{A x - y_i}{ \| A x - y_i \|_{2}}$
we see it’s asymptotically constant, and it’s norm doesn’t depend on the norm of the outliers. While it’s not a formal proof it’s quite intuitive, and can probably be formalized along the lines of Sharon paper.
While first approach with $L_1$ norm can be solved with linear programming, for example simplex method and interior point method, the second approach with $L_2$ norm can be solved with second order cone programming and …surprise, interior point method again.
For interior point method, in both cases original cost function is replaced with
$\sum f$
And the value of $f$ is defined by constraints. For $L_1$
$f_{i} \ge a_ix-y_i$, $f_{i} \ge -a_ix+y_i$
Sometimes it’s formulated by splitting absolute value is into the sum of positive and negative parts
$f_{+_{i}} \ge a_ix-y_i$, $f_{-_{i}} \ge -a_ix+y_i$, $f_{+_{i}} \ge 0$, $f_{-_{i}} \ge 0$
And for $L_2$ it’s a simple
$f_i \ge \| A x - y_i \|_{2}$
Formulations are very similar, and stability/performance are similar too (there was a paper about it, just had to dig it out)

10, April, 2011

## Genetic algorithms – alternative to building block hypothesis

Genetic algorithms and especially their subset, Genetic programming were always fascinating me. My interest was fueled by on and off work with Global optimization, and because GA just plain cool. One of the most interesting thing about GA is that they work quite good on some “practical” problems, while there is no comprehensive theoretical explanation why they should work so well (Of cause they are not always so useful. There was a work on generating feature descriptors with GA, and results were less then impressive).
Historically, first and most well known explanation for GA efficiency was the building block hypothesis. Building block hypothesis is very intuitive. It say that there are exist “building blocks” – small parts of genome with high fitness. GA work is randomly searching for those building blocks and combining them afterward, until global optimum is found. Searching is mostly done with mutation, and combining found building block with crossover (analog of exchange of genetic material in real biological reproduction).
However building blocks have a big problem, and that problem is crossover operator. If building block hypothesis is true, GA work better if integrity of building blocks preserved as much as possible. That is there should be only few “cut and splice” points in the sequence. But practically GA with “uniform” crossover – massive uniform mixing of two genomes, work better then GA with few crossover points.
Recently a new theory of GA efficiency appears, that try to deal with uniform crossover problem – Generative fixation” hypothesis. The idea behind “generative fixation” is that GA works in continuous manner, fixing stable groups of genes with high fitness and continuing search on the rest of genome, reducing search space step by step. From optimization point of view GA in that case works in manner similar to Conjugate gradient method, reducing (or trying) dimensionality of search space in each step. Now about “uniform crossover” – why it works better: subspace, to which search space reduced, should be stable (in stability theory sense). Small permutations wouldn’t case solution to diverge. With uniform crossover of two close solutions resulting solution still will be nearby attractive subspace. The positive effect of uniform crossover is that it randomize solution, but without exiting already found subspace. That randomization clearing out useless “stuck” genes (also called “hitchhikers”), and help to escape local minima.
Interesting question is, what if subspace is not “fixed bits” and even not linear – that is if it’s a manifold. In that case (if hypothesis true) found genes will not be “fixed”, but will “drift” in systematic manners, according to projection of manifold on the semi-fixed bits.
Now to efficiency GA for “practical” task. If the “generative fixation” theory is correct, “practical” task, for which GA work well could be the problems for which dimensionality reduction is natural, for example if solution belong to low-dimensional attractive manifold. (addenum 7/11)That mean GA shouldn’t work well for problem which allow only combinatorial search. Form this follow that if GA work for compressed sensing problem it should comply with Donoho-Tanner Phase Transition diagram.
Overall I like this new hypothesis, because it bring GA back to family of mathematically natural optimization algorithms. That doesn’t mean the hypothesis is true of cause. Hope there will be some interest, more work, testing and analysis. What is clear that is current building block hypothesis is not unquestionable.

PS 7/11:
Simple googling produced paper by Beyer An Alternative Explanation for the Manner in which Genetic Algorithms Operate with quite similar explanation how uniform crossover works.

6, November, 2010

## Recursive Interferometry – Phase Congruency?

Thanks to Igor Carron I’ve watched a great videolecture by Stephane Mallat High dimensional classification by recursive interferometry. Actually I watched it twice, and I think I understand most of it now))). And it was not about compressed sensing, not even about manifold learning much . It was mostly about a new application of wavelets . How to use wavelets to produce low dimensional data (image descriptor if we are talking about computer vision) from high dimensional data(that is image). The idea is to transcend linear representation and use nonlinear operation – absolute value of wavelet. Absolute value – square root of wavelet square carry information of frequencies differences. It’s invert Fourier transform have new harmonics – differences of frequencies of original function. That interference of harmonics of original image. Now it was reminding me something. Yep – phase congruency (pdf). Phase congruency also use absolute value of wavelet(windowed Fourier). It seems to me it has perfect explanation. Interference pattern defined by how in-phase both wave are. That is it’s like a phase congruency taken into each point. Phase congruency edge-detector is in fact finding maximum of somehow normalized interference pattern. In that sense this Mallat’s method producing invariants from high-dimensional data is analogous to producing sketch from photo.
Ok, enough rambling for now.

30, September, 2010

## Compressive Sensing and Computer Vision

Thanks to Igor Carron for pointing out this video lecture
Compressive Sensing for Computer Vision: Hype vs Hope
Basically it’s about imagining the lower-dimensional signal(image) as projection by rectangular matrix from mostly zero high-dimensional vector. It happens that this sparse high-dimensional vector can be restored if the matrix is almoste orthonormal (Restricted Isometry Property). Discrete Fourier Transform and random matrices have that property.
This sparse vector could be considered as classification space for original signal. So application of Compressive Sensing to Computer Vision is mostly about classification or recognition. As methods used by CS are convex and linear programming those are not run-time methods, and would not help much in real-time tracking. There is CS-inspired advise at the end of the lecture, about trying to replace $L^{2}$ norm optimization with $L^{1}$ norm. That could be actually helpful in some cases. If $L^{1}$ approximated as iteratively reweighted $L^{2}$ it’s essentially the same as robustification of least square method.

7, December, 2009

## Learning mathematics is good for the soul

At The n-Category Café David Corfield is talking about mathematical emotions. This is about ‘emotions belonging to mathematical thinking’, specific feeling related to mathematical intuition, meaning and sense of the rightness. In the end he is referencing to spiritual motivation of mathematics. From myself I add that those thoughts are closely related to the last book of Neal StephensonAnathem

19, November, 2009 Posted by | Uncategorized | , , , | Comments Off on Learning mathematics is good for the soul

## New polymath project announced – deterministic way to find primes

New polymath project – massively collaborative mathematic project announced at the polymath blogdeterministic way to find primes – given an integer k, is guaranteed to find a prime of at least k digits in length of time polynomial in k.

29, July, 2009 Posted by | Uncategorized | , , | Comments Off on New polymath project announced – deterministic way to find primes