## Stuff and AR

I once asked, what’s 3d registration/reconstruction/pose estimation is about – optimization or statistics? The more I think about it, the more I convinced it’s at least 80% statistics. Often specifically optimization tricks like Tikhonov regularization have statistical underpinning. Stability of optimization is robust statistics(Yes I know, I repeat it way too often). Cost function formulation is a formulation for error distribution and define convergence speed.

Now unrelated(almost) AR stuff:

I already mentioned on Twitter that version of markerless tracker for which I did a lot of work is part of Samsung AR SDK (SARI) for Android and Bada. It was was shown at AP2011(Presentaion and also include nice Bada code). AR SDK presentation is here.

Some videos form presentation – Edi Bear game demo with non-reference tracking at the end of the video and less trivial elements of SLAM tracking. Other application of SARI SDK – PBI (This one seems use earlier version).

## XKCD turtles

I’m Achilles!

I’m a turtle

I’m Spartacus!

I’m a turtle

I think therefore I am!

I’m a turtle

I’m ClearCase!

I’m a turtle

I am the alpha and the omega!

I’m a turtle

via xkcd

## Robust estimators III: Into the deep

Cauchy estimator have some nice properties (Gonzales et al “Statistically-Efficient Filtering in Impulsive Environments: Weighted Myriad Filter” 2002):

By tuning in

it can approximate either least squares (big ), or mode – maximum of histogram – of sample set (small ). For small estimator behave the same way as power law distribution estimator with small .

Another property is that for several measurements with different scales estimator of their sum will be simple

which is convenient for estimation of random walks

I heard convulsion in the sky,

And flight of angel hosts on high,

And monsters moving in the deep

Those verses from The Prophet by A.Pushkin could be seen as metaphor of profound mathematical insight, encompassing bifurcations, higher dimensional algebra and murky depths of statistics.

I now intend to dive deeper into of statistics – toward “data depth”. Data depth is a generalization of median concept to multidimensional data. Remind you that median can be seen either as order parameter – value dividing the higher half of measurements from lower, or geometrically, as the minimum of norm. Second approach lead to geometric median, about which I already talked about.

First approach to generalizations of median is to try to apply order statistics to multidimensional vectors.The idea is to make some kind of partial order for n-dimensional points – “depth” of points, and to choose as the analog of median the point of maximum depth.

Basically all *data depth* concepts define “depth” as some characterization of how deep points are reside inside the point cloud.

Historically first and easiest to understand was convex hull approach – make convex hull of data set, assign points in the hull depth 1, remove it, get convex hull of points remained inside, assign new hull depth 2, remove etc.; repeat until there is no point inside last convex hull.

Later Tukey introduce similar “halfspace depth” concept – for each point X find the minimum number of points which could be cut from the dataset by plane through the point X. That number count as depth(see the nice overview of those and other geometrical definition of depth at Greg Aloupis page)

In 2002 Mizera introduced “global depth”, which is less geometric and more statistical. It start with assumption of some loss function (“criterial function” in Mizera definition) of measurement set . This function could be(but not necessary) cumulative probability distribution. Now for two parameters and , is *more fit* with respect if for all . is *weakly optimal* with respect to if there is nor better fit parameter with respect to . At last *global depth* of is the minimum possible size of such that is *not* weakly optimal with respect to – reminder of measurements. In other words *global depth* is minimum number of measurements which should be removed for stop being weakly optimal. Global depth is not easy to calculate or visualize, so Mizera introduce more simple concept – *tangent depth*.

Tangent depth defined as . What does it mean? Tangent depth is minimum number of “bad” points – such points that for specific direction loss function for themis growing.

Those definitions of “data depth” allow for another type of estimator, based not on likelihood, but on order statistics –*maximum depth estimators*. The advantage of those estimators is robustness(breakdown point ~25%-33%) and disadvantage – low precision (high bias). So I wouldn’t use them for precise estimation, but for sanity check or initial approximation. In some cases they could be computationally more cheap than M-estimators. As useful side effect they also give some insight into structure of dataset(it seems originally maximum depth estimators was seen as data visualization tool). Depth could be good criterion for outliers rejection.

Disclaimer: while I had very positive experience with Cauchy estimator, data depth is a new thing for me.I have yet to see how useful it could be for computer vision related problems.

## Robust estimators II

In this post I was complaining that I don’t know what breakdown point for redescending M-estimators is. Now I found out that upper bound for breakdown point of redescending of M-estimators was given by Mueller in 1995, for linear regression (that is statisticians word for simple estimation of p-dimensional hyperplane):

– number of measurements and is little tricky: it is a maximum number of measurement vectors X lying in the same p-dimensional hyperplane. If number of measurements N >> p that mean breakdown point is near 50% – You can have half measurement results completely out of the blue and estimator will still work.

That only work if the error present only in results of measurements, which is reasonable condition – in most cases we can move random error from x part to y part.

Now which M-estimators attain this upper bound?

The condition is “slow variation”(Mizera and Mueller 1999)

Mentioned in previous post Cauchy estimator is satisfy that condition:

and its derivative

In practice we always work with , not so Cauchy estimator is easy to calculate.

Rule of the thumb: if you don’t know which robust estimator to use, use Cauchy: It’s fast(which is important in real time apps), its easy to understand, it’s differentiable, and it is as robust as possible (that is for redescending M-estimator)

## Robust estimators – understand or die… err… be bored trying

This is continuation of my attempt to understand internal mechanics of robust statistics. First I want to say that robust statistics “just works”. It’s not necessary to have deep understanding of it to use it and even to use it creatively. However without that deeper understanding I feel myself kind of blind. I can modify or invent robust estimators empirically, but I can not see clearly the reasons, why use this and not that modification.

Now about robust estimators. They could be divided into two groups: maximum likelihood estimators(M-estimators), which in case of robust statistics usually, but not always are redescending estimators (notable ** not** redescending estimator is norm), and all the rest of estimators.

This second “all the rest” group include subset of L-estimators(think of median, which is also M-estimator with norm.Yea, it’s kind of messy), S-estimators (use global scale estimation for all the measurements) R-estimators, which like L-estimator use order statistics but use it for weights. There may be some others too, but I don’t know much about this second group.

It’s easy to understand what M-estimators do: just find the value of parameter which give maximum probability of given set of measurements.

or

which give us traditional M-estimator form

or

,

Practically we are usually work not with measurements per se, but with some distribution of cost function of the measurements , so it become

it’s the same as the previous equation just defined in such a way as to separate statistical part from cost function part.

Now if we make a set of weights it become

We see that it could be considered as “nonlinear least squares”, which could be solved with iteratively reweighted least squares

Now for second group of estimators we have probability of joint distribution

All the global factors – sort order, global scale etc. are incorporated into measurements dependence.

It seems the difference between this formulation of second group of estimators and M-estimator is that conditional independence assumption about measurements is dropped.

Another interesting thing is that if some of measurements are not dependent on others, this formulation can get us bayesian network

Now lets return to M-estimators. M-estimator is defined by assumption about probability distribution of the measurements.

So M-estimator and *probabilistic distribution* through which it is defined are essentially the same. Least squares, for example, is produced by normal(gausssian) distribution. Just take sum of logarithms of gaussian and you get least squares estimator.

If we are talking about normal (pun intended), non-robust estimator, their defining feature is finite variance of distribution.

We have central limit theorem which saying that for any distribution mean value of samples will have approximately normal(or Gaussian) distribution.

From this follow property of asymptotic normality – for estimator with finite variance its distribution around true value of parameter approximate normal distribution.

We are discussing robust estimators, which are stable to error and have “thick-tailed” distribution, so we *can not* assume finite variance of distribution.

Nevertheless to have “true” result we want some form of probabilistic convergence of measurements to true value. As it happens such class of distribution with infinite variance exists. It’s called alpha-stable distributions.

Alpha stable distribution are those distributions for which linear combination of random variables have the same distribution, up to scale factor. From this follow analog of central limit theorem for stable distribution.

The most well known alpha-stable distribution is Cauchy distribution, which correspond to widely used redescending estimator

Cauchy distribution can be generalized in several way, including recent GCD – generalized Cauchy distribution(Carrillo et al), with density function

and estimator

Carrillo also introduce Cauchy distribution-based “norm” (it’s not a real norm obviously) which he called “Lorentzian norm”

is correspond classical Cauchy distribution

He successfully applied Lorentzian norm based basis pursuit to compressed sensing problem, which support idea that compressed sensing and robust statistics are dual each other.

## Is Robust Statistics have formal mathematical foundation?

As I have already written I have a trouble understanding what robust estimators actually estimate from probabilistic or other formal point of view. I mean estimators which are *not* maximum likelihood estimators. There is a formal definition which doesn’t explain a lot to me. It looks like estimator estimate some quantity, and we know how good we are at estimating it, but how we know what we are actually estimate? Or does this question even make sense? But that is actually a minor bummer. A problem with understanding outliers is a lot worse for me. A breakdown point is a fundamental concept in robust statistics. And breakdown point is defined as a relative number of outliers in the sample set. The problem is, it seems there is no formal definition of outlier in statistics or probability theory. We can talk about mixture models, and tail distributions but those concepts are not quite consistent with breakdown point. Breakdown point looks like it belong to area of optimization/topology, not statistics. Could it be that outliers could be defined consistently only if we have some additional structural information/constraints beside statistical (distribution)? That inability to reconcile statistics and optimization is a problem which causing cognitive headache for me.

## Minimum sum of distance vs L1 and geometric median

All this post is just a more detailed explanation of the end of the previous post.

Assume we want to estimating a state from noisy linear measurements , , – noise with outliers, like in the paper by Sharon, Wright and Ma Minimum Sum of Distances Estimator: Robustness and Stability

Sharon at al show that minimum norm estimator, that is

is a robust estimator with stable breakdown point, not depending on the noise level. What Sharon did was to use as cost function the sum of absolute values of all components of errors vector. However there are exists another approach.

In one-dimensional case minimum norm is a median.But there exist generalization of median to – geometric median. In our case it will be

That is not a least squares – minimized the sum of norm, not the sum of squares of norm.

Now why is this a stable and robust estimator? If we look at the Jacobian

we see it’s asymptotically constant, and it’s norm doesn’t depend on the norm of the outliers. While it’s not a formal proof it’s quite intuitive, and can probably be formalized along the lines of Sharon paper.

While first approach with norm can be solved with linear programming, for example simplex method and interior point method, the second approach with norm can be solved with second order cone programming and …surprise, interior point method again.

For interior point method, in both cases original cost function is replaced with

And the value of is defined by constraints. For

,

Sometimes it’s formulated by splitting absolute value is into the sum of positive and negative parts

, , ,

And for it’s a simple

Formulations are very similar, and stability/performance are similar too (there was a paper about it, just had to dig it out)

## L1, robust statisrics and compressed sensing

Anyone who did 3D reconstruction and camera pose estimation know, that outliers one of the main, if not the main problem there. There are several ways to deal with outliers, RANSAC and trimming are probably most common. Both of them has major drawback though – they based on the initial error estimation. But for example, in pose estimation, situation where the wrong values have initial error order of magnitude less then correct values is quite common. RANSAC and trimming would make situation worse in that case. What really work there are robust estimators, which is, in many cases just statisticians name for reweighted iterative least square.

Now why and how robust estimators works is really interesting. Basically robust estimator is a maximum likelihood estimator for non-normal distribution, that is distribution with “thick” tail. One of the simplest of robust estimators is L1, which correspond to Laplace distribution. Laplace distribution descend more slow than normal distribution, so it’s obviously more robust. And now the really interesting things start. L1 estimator is the fundamental concept of compressed sensing. And compressed sensing is all about finding “sparse” solution, that is solution which are mostly zeros, but with few components are quite big. And what outliers are? They are exactly “sparse” big components of error vector. If we have linear system with noise in right part, and the noise is dominated by small number of really big outliers then, as Terence Tao pointed out, we can multiply both part of the system on the appropriate matrix and get a sparse system of equations for outliers. That would be a classical compressed sensing problem, for which L1 minimization works perfectly. Recently it was proven, using compressed sensing inspired technique, that L1 estimator for system with outliers really behave similar to L1 minimizer for sparse solutions – it has stable breaking point(Sharon, Wright, Ma, Minimum Sum of Distances Estimator: Robustness and Stability).

That make me think about things I really don’t understand – what is connection between other, redescending estimators and L1 estimator? In practical applications redescending estimators often works better than L1. But redescending estimators practically is not much different form trimming. Does it mean that they are just convenient shortcuts, and in general case L1 is more robust?(One drawback of redescending estimator that it can has multiple local minima) Which assumptions about outliers we should do to to choose most appropriate estimator? I would like to read some theory of redescending estimators, their breaking point and especially their relation to L1, but so far not sure even where to start…

(PPS In this post I talk more about Mueller work on redescending M-estimators which partially answer the question)

PS Another interesting (for me that is, for someone else it could be trivial) problem is dimensionality. For 1-dimensional variable L1 and distance is the same. For vector they are not. So for vector-valued variables estimator “minimum sum of distance” estimator is not the same as L1 estimator. Would be L1 more robust than “minimum sum of distance” for vectors? Compressed sensing logic say that it should, but L1 estimator is anisotropic, it depend on coordinate system. That is for L1 to be effective the outliers should be aligned with coordinate system. Here there is the difference between overall dimensionality of the problem -number of samples and “micro” dimensionality – dimensionality of each sample. I’ll try to sort it out later.

## How Kinect depth sensor works – stereo triangulation?

Kinect use depth sensor produced by PrimeSense, but how exactly it works is not obvious from the first glance. Some person here, claimed to be specialist, assured me that PrimeSense sensor is using time-of-flight depth camera. Well, he was wrong. In fact PrimeSense explicitly saying they are not using time-of-flight, but something they call “light coding” and use standard off-the shelf CMOS sensor which is not capable extract time of return from modulated light.

Daniel Reetz made excellent works of making IR photos of Kinect laser emitter and analyzing it’s characteristics. He confirm PrimeSense statement – IR laser is not modulated. All that laser do is project static pseudorandom pattern of specs on the environment. PrimeSense use only one IR sensor. How it possible to extract depth information from the single IR image of the spec pattern? Stereo triangulation require two images to get depth of each point(spec). Here is the trick: actually there not one, but two images. One image is what we see on the photo – image of the specs captured by IR sensor. The second image is invisible – it’s a hardwired pattern of specs which laser project. That second image should be hardcoded into chip logic. Those images are not equivalent – there is some distance between laser and sensor, so images correspond to different camera positions, and that allow to use stereo triangulation to calculate each spec depth.

The difference here is that the second image is “virtual” – position of the second point y_2 is already hardcoded into memory. Because laser and sensor are aligned that make task even more easy: all one have to do is to measure horizontal offset of the spec on the first image relative to hardcoded position(after correcting lens distortion of cause).

That also explain pseudorandom pattern of the specs. Pseudorandom patten make matching of specs in two images more easy, as each spec have locally different neighborhood. Can it be called “structured light” sensor? With some stretch of definition. Structured light usually project grid of regular lines instead of pseudorandom points. At least PrimeSense object to calling their method “structured light”.

## New laptop, new Ubuntu

Got myself new and shiny Asus u35jc-a1 and started dual boot Ubuntu on it. I have ubuntu as wubi(ubuntu in windows file) on my old desktop replacement Dell XPS, so I started with wubi for u35jc too. Wubi worked from the start, wifi card works without problem. However NVIDIA completely screwed up hybrid driver for Linux(there are 2 vidoecards on u35jc, one integrated and other NVIDIA), it’s completely unusable, so NVIDIA driver should be disabled. Happily there are detailed instructions on ubuntuforums. They are for 10.4, but work for 10.10 too. Suspend was not working quite stable though, even after fix from ubuntuforums. It’s tricky – suspension require turing NVIDIA driver on and off.. Suspend was working for AC power, but suspend on battery power caused system hang someteimes. I proceed to multituch fix from the list. Multituch fix require creation or modification of xorg.conf, wich require stopping X. Stopping X caused crush which permanently killed Ubuntu wubi install. That was quite scary, so I decided to forgo wubi and make complete ubuntu intallation into partition.

Installation was not completely smooth. Probably because of some quirks of initial Asus partition, ubuntu installer refused to create any partition but first in the free space. After creation of the first boot partition remaining free disk space become unusable. So I forgo swap partition and installed ubuntu into single / partition. After that I implemented only acpi_call and suspend fixes. Suspend now works like charm, both for AC and battery. Multitach fix I put aside for now – it’s not critical. It seems all the problem, necessity of separate suspend fix, problems with wubi and stopping X were caused by NVIDA driver. Hope NVIDIA will fix it eventually – I want to play with GPGPU without going into Windows boot.