Mirror Image

Meaning of conditional probability

Conditional probability was always baffling me. Empirical, frequentists meaning is clear, but the abstract definition, originating from Kolmogorov – what was its mathematical meaning? How it can be derived? It’s a nontrivial definition and is appearing in the textbooks out the air, without measure theory intuition behind it.

Here I mostly follow Chang&Pollard  paper  Conditioning as disintegartion. Beware that the paper use non-standard notation, but this post follow more common notation, same as in wikipedia.

Here is example form Chang&Pollard paper:

Suppose we have distribution on  $\mathbb{R}^2$ concentrated in two straight lines $L_1$ and $L_2$ with respective density $g_i(x)$ and angles with X axis $\alpha_i$. Observation (X,Y) taken, giving $X=x_0$, what is probability that point lies on the line $L_1$ ?

Standard approach would be approximate $x_0$ with $[ x_0, x_0 +\Delta ]$ and take limit with  $\Delta \to 0$

$\frac{\int_{x_0}^{x_0+ \Delta}g_1/cos(\alpha_1)}{\int_{x_0}^{x_0+ \Delta}g_1/cos(\alpha_1) + \int_{x_0}^{x_0+ \Delta}g_2/cos(\alpha_2)}$

Not only taking this limit is kind of cumbersome, it’s also not totally obvious that it’s the same conditional probability that defined in the abstract definition – we are replacing ratio with limit here.

Now what is “correct way” to define conditional probabilities, especially for distributions?

For simplicity we will first talk about single scalar random variable, defined on probability space. We will think of random variable X as function on the sample space. Now condition $X=x_0$ define fiber – inverse image of $x_0$.

Disintegration theorem say that probability measure on the sample space can be decomposed into two measures – parametric family of measures induced by original probability on each fiber and “orthogonal” measure on $\mathbb{R}$  – on the parameter space of the first measure. Here $\mathbb{R}$ is the space of values of X  and serve as parameter space for measures on fibers. Second measure induced by the inverse image of  the function (random variable) for each measurable set on $\mathbb{R}$. This second measure is called Pushforward measure. Pushforward measure is just for measurable set on $\mathbb{R}$ (in our case) taking its X  inverse image on sample space and measuring it with μ.

Fiber is in fact sample space for conditional event, and measure on fiber is our conditional distribution.

Full statement of the theorem require some term form measure theory. Following wikipedia

Let P(X) is collection of Borel probability measures on X,  P(Y) is collection of  Borel probability measures on Y

Let Y and X be two Radon spaces. Let μ ∈ P(Y), let $\pi$: Y → X be a Borel- measurable function, and let ν ∈ P(X) be the pushforward measure  $\mu \circ \pi^{-1}$.

* Then there exists a ν-almost everywhere uniquely determined family of probability measures $\mu_x$  ⊆  P(Y) such that
* the function $x \mapsto \mu_{x}$ is Borel measurable, in the sense that $x \mapsto \mu_{x} (B)$ is a Borel-measurable function for each Borel-measurable set B ⊆ Y;
*  $\mu_x$  “lives on” the  fiber $\pi^{-1}(x)$ : for ν-almost all x  ∈ X,

$\mu_x(Y \setminus \pi^{-1} (x))=0$

and so $\mu_x(E)=\mu_x(E \cap \pi^{-1}(x))$

* for every Borel-measurable function $f$ : Y → [0, ∞],

$\int_{Y} f(y) \, \mathrm{d} \mu (y) = \int_{X} \int_{\pi^{-1} (x)} f(y) \, \mathrm{d} \mu_{x} (y) \mathrm{d} \nu (x)$

From here for any event E form Y

$\mu (E) = \int_{X} \mu_{x} \left( E \right) \, \mathrm{d} \nu (x)$

This was complete statement of the disintegration theorem.

Now returning to Chang&Pollard example. For formal derivation I refer you to the original paper, here we will just “guess” $\mu_x$, and by uniqueness it give us disintegration. Our conditional distribution for $X=x$ will be just point masses on the intersections of lines $L_1$ and $L_2$ with axis  $X=x$
Here $\delta$ is delta function – point mass.

$d\mu_x(y) = \frac{\delta (L_1( x )-y) g_1( x ) / cos(\alpha_1) + \delta (L_2( x )-y) g_2( x ) / cos(\alpha_2)}{Z(x)}dy$

$Z(x) = g_1(x)/cos(\alpha_1) + g_2(x)/cos(\alpha_2)$

and for $\nu$
$d\nu = Z(x)dx$

Our conditional probability that event lies on $L_1$ with condition $X = x_0$, form conditional density $d\mu_{x_0}/dy$ thus

$\frac{g_1(x_0)/cos(\alpha_1)}{g_1(x_0)/cos(\alpha_1) + g_2(x_0)/cos(\alpha_2)}$

Another example from Chen&Pollard. It relate to sufficient statistics. Term sufficient statistic used if we have probability distribution depending on some parameter, like in maximum likelihood estimation. Sufficient statistic is some function of sample, if it’s possible to estimate parameter of distribution from only values of that function in the best possible way – adding more data form the sample will not give more information about parameter of distribution.
Let $\mathbb{P}_{\theta}$ be uniform distribution on the square $[0, \theta]^2$. In that case M = max(x, y) is sufficient statistics for $\theta$. How to show it?
Let take our function $(x,y) \to m$ , $m = max(x,y)$ and make disintegration.
$d\mu_m(x,y) = s_m(x,y)dxdy$ is a uniform distribution on edges where x and y equal m and $d\nu = (2m / \theta^2 )dm$
$s_m(x,y)$ is density of conditional probability $P(\cdot | M = m)$ and it doesn’t depend on $\theta$
For any $f \ \ \mathbb{P}_{\theta}(f| m) = \int f s_m(x,y) \ , \ \mathbb{P}_{\theta, m}(\cdot) = \mathbb{P}_{m}(\cdot)$ – means that M is  sufficient.

It seems that in most cases disintegration is not a tool for finding conditional distribution. Instead it can help to guess it and form uniqueness prove that the guess is correct. That correctness could be nontrivial – there are some paradoxes similar to Borel Paradox in Chang&Pollard paper.

7, November, 2013 Posted by | Math | , , | Comments Off on Meaning of conditional probability

Robust estimators III: Into the deep

Cauchy estimator have some nice properties (Gonzales et al “Statistically-Efficient Filtering in Impulsive Environments: Weighted Myriad Filter” 2002):
By tuning $\gamma$ in
$\psi(x) = \frac{x}{\gamma^2+ x^2}$
it can approximate either least squares (big $\gamma$), or mode – maximum of histogram – of sample set (small $\gamma$). For small $\gamma$ estimator behave the same way as power law distribution estimator with small $\alpha$.
Another property is that for several measurements with different scales $\gamma_i$ estimator of their sum will be simple
$\psi(x) = \frac{x}{(\sum \gamma_i)^2+ x^2}$
which is convenient for estimation of random walks

I heard convulsion in the sky,
And flight of angel hosts on high,
And monsters moving in the deep

Those verses from The Prophet by A.Pushkin could be seen as metaphor of profound mathematical insight, encompassing bifurcations, higher dimensional algebra and murky depths of statistics.
I now intend to dive deeper into of statistics – toward “data depth”. Data depth is a generalization of median concept to multidimensional data. Remind you that median can be seen either as order parameter – value dividing the higher half of measurements from lower, or geometrically, as the minimum of $L_1$ norm. Second approach lead to geometric median, about which I already talked about.
First approach to generalizations of median is to try to apply order statistics to multidimensional vectors.The idea is to make some kind of partial order for n-dimensional points – “depth” of points, and to choose as the analog of median the point of maximum depth.
Basically all data depth concepts define “depth” as some characterization of how deep points are reside inside the point cloud.
Historically first and easiest to understand was convex hull approach – make convex hull of data set, assign points in the hull depth 1, remove it, get convex hull of points remained inside, assign new hull depth 2, remove etc.; repeat until there is no point inside last convex hull.
Later Tukey introduce similar “halfspace depth” concept – for each point X find the minimum number of points which could be cut from the dataset by plane through the point X. That number count as depth(see the nice overview of those and other geometrical definition of depth at Greg Aloupis page)
In 2002 Mizera introduced “global depth”, which is less geometric and more statistical. It start with assumption of some loss function (“criterial function” in Mizera definition) $F(x_i, \theta)$ of measurement set $x_i$. This function could be(but not necessary) cumulative probability distribution. Now for two parameters $\theta_1$ and $\theta_2$, $\theta_1$ is more fit with respect $A \subset N$ if for all $i \in A$ $F(x_i, \theta_1) > F(x_i, \theta_2)$. $\hat{\theta}$ is weakly optimal with respect to $A$ if there is nor better fit parameter with respect to $A$. At last global depth of $\theta$ is the minimum possible size of $A$ such that $\theta$ is not weakly optimal with respect to $N \setminus A$ – reminder of measurements. In other words global depth is minimum number of measurements which should be removed for $\theta$ stop being weakly optimal. Global depth is not easy to calculate or visualize, so Mizera introduce more simple concept – tangent depth.
Tangent depth defined as $min_{\parallel u\parallel=1}\mid \{ i: u^T \bigtriangledown_{\theta} F(x_i) \geq 0 \}\mid$. What does it mean? Tangent depth is minimum number of “bad” points – such points that for specific direction loss function for themis growing.
Those definitions of “data depth” allow for another type of estimator, based not on likelihood, but on order statisticsmaximum depth estimators. The advantage of those estimators is robustness(breakdown point ~25%-33%) and disadvantage – low precision (high bias). So I wouldn’t use them for precise estimation, but for sanity check or initial approximation. In some cases they could be computationally more cheap than M-estimators. As useful side effect they also give some insight into structure of dataset(it seems originally maximum depth estimators was seen as data visualization tool). Depth could be good criterion for outliers rejection.
Disclaimer: while I had very positive experience with Cauchy estimator, data depth is a new thing for me.I have yet to see how useful it could be for computer vision related problems.

2, May, 2011 Posted by | computer vision, sci | , , , | Comments Off on Robust estimators III: Into the deep

Robust estimators II

In this post I was complaining that I don’t know what breakdown point for redescending M-estimators is. Now I found out that upper bound for breakdown point of redescending of M-estimators was given by Mueller in 1995, for linear regression (that is statisticians word for simple estimation of p-dimensional hyperplane):
$\frac{1}{N}(\frac{N - \mathcal{N}(x) + 1)}{2})$
$N$ – number of measurements and $\mathcal{N}(x)$ is little tricky: it is a maximum number of measurement vectors X lying in the same p-dimensional hyperplane. If number of measurements N >> p that mean breakdown point is near 50% – You can have half measurement results completely out of the blue and estimator will still work.
That only work if the error present only in results of measurements, which is reasonable condition – in most cases we can move random error from x part to y part.
Now which M-estimators attain this upper bound?
The condition is “slow variation”(Mizera and Mueller 1999)
$\lim_{t\to \infty} \frac{\rho(t x)}{\rho(t)} = 1$
Mentioned in previous post Cauchy estimator is satisfy that condition:
$\rho(x) = -\ln(1 +(\frac{x}{\gamma})^2)$ and its derivative $\psi(x) = \frac{x}{\gamma^2+ x^2}$
In practice we always work with $\psi$, not $\rho$ so Cauchy estimator is easy to calculate.
Rule of the thumb: if you don’t know which robust estimator to use, use Cauchy: It’s fast(which is important in real time apps), its easy to understand, it’s differentiable, and it is as robust as possible (that is for redescending M-estimator)

19, April, 2011 Posted by | computer vision, sci | , , , , | Comments Off on Robust estimators II

Robust estimators – understand or die… err… be bored trying

This is continuation of my attempt to understand internal mechanics of robust statistics. First I want to say that robust statistics “just works”. It’s not necessary to have deep understanding of it to use it and even to use it creatively. However without that deeper understanding I feel myself kind of blind. I can modify or invent robust estimators empirically, but I can not see clearly the reasons, why use this and not that modification.
Now about robust estimators. They could be divided into two groups: maximum likelihood estimators(M-estimators), which in case of robust statistics usually, but not always are redescending estimators (notable not redescending estimator is $L_1$ norm), and all the rest of estimators.
This second “all the rest” group include subset of L-estimators(think of median, which is also M-estimator with $L_1$ norm.Yea, it’s kind of messy), S-estimators (use global scale estimation for all the measurements) R-estimators, which like L-estimator use order statistics but use it for weights. There may be some others too, but I don’t know much about this second group.
It’s easy to understand what M-estimators do: just find the value of parameter which give maximum probability of given set of measurements.
$argmax_{\theta} \prod_{i=1}^{n}p ( x_i\mid\theta)$
or
$argmin_{\theta}\sum_{i=1}^{n} -ln(p(x_i|\theta))$
which give us traditional M-estimator form
$argmin_{\theta}\sum_{i=1}^n \rho(x_i, \theta)$
or
$\sum_{i=1}^n \psi(x_i, \theta) = 0$, $\psi = \frac{\partial \rho}{\partial \theta}$
Practically we are usually work not with measurements per se, but with some distribution of cost function $F(x,\theta)$ of the measurements $\rho(x, \theta) = p(F(x, \theta))$, so it become
$\sum_{i=1}^n \psi(x_i, \theta)\frac{\partial F(x_i,\theta)}{\partial \theta} = 0$
it’s the same as the previous equation just $\psi$ defined in such a way as to separate statistical part from cost function part.
Now if we make a set of weights $w_i = \frac{\psi_i}{F_i}$ it become
$\sum_{i=1}^n w_i(x_i, \theta) F(x_i, \theta) \frac{\partial F(x_i,\theta)}{\partial \theta} = 0$
We see that it could be considered as “nonlinear least squares”, which could be solved with iteratively reweighted least squares
Now for second group of estimators we have probability of joint distribution
$argmax_{\theta} \prod_{i=1}^{n}p ( x_i\mid x_{j, j\neq i}, \theta)$
All the global factors – sort order, global scale etc. are incorporated into measurements dependence.
It seems the difference between this formulation of second group of estimators and M-estimator is that conditional independence assumption about measurements is dropped.
Another interesting thing is that if some of measurements are not dependent on others, this formulation can get us bayesian network

Now lets return to M-estimators. M-estimator is defined by assumption about probability distribution of the measurements.
So M-estimator and probabilistic distribution through which it is defined are essentially the same. Least squares, for example, is produced by normal(gausssian) distribution. Just take sum of logarithms of gaussian and you get least squares estimator.
If we are talking about normal (pun intended), non-robust estimator, their defining feature is finite variance of distribution.
We have central limit theorem which saying that for any distribution mean value of samples will have approximately normal(or Gaussian) distribution.
From this follow property of asymptotic normality – for estimator with finite variance its distribution around true value of parameter $\theta$ approximate normal distribution.
We are discussing robust estimators, which are stable to error and have “thick-tailed” distribution, so we can not assume finite variance of distribution.
Nevertheless to have “true” result we want some form of probabilistic convergence of measurements to true value. As it happens such class of distribution with infinite variance exists. It’s called alpha-stable distributions.
Alpha stable distribution are those distributions for which linear combination of random variables have the same distribution, up to scale factor. From this follow analog of central limit theorem for stable distribution.
The most well known alpha-stable distribution is Cauchy distribution, which correspond to widely used redescending estimator
$\psi(x) = \frac {x} {\varepsilon + x^2}$
Cauchy distribution can be generalized in several way, including recent GCD – generalized Cauchy distribution(Carrillo et al), with density function
$p\Gamma(p/2)/2\Gamma(1/p)^2(\sigma^p + x^p)^{-2/p}$
and estimator
$\psi(x)=\frac{p|x|^{p-1}sgn(x)}{\sigma^p + x^p}$
Carrillo also introduce Cauchy distribution-based “norm” (it’s not a real norm obviously) which he called “Lorentzian norm”
$||u||_{LL_p} = \sum ln(1 + \frac{|u_i|^p}{\sigma^p})$
${LL_2}$ is correspond classical Cauchy distribution
He successfully applied Lorentzian norm ${LL_2}$ based basis pursuit to compressed sensing problem, which support idea that compressed sensing and robust statistics are dual each other.

15, April, 2011 Posted by | sci | , , , , , | Comments Off on Robust estimators – understand or die… err… be bored trying

Is Robust Statistics have formal mathematical foundation?

As I have already written I have a trouble understanding what robust estimators actually estimate from probabilistic or other formal point of view. I mean estimators which are not maximum likelihood estimators. There is a formal definition which doesn’t explain a lot to me. It looks like estimator estimate some quantity, and we know how good we are at estimating it, but how we know what we are actually estimate? Or does this question even make sense? But that is actually a minor bummer. A problem with understanding outliers is a lot worse for me. A breakdown point is a fundamental concept in robust statistics. And breakdown point is defined as a relative number of outliers in the sample set. The problem is, it seems there is no formal definition of outlier in statistics or probability theory. We can talk about mixture models, and tail distributions but those concepts are not quite consistent with breakdown point. Breakdown point looks like it belong to area of optimization/topology, not statistics. Could it be that outliers could be defined consistently only if we have some additional structural information/constraints beside statistical (distribution)? That inability to reconcile statistics and optimization is a problem which causing cognitive headache for me.

11, April, 2011 Posted by | sci | , , , , | Comments Off on Is Robust Statistics have formal mathematical foundation?

Minimum sum of distance vs L1 and geometric median

All this post is just a more detailed explanation of the end of the previous post.
Assume we want to estimating a state $x \in R^n$ from $m \gg n$ noisy linear measurements $y \in R^m$, $y = Ax + z$, $z$– noise with outliers, like in the paper by Sharon, Wright and Ma Minimum Sum of Distances Estimator: Robustness and Stability
Sharon at al show that minimum $L_1$ norm estimator, that is
$arg \min_{x} \sum_{i=1}^{m} \| a_i^T x - y_i \|_{1}$
is a robust estimator with stable breakdown point, not depending on the noise level. What Sharon did was to use as cost function the sum of absolute values of all components of errors vector. However there are exists another approach.
In one-dimensional case minimum $L_1$ norm is a median.But there exist generalization of median to $R^n$geometric median. In our case it will be
$arg \min_{x} \sum_{i=1}^{m} \| A x - y_i \|_{2}$
That is not a least squares – minimized the sum of $L_2$ norm, not the sum of squares of $L_2$ norm.
Now why is this a stable and robust estimator? If we look at the Jacobian
$\sum_{i=1}^{m} A^T \frac{A x - y_i}{ \| A x - y_i \|_{2}}$
we see it’s asymptotically constant, and it’s norm doesn’t depend on the norm of the outliers. While it’s not a formal proof it’s quite intuitive, and can probably be formalized along the lines of Sharon paper.
While first approach with $L_1$ norm can be solved with linear programming, for example simplex method and interior point method, the second approach with $L_2$ norm can be solved with second order cone programming and …surprise, interior point method again.
For interior point method, in both cases original cost function is replaced with
$\sum f$
And the value of $f$ is defined by constraints. For $L_1$
$f_{i} \ge a_ix-y_i$, $f_{i} \ge -a_ix+y_i$
Sometimes it’s formulated by splitting absolute value is into the sum of positive and negative parts
$f_{+_{i}} \ge a_ix-y_i$, $f_{-_{i}} \ge -a_ix+y_i$, $f_{+_{i}} \ge 0$, $f_{-_{i}} \ge 0$
And for $L_2$ it’s a simple
$f_i \ge \| A x - y_i \|_{2}$
Formulations are very similar, and stability/performance are similar too (there was a paper about it, just had to dig it out)

10, April, 2011

Genetic algorithms – alternative to building block hypothesis

Genetic algorithms and especially their subset, Genetic programming were always fascinating me. My interest was fueled by on and off work with Global optimization, and because GA just plain cool. One of the most interesting thing about GA is that they work quite good on some “practical” problems, while there is no comprehensive theoretical explanation why they should work so well (Of cause they are not always so useful. There was a work on generating feature descriptors with GA, and results were less then impressive).
Historically, first and most well known explanation for GA efficiency was the building block hypothesis. Building block hypothesis is very intuitive. It say that there are exist “building blocks” – small parts of genome with high fitness. GA work is randomly searching for those building blocks and combining them afterward, until global optimum is found. Searching is mostly done with mutation, and combining found building block with crossover (analog of exchange of genetic material in real biological reproduction).
However building blocks have a big problem, and that problem is crossover operator. If building block hypothesis is true, GA work better if integrity of building blocks preserved as much as possible. That is there should be only few “cut and splice” points in the sequence. But practically GA with “uniform” crossover – massive uniform mixing of two genomes, work better then GA with few crossover points.
Recently a new theory of GA efficiency appears, that try to deal with uniform crossover problem – Generative fixation” hypothesis. The idea behind “generative fixation” is that GA works in continuous manner, fixing stable groups of genes with high fitness and continuing search on the rest of genome, reducing search space step by step. From optimization point of view GA in that case works in manner similar to Conjugate gradient method, reducing (or trying) dimensionality of search space in each step. Now about “uniform crossover” – why it works better: subspace, to which search space reduced, should be stable (in stability theory sense). Small permutations wouldn’t case solution to diverge. With uniform crossover of two close solutions resulting solution still will be nearby attractive subspace. The positive effect of uniform crossover is that it randomize solution, but without exiting already found subspace. That randomization clearing out useless “stuck” genes (also called “hitchhikers”), and help to escape local minima.
Interesting question is, what if subspace is not “fixed bits” and even not linear – that is if it’s a manifold. In that case (if hypothesis true) found genes will not be “fixed”, but will “drift” in systematic manners, according to projection of manifold on the semi-fixed bits.
Now to efficiency GA for “practical” task. If the “generative fixation” theory is correct, “practical” task, for which GA work well could be the problems for which dimensionality reduction is natural, for example if solution belong to low-dimensional attractive manifold. (addenum 7/11)That mean GA shouldn’t work well for problem which allow only combinatorial search. Form this follow that if GA work for compressed sensing problem it should comply with Donoho-Tanner Phase Transition diagram.
Overall I like this new hypothesis, because it bring GA back to family of mathematically natural optimization algorithms. That doesn’t mean the hypothesis is true of cause. Hope there will be some interest, more work, testing and analysis. What is clear that is current building block hypothesis is not unquestionable.

PS 7/11:
Simple googling produced paper by Beyer An Alternative Explanation for the Manner in which Genetic Algorithms Operate with quite similar explanation how uniform crossover works.

6, November, 2010

Learning mathematics is good for the soul

At The n-Category Café David Corfield is talking about mathematical emotions. This is about ‘emotions belonging to mathematical thinking’, specific feeling related to mathematical intuition, meaning and sense of the rightness. In the end he is referencing to spiritual motivation of mathematics. From myself I add that those thoughts are closely related to the last book of Neal StephensonAnathem

19, November, 2009 Posted by | Uncategorized | , , , | Comments Off on Learning mathematics is good for the soul

Some phase correlation tricks

Then doing phase correlation on low-resolution, or extreme low-resolution (like below 32×32) images, the noise could become a serious problem, up to making result completely useless. Fortunately there are some tricks, which help in this situation. Some of them I stumbled upon myself, and some picked up in relevant papers.
First is obvious – pass image through the smoothing filter. Pretty simple window filter from integral image can help here.
Second – check consistency of result. Histogram of cross-power specter can help here. Here there is the wheel within the wheel, which I have found out the hard way – discard lower and right sectors of cross-power specter for histogram, they are produced from high-frequency parts of the specter and almost always are noise, even if cross-power specter itself quite sane.
Now more academic tricks:
You could extract sub-pixel information from cross-power specter. There are lot of ways to do it, just google/citeseer for it. Some are fast and unreliable, some slow and reliable.
Last one is really nice, I’ve picked it from Carneiro & Japson paper about phase-based features.
For cross power specter calculation instead of
$\frac{F_{1}\cdot F_{2}^{*}} {\left| F_{1}\cdot F_{2} \right|}$
use
$\frac{F_{1}\cdot F_{2}^{*}} {a + \left| F_{1}\cdot F_{2} \right|}$
where $a$ is a small positive parameter
This way harmonics with small amplitude excluded from calculations. This is pretty logical – near zero harmonics have phase undefined, almost pure noise.

PS
Another problem with extra low-resolution phase correlation is that sometimes motion vector appear not as primary, but as secondary peak, due to ambiguity of the images relations. I have yet to find out what to do in this situation…

29, August, 2009 Posted by | Coding AR | , , , | Comments Off on Some phase correlation tricks

Importance of phase

Here are some nice pictures illustrating importance of Fourier phase

27, August, 2009 Posted by | Coding AR | , , | Comments Off on Importance of phase