# Mirror Image

## Robust estimators – understand or die… err… be bored trying

This is continuation of my attempt to understand internal mechanics of robust statistics. First I want to say that robust statistics “just works”. It’s not necessary to have deep understanding of it to use it and even to use it creatively. However without that deeper understanding I feel myself kind of blind. I can modify or invent robust estimators empirically, but I can not see clearly the reasons, why use this and not that modification.
Now about robust estimators. They could be divided into two groups: maximum likelihood estimators(M-estimators), which in case of robust statistics usually, but not always are redescending estimators (notable not redescending estimator is $L_1$ norm), and all the rest of estimators.
This second “all the rest” group include subset of L-estimators(think of median, which is also M-estimator with $L_1$ norm.Yea, it’s kind of messy), S-estimators (use global scale estimation for all the measurements) R-estimators, which like L-estimator use order statistics but use it for weights. There may be some others too, but I don’t know much about this second group.
It’s easy to understand what M-estimators do: just find the value of parameter which give maximum probability of given set of measurements.
$argmax_{\theta} \prod_{i=1}^{n}p ( x_i\mid\theta)$
or
$argmin_{\theta}\sum_{i=1}^{n} -ln(p(x_i|\theta))$
which give us traditional M-estimator form
$argmin_{\theta}\sum_{i=1}^n \rho(x_i, \theta)$
or
$\sum_{i=1}^n \psi(x_i, \theta) = 0$, $\psi = \frac{\partial \rho}{\partial \theta}$
Practically we are usually work not with measurements per se, but with some distribution of cost function $F(x,\theta)$ of the measurements $\rho(x, \theta) = p(F(x, \theta))$, so it become
$\sum_{i=1}^n \psi(x_i, \theta)\frac{\partial F(x_i,\theta)}{\partial \theta} = 0$
it’s the same as the previous equation just $\psi$ defined in such a way as to separate statistical part from cost function part.
Now if we make a set of weights $w_i = \frac{\psi_i}{F_i}$ it become
$\sum_{i=1}^n w_i(x_i, \theta) F(x_i, \theta) \frac{\partial F(x_i,\theta)}{\partial \theta} = 0$
We see that it could be considered as “nonlinear least squares”, which could be solved with iteratively reweighted least squares
Now for second group of estimators we have probability of joint distribution
$argmax_{\theta} \prod_{i=1}^{n}p ( x_i\mid x_{j, j\neq i}, \theta)$
All the global factors – sort order, global scale etc. are incorporated into measurements dependence.
It seems the difference between this formulation of second group of estimators and M-estimator is that conditional independence assumption about measurements is dropped.
Another interesting thing is that if some of measurements are not dependent on others, this formulation can get us bayesian network

So M-estimator and probabilistic distribution through which it is defined are essentially the same. Least squares, for example, is produced by normal(gausssian) distribution. Just take sum of logarithms of gaussian and you get least squares estimator.
If we are talking about normal (pun intended), non-robust estimator, their defining feature is finite variance of distribution.
We have central limit theorem which saying that for any distribution mean value of samples will have approximately normal(or Gaussian) distribution.
From this follow property of asymptotic normality – for estimator with finite variance its distribution around true value of parameter $\theta$ approximate normal distribution.
We are discussing robust estimators, which are stable to error and have “thick-tailed” distribution, so we can not assume finite variance of distribution.
Nevertheless to have “true” result we want some form of probabilistic convergence of measurements to true value. As it happens such class of distribution with infinite variance exists. It’s called alpha-stable distributions.
Alpha stable distribution are those distributions for which linear combination of random variables have the same distribution, up to scale factor. From this follow analog of central limit theorem for stable distribution.
The most well known alpha-stable distribution is Cauchy distribution, which correspond to widely used redescending estimator
$\psi(x) = \frac {x} {\varepsilon + x^2}$
Cauchy distribution can be generalized in several way, including recent GCD – generalized Cauchy distribution(Carrillo et al), with density function
$p\Gamma(p/2)/2\Gamma(1/p)^2(\sigma^p + x^p)^{-2/p}$
and estimator
$\psi(x)=\frac{p|x|^{p-1}sgn(x)}{\sigma^p + x^p}$
Carrillo also introduce Cauchy distribution-based “norm” (it’s not a real norm obviously) which he called “Lorentzian norm”
$||u||_{LL_p} = \sum ln(1 + \frac{|u_i|^p}{\sigma^p})$
${LL_2}$ is correspond classical Cauchy distribution
He successfully applied Lorentzian norm ${LL_2}$ based basis pursuit to compressed sensing problem, which support idea that compressed sensing and robust statistics are dual each other.