# Mirror Image

## November finds in #arxiv and NIPS 2013

This is “find in arxiv” reposts form my G+ stream for November.

NIPS 2013

Accelerating Stochastic Gradient Descent using Predictive Variance Reduction

Stochastic gradient (SGD) is the major tool for Deep Learning. However if you look at the plot of cost function over iteration for   SGD you will see that after quite fast descent it becoming extremely slow, and error decrease could even become non-monotonous.  Author explain by necessity of trade of between the step size and variance of random factor – more precision require smaller variance but that mean smaller descent step and slower convergence. “Predictive variance” author suggest to mitigate problem is the same old “adding back the noise” trick, used for example in Split Bregman. Worth reading IMHO.

Predicting Parameters in Deep Learning
Output of the first layer of ConvNet is quite smooth, and that could be used for dimensionality reduction, using some dictionary, learned or fixed(just some simple kernel). For ConvNet predicting 75% of parameters with fixed dictionary have negligible effect on accuracy.
http://papers.nips.cc/paper/5025-predicting-parameters-in-deep-learning

Learning a Deep Compact Image Representation for Visual Tracking
Application of ADMM (Alternating Direction Method of Multipliers, of which Split Bregman again one of the prominent examples) to denoising autoencoder with sparsity.
http://papers.nips.cc/paper/5192-learning-a-deep-compact-image-representation-for-visual-tracking

Deep Neural Networks for Object Detection
People from Google are playing with Alex Krizhevsky’s ConvNet
http://papers.nips.cc/paper/5207-deep-neural-networks-for-object-detection

–arxiv (last)–

Are all training examples equally valuable?
It’s intuitively obvious that some training sample are making training process worse. The question is – how to find wich sample should be removed from training? Kind of removing outliers. Authors define “training value” for each sample of binary classifier.
http://arxiv.org/pdf/1311.7080

Finding sparse solutions of systems of polynomial equations via group-sparsity optimization
Finding sparse solution of polynomial system with lifting method.
I still not completely understand why quite weak structure constraint is enough for found approximation to be solution with high probability. It would be obviously precise  for binary 0-1 solution, but why for general sparse?
http://arxiv.org/abs/1311.5871

Semi-Supervised Sparse Coding
Big dataset with small amount of labeled samples – what to do? Use unlabeled samples for sparse representation. And train labeled samples in sparse representation.
http://arxiv.org/abs/1311.6834
From the same author, similar theme - Cross-Domain Sparse Coding
Two domain training – use cross domain data representation to map all the samples from both source and target domains to a data representation space with a common distribution across domains.
http://arxiv.org/abs/1311.7080

Robust Low-rank Tensor Recovery: Models and Algorithms
More of tensor decomposition with trace norm
http://arxiv.org/abs/1311.6182

Complexity of Inexact Proximal Newton methods
Application of Proximal Newton (BFGS) to subset of coordinates each step – active set coordinate descent.
http://arxiv.org/pdf/1311.6547

Computational Complexity of Smooth Differential Equations
Polynomial-memory complexity of ordinary differential equations.
http://arxiv.org/abs/1311.5414

–arxiv (2)–

Deep Learning

Visualizing and Understanding Convolutional Neural Networks
This is exploration of Alex Krizhevsky’s  ConvNet
using “deconvnet” approach – using deconvolution on output of each layer and visualizing it. Results looks interesting – strting from level 3 it’s something like thersholded edge enchantment, or sketch. Also there are evidences supporting “learn once use everywhere” approach – convnet trained on ImageNet is also effective on other datasets
http://arxiv.org/abs/1311.2901

Unsupervised Learning of Invariant Representations in Hierarchical Architectures
Another paper on why and how deep learning works.
Attempt to build theoretical framework for invariant features in deep learning. Interesting result - Gabor wavelets are optimal filters for simultaneous scale and translation invariance. Relations to sparsity and scattering transform
http://arxiv.org/abs/1311.4158

Computer Vision
An Experimental Comparison of Trust Region and Level Sets
Trust regions method for energy-based segmentation.
Trust region is one of the most important tools in optimization, especially non-convex.
http://en.wikipedia.org/wiki/Trust_region
http://arxiv.org/abs/1311.2102

Blind Deconvolution with Re-weighted Sparsity Promotion
Using reweighted L2 norm for sparsity in blind deconvolution
http://arxiv.org/abs/1311.4029

Optimization

http://www.optimization-online.org/DB_FILE/2013/04/3833.pdf )
It use Bregman distance and related to ADMM.
The paper is application of universal gradient method  to online learning and give bound on regret function.
http://arxiv.org/abs/1311.3832

CS
A Component Lasso
Approximate covariance matrix with block-diagonal matrix and apply Lasso to each block separately
http://arxiv.org/abs/1311.4472

_FuSSO: Functional Shrinkage and Selection Operator
Lasso in functional space with some orthogonal basis_
http://arxiv.org/abs/1311.2234

Non-Convex Compressed Sensing Using Partial Support Information
More of Lp norm for sparse recovery. Reweighted this time.
http://arxiv.org/abs/1311.3773

–arxiv  (1)–

Optimization, CS
Scalable Frames and Convex Geometry
Frame theory is a basis(pun intended) of wavelets theory, compressed sening and overcomplete dictionaries in ML
http://en.wikipedia.org/wiki/Frame_of_a_vector_space
Here is a discussion how to make “tight frame”
http://en.wikipedia.org/wiki/Frame_of_a_vector_space#Tight_frames
from an ordinary frame by scaling m of its components
Interesting geometric insight provided – to do it mcomponents of frame should make “blunt cone”
http://en.wikipedia.org/wiki/Convex_cone#Blunt_and_pointed_cones
http://arxiv.org/abs/1310.8107

Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization
Some bounds for convergence of dictionary learning. Converge id initial error is O(1/s^2), s- sparcity level
http://arxiv.org/abs/1310.7991

Robust estimators
Robustness of ℓ1 minimization against sparse outliers and its implications in Statistics and Signal Recovery
This is another exploration of L1 estimator. It happens (contrary to common sense) that L1 is not especially robust from “breakdown point” point of view if there is no constraint of noise. However it practical usefulness can be explained that it’s very robust to sparse noise
http://arxiv.org/abs/1310.7637

Numerical
Local Fourier Analysis of Multigrid Methods with Polynomial Smoothers and Aggressive coarsening
Overrelaxaction with Chebyshev weights on the fine grid, with convergence analysis.
http://arxiv.org/abs/1310.8385

1, December, 2013

## Deriving Gibbs distribution from stochastic gradients

Stochastic gradients is one of the most important tools in optimization and machine learning (especially for Deep Learning – see for example ConvNet). One of it’s advantage is that it behavior is well understood in general case, by application of methods of statistical mechanics.

In general form stochastic gradient descent could be written as
$w_{n+1} = w_n - \alpha \bigtriangledown E(w_n) + F_n$
where $F_n$ is a random variable with expectation zero.
To apply methods of statistical mechanics we can rewrite it in continuous form, as stochastic gradient flow
$\partial J/\partial t = - \bigtriangledown_J E(J) +F(t)$
and random variable F(t) we assume to be white noise for simplicity.
In that moment most of textbooks and papers refer to “methods of statistical mechanics” to show that
stochastic gradient flow has invariant probability distribution, which is called Gibbs distribution
$p(x) = \frac {exp(-\beta E(x))}{Z}$
and from here derive some interesting things like temperature and free energy.
The question is – how Gibbs distribution derived from stochastic gradient flow?

First we have to understand what stochastic gradient flow really means.
It’s not a partial differential equation (PDE), because it include random variable, which is not a function $\mathbb{R} \to \mathbb{R}$. In fact it’s a stochastic differential equation (SDE) . SDE use some complex mathematical machinery and relate to partial differential equations, probability/statistics, measure theory and ergodic theory. However they used a lot in finance, so there is quite a number of textbooks on SDE for non-mathematicians. For the short, minimalistic and accessible book on stochastic differential equation I can’t recommend highly enough introduction to SDE by L.C. Evans

SDE in question is called Ito diffusion. Solution of that equation is a stochastic process – collection of random variables parametrized by time. Sample path of stochastic process in question as a function of time is nowhere differentiable – it’s difficult to talk about it in term of derivatives, so it is defined through it’s integral form.
First I’ll notice that integral of white noise is actually Brownian motion, or Wiener process.
Assume that we have stochastic differential equation written in informal manner
$\partial X / \partial t = b(X, t) + g(X, t) F(t)$ with X -stochastic process and F(t) – white noise
It’s integral form is
$X(T) - X(0) = \int_{0}^{T} b(X, t) dt + \int_{0}^{T} g(X, t) dW(t) \ \ (1)$
where W(t) is a Wiener process
$W(t) = \int F(t) dt$
This equation is usually written in the form
$dX = b(X,t)dt + g(X,t)dW \ \ (2)$
This is only a notation for integral equation, d here is not a differential.
Returning to (1)
$\int_{0}^{T} b(X, t) dt$ is an integral along sample path, it’s meaning is obvious, or it can be defined as limit of Riemann sums with respect to time.

The most notable thing here is
$\int_{0}^{T} g(X, t) dW(t)$ – integral with respect to Wiener process (3)
It’s a stochastic integral, and it’s defined in the courses of stochastic differential equation as the limit of Riemann sums of random variables, in the manner similar to definition of ordinary integral.
Curiously, stochastic integral is not quite well defined. Depending on the form of the sum it produce different results, like Ito integral:

$\int_0^t g \,d W =\lim_{n\rightarrow\infty} \sum_{[t_{i-1},t_i]\in\pi_n}g_{t_{i-1}}(W_{t_i}-W_{t_{i-1}})$
Different Riemann sums produce different integral – Stratonovich integral:
$\int_0^t g \,d W =\lim_{n\rightarrow\infty} \sum_{[t_{i-1},t_i]\in\pi_n}(g_{t_i} + g_{t_{i-1}})/2(W_{t_i}-W_{t_{i-1}})$

Ito integral used more often in statistics because it use $g_{t_{i-1}}$ – it don’t “look forward”, and Stratonovich more used in theoretical physics.

Returning to Ito integral – Ito integral is stochastic process itself, and it has expectation zero for each t.
From definition of Ito integral follow one of the most important tools of stochastic calculus – Ito Lemma (or Ito formula)
Ito lemma states that for solution of SDE (2)
$dX = b(X, t)dt + g(x, t)dW$ X, b, W – vectors, g – matrix
were W is Wiener process (actually some more general process) and b and g are good enough

$du(\mathbf{X}, t) = \frac{\partial u}{\partial t} dt + (\nabla_\mathbf{X}^{\mathsf T} u) d\mathbf{X} + \tfrac{1}{2} \sum_{i,j} (gg^\mathsf{T})_{i,j}\tfrac{ \partial^2 u}{\partial x_i \partial x_j} dt$

where $\nabla_\mathbf{X} u$ is the gradient.
From Ito lemma follow Ito product rule for scalar processes: applying Ito formula to process $V = { X \choose Y}$ combined from two processes X and Y to function u(V) = XY
$d(XY) = XdY + YdX + g_x \cdot g_y dt$
Using Ito formula and Ito product rule it is possible to get Feynman–Kac formula (derivation could be found in the wikipedia, it use only Ito formula, Ito product rule and the fact that expectation of Ito integral (3) is zero):
for partial differential equation (PDE)
$\frac{\partial u}{\partial t} + (\nabla_\mathbf{X}u)^{\mathsf T} b + \tfrac{1}{2} \sum_{i,j} (gg^\mathsf{T})_{i,j}\tfrac{ \partial^2 u}{\partial x_i \partial x_j} - vu = f$
with terminal condition
$u(x, T) = \psi(x) \ \ (4)$
solution can be written as conditional expectation:
$u(x,t) = E\left[ \int_t^T e^{- \int_t^r v(X_\tau,\tau)\, d\tau}f(X_r,r)dr + e^{-\int_t^T v(X_\tau,\tau)\, d\tau}\psi(X_T) \Bigg| X_t=x \right]$
Feynman–Kac formula establish connection between PDE and stochastic process.
From Feynman–Kac formula taking $v\equiv 0$ and $f \equiv 0$ we get Kolmogorov backward equation :
for
$Lu = (\nabla_\mathbf{X}u)^{\mathsf T} b + \tfrac{1}{2} \sum_{i,j} (gg^\mathsf{T})_{i,j}\tfrac{ \partial^2 u}{\partial x_i \partial x_j}$
equation
$\frac{\partial u}{\partial t} + Lu = 0 \ \ (5)$
with terminal condition (4) have solution as conditional expectation
$u(x, t) = E(\psi(X_T) | X_t = x) \ \ (6)$
From Kolmogorov backward equation we can obtain Kolmogorov forward equation, which describe evolution of probability density for random process X (2)
In SDE courses it’s established that (2) is a Markov process and has transitional probability P and transitional density p:
p(x, s, y, t) = probability density at being at y in time t, on condition that it started at x in time s
taking u – solution of (5) with terminal condition (6)
$u(x, s) = E(\psi(X_T) | X_s = x) = \int p(x, s, z, T) \psi(z) dz$
From Markov property
$p(x, s, z, T) = \int p(x, s, y, t)p(y, t, z, T) dz$
from here
$u(x, s) = \int \int p(x, s, y, t) p(y, t, z, T) \psi(z) dz dy$
$u(x, s) = \int p(x, s, y, t) u(y, t) dy$
form here
$0 = \frac{\partial u}{\partial t} = \int \frac {\partial p(x, s, y, t)}{\partial t}u(y, t) + p(x, s, y, t) \frac {\partial u(y, t)}{\partial t} dy$
from (5)
$\int \frac{\partial p(x, s, y, t)}{\partial t}u(y, t) - (Lu) p(x, s, y, t) dy = 0 \ \ (7)$
Now we introduce dual operator $L^*$
$\int p (Lu) dy = \int (L^{*}p) u dy$
By integration by part we can get
$L^*u = -\nabla_\mathbf{X}^{\mathsf T}(ub) + \tfrac{1}{2} \sum_{i,j} \tfrac{ \partial^2 }{\partial x_i \partial x_j}(gg^\mathsf{T}u)_{i,j}$
and from (7)
$\int (\frac{\partial p(x, s, y, t)}{\partial t} - L^{*}p) u(y, t) dy = 0$
for t=T
$\int (\frac{\partial p(x, s, y, T)}{\partial t} - L^{*}p(x, s, y, T)) \psi(y) dy = 0$
This is true for any $\psi$, wich is independent from p
And we get Kolmogorov forward equation for p. Integrating by x we get the same equation for probability density at any moment T
$\frac{\partial p}{\partial t} - L^{*}p = 0$
$\tfrac{\partial J}{\partial t} = - \bigtriangledown_J E(J) + \sqrt{\tfrac{2}{\beta}}F(t)$
Stochastic gradient flow in SDE notation
$dJ = - \bigtriangledown_J E(J)dt + \sqrt{\tfrac{2}{\beta}} dW$ – integral of white noise is Wiener process
We want to find invariant probability density $p_G$. Invariant – means it doesn’t change with time, $\tfrac{\partial p_G}{\partial t} = 0$
so from Kolmogorov forward equation
$L^*p_G = 0$
or
$\nabla^{\mathsf T}(p_G\nabla E)+ \tfrac{1}{\beta} \nabla^{\mathsf T} \nabla p_G = 0$
$p_G\nabla E + \tfrac{1}{\beta} \nabla p_G = C$
C = 0 because we want $p_G$ integrable
$\nabla(E + \tfrac{1}{\beta} log(p_G)) = 0$
and at last we get Gibbs distribution
$p_G = \frac{exp(-\beta E)}{Z}$
Recalling again the chain of reasoning:
Wiener process

SDE + Ito Lemma + Ito product rule + zero expecation of Ito integral →

Kolmogorov forward equation for probability density  →

Gibbs invariant distribution

## Meaning of conditional probability

Conditional probability was always baffling me. Empirical, frequentists meaning is clear, but the abstract definition, originating from Kolmogorov – what was its mathematical meaning? How it can be derived? It’s a nontrivial definition and is appearing in the textbooks out the air, without measure theory intuition behind it.

Here I mostly follow Chang&Pollard  paper  Conditioning as disintegartion. Beware that the paper use non-standard notation, but this post follow more common notation, same as in wikipedia.

Here is example form Chang&Pollard paper:

Suppose we have distribution on  $\mathbb{R}^2$ concentrated in two straight lines $L_1$ and $L_2$ with respective density $g_i(x)$ and angles with X axis $\alpha_i$. Observation (X,Y) taken, giving $X=x_0$, what is probability that point lies on the line $L_1$ ?

Standard approach would be approximate $x_0$ with $[ x_0, x_0 +\Delta ]$ and take limit with  $\Delta \to 0$

$\frac{\int_{x_0}^{x_0+ \Delta}g_1/cos(\alpha_1)}{\int_{x_0}^{x_0+ \Delta}g_1/cos(\alpha_1) + \int_{x_0}^{x_0+ \Delta}g_2/cos(\alpha_2)}$

Not only taking this limit is kind of cumbersome, it’s also not totally obvious that it’s the same conditional probability that defined in the abstract definition – we are replacing ratio with limit here.

Now what is “correct way” to define conditional probabilities, especially for distributions?

For simplicity we will first talk about single scalar random variable, defined on probability space. We will think of random variable X as function on the sample space. Now condition $X=x_0$ define fiber - inverse image of $x_0$.

Disintegration theorem say that probability measure on the sample space can be decomposed into two measures – parametric family of measures induced by original probability on each fiber and “orthogonal” measure on $\mathbb{R}$  - on the parameter space of the first measure. Here $\mathbb{R}$ is the space of values of X  and serve as parameter space for measures on fibers. Second measure induced by the inverse image of  the function (random variable) for each measurable set on $\mathbb{R}$. This second measure is called Pushforward measure. Pushforward measure is just for measurable set on $\mathbb{R}$ (in our case) taking its X  inverse image on sample space and measuring it with μ.

Fiber is in fact sample space for conditional event, and measure on fiber is our conditional distribution.

Full statement of the theorem require some term form measure theory. Following wikipedia

Let P(X) is collection of Borel probability measures on X,  P(Y) is collection of  Borel probability measures on Y

Let Y and X be two Radon spaces. Let μ ∈ P(Y), let $\pi$: Y → X be a Borel- measurable function, and let ν ∈ P(X) be the pushforward measure  $\mu \circ \pi^{-1}$.

* Then there exists a ν-almost everywhere uniquely determined family of probability measures $\mu_x$  ⊆  P(Y) such that
* the function $x \mapsto \mu_{x}$ is Borel measurable, in the sense that $x \mapsto \mu_{x} (B)$ is a Borel-measurable function for each Borel-measurable set B ⊆ Y;
*  $\mu_x$  ”lives on” the  fiber $\pi^{-1}(x)$ : for ν-almost all x  ∈ X,

$\mu_x(Y \setminus \pi^{-1} (x))=0$

and so $\mu_x(E)=\mu_x(E \cap \pi^{-1}(x))$

* for every Borel-measurable function $f$ : Y → [0, ∞],

$\int_{Y} f(y) \, \mathrm{d} \mu (y) = \int_{X} \int_{\pi^{-1} (x)} f(y) \, \mathrm{d} \mu_{x} (y) \mathrm{d} \nu (x)$

From here for any event E form Y

$\mu (E) = \int_{X} \mu_{x} \left( E \right) \, \mathrm{d} \nu (x)$

This was complete statement of the disintegration theorem.

Now returning to Chang&Pollard example. For formal derivation I refer you to the original paper, here we will just “guess” $\mu_x$, and by uniqueness it give us disintegration. Our conditional distribution for $X=x$ will be just point masses on the intersections of lines $L_1$ and $L_2$ with axis  $X=x$
Here $\delta$ is delta function – point mass.

$d\mu_x(y) = \frac{\delta (L_1( x )-y) g_1( x ) / cos(\alpha_1) + \delta (L_2( x )-y) g_2( x ) / cos(\alpha_2)}{Z(x)}dy$

$Z(x) = g_1(x)/cos(\alpha_1) + g_2(x)/cos(\alpha_2)$

and for $\nu$
$d\nu = Z(x)dx$

Our conditional probability that event lies on $L_1$ with condition $X = x_0$, form conditional density $d\mu_{x_0}/dy$ thus

$\frac{g_1(x_0)/cos(\alpha_1)}{g_1(x_0)/cos(\alpha_1) + g_2(x_0)/cos(\alpha_2)}$

Another example from Chen&Pollard. It relate to sufficient statistics. Term sufficient statistic used if we have probability distribution depending on some parameter, like in maximum likelihood estimation. Sufficient statistic is some function of sample, if it’s possible to estimate parameter of distribution from only values of that function in the best possible way – adding more data form the sample will not give more information about parameter of distribution.
Let $\mathbb{P}_{\theta}$ be uniform distribution on the square $[0, \theta]^2$. In that case M = max(x, y) is sufficient statistics for $\theta$. How to show it?
Let take our function $(x,y) \to m$ , $m = max(x,y)$ and make disintegration.
$d\mu_m(x,y) = s_m(x,y)dxdy$ is a uniform distribution on edges where x and y equal m and $d\nu = (2m / \theta^2 )dm$
$s_m(x,y)$ is density of conditional probability $P(\cdot | M = m)$ and it doesn’t depend on $\theta$
For any $f \ \ \mathbb{P}_{\theta}(f| m) = \int f s_m(x,y) \ , \ \mathbb{P}_{\theta, m}(\cdot) = \mathbb{P}_{m}(\cdot)$ – means that M is  sufficient.

It seems that in most cases disintegration is not a tool for finding conditional distribution. Instead it can help to guess it and form uniqueness prove that the guess is correct. That correctness could be nontrivial – there are some paradoxes similar to Borel Paradox in Chang&Pollard paper.

7, November, 2013

## Finds in arxiv, october

This is duplication of my ongoing G+ series of post on interesting for me papers in arxiv. Older posts are not here but in my G+ thread.

Finds in #arxiv :
*Optimization, numerical & convex, ML*
The Linearized Bregman Method via Split Feasibility Problems: Analysis and Generalizations
Reformulation of  Split Bregman/ ADMM as split feasibility problem and algorithm/convergence for generalized split feasibility by Bregman projection. This general formulation include both Split Bregman and Kaczmarz (my comment – randomized Kaczmarz  seems could be here too)

http://arxiv.org/abs/1309.2094

Stochastic gradient descent and the randomized Kaczmarz algorithm
Hybrid of  randomized Kaczmarz  and stochastic gradient descent  - into my “to read” pile

http://arxiv.org/abs/1310.5715

Trust–Region Problems with Linear Inequality Constraints: Exact SDP Relaxation, Global Optimality and Robust Optimization
“Extended” trust region for linear inequalities constrains

http://arxiv.org/abs/1309.3000

Conic Geometric Programming
Unifing framwork for conic and geometric programming

http://arxiv.org/abs/1310.0899

http://en.wikipedia.org/wiki/Geometric_programming

http://en.wikipedia.org/wiki/Conic_programming

Gauge optimization, duality, and applications
Another big paper about different, not Lagrange duality, introduced by Freund (1987)

http://arxiv.org/abs/1310.2639

Color Bregman TV
mu parameters in split bregman made adaptive, to exploit coherence of edges in different color channels

http://arxiv.org/abs/1310.3146

Iteration Complexity Analysis of Block Coordinate Descent Methods
Some convergence analysis for BCD and projected gradient BCD

http://arxiv.org/abs/1310.6957

Successive Nonnegative Projection Algorithm for Robust Nonnegative Blind Source Separation
Nonnegative matrix factorization

http://arxiv.org/abs/1310.7529

Scaling SVM and Least Absolute Deviations via Exact Data Reduction
SVN for large-scale problems

http://arxiv.org/abs/1310.7048

Image Restoration using Total Variation with Overlapping Group Sparsity
While title is promising I have doubt about that paper.  The method authors suggest is equivalent to adding averaging filter to TV-L1 under L1 norm. There is no comparison  to just applying TV-L1 and smoothing filter interchangeably.The method author suggest is very costly, and using median filter instead of averaging would cost the same while obviously more robust.

http://arxiv.org/abs/1310.3447

*Deep learning*
Deep and Wide Multiscale Recursive Networks for Robust Image Labeling
_Open source_ matlab/c package coming soon(not yet available)

http://arxiv.org/abs/1310.0354

Improvements to deep convolutional neural networks for LVCSR
convolutional  networks, droput for speech recognition,

http://arxiv.org/abs/1309.1501v1

DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Already discussed on G+  - open source framework in “learn one use everywhere” stile. Learning done off-line on GPU using ConvNet, and recognition is online in pure python.

http://arxiv.org/abs/1310.1531

Statistical mechanics of complex neural systems and high dimensional data
Big textbook-like overview paper on statistical mechanics of learning. I’ve put it in my “to read” pile.

http://arxiv.org/abs/1301.7115

Randomized co-training: from cortical neurons to machine learning and back again
“Selectron” instead of perception – neurons are “specializing” with weights.

http://arxiv.org/abs/1310.6536

Provable Bounds for Learning Some Deep Representations

http://arxiv.org/abs/1310.6343

Citation:”The current paper presents both an interesting family of denoising autoencoders as

well as new algorithms to provably learn almost all models in this family.”

*Computer vision*
Online Unsupervised Feature Learning for Visual Tracking
Sparse representation, overcomplete dictionary

http://arxiv.org/abs/1310.1690

Shape restoration from local shading – could be very useful in low-feature environment.

http://arxiv.org/abs/1310.2916

Fast 3D Salient Region Detection in Medical Images using GPUs
Finding interest point in 3D images

http://arxiv.org/abs/1310.6736

Grid-based universal framework for object detection/classification

http://arxiv.org/abs/1310.7170

Gaming :)
Lyapunov-based Low-thrust Optimal Orbit Transfer: An approach in Cartesian coordinates
For space sim enthusiast :)

http://arxiv.org/abs/1310.4201

29, October, 2013

## ConvNet for windows

I have seen an excellent wlakthrough on building Alex Krizhevsky’s cuda-convnet for windows, but difference in configuration and installed packages could be tiresome. So here is complete build of convnet for windows 64:

https://github.com/s271/win_convnet

It require having CUDA compute capability 2.0 or better GPU of cause, Windows 64bit, Visual Studio 64bits and Python 64bit with NumPy installed. The rest of libs and dlls are precomplied. In building it I’ve followed instructions by Yalong Bai (Wyvernbai) from http://www.asiteof.me/archives/50.

Read Readme.md before installation – you may (or may not) require PYTHONPATH environmental variable set.

On side note I’ve used WinPython for both libraries and running the package. WinPython seems a very nice package, which include Spyder IDE. I have some problems with Spyder though – sometimes it behave erratically during debugging/running the code. Could be my inexperience with Spyder though. Another nice package – PythonXY – regretfully can not be used – it has only 32 bit version and will not compile with/run 64 bit modules.

24, October, 2013

## CUDA gamification experiment

This demo was inspired by Escapist article “I hate magic” by Robert Rath ( http://www.escapistmagazine.com/artic… ) and “continuous game of life” (http://arxiv.org/abs/1111.1567 ) by Stephan Rafler
This is attempt to show how GPGPU could be used to model magic as different physics.
The demo run at 70 fps at laptop with GF GTX 670M. With some tradeoffs it can be made run much faster
require CUDA GPGPU with at least 2.0 Cuda compute capability (practically all modern cards, GF GTX 550 or better)
Source code:
https://github.com/s271/cuMagic/

29, September, 2013 Posted by | Coding, Demo, Games, GPGPU | , , , | Leave a Comment

## __volatile__ in #cuda reduce sample

“Reduce” is one of the most useful samples in NVIDIA CUDA SDK. It’s implementation of highly optimized cuda algorithm for some of the elements of the array of the arbitrary length. It’s hardly possible to make anything better and generic enough with existing GPGPU architecture (if anyone know something as generic but considerably more efficient I’d like to know too). One of the big plus of the reduce algorithm is that it can work for any binary commutative associative operation – like min, max, multiply etc. And NVIDIA sample provide this ability – it’s implemented as reduce on template class, so all one have to do is implement class with overload of addition and assignment operations.

However there is one obstacle – it’s a __volatile__ qualifier in the code. Simple overload of “=” “+=” and “+” operations  in class LSum cause compiler error like

error: no operator “+” matches these operands
1> operand types are: LSum + volatile LSum

The answer is add __volatile__ to all class operation, but there is the trick here:

for “=”  just

volatile LSum& operator =(volatile LSum &rhs)

is not enough. You should add volatile to the end too, to specify not only input and output, but function itself as volatile.

At the end class looks like:

class LSum
{
public:

__device__ LSum& operator+=(volatile LSum &rhs)

{

return *this;

};

__device__ LSum operator+(volatile LSum &rhs)
{
LSum res = *this;
res += rhs;
return res;
};

__device__ LSum& operator =(const float &LSum)
{

return *this;

};

__device__ volatile LSum& operator =(volatile LSum &rhs) volatile
{

return *this;
};
};

11, May, 2013 Posted by | Uncategorized | Comments Off

## Split-Bregman for total variation: parameter choice

Split Bregamn method is one of the most used method for solving convex optimization problem with non-smooth regularization. Mostly it used for $L_1$ regularization, and here I want to talk about TV-L1 and similar regularizers ( like $\|\Psi u\|_1$ )
For simplest possible one-dimentional denoising case
$min_{u} \quad (u-f)^2 - \mu\|\frac{\partial u}{\partial x}\|_1$ (1)
Split Bregman iteration looks like
$min_{\substack u} \quad (u-f)^2 + \mu\lambda(\frac{\partial u}{\partial x}+b-d)^2$ (2)
$d = Shrink(\frac{\partial u}{\partial x} + b, \frac{1}{\lambda})$ (3)
$b = b + \frac{\partial u}{\partial x} - d$ (4)
There is a lot of literature about choosing $\mu$ parameter from (1) for denosing. Methods include Morozov discrepancy principle, L-curve and more – that’s just parameter of Tikhonov regularization.
There is a lot less metods for choosing $\lambda$ for Split Bregman.
Theoretically $\lambda$ shouldn’t be very important – if Split Bregman converge solution is not depending on $\lambda$. In practice of cause convergence speed and stability depend on $\lambda$ a lot. Here I’ll point out some simple consideration which may help to choose $\lambda$
Assume for simplicity that u is twice differentiable almoste everythere, the (1) become
$Lu = u-f - \mu\lambda*(\frac{\partial^2 u}{\partial x^2} + \frac{\partial b}{\partial x} - \frac{\partial d}{\partial x}) = 0$ (5)
We know if Split Bbregman converge $d \to \frac {\partial u}{\partial x}$
that mean $\frac {\partial b} {\partial x} \to \frac {\partial} {\partial x} sign(\frac {\partial u}{\partial x})$, here exteranl derivative is a weak derivative.
For the solution b is therefore locally constant, and we even know what constant it is.
Recalling
$Shrink(x, \frac{1}{\lambda}) = \left\{\begin{array} {l l} |x| \leq \frac{1}{\lambda} : 0\\ |x| > \frac{1}{\lambda} : x - sign(x)\frac{1}{\lambda} \end{array} \right.$
For $|x| > \frac{1}{\lambda}$ we have
$d = \frac{\partial u}{\partial x} + b \pm \frac{1}{\lambda}$
and accordingly
$b = \mp \frac{1}{\lambda}$
For converged solution
$b(x) = \frac{1}{\lambda}sign(\frac{\partial u}{\partial x})$
everythere where $\frac {\partial u}{\partial x}$ is not zero, with possible exeption of measure zero set.
Now returning to choice of $\lambda$ for Split Bregman iterations.
From (4) we see that for small values b , b grow with step $\frac {\partial u}{\partial x}$ until it reach it’s maximum absolute value $\pm \frac{1}{\lambda}$
Also if the b is “wrong”, if we want for it to switch value in one iteration $| \frac {\partial u}{\partial x}|$ should be more than $\frac{2}{\lambda}$
That give us lower boundary for $\lambda$:
For most of x $| \frac {\partial u}{\partial x}| \geq \frac{2}{\lambda}$
What happens if on some interval b reach $\mp \frac{1}{\lambda}$ for two consequtive iterations?
Then on the next iteration form (5) and (3) on that interval
$u_k-f - \mu\lambda(\frac{\partial^2 u_k}{\partial x^2}-\frac{\partial^2 u_{k-1}}{\partial x^2} ) = 0$
or
$u_k = min_{u} (u-f)^2 + \mu\lambda(\frac {\partial u}{\partial x} - \frac {\partial u_{k-1}}{\partial x})^2$
See taht with $\mu\lambda$ big enuogh sign of $\frac {\partial u_k}{\partial x}$ stabilizing with high probaility and we could be close to true solution. The “could” in the last sentence is there because we are inside the interval, and if the values of u on the ends of interval are wrong b “flipping” can propagate inside our interval.
It’s more difficult to estimate upper boundary for $\lambda$.
Obviously for more easy solution of (2) or (5) $\lambda$ shouldn’t be too big, so that eigenvalues of operator $L$ woudn’t be too close to 1. Because solutions of (2) are inexact in Split Breagman we obviously want $L$ having bigger eigenvalues, so that single (or small number of) iteration could suppress error good enough for (2) subproblem.
So in conclusion (and my limited experience) if you can estimate $avg |\frac {\partial u}{\partial x}|$ for solution $\lambda = \frac{2}{avg |\frac {\partial u}{\partial x}|}$ could be a good value to start testing.

2, January, 2013 Posted by | computer vision | 2 Comments

## Algebraic topology – now in compressed sensing

Thanks to Igor Carron for pointing me to this workshop – Algebraic Topology and Machine Learning . There is very interesting paper there Persistent Homological Structures in Compressed Sensing and Sparse Likelihood by Moo K. Chung, Hyekyung Lee and Matthew Arnold. The paper is very comprehensive and require only minimal understanding of some algebraic topology concepts (which is exactly where I’m in realation to algebraic topology). Basically it’s application of topological data analysis to compressive sensing. They use such thing as persistent homology and “barcodes”. Before, persistent homology and barcodes were used for such things as extracting solid structure from noisiy point cloud. Barcode is stable to noise dependence of some topological invariants on some parameter. In case of point cloud parameter is the radius of the ball around each point. As radius go from very big to zero topology of union of balls change, and those changes of topology make barcode. Because barcode is stable topological invariant learning barcode is the same as learning topology of solid structure underlying point cloud.
In the paper authors using graphical lasso (glasso) with $L_1$ regularizer to find interdependency between set of sampled variables. However if consider $\lambda$ parameter of regularizer as a kind of radius of ball this problem aquire persistent homology and barcode. The correlation matrix is thresholded by $\lambda$ and become adjacency matrix of some graph. Barcode is now dependence of topology of that graph on parameter $\lambda$. What is especially spectacular is that to calculate barcode no glasso iteration are needed – barcode obtained by simple thresholding of correlation matrix. Thus barcode easily found and with it topology of correlations of variables. Well, at least that is how I understood the paper.
PS Using this approach for total variation denoising barcode would include dependance of size function from smoothing parameter.

11, December, 2012 Posted by | Uncategorized | Comments Off

## Deleting files which couldn’t be deleted by Administrator in Win 7

If you are an unfortunate person who have to use Windows 7 a lot (like me) undeletable files happen to you from time to time. It could be result of disk failure, or interrupted boot, or just for no reason at all.
The problem is, sometimes they can’t be deleted even by Administrator. To fix this Administrator have have to “take ownership” of the file in question.
It goes like this:
1. run from the command line
chkdsk /f
This step is necessary because file permissions could be corrupted