Something I did for Samsung (kernel of tracker). Biggest improvement in SARI 1.5 is the sensors fusion, which allow for a lot more robust tracking.
Here is example of run-time localization and mapping with SARI 1.5:
This is the AR EdiBear game (free in Samsung apps store)
Thanks to Igor Carron I’ve watched a great videolecture by Stephane Mallat High dimensional classification by recursive interferometry. Actually I watched it twice, and I think I understand most of it now))). And it was not about compressed sensing, not even about manifold learning much . It was mostly about a new application of wavelets . How to use wavelets to produce low dimensional data (image descriptor if we are talking about computer vision) from high dimensional data(that is image). The idea is to transcend linear representation and use nonlinear operation – absolute value of wavelet. Absolute value – square root of wavelet square carry information of frequencies differences. It’s invert Fourier transform have new harmonics – differences of frequencies of original function. That interference of harmonics of original image. Now it was reminding me something. Yep – phase congruency (pdf). Phase congruency also use absolute value of wavelet(windowed Fourier). It seems to me it has perfect explanation. Interference pattern defined by how in-phase both wave are. That is it’s like a phase congruency taken into each point. Phase congruency edge-detector is in fact finding maximum of somehow normalized interference pattern. In that sense this Mallat’s method producing invariants from high-dimensional data is analogous to producing sketch from photo.
Ok, enough rambling for now.
Here is some more narrow baseline local bundle adjustment, from only two camera frames.
Outlier is drawn in red. Some points are not detected as outliers, but still are not localized properly.
Multiscale FAST used for detection. No descriptors were used for point correspondence, instead incremental tracking with search in sliding window by average gradient responses was used(there are three tracked frames between those two). I think those bad points could possibly be isolated with some geometric consistency rules, presuming landscape is smooth.
Here I want to talk about matching in image registration. We are doing registration in 3D or 2D, and using feature points for that. Next stage after extraction of feature points from the image is finding corresponding points in two(or more) images. Usually it’s done with descriptors, like SIFT, SURF, DAISY etc. Sometimes randomized trees are used for it. Whatever methods is used it usually has around .5% of false positives. False positives create outliers in registration algorithm. That is not a big problem in planar trackers or model/marker trackers. It could be a problem for Structure From Motion though. If CPU power is not limited the problem is not very serious. Heavy-duty algorithms like full-sequence bundle adjustment and RANSAC cope with outliers pretty well. However even for high-end mobile phones such algorithms are problematic. Some tricks can help – Georg Klein put full-sequence bundle adjustment into separate thread on PTAM tracker to run asynchronously, but I’m trying to do local, 2-4 frames bundle adjustment here. The problem of false positives is especially difficult for images of patterned environment, where some image parts are similar or repeated.
Here mismatched correspondence marked with blue line (points 15-28).
As you can see it’s not easy for any descriptor to tell the difference between points 13(correct) and 15(wrong) on the left image – their neighborhood is practically the same:
Such situations could easily happen not only indoor, but also in cityscape, industrial, and others regular environments.
One solution for such cases is to increase descriptor radius, to process a bigger patch around the point, but that would create problems of its own, for example too much false negatives.
Other approach is to use geometric consistency of the image points positions.
There are at least two ways to do it.
One is to consider displacements of corresponding points between frames. Here is example from paper by Kanazawa et al “Robast Image Matching Preserving Global Geometric Consistency”
This method first gathering local displacement statistic around each points, filter out outliers and and apply smoothing filter. Here are original matches, matches after applying consistency check and matches after applying smoothing filter.
However this method works best for dense, regular sets of feature points. For small, sparse set of points it does not improving situation much.
Here is a second approach. Build graph out of feature points for each frame.
Local topological structure of the two graphs is different because of false positives. It’s easy to find graph vertices/edges which cause inconsistency – edges marked blue.They can be found for example by signs of crossproducts between edges. After offending vertices found they are removed:
There are different ways to build graph out of feature points. Simplest is nearest neighbors, but may be Delaney triangulation or DSP can do better.
Trying a new descriptor, inspired by SURF and SIFT. Want to use gradient instead of Haar transforms of intensity, but with less dimensionality than SURF. Also don’t need rotation/scale invariance, because using incremental tracking.
Just found out – Mars Rovers used bundle adjustment for its localization and rocks modeling:
“Purpose of algorithm:
To perform autonomous long-range rover localization based on bundle adjustment (BA) technology.
Processing steps of the algorithm include interest point extraction and matching, intra- and inter- stereo tie point selection, automatic cross-site tie point selection by rock extraction, modeling and matching, and bundle adjustment”
I often hear sentiments from users that they don’t like markers, and they are wondering, why there are so relatively few markerless AR around. First I want to say that there is no excuse for using markers in the static scene with immobile camera, or if desktop computer is used. Brute force methods for tracking like bundle adjustment and fundamental matrix are well developed and used for years and years in the computer vision and photogrammetry. However those methods in their original form could hardly produce acceptable frame rate on the mobile devices. From the other hand marker trackers on mobile devices could be made fast, stable and robust.
So why markers are easy and markerless are not ?
The problem is the structure , or “shape” of the points cloud generated by feature detector of the markerless tracker. The problem with structure is that depth coordinate of the points is not easily calculated. That is even more difficult because camera frame taken from mobile device have narrow baseline – frames taken form position close one to another, so “stereo” depth perception is quite rough. It is called structure from motion problem.
In the case of the marker tracker all feature points of the markers are on the same plane, and that allow to calculate position of the camera (up to constant scale factor) from the single frame. Essentially, if all the points produced by detector are on the same plane, like for example from the pictures lying on the table, the problem of structure from motion goes away. Planar cloud of point is essentially the same as the set of markers – for example any four points could be considered as marker and the same algorithm could apply. Structure from motion problem is why there is no easy step from “planar only” tracker to real 3d markerless tracker.
However not everything is so bad for mobile markerless tracker. If tracking environment is indoor, or cityscape there is a lot of rectangles, parallel lines and other planar structures around. Those could be used as initial approximation for one the of structure from motion algorithm, or/and as substitutes for markers.
Another approach of cause is to find some variation of structure from motion method which is fast and works for mobile. Some variation of bundle adjustment algorithm looks most promising to me.
PS PTAM tracker, which is ported to iPhone, use yet another approach – instead of using bundle adjustment for each frame, bundle adjustment is running in the separate thread asynchronously, and more simple method used for frame to frame tracking.
PPS And the last thing, from 2011:
I have tested oriented descriptors SURF descriptors vs upright descriptors for approximately horizontally oriented camera images and got feature density less than oriented then for upright. Repeatability of oriented was worse too…
One of the big problem in image registration/structure from motion/3d tracking is using global information of the image. Feature/blob extraction, like SIFT, SURF or FAST etc using only local information around the point. Region detector like MSER using area information, but MSER is not good at tracking textures, and not quite stable at complex scenes. Edge detection provide some non-local information, but require processing edges. That could be computationally heavy, but looks promising anyway. There are a lot of methods which use global information – all kind of texture segmentation, epitome, snakes/appearance models, but those are computationally heavy and not suitable for mobiles. The question is how to incorporate global information from the image into tracker, and make it with minimal amount of operations. One way is to optimise tracker for specific environment – for example use the property of cityscape, a lot of planar structures and straight lines. Such multiplanar tracker wouldn’t work in the forest or park, but could be a working compromise.