Here I want to talk about matching in image registration. We are doing registration in 3D or 2D, and using feature points for that. Next stage after extraction of feature points from the image is finding corresponding points in two(or more) images. Usually it’s done with descriptors, like SIFT, SURF, DAISY etc. Sometimes randomized trees are used for it. Whatever methods is used it usually has around .5% of false positives. False positives create outliers in registration algorithm. That is not a big problem in planar trackers or model/marker trackers. It could be a problem for Structure From Motion though. If CPU power is not limited the problem is not very serious. Heavy-duty algorithms like full-sequence bundle adjustment and RANSAC cope with outliers pretty well. However even for high-end mobile phones such algorithms are problematic. Some tricks can help – Georg Klein put full-sequence bundle adjustment into separate thread on PTAM tracker to run asynchronously, but I’m trying to do local, 2-4 frames bundle adjustment here. The problem of false positives is especially difficult for images of patterned environment, where some image parts are similar or repeated.
Here mismatched correspondence marked with blue line (points 15-28).
As you can see it’s not easy for any descriptor to tell the difference between points 13(correct) and 15(wrong) on the left image – their neighborhood is practically the same:
Such situations could easily happen not only indoor, but also in cityscape, industrial, and others regular environments.
One solution for such cases is to increase descriptor radius, to process a bigger patch around the point, but that would create problems of its own, for example too much false negatives.
Other approach is to use geometric consistency of the image points positions.
There are at least two ways to do it.
One is to consider displacements of corresponding points between frames. Here is example from paper by Kanazawa et al “Robast Image Matching Preserving Global Geometric Consistency”
This method first gathering local displacement statistic around each points, filter out outliers and and apply smoothing filter. Here are original matches, matches after applying consistency check and matches after applying smoothing filter.
However this method works best for dense, regular sets of feature points. For small, sparse set of points it does not improving situation much.
Here is a second approach. Build graph out of feature points for each frame.
Local topological structure of the two graphs is different because of false positives. It’s easy to find graph vertices/edges which cause inconsistency – edges marked blue.They can be found for example by signs of crossproducts between edges. After offending vertices found they are removed:
There are different ways to build graph out of feature points. Simplest is nearest neighbors, but may be Delaney triangulation or DSP can do better.
Trying a new descriptor, inspired by SURF and SIFT. Want to use gradient instead of Haar transforms of intensity, but with less dimensionality than SURF. Also don’t need rotation/scale invariance, because using incremental tracking.
Here is a sample of image registration with fiduciary marker (actually the marker I used in my games) vs registration with bundle adjustment. Blue lines are points heights (relatively to marker plane) calculated using marker registration and triangulation. White lines are the same using bundle adjustment(modified). Points extracted with multiscale FAST and fitted with log-polar Fourier descriptors for correspondence (actually SURF descriptor produce the same correspondence).
As you can see markerless is in no way worse then markers, at least on this example ))).
I have tested oriented descriptors SURF descriptors vs upright descriptors for approximately horizontally oriented camera images and got feature density less than oriented then for upright. Repeatability of oriented was worse too…
One of the big problem in image registration/structure from motion/3d tracking is using global information of the image. Feature/blob extraction, like SIFT, SURF or FAST etc using only local information around the point. Region detector like MSER using area information, but MSER is not good at tracking textures, and not quite stable at complex scenes. Edge detection provide some non-local information, but require processing edges. That could be computationally heavy, but looks promising anyway. There are a lot of methods which use global information – all kind of texture segmentation, epitome, snakes/appearance models, but those are computationally heavy and not suitable for mobiles. The question is how to incorporate global information from the image into tracker, and make it with minimal amount of operations. One way is to optimise tracker for specific environment – for example use the property of cityscape, a lot of planar structures and straight lines. Such multiplanar tracker wouldn’t work in the forest or park, but could be a working compromise.
Testing outdoor markerless tracking with FAST/SURF feature detector.
The plane of the camera is not parallel to the earth, that make difficult for eye to estimate precision.
Feature detected with multistage FAST and fitted with SURF descriptors
Less strict threshold give a lot more correspondences, but also some false positives
I did some research on the SURF optimization. While it still possible to make it significantly faster with lazy evaluation, the problem of the scale remain. Fine-scale features are not detectedable on the bigger scale, so it doesn’t look like there is an easy way to reduce search area using only upper scale. If scale-space is not helping to reduce search area it become liability for mobile tracking – range can’t change too fast for a mobile pone, so scale of the feature will be about the same between frames.
Will try plain, not scale-space corner detectors now, starting with FAST.
I continue to test SURF, in respect to scale space. Scale space is essentially a pyramid of progressively more blurred or lower resolution images. The idea of scale invariant feature detection is that the “real” feature should be present at several scales – that is should be clearly detectable at several image resolution/blur levels. The interesting thing I see is, that for SURF, at least for test images from Mikolajczyk ‘s dataset, scale space seems doesn’t affect detection rate with viewpoint change. I meant that there is no difference if feature distinct in several scales or only in one. That’s actually reasonable – scale space obviously benefit detection in the blurred images, or noisy images, or repeatability/correspondence in scaled images , and “viewpoint” images form Mikolajczyk ‘s dataset are clear, high resolution and about the same scale. Nevertheless there is some possibility for optimization here.
I have tested several modification of SURF, using original SURF Hessian, extremum of SURF-based Laplacian, Hessian-Laplace – extremum both Hessian and Laplacian and minimal eigenvalue of Hessian. They all give about the same detection rate, but original SURF Hessian give better results. Minimal eigenvalue of Hessian seems better scaling with threshold value – original Hessian absolute value could be very low, but eigenvalues are not. So this approach may have some advantage if there are potential precision loss problem, for example in fixed point calculations. A lot of high-end mobile phones still launched without hardware floating point so it still could be useful in AR or Computer Vision applications.