Project 2: Local Feature Matching
The goal of this project is to implement a local feature matching algorithm. This will allow us to match points of the same physical scene of images taken from different perspectives.
In this page we will discuss the steps taken during the development of the project as well as the results obtained. The project can be broken down into three main parts:
- Interest point detection
- Local feature description
- Feature matching
Given a pair of images, the first step is to identify the potential interest points, which we will do by implementing the Harris corner detection algorithm. Once the interest points are identified, the next step is to extract the information around each interest point, in other words, to extract local feature descriptions. We will implement a SIFT like algorithm for this purpose. Lastly, we will match the features obtained with one image with the features obtained with the other. This will be done by using nearest neighbor search. We will now describe each of these steps in detail.
The information that we extract is the same as mentioned above. As the number of interest points increases the chances of getting the best interest points increases as well (window = 3), but also does the chances of detecting irrelevant points (threshold = 0.0001). On the contrary, if a small number of interest points is selected, the chances of not selecting the best and most representative interest points increases, what can be translated into an accuracy drop (threshold = 0.01; and window = 3).
*The number of keypoints in the table corresponds to the minimum number of keypoints between the pictures. This metric has been set due to the way match features algorithm works. This algorithm will only match the features of the image with the lowest number of features (see section 3 for more details).
2. Local feature descriptors
Read more: Yellow stiletto nails
Once that interest points have been detected in each pair of images, the next step is to define the region around each interest point. As a start, we implemented a normalized patches descriptor, which returned an accuracy of 80%, which is a good score for such a simple descriptor. Next we decided to implement a basic SIFT-like algorithm, which will presumabily score a higher accuracy, now that gradients will be taken into account. The algorithm specifically that runs as follows:
- Apply a gaussian filter to the image in order to remove noise.
- Calculate gradient directions and magnitudes for each pixel in the image.
- For each interest point calculate its corresponding window, and divide it into 4×4 grids. In practice this last division is implemented with the next step.
- For each cell at each grid, calculate its orientations and magnitudes, and add these magnitudes to their corresponding bins (histogram of orientations).
- Normalize the vector of bins for each interest point, clipped the values to 0.2 and normalize it again. (The clipping part has been commented in the code, seems it doesn’t seem to have a positive impact.)
As we can see, there are many parameters we can play with at this stage as well. Here we are showing a few experiments we have tried, and the ideas extracted:
- Feature_width: (16) first of all, it must be mentioned that the size of the sliding window in the get_interest_function, is a variable that depends on feature_width (neigSize= feature_width/2) such that an increase of feature_width inevitably implies a decrease in the number of interest points to be generated. Which in turn will generally result in a decrese of accuracy (happened with feature_width = 32, and = 64.
Leaving the window size aside (for this part of the experiment we fixed its value to 8), an increase in the feature_width can be translated into a higher running time, since more pixels are going to be processed for every interest point. Also, broad patches increase the chances of including specific picture characteristics that won’t appear in other pictures, making it difficult to find matches, or leading this noise to generate wrong matches. Here are some of the results obtained for Notre Dame:
Table 2. Effects of changing feature width.
As we can see, the results support our statements above, and shows that the best configuration is that with feature_width = 16, and window size = feature_width/2 = 8, both in terms of accuracy and running time.
Number of orientations per histogram: (8) as we can see from the table, as the number of bin increases the accuracy drops. This is proably due to the fact that as the number of orientations to take into account increases, the matching process becomes more sensitive to rotations, since a slightly rotation might make the pixels belong to a different orientation bin. On the contrary, if bins are too large (covering wide angles), the information retained in each bin might be too generic with pixels with significantly different gradient orientations being mixed together. In other words, if the bins are too large, they might stop being representative of the image’s grids.
Table 3. Effects of changing the number of bins.
Sigma for blurring the original image: (2.5) my first intuition was to set its value equals to the sigma used in interest point detection. With this configuration I achieved a 93% accuracy in Notre Dame, and an impresive 100% in Mount Rushmore. I tried other values values as well, and realized that in general values above 1 did not performed well, except (oddly) for the value 2.5, with which I managed to achieve 95% in Notre Dame and 98% in Mount Rushmore. Finally, I decided to keep the value at 2.5 since after trying multiple changes in other parameters, the results obtained with 2.5 seemed more robust to changes than the previous ones.
Read more: Silent library cards
Illustration 2. Local feature matching. Mount Rushmore – 100% accuracy with top 100 most confidence interest points.
3. Feature matching
Once that we have identified local feature descriptors, the last step is to mach them succesfully. For this purpose, we have taken two approaches. In the first one we implemented our own code for nearest neighbor searching, while for the second we used Matlab’s built-in function knnsearch, that by default uses kd-tree method as its nearest neighbor search method. First we will explain the parameters applied to our code, in order to be able to compare performances of the approaches.
For our own function, we have implemented a simple nearest neighbor search that will calculate the distances between each feature in one image with all the features in the other image. Once we have these vectors of distances, we will sort them in ascending order and calculate the distance ratio between the two nearest neighbors. As we can see it’s simple algorithm with just a few parameters to discuss:
- Distance measure: euclidean.
- Nearest Neighbor Distance Ratio (NNDR): dist(nn1)/dist(nn2).
- Threshold: 0.8. If nndr is greater than the threshold, them set nndr to 0. With this threshold we aim to remove features that have two possible and very similar correspondences in the other image.
- Confidence metric: 1-nndr.
- Selection of source image:image with the least number of features. We call source image to the image from where features are trying to be matched. Our idea is to look for matches from the image with the least number of features, so that the chances of matching two features in one image to the same feature in the other one decreases. Also, computational time will be slightly improved since a fewer number of calculations (nndr, confidence metric, etc) will be performed.
In terms of parameters values, they were all set with the same values for knnsearch. The only thing that was not implement was source image selection in order to maintain a cleaner code. As we can see below, this didn’t have a big impact on performances comparison, and knnsearch will still obtain better running time.
Table 4. Accuracy and running time with my code and with knnsearch.
Read more: Cool clay cup designs
As expected, knnsearch using kd-tree runs significantly faster than our code. This is because it is a built-in function that uses kd-tree and therefore we can expect it to be highly optimized. In terms of accuracy, we can see that both returns the same percentaje, except for Gaudi, where my code can find 1 more feature (150% wohoo!) than knnsearch. This is probably due to the selection of source image issue mentioned above.
One of the weaknesses of knnsearch using a kd-tree is that it becomes inefficient as k dimensions increases. However, since for our current problem we will be only using 2, the function should run without any problem.
4. Results
In this section we will show some of the results obtained with the parameters presented above:
Illustration 3. Notre Dame – Top 100 most confidence interest points – 95% accuracy.
Illustration 4. Mount Rushmore – Top 100 most confidence interest points – 98% accuracy.
Illustration 5. Episcopal Gaudi – Top 100 most confidence interest points – 3% accuracy.
As we can observe, we were able to achieve high accuracy scores for both Notre Dame and Mount Rushmore, but only a poor 3% in Episcopal Gaudi. Due to time limitations we were not able to implement scale and rotation invariance, a function that would have help match this last pair of images, where a scale can be easily appreciated.