QATM: Quality-Aware Template Matching for Deep learning
Introduction
Matching: The matching problem is a basic problem in computer vision. There are region-based matching problems, such as template matching, and semantic-based matching problems, such as semantic alignment. In this blog, we will first go through classic solutions for template matching. Then we will demonstrate our deep learning friendly matching algorithm QATM and its performance on template matching in a training-free way (only use pre-trained model weights, no other training involved). Finally, we implement QATM as a learning DNN module and show its applications in semantic matching problems (e.g. semantic alignment).
Classic Template Matching and Limitations: The task of template matching is to find a (smaller) template image (or its nearest patch) in a (larger) image. The former image is usually called query image and the latter is called reference/target image. Classic template matching methods often use sum-of-squared-differences (SSD) or normalized cross-correlation (NCC) to calculate the similarity between the template and the underlying image. OpenCV provides off-the-shelf functions for these algorithms [1]. However, these methods start to fail when the transformation is complex or non-rigid (e.g. The query is the palm, and the reference is the back of the hand). To overcome these complex transformations, Dekel et al.[2] introduced Best Buddies Similarity (BBS) measure, Talmi et al. [3] introduced Deformable Diversity Similarity (DDIS) and Kat et al. [4] proposed Cooccurrence based template matching (CoTM). These methods indeed improve the performance of template matching. However, they cannot be used in deep neural networks (DNN) because of two limitations — (1) using non-differentiable operations, such as thresholding, counting, etc., and (2) using operations that are not efficient with DNNs, such as loops and other non-batch operations.
Approach
Matching Quality: Let’s see five different template matching scenarios in Table 1. There are four matching cases and one not matching case. Specifically, we are more interested in 1-to-1 matching (See Fig 1.). The 1-to-1 matching usually reflects the correspondence between foreground objects, such as person or object of interest. While the background are usually 1-to-N, M-to-1 or M-to-N matching (e.g., homogenous sky pattern). Hence, we only call the 1-to-1 matching a high-quality matching and give high matching response while the other three match low-quality matching and give low matching response.

Matching Cases | Likelihood(s|t) | Likelihood(t|s) | QATM Score(s,t) |
1-to-1 | 1 | 1 | 1 |
1-to-N | 1 | 1/N | 1/N |
M-to-1 | 1/M | 1 | 1/M |
M-to-N | 1/M | 1/N | 1/MN |
Not Matching | 1/||S|| | 1/||T|| | ~0 |
Table 1. Matching cases and corresponding QATM score. s and t are patterns in the query and target images, respectively. ||S|| and ||T|| are the cardinality of patterns.,
QATM
Let S and T be the query and reference images and let fs and ft be the feature representation of patch s and t (s and t are the patches of S and T, respectively), and (·)is a predefined similarity measure between two patches, e.g. cosine similarity. Given a query patch s, we define the likelihood function that a template patch t is matched.

Here, is a positive number for the heated-up softmax function. A detailed discussion for the value of can be found in the paper. The likelihood score will be high when there is only one match for s in T. Any other cases, e.g., 1-to-N or no matching, the likelihood will give a low score.
And the QATM score is simply the product of likelihoods that s is matched in T and t is matched in S.

The QATM score will be high if and only if the matching is 1-to-1. We will later see that this property is good for overcoming false-positive responses. It is worth noting that we only use a pre-trained CNN model for feature extraction and no other learning process is involved in template matching.
Performance on Template Matching
We evaluate the performance of template matching on the OTB dataset [5]. Every example in the OTB datasets is composed of two frames from the same video. The same object in the two frames may experience non-rigid transformation, be occluded, or under different light conditions. We show four qualitative results in Fig 2 top four rows.
In the meantime, to show QATM’s robustness to false alarms, we create a modified OTB (MOTB) dataset by randomly matching query and reference images from two different examples. In this case, their show is no match in the reference image, and therefore the response show is low everywhere. The bottom four rows in Fig 2 show the performance of QATM on MOTB. We can see that QATM has much lower false alarms than the baselines. For more quantitative performance, the reader can refer to our paper.

Fig 2: Qualitative template matching performance. Columns from left to right are: the template frame, the target search frame with predicted bounding boxes overlaid (different colors indicate different method), and the response maps of QATM, BBS, DDIS, CoTM, respectively. Rows from top to bottom: the top four are positive samples from OTB, while the bottom four are negative samples from MOTB.
QATM As A Learnable Module
Now we introduce the applications of QATM as a differentiable layer with learnable parameters in two matching problems. In the following applications, all we did is replace the matching module in the original paper with the QATM matching module (Fig 2) and make temperature parameter learnable.


Fig 2. Left: Replace the matching module with QATM in image-to-GPS verification [6]. Right: Replace the matching module with QATM in semantic alignment [7].
The first application is image-to-GPS verification (IGV). This task attempts to verify whether a given image is taken as the claimed GPS location through visual verification. IGV first uses the claimed location to find a reference panorama image in a third-party database, e.g. Google StreetView, and then take both the given image and the reference as network inputs to verify visual contents via template matching and produces the verification decision. The major challenges of the IGV task compared to the classic template matching problem are (1) only a small unknown portion visual content in the query image can be verified in the reference image, and (2) the reference image is a panorama, where the potential matching region-of-interest might be distorted.
Fig 3 compares the performance of QATM and the baseline. We can see that when the homogenous pattern (e.g., sky, road) takes a larger portion of the image, the QATM has a great advantage over the baseline.
Experimental Results

Figure 2: Qualitative image-to-GPS results. Columns from left to right are: the query image, the reference panorama image with predicted bounding boxes overlaid (GT, the proposed QATM, and the baseline BUPM), and the response maps of ground truth mask, QATM-improved, and baseline, respectively.
The second application is Semantic Image Alignment (SIA). The overall goal for the SIA is to wrap a given image such that after wrapping it is aligned to a reference image in terms of category-level correspondence. A typical DNN solution for semantic image alignment task takes two input images, one for wrapping and the other for reference, and commonly output a set of parameters for image wrapping.
Fig 3 compares the performance of QATM and two baselines. The warped keypoints (circle points) by QATM are closer to the ground truth (cross points) in the reference image. More detailed quantitative evaluations are available in the paper.

Figure 3: Qualitative results on PF-PASCAL dataset. Columns from left to right represent source image, target image, transform results of QATM, GeoCNN and Weakly-supervisedSA. Circles and crosses indicate key points on source images and target images.