Detection methods based on deep convolutional networks search for interest points by constructing response maps using supervised, self-supervised, and unsupervised methods. Supervised methods use anchors to guide the training process of the model; however, the performance of the model is likely to be limited by the anchor construction method. Self-supervised and unsupervised methods rarely require human annotations. Instead, they use geometric constraints between two images to guide the model. Feature descriptors use local information (i.e., patches) around detected keypoints to search for correct correspondences. Owing to their extraordinary information-extraction and representation capabilities, deep learning techniques have achieved good performance in feature description. Feature description is often formulated as a supervised learning problem, wherein a feature space is learned in such a way that matching features become as close as possible, while unmatched features are far apart. Along this line of research, existing methods are coarsely divided into two categories: metric learning and descriptor learning; the difference between these two methods lies in the output of the descriptors. Metric-learning methods learn a discriminative metric for similarity measurements, whereas descriptor learning generates descriptor representations from raw images or patches. Many methods adopt an end-to-end approach to integrate feature detection, feature description, and feature matching into the matching pipeline, which is beneficial for improving matching performance. Several recent studies have shown competitive results in local feature matching. However, their robustness and accuracy are often limited by challenging conditions, such as illumination and seasonal changes. Local feature matching may fail to establish sufficiently reliable correspondences owing to illumination variations and viewpoint changes. The accuracy of correspondence plays a significant role in the pipeline of computer vision tasks. The better the detection and matching qualities, the more accurate and robust the results. We consider the shape-awareness feature to be beneficial for feature matching. Therefore, in this study, we present DSD-MatchingNet for local feature matching. To mitigate the lack of shape-awareness of features, we first introduce a deformable feature extraction framework with deformable convolutional networks, which allows us to learn a dynamic receptive field, estimate local transformations, and adapt to geometric variations. Second, to facilitate the implementation of pixel-level matching, we develop sparse-to-dense hypercolumn matching to learn correspondence maps. We then adopt the correspondence estimation error and cycle-consistent error to obtain more accurate and robust correspondences. By effectively leveraging the aforementioned methods, the accuracy of DSD-MatchingNet was boosted on the HPatches and Aachen Day-Night datasets. The main contributions of the present study are summarized as follows:
-We propose a novel network, DSD-MatchingNet, that takes advantage of sparse-to-dense hypercolumn matching for robust and accurate local feature matching.
-We propose a deformable feature extraction framework to obtain multilevel dense feature maps, which are utilized for further sparse-to-dense matching. Deformable convolution networks are introduced into our framework to generate a dynamic receptive field, which is beneficial for feature matching. This encourages the network to generate more robust correspondence.
-We propose a pixel-level correspondence estimation error and symmetry of correspondence to punish incorrect predictions, which helps the network to find accurate correspondences.
Journal
Virtual Reality & Intelligent Hardware
Article Title
DSD-MatchingNet: Deformable Sparse-to-Dense Feature Matching for Learning Accurate Correspondences
Article Publication Date
2-Nov-2022