Functional Maps
Our method is based on the functional maps framework, first introduced in
[
13
], and extended in various works such as [
19
,
20
,
21
,
22
,
23
,
24
] among others (see [
25
] for
an overview). This general approach is based on encoded maps between shapes using a reduced
basis representation. Consequently, the problem of map optimization becomes both linear and more
compact. Besides, this framework allows to represent natural constraints such as near-isometry or
bijectivity as linear-algebraic regularization. It has also been extended to the partial setting [26,27].
One of the bottlenecks of this framework is the estimation of so-called “descriptor functions” that
are key to the functional map computation. Early methods have relied on axiomatic features, mainly
based on multi-scale diffusion-based descriptors, e.g., HKS and WKS [28,29].
Learning-based methods
Several approaches have proposed to learn maps between shapes by
formulating it as a dense segmentation problem, e.g., [
30
,
31
,
32
,
33
,
34
,
35
]. However, these methods
(1) usually require many labeled training shapes, which can be hard to obtain, and (2) tend to overfit
to the training connectivity, making the methods unstable to triangulation change.
Closer to our approach are deep shape matching methods that also rely on the functional map
framework, pioneered by FMNet [
7
]. In this work, SHOT descriptors [
36
] are given as input to the
network, whose goal is to refine these descriptors in order to yield a functional map as close to the
ground-truth as possible. The key advantage of this approach is that it directly estimates and optimizes
for the map itself, thus injecting more structure in the learning problem. FMNet introduced the idea
of learning for shape pairs, using the same feature extractor (in their case, a SHOT MLP-based
refiner) for the source and target shapes in a Siamese fashion to produce improved output descriptors
for functional map estimation. However, later experiments conducted in [
5
] have highlighted that
SHOT-based pipelines suffer greatly from connectivity overfitting. Thus, in more recent works, the
authors in [
5
,
15
,
14
] advocate for learning directly from shapes’ geometry, while exploiting strong
regularizers for functional map estimation.
The major upside of using the functional map framework for deep shape matching is that it relies
on the intrinsic information of shapes, which results in overall good generalization from training to
testing, especially across pose changes, which involve minimal intrinsic deformation.
Unsupervised spectral learning
The methods described above are supervised deep shape matching
pipelines. While these methods usually give good correspondence prediction, they need ground-
truth supervision at training time. Consequently, other methods have focused on training for shape
matching using the functional map framework, without ground-truth supervision. This was originally
performed directly on top of FMNet by enforcing either geodesic distance preservation [
8
,
37
], or
natural properties on the output functional map [9], as well as by promoting cycle consistency [38].
To disambiguate symmetries present in many organic shapes, some works choose to rely on so-called
“weak-supervision”, by rigidly aligning all shapes (on the same three axes) as a pre-processing step
[
15
,
39
], and then use the extrinsic embedding information to resolve the symmetry ambiguity. This,
however, limits their utility to correspondences between shapes with the same rigid alignment as the
training set. Another solution is to use input signals that are independent to the shape alignment, such
as SHOT [
36
] descriptors as done in the original FMNet. One of these recent methods [
11
], makes
use of optimal transport on top of this SHOT-refiner to align the shapes at different spectral scales.
This method, like ours, computes the functional map at different scales via progressive upsampling,
but they only keep the last map as the output whereas we propose to let the network learn the best
combination of different resolutions. Additionally, this method is dependent on the SHOT input,
which makes it unstable towards change in triangulation. In-network refinement is also performed in
DG2N [40], but not in the spectral space.
Attention-based spectral learning
The attention mechanism was originally introduced in deep
learning for natural language processing, and consists in putting relative weights on different words of
an input sentence [
41
]. This mechanism can be applied in different contexts, including that of shape
analysis. Indeed, attention learned in the feature domain can be used to focus on different parts of a
3D shape, for instance in partial shape matching, as done in [
10
]. As we show in this paper, attention
can also be used in the spectral domain by letting the network focus on different levels of details
depending on the input shapes and their resulting functional maps at different spectral resolutions.
Indeed, the utility of considering different resolutions of a functional map, e.g., via upsampling of its
size, has been highlighted in [
42
,
43
]. Here we propose to let the network learn to adaptively combine
all the intermediate functional maps into a final coherent correspondence.
3