Neural Shape Deformation Priors Jiapeng Tang1Lev Markhasin2Bi Wang2Justus Thies3Matthias Nießner1 1Technical University of Munich2Sony Europe RDC Stuttgart

2025-04-24 0 0 5.62MB 22 页 10玖币
侵权投诉
Neural Shape Deformation Priors
Jiapeng Tang1Lev Markhasin2Bi Wang2Justus Thies3Matthias Nießner1
1Technical University of Munich 2Sony Europe RDC Stuttgart
3Max Planck Institute for Intelligent Systems, Tübingen, Germany
https://tangjiapeng.github.io/projects/NSDP/
Figure 1: Neural shape deformation priors allow for intuitive shape manipulation of existing source
meshes. A user can create novel shapes by dragging handles (red circles) defined on the region of
interest (red regions) to desired locations (blue circles).
Abstract
We present Neural Shape Deformation Priors, a novel method for shape manip-
ulation that predicts mesh deformations of non-rigid objects from user-provided
handle movements. State-of-the-art methods cast this problem as an optimization
task, where the input source mesh is iteratively deformed to minimize an objective
function according to hand-crafted regularizers such as ARAP [
54
]. In this work,
we learn the deformation behavior based on the underlying geometric properties of
a shape, while leveraging a large-scale dataset containing a diverse set of non-rigid
deformations. Specifically, given a source mesh and desired target locations of
handles that describe the partial surface deformation, we predict a continuous
deformation field that is defined in 3D space to describe the space deformation.
To this end, we introduce transformer-based deformation networks that represent
a shape deformation as a composition of local surface deformations. It learns a
set of local latent codes anchored in 3D space, from which we can learn a set of
continuous deformation functions for local surfaces. Our method can be applied to
challenging deformations and generalizes well to unseen deformations. We validate
our approach in experiments using the DeformingThing4D dataset, and compare to
both classic optimization-based and recent neural network-based methods.
1 Introduction
Editing and deforming 3D shapes is a key component in animation creation and computer aided
design pipelines. Given as little user input as possible, the goal is to create new deformed instances
of the original 3D shape which look natural and behave like real objects or animals. The user input is
assumed to be very sparse, such as vertex handles that can be dragged around. For example, users
can animate a 3D model of an animal by dragging its feet forward. This problem is severely ill-
posed and typically under-constrained, as there are many possible deformations that can be matched
with the provided partial surface deformations of handles, especially for large surface deformations.
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.05616v2 [cs.CV] 1 Feb 2023
Thus, strong priors encoding deformation regularity are necessary to tackle this problem. Physics
and differential geometry provide solutions that use various analytical priors which define natural-
looking mesh deformations, such as elasticity [
62
,
1
], Laplacian smoothness [
31
,
55
,
77
], and
rigidity [
54
,
57
,
29
] priors. They update mesh vertex coordinates by iteratively optimizing energy
functions that satisfy constraints from both the pre-defined deformation priors and given handle
locations. Although these algorithms can preserve geometric details of the original source model,
they still have limited capacity to model realistic deformations, since the deformation priors are
region independent, e.g., the head region deforms in a similar way as the tail of an animal, resulting
in unrealistic deformation states.
Hence, motivated by the recent success of deep neural networks for 3D shape modeling [
33
,
42
,
13
,
68
,
58
,
14
,
44
,
26
,
2
,
18
,
10
,
63
,
60
,
12
], we propose to learn shape deformation priors of a
specific object class, e.g., quadruped animals, to complete surface deformations beyond observed
handles. We formulate the following properties of such a learned model; (1) it should be robust to
different mesh quality and number of vertices, (2) the source mesh is not limited to canonical pose
(i.e., the input mesh can have arbitrary pose), and (3) it should generalize well to new deformations.
Towards these goals, we represent deformations as a continuous deformation field which is defined
in the near-surface region to describe the space deformation caused by the corresponding surface
deformation. The continuity property enables us to manipulate meshes with infinite number of
vertices and disconnected components. To handle source meshes in arbitrary poses, we learn shape
deformations via canonicalization. Specifically, the overall deformation process consists of two
stages: arbitrary-to-canonical transformation and canonical-to-arbitrary transformation. To obtain
more detailed surface deformations and better generalization capabilities to unseen deformations,
we propose to learn local deformation fields conditioned on local latent codes encoding geometry-
dependent deformation priors, instead of global deformation fields conditioned on a single latent
code. To this end, we propose Transformer-based Deformation Networks (TD-Nets), which learns
encoder-based local deformation fields on point cloud approximations of the input mesh. Concretely,
TD-Nets encode an input point cloud with surface geometry information and incomplete deformation
flow into a sparse set of local latent codes and a global feature vector by using the vector attention
blocks proposed in [
74
]. The deformation vectors of spatial points are estimated by an attentive
decoder, which aggregates the information of neighboring local latent codes of a spatial point based
on the feature similarity relationships. The aggregated feature vectors are finally passed to a multi-
layer-perceptron (MLP) to predict displacement vectors which can be applied to the source mesh to
compute the final output mesh.
To summarize, we introduce transformer-based local deformation field networks which are capable
to learn shape deformation priors for the task of user-driven shape manipulation. The deformation
networks learn a set of anchor features based on a vector attention mechanism, enhancing the
global deformation context, and selecting the most informative local deformation descriptors for
displacement vector estimations, leading to an improved generalization ability to new deformations.
In comparison to classical hand-crafted deformation priors as well as recent neural network-based
deformation predictors, our method achieves more accurate and natural shape deformations.
2 Related Work
User-guided shape manipulation lies at the intersection of computer graphics and computer vision.
Our proposed method is related to polygonal mesh geometry processing, neural field representations,
as well as vision transformers.
Optimization-based Shape Manipulation.
Classical methods formulate shape manipulation as
a mathematical optimization problem. They perform mesh deformations by either deforming the
vertices [
5
,
53
] or the 3D space [
23
,
3
,
29
,
37
,
51
]. Performing mesh deformation without any other
information about the target shape, but only using limited user-provided correspondences is an under-
constrained problem. To this end, the optimization methods require deformation priors to constraint
the deformation regularity as well as the smoothness of the deformed surface. Various analytic
priors have been proposed which encourage smooth surface deformations, such as elasticity [
62
,
1
],
Laplacian smoothness [
31
,
55
,
77
], and rigidity [
54
,
57
,
29
]. These methods use efficient linear solvers
to iteratively optimize energy functions that satisfy constraints from both the pre-defined deformation
prior and provided handle movements. Recently, NFGP [
69
] was proposed to optimize neural
2
networks with non-linear deformation regularizations. Specifically, it performs shape deformations
by warping the neural implicit fields of the source model through a deformation vector field, which
is constrained by modeling implicitly represented surfaces as elastic shells. NeuralMLS [
52
]
learned a geometry-aware weight function of a shape and given control points for moving least
squares(MLS) deformations, which smoothly interpolates the control point displacements over space.
Although they can preserve many geometric details of the source shape, they struggle to model
complex deformations, as local surfaces are simply constrained to be transformed in a similar manner.
In contrast, we aim to learn deformation priors based on local geometries to infer hidden surface
deformations.
Learning-based Shape Reconstruction and Manipulation.
Learning-based shape manipulation
has been studied to learn shape priors based on shape auto-encoding or auto-decoding. [
76
,
15
,
20
,
25
]
map a class of shapes into a latent space. During inference, given handle positions as input, they
find an optimal latent code whose 3D interpretation is the most similar to the observation. In
contrast, we learn explicit deformation priors to directly predict 3D surface deformations. Jakab et
al. [
24
] proposed to control shapes via unsupervised 3D keypoint discovery. Instead, we use partial
surface deformations represented by handle displacements as input observations, rather than keypoint
displacements. There exist a series of methods that use deep neural networks to complete non-rigid
shapes [
25
,
41
,
7
,
30
,
61
,
50
,
66
,
8
] from partial scans. Our task is partially related to this task,
but our shape manipulation task from user input requires completion of the deformation field. In
contrast to shape completion, our setting is more under-constrained, as the user-provided handle
correspondences are very sparse and more incomplete than partial point clouds from scans. Recent
methods for clothed-human body reconstruction choose to canonicalize the captured scan into a pre-
defined T-pose [
65
,
35
,
11
] using the skeletal deformation model of SMPL [
32
] or STAR [
40
] which
can also be used to later animate the human. Inspired by this, we also perform a canonicalization to
enable editing of source meshes with arbitrary poses, before applying the actual deformation towards
the target pose handles.
Continuous Neural Fields.
Continuous neural field representations have been widely used in
3D shape modeling [
33
,
13
,
42
] and 4D dynamics capture [
39
,
61
,
7
,
41
,
30
]. Recent work that
represents 3D shapes as continuous signed distance fields [
2
,
68
,
18
,
10
,
63
] or occupancy fields [
33
,
13
,
14
,
34
,
44
,
26
,
59
,
60
,
17
,
72
] can theoretically obtain volumetric reconstructions with infinite
resolutions, as they are not bound to the resolution of a discrete grid structure. Similarly, we learn
continuous deformation fields defined in 3D space for shape deformations [
58
,
25
,
69
,
21
]. Due to
the continuity of the deformation fields, our method is not limited by the number of mesh vertices,
or disconnected components. Different from ShapeFlow [
25
], OFlow [
39
], LPDC-Net [
61
] and
NPMs [
41
] that learn a deformation field from a single latent code, inspired by local implicit field
learning [
14
,
44
,
60
,
17
,
71
], we model the deformation field as a composition of local deformation
functions, improving the representation capability of describing complex deformations as well as
generalization to new deformations.
Visual Transformers.
Recently, transformer architectures [
64
] from natural language processing
have revolutionized many computer vision tasks, including image classification [
16
,
67
], object
recognition [
9
], semantic segmentation [
75
], or 3D reconstruction [
6
,
70
,
17
,
71
,
46
]. We refer the
reader to [
19
] for a detailed survey of visual transformers. In this work, we propose the usage of
a transformer architecture to learn deformation fields. Given the input point cloud sampled from
the source mesh with partial deformation flow (defined by the user handles), we employ the vector
attention blocks from Point Transformer [
74
] as a main point cloud processing module to extract a
sparse set of local latent codes, enhancing the global understanding of deformation behaviours. Based
on the obtained local deformation descriptors, our attentive deformation decoder learns to attend to
the most informative features from near-by local codes to predict a deformation field.
3 Approach
Given a source mesh
S={V,F}
where
V
and
F
denote the set of vertices and the set of faces,
respectively, we aim to deform
S
to obtain a target mesh
T
by selecting a sparse set of mesh vertices
H={hi}`
i=1
as handles, and dragging them to target locations
O={oi}`
i=1
. The key idea in this
work is to use deformation priors to complete hidden surface deformations. Specifically, the goal
3
User Input
Source Mesh
Target Handle Locations
Backward
Deformation
Networks
Forward
Deformation
Networks
Shape Deformation via Canonicalization
Target Mesh
User Input
Source Mesh
Target Handle Locations
Backward
Deformation
Networks
Forward
Deformation
Networks
Shape Deformation via Canonicalization
Target Mesh
Figure 2:
Overview
. Given a source mesh
S
with sparse handles
H
(red circles) and their respective
target locations
O
(blue circles) as input, our method deforms the mesh to the target mesh
T
via
canonicalization
C
. The backward
b
and forward
f
deformation networks store the deformation
priors that allow our method to produce consistent and natural-looking outputs.
is to learn a continuous deformation field
D
defined in 3D space, from which we can obtain the
deformed mesh
T0={V +D(V),F}
through vertex deformations of the source mesh
S
. The overall
pipeline of the proposed approach is shown in Figure 2. Our method can be applied to input meshes
in arbitrary poses by leveraging learned shape deformation via canonicalization (see Section 3.1).
To represent the underlying deformation prior, we propose neural deformation fields as described in
Section 3.2 which can be learned from large deformation datasets (see Section 3.3).
3.1 Learning Shape Deformations via Canonicalization
To ensure robustness w.r.t. varying input mesh quality (topology and resolution), we operate on
point clouds instead of meshes. Specifically, we sample a point cloud
PS={pi}n
i=0 Rn×3
from
S
of size
n= 5000
. We define the target handle point locations
PO={oi}n
i=0 Rn×3
, where
we use zeros to represent unknown point flows. Further, to avoid the ambiguity of zero point flow,
we define the corresponding binary user handle masks
M={bi}n
i=0 Rn
where
bi= 1
if
pi
is a
handle or otherwise bi= 0.
To learn the shape transformation between two arbitrary non-rigidly deformed poses, one can learn
deformation fields that directly map the source deformed space to target space. However, it would be
difficult to learn the deformation priors well, as there could be infinite deformation state transformation
pairs. To decrease the learning complexity, we introduce a canonical space as an intermediate state.
We divide the shape transformation process into two steps; a backward deformation that aligns the
source deformed space to canonical space, and a forward deformation that maps the canonical space to
the target deformation space. Concretely,
PS
is passed into the backward transformation network
b
to learn the backward deformation field
Db
which transforms the input shape
PS
into a canonical pose
P0
C
. Similarly, the querying non-surface point set
QS={qi}m
i=0 Rm×3, m = 5000
randomly
sampled in the 3D space of
S
is also mapped to canonical space through
Q0
C=QS+Db(QS)
.
Lastly, given
P0
C
,
M
, and
PO
as input, a forward transformation network
f
is learned to represent
the forward deformation field Dfthat predicts final locations Q0
T=Q0
C+Df(Q0
C).
3.2 Transformer-based Deformation Networks (TD-Nets)
The deformation via canonicalization is based on two deformation field predictors (forward and
backward deformations). Both networks share the same architecture, thus, in the following, we
will only describe the forward deformation network as visualized in Figure 3while the backward
deformation network is analogous. It consists of a transformer-based deformation encoder and a
vector cross attention-based decoder network.
Point transformer encoder.
Given a point set
PC
with handle locations
PO
and a binary mask
M
as inputs, we use point transformer layers from [
74
] to build our encoder modules. The point
transformer layer is based on the vector attention mechanism [
73
]. Let
X={xi,fi}i
and
Y=
{yi,gi}i
be the query and key-value sequences, where
xi
and
yi
denote the coordinates of query and
key-value points with corresponding feature vectors
fi
and
gi
. The vector cross attention operator
4
User Input
VCA
Transformer
Encoder
MLP
Sampling
Target mesh
Pooling & FCs
kNN
Query
Key-Values
Figure 3:
Transformer-based Forward Deformation Networks
. Given a canonical mesh
C
with
handle positions
H
(red circles) and desired handle locations
O
(blue circles), we perform surface
sampling to obtain a point cloud
PC
with additional channels of handle mask
M
and point flow
PO
.
A point-transformer encoder is devised to extract a sparse set of local latent codes
Z={ci,zi}i
from
this point cloud, where
ci
are the anchor positions of the latent features
zi
. For a specific point
q
in
3D space (i.e. a vertex from the source mesh), based on the
zglo
, a vector cross attention (
VCA
) block
is used to effectively fuse the information of
Zq
into
zq
from the
k
nearest neighbouring latent codes
of
q
. Using a multi-layer perceptron (MLP) conditioned on
zq
, we predict the deformed location
q0
in the target space.
VCA is defined as:
VCA(X,Y) : f0
i=X
j∈Ni
ρ(γ(ϕ(gj)ψ(fi) + δ)) (α(fi) + δ),(1)
where
f0
i
are the aggregated features,
ϕ
,
ψ
, and
α
are linear projections implemented by a fully-
connected layer.
γ
is a mapping function implemented by a two-layer MLP to predict attention
vectors.
ρ
is the attention weight normalization function, in our case softmax.
δ:= θ(xiyj)
is the
positional embedding module [
64
,
36
] implemented by a two linear layers with a single ReLU [
38
].
It leverages relatively positional information of
xi
and
yj
to benefit the network training. Then, with
the definition of VCA, the vector self-attention operator VSA can be defined as:
VSA(X) := VCA(X,X).(2)
Based on VCA and VSA, we can define two basic modules to build our encoder network, i.e. the
point transformer block (PTB) and the point abstraction block (PAB). The definition of the point
transformer block
PTB
is a combination of the BatchNorm (BN) layer [
22
], VSA, and residual
connections, formulated as:
PTB(X) := BN(X+ VSA(X)).(3)
For each point
Xi
, it encapsulates the information from
kenc = 16
nearest neighborhoods while
keeping the point’s position
xi
unchanged. The point abstraction block
PAB
consists of farthest
point sampling (FPS), BN, VCA, and VSA, which is defined as follow:
PAB(X) := BN(FPS(X) + VSA(VCA(FPS(X),X)).(4)
The point cloud
PC
with handle mask
M
and flow
PO
as additional channels are passed to a point
transformer block (PTB) to obtain a feature point cloud
Z0={c0
i,z0
i}n
i=1
. By using two consecutive
point abstraction blocks (PABs) with intermediate set size of
n1= 500
and
n2= 100
, we obtain
Z1={c1
i,z1
i}n1
i=1
and
Z2={c2
i,z2
i}n2
i=1
. To enhance global deformation priors, we stack 4 point
transformer blocks with full self-attention whose
kenc
is set to 100 to exchange the global information
in the whole set of
Z2
. By doing so, we can obtain a sparse set of local deformation descriptors
Z={ci,zi}100
i=1
that are anchored in
{ci}
. Finally, we perform a global max-pooling operation
followed by two linear layers to obtain the global latent vector zglo.
Attentive deformation decoder.
Based on the learned local latent codes
Z={ci,zi}100
i=1
and global
latent vector
zglo
, the deformation decoder defines the forward deformation function
Df:R3R3
,
which maps a point
q
from the canonical space of
C
to the 3D space of
T
. Similar to tri-linear
interpolation operations in grid-based implicit field learning, a straightforward way to find the
corresponding feature vector
zq
is to use the weighted combination of
kdec = 16
nearby local
codes
Zq={ck,zk}kdec
k=1
. Intuitively, the weight is inversely proportional to the euclidean distance
between
q
and the anchoring location
ck
[
44
]. However, distance-based feature queries ignore
5
摘要:

NeuralShapeDeformationPriorsJiapengTang1LevMarkhasin2BiWang2JustusThies3MatthiasNießner11TechnicalUniversityofMunich2SonyEuropeRDCStuttgart3MaxPlanckInstituteforIntelligentSystems,Tübingen,Germanyhttps://tangjiapeng.github.io/projects/NSDP/Figure1:Neuralshapedeformationpriorsallowforintuitiveshapema...

展开>> 收起<<
Neural Shape Deformation Priors Jiapeng Tang1Lev Markhasin2Bi Wang2Justus Thies3Matthias Nießner1 1Technical University of Munich2Sony Europe RDC Stuttgart.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:22 页 大小:5.62MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注