TANGO Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition Yongwei Chen13 Rui Chen1 Jiabao Lei1 Yabin Zhang2 Kui Jia14

2025-05-02 0 0 2.6MB 13 页 10玖币
侵权投诉
TANGO: Text-driven Photorealistic and Robust 3D
Stylization via Lighting Decomposition
Yongwei Chen1,3, Rui Chen1, Jiabao Lei1, Yabin Zhang2, Kui Jia1,4
1South China University of Technology 2The Hong Kong Polytechnic University
3DexForce Co. Ltd. 4Peng Cheng Laboratory
eecyw@mail.scut.edu.cn,kuijia@scut.edu.cn
Abstract
Creation of 3D content by stylization is a promising yet challenging problem in com-
puter vision and graphics research. In this work, we focus on stylizing photorealistic
appearance renderings of a given surface mesh of arbitrary topology. Motivated by
the recent surge of cross-modal supervision of the Contrastive Language-Image
Pre-training (CLIP) model, we propose TANGO, which transfers the appearance
style of a given 3D shape according to a text prompt in a photorealistic manner.
Technically, we propose to disentangle the appearance style as the spatially varying
bidirectional reflectance distribution function, the local geometric variation, and the
lighting condition, which are jointly optimized, via supervision of the CLIP loss,
by a spherical Gaussians based differentiable renderer. As such, TANGO enables
photorealistic 3D style transfer by automatically predicting reflectance effects even
for bare, low-quality meshes, without training on a task-specific dataset. Extensive
experiments show that TANGO outperforms existing methods of text-driven 3D
style transfer in terms of photorealistic quality, consistency of 3D geometry, and
robustness when stylizing low-quality meshes. Our codes and results are available
at our project webpage https://cyw-3d.github.io/tango.
1 Introduction
3D content creation by stylization (e.g., stylized according to text prompts [
31
], images [
44
], or
3D shapes [
50
]) has important applications in computer vision and graphics areas. The problem is
yet challenging and traditionally requires manual efforts from experts of professional artists and a
large amount of time cost. In the meanwhile, many online 3D repositories [
6
,
47
,
41
] can be easily
accessed on the Internet whose contained surface meshes are bare contents without any styles. It is
thus promising if automatic, diverse, and realistic stylization can be achieved given these raw 3D
contents. We note that similar to recent 3D stylization works [
31
,
50
,
18
,
33
,
21
], we consider style as
the particular appearance of an object, which is determined by color (albedo) and physical reflective
effect of the object surface, while considering content as the global shape structure and topology
prescribed by an explicit 3D mesh or other implicit representation of the object surface.
Stylization is usually guided by some sources of styling signals, such as a text prompt [
31
], a reference
image [
44
], or a target 3D shape [
50
]. In this work, we choose to work with stylization by text
prompts, motivated by the surprising effects recently achieved in many applications [
44
,
19
,
10
], by
using the cross-modal supervision model of Contrastive Language–Image Pre-training (CLIP) [
37
].
The goal of this paper is to devise an end-to-end neural architecture system, named TANGO, that
can transfer, guided by a text prompt, the style of a given 3D shape of arbitrary topology. Note that
TANGO can be directly applied to arbitrary meshes with arbitrary target styles, without additional
learning/optimization procedures on a task-specific dataset; Figure 1 gives the illustration.
Corresponding Author
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11277v2 [cs.CV] 3 Nov 2022
The most relevant work so far is Text2Mesh [
31
], which is the first to use the CLIP loss in mesh
stylization; given an input mesh and a text prompt, it predicts stylized color and displacement for
each mesh vertex, and the stylized mesh is just a result of this colored vertex displacement procedure.
However, Text2Mesh does not support the stylization of lighting conditions, reflectance properties,
and local geometric variations, which are necessary in order to produce a photorealistic appearance
of 3D surface. To this end, we formulate the problem of mesh stylization as learning three unknowns
of the surface, i.e., the spatially varying Bidirectional Reflectance Distribution Function (SVBRDF),
the local geometric variation (normal map), and the lighting condition. We show that TANGO is
able to learn the above unknowns supervised only by CLIP-bridged text prompts, and generate
photorealistic rendering effects by approximating these unknowns during inference. Technically, by
jointly encoding the text prompt and images of the given mesh rendered by a spherical Gaussians
based differentiable renderer, TANGO is able to compare the embeddings of text and mesh style and
then backpropagate the gradient signals to update the style parameters. Our approach disentangles the
style into three components represented by learnable neural networks, namely continuous functions
respectively of the point, its normal, and the viewing direction on the surface. Note that due to
the learning of neural normal map, TANGO can produce fine-grained geometric details even on
low-quality meshes with only a few faces; in contrast, vertex displacement-based methods, such as
[31], often fail in such cases.
We finally summarize our contributions as follows.
We propose a novel, end-to-end architecture (TANGO) for text-driven photorealistic 3D
style transfer; our model learns to assign advanced physical lighting appearance and local
geometric details to a raw mesh, by leveraging an off-the-shelf, pre-trained CLIP model.
TANGO can automatically predict material, normal map, and lighting condition for any bare
mesh prescribed by text description, and can also handle low-quality meshes with only a
few faces, due to our prediction of colors and reflectance properties for every intersection
point of a camera ray.
We conduct extensive ablation studies and experiments that show the advantages of TANGO.
Notably, except for more photorealistic renderings, our results have no self-intersection in
local geometry details; in contrast, such imperfections would appear in results from existing
methods.
2 Related Work
Text-Driven Generation and Manipulation.
Our work is primarily inspired by [
31
] and other
methods [
44
,
19
,
40
,
20
,
22
] which use text prompt to guide 3D generation or manipulation. The
optimization procedures of these methods are guided by the CLIP model [
37
]. Specifically, CLIP-
NeRF [
44
] proposes a unified framework that allows manipulating NeRF, guided by either a short
text prompt or an examplar image. Another work Dreamfields [
19
] chooses to generate 3D scenes
from scratch in a NeRF representation. Different from those NeRF-based methods, CLIP-Forge [
40
]
presents a method for shape generation represented by voxels. Text2Mesh [
31
] predicts color and
geometric details for a given template mesh according to the text prompt. Concurrently, Khalid et al.
[
22
] generate stylized mesh from a sphere guided by CLIP. Meanwhile, apart from 3D content creation,
there are other works focusing on text-guided image generation and manipulation. StyleCLIP [
36
]
and VQGAN-CLIP [
10
] use CLIP-loss to optimize latent codes in GAN’s style space. GLIDE [
35
] is
introduced for more photorealistic image generation and editing with text-guided diffusion models,
while DALL
E [
38
] explores to generate images using transformer. Compared to these methods, we
focus on improving the realism of text-driven 3D mesh stylization.
3D Style Transfer. Generating or editing 3D content according to a given style is a long-standing
task in computer graphics [
12
,
14
,
48
]. NeuralCage [
49
] implements traditional cage-based defor-
mation as differentiable network layers and generates feature-deserving deformations. Furthermore,
3DStyleNet [
50
] conducts 3D object stylization informed by a reference textured 3D shape. Other
works are specific to styles of furniture [
27
], 3D colleges [
12
], LEGO [
23
], and portraits [
17
]. Com-
pared to these global geometric transfer approaches, some works investigate style transfer on a more
local level. Specifically, Hertz et al. [
18
] propose a generative framework to learn local structures
from the style of a given mesh, while Liu et al. [
25
] subdivide the given mesh for a coarse-to-fine
2
geometric modeling. Unlike these methods, we aim to synthesize a wide range of realistic styles
specified by a text prompt, making it more convenient to guide the process by high-level semantics.
BRDF and Lighting Estimation.
Previously, BRDF and lighting can only be estimated under
complicated settings in a laboratory environment [
2
,
28
,
46
] before the deep learning era. With the
help of neural networks, some techniques can use simpler settings to estimate BRDF and lighting
beyond geometry [
13
,
3
,
1
,
11
]. They assume that scenes are under specific illumination, like one [
11
]
or multiple [
13
] flashlights and linear light [
15
]. However, they often assume that the geometry is a
plane [
13
] or known [
3
]. These requirements are difficult to be met in reality. Furthermore, recent
works [
7
,
8
,
26
,
53
] exploit neural networks to jointly predict geometry, BRDF and in some cases,
lighting via 2D image loss. These methods can be split into two categories according to the geometry
representation. Among them, those who use explicit representation [
26
,
7
,
8
] usually deform a sphere
to obtain final shapes, which cannot represent arbitrary topology. Another pipeline of methods that
use implicit geometry representation like SDF [
51
], occupancy function [
30
] or NeRF [
42
,
4
,
52
] can
get higher quality shapes, but these representations are not handy to be processed by contemporary
game engines. Unlike them, our task is to change the style of a given shape, so the explicit triangle
mesh is preferred due to its convenience. As for BRDF and lighting representation, PhySG [
51
]
and NeRD [
4
] use mixtures of spherical Gaussians to represent illumination; NeRFactor [
52
] and
nvdiffrec [
34
] choose low-resolution envmaps and split sum lighting model, respectively. Different
from all these 2D supervision methods, we jointly estimate spatially varying BRDF, lighting, and
local geometry (normal map) supervised by text prompts. With the help of estimated BRDF and
lighting, we can represent complicated light reflection in the real world on stylized meshes.
3 Method
We contribute TANGO, an end-to-end architecture that enables text-driven photorealistic 3D style
transfer for any bare mesh, which is supervised by a semantic loss. The heart of TANGO is to
disentangle the style of the input mesh as reflectance properties and scene lighting. Then, given
a target style specified by a text prompt, we could learn the corresponding style parameters by
leveraging the pre-trained CLIP model, and then the stylized images could be generated with the
learned style parameters through our spherical Gaussians (SG) differentiable renderer. There are three
unknowns in our defined style parameters: (i) spatially varying BRDF (SVBRDF), including diffuse,
roughness, specular, represented by parameters
ξRm
; (ii) normal, represented by parameters
γRn; (iii) lighting, represented by parameters τRl.
We use explicit triangle mesh representation to represent the input 3D shape. The input template mesh
Mconsists of
e
vertices
VRe×3
and
u
faces
F∈ {1, ..., n}u×3
and is fixed throughout training.
The object’s style is described by a text prompt, and the aforementioned SVBRDF, normal and lighting
parameters are optimized to fit the description. Compared to other implicit representation [
51
,
4
], the
explicit representation here is more convenient to obtain and easy to use by current game engines and
other applications.
Forward model.
Given an object mesh M, we first scale it in a unit sphere and then randomly
sample points near Mas camera positions
c
to render images. For each pixel in rendered images,
indexed by p, let
Rp={cp+tνp|t0}
denotes the ray through pixel p, where
cp
is randomly
sampled on a sphere containing the object and
νp
denotes the direction of the ray (i.e. the vector
pointing from
cp
to p). Let
xp
denotes the first intersection of the ray
Rp
and the mesh M,
np
denotes the ground truth normal of point
xp
,
ˆ
np=Π(np,xp;γ)
is the estimated normal on surface
point
xp
. For each surface point
xp
with the estimated normal
ˆ
np
, we suppose that
Li(ωi;τ)
is the incident light intensity from direction
ωi
, and SVBRDF
fr(νp,ωi,xp;ξ)
are the surface
reflectance coefficients of the material at location
xp
from viewing direction
νp
and incident light
direction
ωi
. Then the observed light intensity
Lp(νp,xp,np)
is an integral over the hemisphere
Ω = {ωi:ωi·ˆ
np0}:
Lp(νp,xp,np;ξ,γ,τ) = Z
Li(ωi;τ)fr(νp,ωi,xp;ξ)(ωi·Π(np,xp;γ))dωi.(1)
For each image
I[0,1]H×W×3
corresponding to a camera position
c
, we calculate the pixel color
by Equation (1). The rendered image is encoded into a latent code by a pre-trained CLIP model,
3
摘要:

TANGO:Text-drivenPhotorealisticandRobust3DStylizationviaLightingDecompositionYongweiChen1,3,RuiChen1,JiabaoLei1,YabinZhang2,KuiJia1,41SouthChinaUniversityofTechnology2TheHongKongPolytechnicUniversity3DexForceCo.Ltd.4PengChengLaboratoryeecyw@mail.scut.edu.cn,kuijia@scut.edu.cnAbstractCreationof3Dcon...

展开>> 收起<<
TANGO Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition Yongwei Chen13 Rui Chen1 Jiabao Lei1 Yabin Zhang2 Kui Jia14.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:2.6MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注