TANGO Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition Yongwei Chen13 Rui Chen1 Jiabao Lei1 Yabin Zhang2 Kui Jia14

2025-05-02 0 0 2.6MB 13 页 10玖币

侵权投诉

TANGO: Text-driven Photorealistic and Robust 3D

Stylization via Lighting Decomposition

Yongwei Chen1,3, Rui Chen1, Jiabao Lei1, Yabin Zhang2, Kui Jia1,4∗

1South China University of Technology 2The Hong Kong Polytechnic University

3DexForce Co. Ltd. 4Peng Cheng Laboratory

eecyw@mail.scut.edu.cn,kuijia@scut.edu.cn

Abstract

Creation of 3D content by stylization is a promising yet challenging problem in com-

puter vision and graphics research. In this work, we focus on stylizing photorealistic

appearance renderings of a given surface mesh of arbitrary topology. Motivated by

the recent surge of cross-modal supervision of the Contrastive Language-Image

Pre-training (CLIP) model, we propose TANGO, which transfers the appearance

style of a given 3D shape according to a text prompt in a photorealistic manner.

Technically, we propose to disentangle the appearance style as the spatially varying

bidirectional reﬂectance distribution function, the local geometric variation, and the

lighting condition, which are jointly optimized, via supervision of the CLIP loss,

by a spherical Gaussians based differentiable renderer. As such, TANGO enables

photorealistic 3D style transfer by automatically predicting reﬂectance effects even

for bare, low-quality meshes, without training on a task-speciﬁc dataset. Extensive

experiments show that TANGO outperforms existing methods of text-driven 3D

style transfer in terms of photorealistic quality, consistency of 3D geometry, and

robustness when stylizing low-quality meshes. Our codes and results are available

at our project webpage https://cyw-3d.github.io/tango.

1 Introduction

3D content creation by stylization (e.g., stylized according to text prompts [

], images [

], or

3D shapes [

]) has important applications in computer vision and graphics areas. The problem is

yet challenging and traditionally requires manual efforts from experts of professional artists and a

large amount of time cost. In the meanwhile, many online 3D repositories [

] can be easily

accessed on the Internet whose contained surface meshes are bare contents without any styles. It is

thus promising if automatic, diverse, and realistic stylization can be achieved given these raw 3D

contents. We note that similar to recent 3D stylization works [

], we consider style as

the particular appearance of an object, which is determined by color (albedo) and physical reﬂective

effect of the object surface, while considering content as the global shape structure and topology

prescribed by an explicit 3D mesh or other implicit representation of the object surface.

Stylization is usually guided by some sources of styling signals, such as a text prompt [

], a reference

image [

], or a target 3D shape [

]. In this work, we choose to work with stylization by text

prompts, motivated by the surprising effects recently achieved in many applications [

], by

using the cross-modal supervision model of Contrastive Language–Image Pre-training (CLIP) [

The goal of this paper is to devise an end-to-end neural architecture system, named TANGO, that

can transfer, guided by a text prompt, the style of a given 3D shape of arbitrary topology. Note that

TANGO can be directly applied to arbitrary meshes with arbitrary target styles, without additional

learning/optimization procedures on a task-speciﬁc dataset; Figure 1 gives the illustration.

∗Corresponding Author

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11277v2 [cs.CV] 3 Nov 2022

The most relevant work so far is Text2Mesh [

], which is the ﬁrst to use the CLIP loss in mesh

stylization; given an input mesh and a text prompt, it predicts stylized color and displacement for

each mesh vertex, and the stylized mesh is just a result of this colored vertex displacement procedure.

However, Text2Mesh does not support the stylization of lighting conditions, reﬂectance properties,

and local geometric variations, which are necessary in order to produce a photorealistic appearance

of 3D surface. To this end, we formulate the problem of mesh stylization as learning three unknowns

of the surface, i.e., the spatially varying Bidirectional Reﬂectance Distribution Function (SVBRDF),

the local geometric variation (normal map), and the lighting condition. We show that TANGO is

able to learn the above unknowns supervised only by CLIP-bridged text prompts, and generate

photorealistic rendering effects by approximating these unknowns during inference. Technically, by

jointly encoding the text prompt and images of the given mesh rendered by a spherical Gaussians

based differentiable renderer, TANGO is able to compare the embeddings of text and mesh style and

then backpropagate the gradient signals to update the style parameters. Our approach disentangles the

style into three components represented by learnable neural networks, namely continuous functions

respectively of the point, its normal, and the viewing direction on the surface. Note that due to

the learning of neural normal map, TANGO can produce ﬁne-grained geometric details even on

low-quality meshes with only a few faces; in contrast, vertex displacement-based methods, such as

[31], often fail in such cases.

We ﬁnally summarize our contributions as follows.

•

We propose a novel, end-to-end architecture (TANGO) for text-driven photorealistic 3D

style transfer; our model learns to assign advanced physical lighting appearance and local

geometric details to a raw mesh, by leveraging an off-the-shelf, pre-trained CLIP model.

•

TANGO can automatically predict material, normal map, and lighting condition for any bare

mesh prescribed by text description, and can also handle low-quality meshes with only a

few faces, due to our prediction of colors and reﬂectance properties for every intersection

point of a camera ray.

•

We conduct extensive ablation studies and experiments that show the advantages of TANGO.

Notably, except for more photorealistic renderings, our results have no self-intersection in

local geometry details; in contrast, such imperfections would appear in results from existing

methods.

2 Related Work

Text-Driven Generation and Manipulation.

Our work is primarily inspired by [

] and other

methods [

] which use text prompt to guide 3D generation or manipulation. The

optimization procedures of these methods are guided by the CLIP model [

]. Speciﬁcally, CLIP-

NeRF [

] proposes a uniﬁed framework that allows manipulating NeRF, guided by either a short

text prompt or an examplar image. Another work Dreamﬁelds [

] chooses to generate 3D scenes

from scratch in a NeRF representation. Different from those NeRF-based methods, CLIP-Forge [

]

presents a method for shape generation represented by voxels. Text2Mesh [

] predicts color and

geometric details for a given template mesh according to the text prompt. Concurrently, Khalid et al.

[

] generate stylized mesh from a sphere guided by CLIP. Meanwhile, apart from 3D content creation,

there are other works focusing on text-guided image generation and manipulation. StyleCLIP [

]

and VQGAN-CLIP [

] use CLIP-loss to optimize latent codes in GAN’s style space. GLIDE [

] is

introduced for more photorealistic image generation and editing with text-guided diffusion models,

while DALL

•

E [

] explores to generate images using transformer. Compared to these methods, we

focus on improving the realism of text-driven 3D mesh stylization.

3D Style Transfer. Generating or editing 3D content according to a given style is a long-standing

task in computer graphics [

]. NeuralCage [

] implements traditional cage-based defor-

mation as differentiable network layers and generates feature-deserving deformations. Furthermore,

3DStyleNet [

] conducts 3D object stylization informed by a reference textured 3D shape. Other

works are speciﬁc to styles of furniture [

], 3D colleges [

], LEGO [

], and portraits [

]. Com-

pared to these global geometric transfer approaches, some works investigate style transfer on a more

local level. Speciﬁcally, Hertz et al. [

] propose a generative framework to learn local structures

from the style of a given mesh, while Liu et al. [

] subdivide the given mesh for a coarse-to-ﬁne

geometric modeling. Unlike these methods, we aim to synthesize a wide range of realistic styles

speciﬁed by a text prompt, making it more convenient to guide the process by high-level semantics.

BRDF and Lighting Estimation.

Previously, BRDF and lighting can only be estimated under

complicated settings in a laboratory environment [

] before the deep learning era. With the

help of neural networks, some techniques can use simpler settings to estimate BRDF and lighting

beyond geometry [

]. They assume that scenes are under speciﬁc illumination, like one [

]

or multiple [

] ﬂashlights and linear light [

]. However, they often assume that the geometry is a

plane [

] or known [

]. These requirements are difﬁcult to be met in reality. Furthermore, recent

works [

] exploit neural networks to jointly predict geometry, BRDF and in some cases,

lighting via 2D image loss. These methods can be split into two categories according to the geometry

representation. Among them, those who use explicit representation [

] usually deform a sphere

to obtain ﬁnal shapes, which cannot represent arbitrary topology. Another pipeline of methods that

use implicit geometry representation like SDF [

], occupancy function [

] or NeRF [

] can

get higher quality shapes, but these representations are not handy to be processed by contemporary

game engines. Unlike them, our task is to change the style of a given shape, so the explicit triangle

mesh is preferred due to its convenience. As for BRDF and lighting representation, PhySG [

]

and NeRD [

] use mixtures of spherical Gaussians to represent illumination; NeRFactor [

] and

nvdiffrec [

] choose low-resolution envmaps and split sum lighting model, respectively. Different

from all these 2D supervision methods, we jointly estimate spatially varying BRDF, lighting, and

local geometry (normal map) supervised by text prompts. With the help of estimated BRDF and

lighting, we can represent complicated light reﬂection in the real world on stylized meshes.

3 Method

We contribute TANGO, an end-to-end architecture that enables text-driven photorealistic 3D style

transfer for any bare mesh, which is supervised by a semantic loss. The heart of TANGO is to

disentangle the style of the input mesh as reﬂectance properties and scene lighting. Then, given

a target style speciﬁed by a text prompt, we could learn the corresponding style parameters by

leveraging the pre-trained CLIP model, and then the stylized images could be generated with the

learned style parameters through our spherical Gaussians (SG) differentiable renderer. There are three

unknowns in our deﬁned style parameters: (i) spatially varying BRDF (SVBRDF), including diffuse,

roughness, specular, represented by parameters

ξ∈Rm

; (ii) normal, represented by parameters

γ∈Rn; (iii) lighting, represented by parameters τ∈Rl.

We use explicit triangle mesh representation to represent the input 3D shape. The input template mesh

Mconsists of

vertices

V∈Re×3

and

faces

F∈ {1, ..., n}u×3

and is ﬁxed throughout training.

The object’s style is described by a text prompt, and the aforementioned SVBRDF, normal and lighting

parameters are optimized to ﬁt the description. Compared to other implicit representation [

], the

explicit representation here is more convenient to obtain and easy to use by current game engines and

other applications.

Forward model.

Given an object mesh M, we ﬁrst scale it in a unit sphere and then randomly

sample points near Mas camera positions

to render images. For each pixel in rendered images,

indexed by p, let

Rp={cp+tνp|t≥0}

denotes the ray through pixel p, where

is randomly

sampled on a sphere containing the object and

νp

denotes the direction of the ray (i.e. the vector

pointing from

to p). Let

denotes the ﬁrst intersection of the ray

and the mesh M,

denotes the ground truth normal of point

np=Π(np,xp;γ)

is the estimated normal on surface

point

. For each surface point

with the estimated normal

, we suppose that

Li(ωi;τ)

is the incident light intensity from direction

ωi

, and SVBRDF

fr(νp,ωi,xp;ξ)

are the surface

reﬂectance coefﬁcients of the material at location

from viewing direction

νp

and incident light

direction

ωi

. Then the observed light intensity

Lp(νp,xp,np)

is an integral over the hemisphere

Ω = {ωi:ωi·ˆ

np≥0}:

Lp(νp,xp,np;ξ,γ,τ) = ZΩ

Li(ωi;τ)fr(νp,ωi,xp;ξ)(ωi·Π(np,xp;γ))dωi.(1)

For each image

I∈[0,1]H×W×3

corresponding to a camera position

, we calculate the pixel color

by Equation (1). The rendered image is encoded into a latent code by a pre-trained CLIP model,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TANGO:Text-drivenPhotorealisticandRobust3DStylizationviaLightingDecompositionYongweiChen1,3,RuiChen1,JiabaoLei1,YabinZhang2,KuiJia1,41SouthChinaUniversityofTechnology2TheHongKongPolytechnicUniversity3DexForceCo.Ltd.4PengChengLaboratoryeecyw@mail.scut.edu.cn,kuijia@scut.edu.cnAbstractCreationof3Dcon...

展开>> 收起<<

TANGO Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition Yongwei Chen13 Rui Chen1 Jiabao Lei1 Yabin Zhang2 Kui Jia14.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

TANGO Text-driven Photorealistic and Robust 3D Stylization via Lighting Decomposition Yongwei Chen13 Rui Chen1 Jiabao Lei1 Yabin Zhang2 Kui Jia14

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: