The most relevant work so far is Text2Mesh [
31
], which is the first to use the CLIP loss in mesh
stylization; given an input mesh and a text prompt, it predicts stylized color and displacement for
each mesh vertex, and the stylized mesh is just a result of this colored vertex displacement procedure.
However, Text2Mesh does not support the stylization of lighting conditions, reflectance properties,
and local geometric variations, which are necessary in order to produce a photorealistic appearance
of 3D surface. To this end, we formulate the problem of mesh stylization as learning three unknowns
of the surface, i.e., the spatially varying Bidirectional Reflectance Distribution Function (SVBRDF),
the local geometric variation (normal map), and the lighting condition. We show that TANGO is
able to learn the above unknowns supervised only by CLIP-bridged text prompts, and generate
photorealistic rendering effects by approximating these unknowns during inference. Technically, by
jointly encoding the text prompt and images of the given mesh rendered by a spherical Gaussians
based differentiable renderer, TANGO is able to compare the embeddings of text and mesh style and
then backpropagate the gradient signals to update the style parameters. Our approach disentangles the
style into three components represented by learnable neural networks, namely continuous functions
respectively of the point, its normal, and the viewing direction on the surface. Note that due to
the learning of neural normal map, TANGO can produce fine-grained geometric details even on
low-quality meshes with only a few faces; in contrast, vertex displacement-based methods, such as
[31], often fail in such cases.
We finally summarize our contributions as follows.
•
We propose a novel, end-to-end architecture (TANGO) for text-driven photorealistic 3D
style transfer; our model learns to assign advanced physical lighting appearance and local
geometric details to a raw mesh, by leveraging an off-the-shelf, pre-trained CLIP model.
•
TANGO can automatically predict material, normal map, and lighting condition for any bare
mesh prescribed by text description, and can also handle low-quality meshes with only a
few faces, due to our prediction of colors and reflectance properties for every intersection
point of a camera ray.
•
We conduct extensive ablation studies and experiments that show the advantages of TANGO.
Notably, except for more photorealistic renderings, our results have no self-intersection in
local geometry details; in contrast, such imperfections would appear in results from existing
methods.
2 Related Work
Text-Driven Generation and Manipulation.
Our work is primarily inspired by [
31
] and other
methods [
44
,
19
,
40
,
20
,
22
] which use text prompt to guide 3D generation or manipulation. The
optimization procedures of these methods are guided by the CLIP model [
37
]. Specifically, CLIP-
NeRF [
44
] proposes a unified framework that allows manipulating NeRF, guided by either a short
text prompt or an examplar image. Another work Dreamfields [
19
] chooses to generate 3D scenes
from scratch in a NeRF representation. Different from those NeRF-based methods, CLIP-Forge [
40
]
presents a method for shape generation represented by voxels. Text2Mesh [
31
] predicts color and
geometric details for a given template mesh according to the text prompt. Concurrently, Khalid et al.
[
22
] generate stylized mesh from a sphere guided by CLIP. Meanwhile, apart from 3D content creation,
there are other works focusing on text-guided image generation and manipulation. StyleCLIP [
36
]
and VQGAN-CLIP [
10
] use CLIP-loss to optimize latent codes in GAN’s style space. GLIDE [
35
] is
introduced for more photorealistic image generation and editing with text-guided diffusion models,
while DALL
•
E [
38
] explores to generate images using transformer. Compared to these methods, we
focus on improving the realism of text-driven 3D mesh stylization.
3D Style Transfer. Generating or editing 3D content according to a given style is a long-standing
task in computer graphics [
12
,
14
,
48
]. NeuralCage [
49
] implements traditional cage-based defor-
mation as differentiable network layers and generates feature-deserving deformations. Furthermore,
3DStyleNet [
50
] conducts 3D object stylization informed by a reference textured 3D shape. Other
works are specific to styles of furniture [
27
], 3D colleges [
12
], LEGO [
23
], and portraits [
17
]. Com-
pared to these global geometric transfer approaches, some works investigate style transfer on a more
local level. Specifically, Hertz et al. [
18
] propose a generative framework to learn local structures
from the style of a given mesh, while Liu et al. [
25
] subdivide the given mesh for a coarse-to-fine
2