Vitruvio 3D Building Meshes via Single Perspective Sketches Alberto Tono Stanford University Computational Design Institute

2025-05-06 0 0 8.23MB 27 页 10玖币

侵权投诉

Vitruvio: 3D Building Meshes via Single Perspective Sketches

Alberto Tono

Stanford University & Computational Design Institute

atono@stanford.edu & alberto.tono@cd.institute

Heyaojing Huang

Stanford University

hhyj4495@stanford.edu

Ashwin Agrawal

Stanford University

ashwin15@stanford.edu

Martin Fischer

Stanford University

fischer@stanford.edu

Figure 1. Vitruvio converts a single perspective sketch to a 3D watertight mesh in universal Scene Description (USD) format. Vitruvio’s

ﬁnal output consists of a 3D printable model. In the ﬁgure above, the model has been printed with DREMEL 3D45, PLA material from

a.gcode shape ﬁle. The users envision a 3D building mass (the ground truth building on the left, representing the ”D2X” building in the

dataset). Then, they sketch it on an iPad with a single-line style, centered on a squared canvas. This diagram wants to show some of the

assumptions and limitations. The model has been trained only with a single-line synthetic style for the assumptions. For the limitations,

the output mesh lacks accurate dimensions and proportions, as presented in Section 7.

Abstract

At the beginning of a project, architects convey design

ideas via quick 2D diagrams, front views, ﬂoor plans, and

sketches. Consequently, many stakeholders have difﬁculty

visualizing the 3D representation of the building mass, lead-

ing to varied interpretations and inhibiting a shared under-

standing of the design. To alleviate the challenge, this pa-

per proposes a deep learning-based method, Vitruvio, for

creating a 3D model from a single perspective sketch. This

allows designers to automatically generate 3D representa-

tions in real-time based on their initial sketches and thus

communicate effectively and intuitively to the client. Vitru-

vio adapts Occupancy Network to perform single view re-

construction (SVR), a technique for creating 3D represen-

tations from a single image. Vitruvio achieves : (1) 18%

increase in the reconstruction accuracy and (2) 26% reduc-

tion in the inference time compared to Occupancy Network

on 1k buildings provided by the New York municipality. This

research investigates the effect of the building orientation

in the reconstruction quality, discovering that Vitruvio can

capture ﬁne-grain details in complex buildings when their

native orientation is preserved during training, as opposed

to the SVR’s standard practice that aligns every building to

its canonical pose. Finally, we release the code.

arXiv:2210.13634v2 [cs.CV] 11 Apr 2023

1. Introduction

The design process in Architecture, Engineering, and

Construction (AEC) industry involves many stakeholders,

including professionals such as engineers, architects, plan-

ners, and non-specialists such as clients, citizens, and users.

Each stakeholder contributes all the design aspects, which

Vitruvius called ‘Firmitas, Utilitas, Venustas’, which trans-

lates as solidity, usefulness, and beauty. Early in the design

process, all parties must reach a shared understanding of

these Vitruvian values to avoid any misrepresentation later

on [10,123]. A critical factor in establishing a shared un-

derstanding is the ability to convey the information quickly,

using a medium all stakeholders can understand [72].

However, during the initial meetings, design ideas are

shared with mediums such as 2D diagrams [16], front

views, ﬂoorplans [48,67,68,89], and sketches on papers

[101]. These mediums often represent the design informa-

tion in a few lines, leading to partial and incomplete repre-

sentations of the overall mass. As a result, many stakehold-

ers need help to visualize the actual 3D representation of

the building, resulting in varied interpretations, which ulti-

mately inhibits a shared understanding of design. [73] notes

that the inability of the stakeholders to interpret 2D designs

leads to reductions in productivity, reworks, wastage, and

cost overruns. [49] points out that this mode of design prac-

tices leads to difﬁculties in the communication of designs

since these representations lack of the 3D information ( such

as proportion, volume, overall mass, and others) needed

during later phases.

To alleviate this challenge, this research aims to gen-

erate 3D geometries from sketches, grounding its theory

on Sketch Modeling, an active area of research since the

90’s [45,118]. Sketch modeling has two major approaches:

Learning-based methods and Non-learning-based methods.

The Non-learning based methods require speciﬁc and de-

ﬁned inputs to construct 3D geometries. As a result, this

method operates with ﬁxed viewpoints and speciﬁc sketch-

ing styles, thus reducing the designer’s ﬂexibility. There-

fore learning-based methods have been employed to resolve

these issues, allowing for more ﬂexibility, as detailed later

in Section 3. Learning-based methods, also called data-

driven, generate a 3D shape from a partial sketch by learn-

ing a joint probabilistic distribution of a shape and sketches

from the dataset.

Currently, these techniques have only focused on decon-

textualized shapes such as furniture and mechanical parts,

where positions and orientations do not directly affect their

representation. However, this is different for buildings.

Their design is affected by the building’s location and ori-

entation

Therefore, this research provides a step forward in this

direction to consider location and orientation within the re-

construction process for deep learning models. Indeed, Vit-

ruvio is a ﬂexible, and contextual method that reconstructs

a 3D representation of buildings from a single perspective

sketch. It provides the ﬂexibility to generate a building mass

from a partial sketch drawn from any perspective viewpoint.

To accomplish this tasks we build our own dataset (Section

4) dubbed Manhattan1k. Manhattan1k preserves the con-

textual information of building, speciﬁcally their locations

and orientations.

To summarize, the contribution is threefold:

1. We explain the use of a learning-based method for

sketch-to-3D applications where the ﬁnal 3D building

shapes depend on a single perspective sketch.

2. We develop Vitruvio adapting Occupancy Network

(OccNet) [61] to our buildings dataset and improving

its accuracy and efﬁciency ( Section 5.1 ).

3. We show qualitatively and quantitatively that the build-

ing orientation affects the reconstruction performance

of our network (Section 5.2).

After presenting the related works and their limitations

in Section 2, the remainder of this paper introduces our

methods in Sections 3,4and the relative experiments to

validate the above-mentioned hypothesis and claims in Sec-

tion 5. Finally, Sections 6,7, and 8present the dis-

cussion and limitations of our method. The code has

been released open source: https://github.com/

CDInstitute/Vitruvio

2. Related Work

This section introduces previous work in sketch model-

ing and the limitations that led to a learning-based approach.

For most, sketch modeling has been an active area of

research since the late 90’s [45,118]. This modeling process

can be represented in two ways: as a series of operations or

as a Single View Reconstruction (SVR) [30,97].

The former is typically adopted by CAD software. It

requires speciﬁc and deﬁned inputs, such as strokes, to

construct 3D geometries. Through a series of sketches

from different viewpoints [26] , or a series of procedures

[40,55,56,69,70,87], these simpliﬁed CAD interfaces , pro-

vide complete control of the 3D geometry, at the expense of

the artistic style. Thus, the models developed have been

view, and style-dependent [126], operating with ﬁxed view-

points and speciﬁc sketching styles.

The latter, SVR, leverages a more ﬂexible approach. It

uses computer vision techniques to reconstruct 3D shapes

from a single sketch without the mandatory requirement

of a digital surface. SVR has recently gained attention

thanks to the advances in learning-based methods [37,39,

47,98,106,126,126,127], inspired by image-based re-

construction where the geometric output is represented in

two main ways: explicitly or implicitly. An explicit rep-

resentation is composed of meshes [36,42,111] , point

clouds [31,78,79,109], voxels [19,83,114], sets of se-

mantically meaningful parts [75], constructive solid geome-

tries (CSG) [55,56] coons patch-based [90], superquadrics

[75]. The implicit ones are represented as occupancy func-

tion [17,46,61], neural ﬁelds, or signed distance ﬁelds

[4,34,57,64,71,74,81,85,96,117,119].

Previous SVR approaches focused on decontextualized

shapes [15,88,115]. Indeed, furniture [15,115] and me-

chanical parts [50,53,112] can be positioned anywhere and

do not have to be designed for an exact location. How-

ever, this is different for buildings. Their construction and

design have speciﬁc constraints and regulations that vary

based on location. In a speciﬁc neighborhood, the buildings

share similar limitations, features, and characteristics. In

the initial design phases, the project location is known, and

it is used in common data-driven approaches [11] with Ge-

ographical Information Systems (GIS) [3,12] to inform the

design process, thereby contextualized. In fact, the location

and orientation highly impact the design. Therefore, the 3D

building mass generation from a sketch should account for

those factors, and they should be present in the dataset. For

example, considering a speciﬁc building in New York, the

wind impacts its design: a ﬁve-degree rotation affects its

energy and structural performance.

Due to these desiderata, a deep learning model. It cap-

tures the underlying correlation between the 2D sketch and

the 3D building shape from a bayesian perspective, with-

out the need to follow a deterministic process and with the

ability to encode additional information such as building lo-

cation and orientation. Hence, the dataset is the key to these

data-centric learning-based approaches. Unfortunately, re-

cent single view sketch-based methods focused on datasets

like ShapeNet [15], and only two targeted the reconstruction

of buildings [70] and [26] from perspective sketches. These

examples either targeted the content generation communi-

ties (gaming and mapping scenarios) [70], or did not allow

a 3D reconstruction based on a single perspective sketch

[26,48,72], sub-optimal for AEC’s workﬂows. While pre-

vious generative design approaches required an explicit for-

mulation of constraints and parameters to generate new so-

lutions, our method can synthesize and learn the generative

process from existing buildings [44]. For this reason, we

develop Vitruvio. Vitruvio is a deep generative model: a

learning-based approach approximating a speciﬁc dataset’s

probability distribution. Our dataset comprises existing 3D

building shapes, sketches, and contextual information (ori-

entation).

3. Method

In this research, we adopt a learning-based [100] method

that better aligns with our desiderata, as previously de-

scribed. Deep generative models, Variational Auto-Encoder

(VAE) [33,52,82,125] , Auto-Encoder (AE), Generative

Adversarial Network (GAN) [16,114], Flow-based [77],

Energy-based, Score-Based or Diffusion Model [60,128])

aim to approximate an unknown joint probability distribu-

tion. In fact, our approach does not learn to directly map

2D images to 3D [19,36,108]. It estimates the joint distri-

bution [52,61,74,77,114] of three main random variables

p(x, y, φ): the building shapes x(i), the respective sketches

yij , and the contextual information φil ( building orientation

and position). This process enables to sample new shapes

from this learned distribution described as pD(where Dis

the dataset x, y, φ).

Figure 2. Sampling and visualization of the unsigned occupancy

function. Blue points correspond to internal and red to external

samples. In this image, points are represented as spheres with a

0.01 unit radius. The building shape has been previously normal-

ized in a unit interval.

First the x(i)represents the 3D building shape ith in

our dataset composed by nindependently sampled build-

ings x(1), . . . , x(n). For this approach, we follow [61]

and x∈R, where each 3D point ( R3) is represented by

an occupancy function that outputs 1if that location is in-

side the shapes, and 0otherwise. This function is repre-

sented as a classiﬁcation problem: a Neural Network out-

puts 1 or 0 values from xyz coordinates ( or pwhere pis

pik ∈R3, k = 1, . . . , K in the building shape ith ) and a

sketch.

Second, the sketches are represented as yij ∈R2(as

grayscale representations), where irepresents the building

and jis the sketch viewpoint.

Figure 3. Diagram of our envisioned design workﬂow. The project starts from a known project location. Then, the designer sketches

multiple design options. Vitruvio converts them to 3D printable meshes. We execute the training process with two different datasets. One

preserves the initial orientation, and the other where the building orientation θis stored and the buildings are aligned in a canonical pose.

Third, φil constitutes the contextual, surrounding factors

lth; in our case, we considered the orientation and location

of the building ith.

The buildings are represented as independent 3D shapes

by the occupancy network fθ(x)where fθ:R3×R2→

[0,1], but after properly encoding the sketch and the orien-

tation into a ﬁnite latent variable z(encoder for φ, y), the

probability of the building shape xbased on the NN param-

eters θcan be represented p(x;θ) = Pzp(x, z;θ). We can

derive the marginal and conditional distribution from this

joint probability approximation to better serve downstream

inference tasks such as sketch modeling and shape genera-

tion [18].

This framework allows the generation of new samples

(Predictive Posterior) that mimic the training distribution

(Likelihood). Thus, new building shapes are generated

without developing speciﬁc generative algorithms, with

constraints and parameters, but by simply sampling from

the learned distribution.

This sampling procedure is conditioned on a sketch (sim-

ilar to conditional inference processes such as Pumarola et

al. [77] Chan et al. [14], or Ramesh et al. [80]).

For the chain rule, as in VAE [7,28,33,35,52,82,91] ,

p(x)is conditioned to the latent vector z( representing the

sketches [95]) :

p(y, φ, x;θ) = p(z, x;θ) = p(z|x;θ)p(x;θ) = p(x;θ|z)p(z)

Since the posterior p(z|x;θ)(represented with an en-

coder) is often intractable, VAE uses a deep network to rep-

resent p(x;θ|z)as the decoder.

In OccNet [61] and IM-NET [17,116], p(x;θ|z)p(z) =

pθ(x, z)≡p(z)pθ(x|z)the decoder p(x|z)overlooked

additional factors such as structure [63] physics [62] or con-

textual variables that inﬂuence the latent variable z(embed-

ding). This could be the cause of their poor generalization.

The decoder produces the 3D shape xconditioned on the

embedding z(probabilistic latent variable model), which

encodes the sketch and other information.

The inputs of the OccNet’s encoder are pij and the occu-

pancy values oij ,ij as in Fig 2. This encoder predicts the

mean and the standard deviation of a Gaussian distribution

(posterior distribution) qψz|(pij , oij )j=1:Kon z∈RL

with Lrepresenting the dimension of the embedding and z

the conditioning on the sketch. In this way, it is assumed

that p(z)has a simple (Gaussian Nµ, σ2) prior distribu-

tion over the features.

Our goal, as in OccNet [61], is to optimize the varia-

tion of the Evidence Lower Bound (ELBO) of the Nega-

tive Log-Likelihood (NLL) of a generative model 8.1, al-

lowing a joint training of the encoder and decoder network.

p(oij )j=1:K|(pij )j=1:K:

Lgen (θ, ψ) = PK

j=1 L(fθ(pij , zi)

| {z }

decoder

, oij )

+DKL(qψz|(pij , oij )j=1:K

|{z }

encoder

kp0(z))

where DKL denotes the KL-divergence, between p0(z)

and zi. Here p0(z)[marginal of p(x, z;θ)] is the prior

distribution on the latent variable zi(typically Gaussian,

with reparametrization trick to ensure the differentiabil-

ity of the sampling). And ziis sampled according to

qψzi|(pij , oij )j=1:K.

log P(x|c)−DKL[Q(z|x, c)kP(z|x, c)] =

E[log P(x|z, c)]

| {z }

decoder

−DKL[Q(z|x, c)kP(z|c)]

| {z }

encoder

Here we maximize the conditional log-likelihood, notic-

ing that the goal is not only to model the complex distribu-

tion of the shapes for buildings but also to make a discrim-

inative prediction based on the input sketch and contextual

information. Speciﬁcally, different buildings could be gen-

erated from the same image based on their different con-

textual information, such as location and orientation. For

example, a sketch of a cube could generate different 3D

models based on the weather conditions of the building site.

In a warm location, the cube could have an atrium [103] to

guarantee more sunlight, air circulation, and shading [24].

In a cold location, to better preserve the heat, the atrium is

not recommended. While the rest of the method follow the

exact implementation of [61] with the same training proce-

dure, losses and inference, our experiments are designed to

validate our initial hypothesis and to show the potential of

OccNet for applications related to building design.

4. Dataset

As mentioned in Section 3, a data-driven learning-based

method is employed in this research. Hence, it requires

a dataset to be trained on to approximate a joint distribu-

tion of a training corpus Dcomposed by x, y, φ . Initially,

Federova et al.’s Synthetic Dataset [92]BuildingNet [88],

Random3DCity [5], 3DCityDB (Berlin) [120] and Real-

city3D [22] have been analyzed. BuildingNet’s dataset and

Federova et al.’s [92] provide buildings with proper segmen-

tation, but unfortunately, they lack contextual information,

are too detailed for conceptual design phases, and misrep-

resent existing buildings. Instead, Realcity3D 1does not

have sketch representations. We built a custom dataset of

1NY website https://www.nyc.gov/site/planning/data-

maps/open-data/dwn-nyc-3d-model-download.page and AI4CE

https://github.com/ai4ce/RealCity3D.

building masses with respective sketches and contextual in-

formation to validate our method.

Therefore, from Realcity3D, we extracted the .obj of 46k

buildings belonging to the municipality of New York. From

these 46k(45847) buildings, we selected, based on their ﬁle

size, a subset of only one thousand shapes from Manhattan

( we called this dataset Manhattan 1K, in the GitHub repos-

itory we released the ﬁlenames of the buildings adopted).

We divided the buildings into three main ﬁle sizes ( corre-

lated to the levels of details ’LOD’ [6,54] categories, the

more details a shape has, the larger the ﬁle size to store

that amount of information) to reduce this variation and

minimize model variance within each class. For simplic-

ity, we split the dataset based on the ﬁle size: small (333),

medium (334), and large (333), as in Table [4] and data sep-

aration provided in the repository. Moreover, we randomly

composed the training/validation/test sets in 700/100/200.

The training set, with 700 buildings, used 16800 synthetic

sketches (24 sketches for each 3D building shape).

Figure 4. Dataset division of the 1k models based on ﬁle size. We

used 333 small size (12Kb), 333 medium size (12-300Kb), and

334 large size (≥334Kb)

Adapting the OccNet’s approach [61], we deﬁned these

building shapes as implicit representations [116]. The

points are sampled uniformly from the bounding box of the

mesh as depicted in Fig. 2.

A function represented by a neural network constructs

the iso-surface determining if the point is inside or outside

the building mesh. This association is performed only with

a watertight mesh (e.g., for measuring IoU). Following Occ-

Net’s approach, we implemented the code provided by [93],

which performs TSDF-fusion on random depth renderings

of the object to create watertight versions of the meshes.

We centered and re-scaled all meshes for voxelizations from

3DR2-N2 [19]. The 3D bounding box of the mesh is cen-

tered at 0, and its longest edge has a length of 1. To ﬁnd

the iso-surface we sampled 100k points in the unit cube

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Vitruvio:3DBuildingMeshesviaSinglePerspectiveSketchesAlbertoTonoStanfordUniversity&ComputationalDesignInstituteatono@stanford.edu&alberto.tono@cd.instituteHeyaojingHuangStanfordUniversityhhyj4495@stanford.eduAshwinAgrawalStanfordUniversityashwin15@stanford.eduMartinFischerStanfordUniversityfischer@s...

展开>> 收起<<

Vitruvio 3D Building Meshes via Single Perspective Sketches Alberto Tono Stanford University Computational Design Institute.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Vitruvio 3D Building Meshes via Single Perspective Sketches Alberto Tono Stanford University Computational Design Institute

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: