Vitruvio 3D Building Meshes via Single Perspective Sketches Alberto Tono Stanford University Computational Design Institute

2025-05-06 0 0 8.23MB 27 页 10玖币
侵权投诉
Vitruvio: 3D Building Meshes via Single Perspective Sketches
Alberto Tono
Stanford University & Computational Design Institute
atono@stanford.edu & alberto.tono@cd.institute
Heyaojing Huang
Stanford University
hhyj4495@stanford.edu
Ashwin Agrawal
Stanford University
ashwin15@stanford.edu
Martin Fischer
Stanford University
fischer@stanford.edu
Figure 1. Vitruvio converts a single perspective sketch to a 3D watertight mesh in universal Scene Description (USD) format. Vitruvio’s
final output consists of a 3D printable model. In the figure above, the model has been printed with DREMEL 3D45, PLA material from
a.gcode shape file. The users envision a 3D building mass (the ground truth building on the left, representing the ”D2X” building in the
dataset). Then, they sketch it on an iPad with a single-line style, centered on a squared canvas. This diagram wants to show some of the
assumptions and limitations. The model has been trained only with a single-line synthetic style for the assumptions. For the limitations,
the output mesh lacks accurate dimensions and proportions, as presented in Section 7.
Abstract
At the beginning of a project, architects convey design
ideas via quick 2D diagrams, front views, floor plans, and
sketches. Consequently, many stakeholders have difficulty
visualizing the 3D representation of the building mass, lead-
ing to varied interpretations and inhibiting a shared under-
standing of the design. To alleviate the challenge, this pa-
per proposes a deep learning-based method, Vitruvio, for
creating a 3D model from a single perspective sketch. This
allows designers to automatically generate 3D representa-
tions in real-time based on their initial sketches and thus
communicate effectively and intuitively to the client. Vitru-
vio adapts Occupancy Network to perform single view re-
construction (SVR), a technique for creating 3D represen-
tations from a single image. Vitruvio achieves : (1) 18%
increase in the reconstruction accuracy and (2) 26% reduc-
tion in the inference time compared to Occupancy Network
on 1k buildings provided by the New York municipality. This
research investigates the effect of the building orientation
in the reconstruction quality, discovering that Vitruvio can
capture fine-grain details in complex buildings when their
native orientation is preserved during training, as opposed
to the SVR’s standard practice that aligns every building to
its canonical pose. Finally, we release the code.
1
arXiv:2210.13634v2 [cs.CV] 11 Apr 2023
1. Introduction
The design process in Architecture, Engineering, and
Construction (AEC) industry involves many stakeholders,
including professionals such as engineers, architects, plan-
ners, and non-specialists such as clients, citizens, and users.
Each stakeholder contributes all the design aspects, which
Vitruvius called ‘Firmitas, Utilitas, Venustas’, which trans-
lates as solidity, usefulness, and beauty. Early in the design
process, all parties must reach a shared understanding of
these Vitruvian values to avoid any misrepresentation later
on [10,123]. A critical factor in establishing a shared un-
derstanding is the ability to convey the information quickly,
using a medium all stakeholders can understand [72].
However, during the initial meetings, design ideas are
shared with mediums such as 2D diagrams [16], front
views, floorplans [48,67,68,89], and sketches on papers
[101]. These mediums often represent the design informa-
tion in a few lines, leading to partial and incomplete repre-
sentations of the overall mass. As a result, many stakehold-
ers need help to visualize the actual 3D representation of
the building, resulting in varied interpretations, which ulti-
mately inhibits a shared understanding of design. [73] notes
that the inability of the stakeholders to interpret 2D designs
leads to reductions in productivity, reworks, wastage, and
cost overruns. [49] points out that this mode of design prac-
tices leads to difficulties in the communication of designs
since these representations lack of the 3D information ( such
as proportion, volume, overall mass, and others) needed
during later phases.
To alleviate this challenge, this research aims to gen-
erate 3D geometries from sketches, grounding its theory
on Sketch Modeling, an active area of research since the
90’s [45,118]. Sketch modeling has two major approaches:
Learning-based methods and Non-learning-based methods.
The Non-learning based methods require specific and de-
fined inputs to construct 3D geometries. As a result, this
method operates with fixed viewpoints and specific sketch-
ing styles, thus reducing the designer’s flexibility. There-
fore learning-based methods have been employed to resolve
these issues, allowing for more flexibility, as detailed later
in Section 3. Learning-based methods, also called data-
driven, generate a 3D shape from a partial sketch by learn-
ing a joint probabilistic distribution of a shape and sketches
from the dataset.
Currently, these techniques have only focused on decon-
textualized shapes such as furniture and mechanical parts,
where positions and orientations do not directly affect their
representation. However, this is different for buildings.
Their design is affected by the building’s location and ori-
entation
Therefore, this research provides a step forward in this
direction to consider location and orientation within the re-
construction process for deep learning models. Indeed, Vit-
ruvio is a flexible, and contextual method that reconstructs
a 3D representation of buildings from a single perspective
sketch. It provides the flexibility to generate a building mass
from a partial sketch drawn from any perspective viewpoint.
To accomplish this tasks we build our own dataset (Section
4) dubbed Manhattan1k. Manhattan1k preserves the con-
textual information of building, specifically their locations
and orientations.
To summarize, the contribution is threefold:
1. We explain the use of a learning-based method for
sketch-to-3D applications where the final 3D building
shapes depend on a single perspective sketch.
2. We develop Vitruvio adapting Occupancy Network
(OccNet) [61] to our buildings dataset and improving
its accuracy and efficiency ( Section 5.1 ).
3. We show qualitatively and quantitatively that the build-
ing orientation affects the reconstruction performance
of our network (Section 5.2).
After presenting the related works and their limitations
in Section 2, the remainder of this paper introduces our
methods in Sections 3,4and the relative experiments to
validate the above-mentioned hypothesis and claims in Sec-
tion 5. Finally, Sections 6,7, and 8present the dis-
cussion and limitations of our method. The code has
been released open source: https://github.com/
CDInstitute/Vitruvio
2. Related Work
This section introduces previous work in sketch model-
ing and the limitations that led to a learning-based approach.
For most, sketch modeling has been an active area of
research since the late 90’s [45,118]. This modeling process
can be represented in two ways: as a series of operations or
as a Single View Reconstruction (SVR) [30,97].
The former is typically adopted by CAD software. It
requires specific and defined inputs, such as strokes, to
construct 3D geometries. Through a series of sketches
from different viewpoints [26] , or a series of procedures
[40,55,56,69,70,87], these simplified CAD interfaces , pro-
vide complete control of the 3D geometry, at the expense of
the artistic style. Thus, the models developed have been
view, and style-dependent [126], operating with fixed view-
points and specific sketching styles.
The latter, SVR, leverages a more flexible approach. It
uses computer vision techniques to reconstruct 3D shapes
from a single sketch without the mandatory requirement
of a digital surface. SVR has recently gained attention
thanks to the advances in learning-based methods [37,39,
47,98,106,126,126,127], inspired by image-based re-
construction where the geometric output is represented in
2
two main ways: explicitly or implicitly. An explicit rep-
resentation is composed of meshes [36,42,111] , point
clouds [31,78,79,109], voxels [19,83,114], sets of se-
mantically meaningful parts [75], constructive solid geome-
tries (CSG) [55,56] coons patch-based [90], superquadrics
[75]. The implicit ones are represented as occupancy func-
tion [17,46,61], neural fields, or signed distance fields
[4,34,57,64,71,74,81,85,96,117,119].
Previous SVR approaches focused on decontextualized
shapes [15,88,115]. Indeed, furniture [15,115] and me-
chanical parts [50,53,112] can be positioned anywhere and
do not have to be designed for an exact location. How-
ever, this is different for buildings. Their construction and
design have specific constraints and regulations that vary
based on location. In a specific neighborhood, the buildings
share similar limitations, features, and characteristics. In
the initial design phases, the project location is known, and
it is used in common data-driven approaches [11] with Ge-
ographical Information Systems (GIS) [3,12] to inform the
design process, thereby contextualized. In fact, the location
and orientation highly impact the design. Therefore, the 3D
building mass generation from a sketch should account for
those factors, and they should be present in the dataset. For
example, considering a specific building in New York, the
wind impacts its design: a five-degree rotation affects its
energy and structural performance.
Due to these desiderata, a deep learning model. It cap-
tures the underlying correlation between the 2D sketch and
the 3D building shape from a bayesian perspective, with-
out the need to follow a deterministic process and with the
ability to encode additional information such as building lo-
cation and orientation. Hence, the dataset is the key to these
data-centric learning-based approaches. Unfortunately, re-
cent single view sketch-based methods focused on datasets
like ShapeNet [15], and only two targeted the reconstruction
of buildings [70] and [26] from perspective sketches. These
examples either targeted the content generation communi-
ties (gaming and mapping scenarios) [70], or did not allow
a 3D reconstruction based on a single perspective sketch
[26,48,72], sub-optimal for AEC’s workflows. While pre-
vious generative design approaches required an explicit for-
mulation of constraints and parameters to generate new so-
lutions, our method can synthesize and learn the generative
process from existing buildings [44]. For this reason, we
develop Vitruvio. Vitruvio is a deep generative model: a
learning-based approach approximating a specific dataset’s
probability distribution. Our dataset comprises existing 3D
building shapes, sketches, and contextual information (ori-
entation).
3. Method
In this research, we adopt a learning-based [100] method
that better aligns with our desiderata, as previously de-
scribed. Deep generative models, Variational Auto-Encoder
(VAE) [33,52,82,125] , Auto-Encoder (AE), Generative
Adversarial Network (GAN) [16,114], Flow-based [77],
Energy-based, Score-Based or Diffusion Model [60,128])
aim to approximate an unknown joint probability distribu-
tion. In fact, our approach does not learn to directly map
2D images to 3D [19,36,108]. It estimates the joint distri-
bution [52,61,74,77,114] of three main random variables
p(x, y, φ): the building shapes x(i), the respective sketches
yij , and the contextual information φil ( building orientation
and position). This process enables to sample new shapes
from this learned distribution described as pD(where Dis
the dataset x, y, φ).
Figure 2. Sampling and visualization of the unsigned occupancy
function. Blue points correspond to internal and red to external
samples. In this image, points are represented as spheres with a
0.01 unit radius. The building shape has been previously normal-
ized in a unit interval.
First the x(i)represents the 3D building shape ith in
our dataset composed by nindependently sampled build-
ings x(1), . . . , x(n). For this approach, we follow [61]
and xR, where each 3D point ( R3) is represented by
an occupancy function that outputs 1if that location is in-
side the shapes, and 0otherwise. This function is repre-
sented as a classification problem: a Neural Network out-
puts 1 or 0 values from xyz coordinates ( or pwhere pis
pik R3, k = 1, . . . , K in the building shape ith ) and a
sketch.
Second, the sketches are represented as yij R2(as
grayscale representations), where irepresents the building
and jis the sketch viewpoint.
3
Figure 3. Diagram of our envisioned design workflow. The project starts from a known project location. Then, the designer sketches
multiple design options. Vitruvio converts them to 3D printable meshes. We execute the training process with two different datasets. One
preserves the initial orientation, and the other where the building orientation θis stored and the buildings are aligned in a canonical pose.
Third, φil constitutes the contextual, surrounding factors
lth; in our case, we considered the orientation and location
of the building ith.
The buildings are represented as independent 3D shapes
by the occupancy network fθ(x)where fθ:R3×R2
[0,1], but after properly encoding the sketch and the orien-
tation into a finite latent variable z(encoder for φ, y), the
probability of the building shape xbased on the NN param-
eters θcan be represented p(x;θ) = Pzp(x, z;θ). We can
derive the marginal and conditional distribution from this
joint probability approximation to better serve downstream
inference tasks such as sketch modeling and shape genera-
tion [18].
This framework allows the generation of new samples
(Predictive Posterior) that mimic the training distribution
(Likelihood). Thus, new building shapes are generated
without developing specific generative algorithms, with
constraints and parameters, but by simply sampling from
the learned distribution.
This sampling procedure is conditioned on a sketch (sim-
ilar to conditional inference processes such as Pumarola et
al. [77] Chan et al. [14], or Ramesh et al. [80]).
For the chain rule, as in VAE [7,28,33,35,52,82,91] ,
p(x)is conditioned to the latent vector z( representing the
sketches [95]) :
p(y, φ, x;θ) = p(z, x;θ) = p(z|x;θ)p(x;θ) = p(x;θ|z)p(z)
Since the posterior p(z|x;θ)(represented with an en-
coder) is often intractable, VAE uses a deep network to rep-
resent p(x;θ|z)as the decoder.
In OccNet [61] and IM-NET [17,116], p(x;θ|z)p(z) =
pθ(x, z)p(z)pθ(x|z)the decoder p(x|z)overlooked
additional factors such as structure [63] physics [62] or con-
textual variables that influence the latent variable z(embed-
ding). This could be the cause of their poor generalization.
The decoder produces the 3D shape xconditioned on the
embedding z(probabilistic latent variable model), which
encodes the sketch and other information.
The inputs of the OccNets encoder are pij and the occu-
pancy values oij ,ij as in Fig 2. This encoder predicts the
mean and the standard deviation of a Gaussian distribution
(posterior distribution) qψz|(pij , oij )j=1:Kon zRL
with Lrepresenting the dimension of the embedding and z
the conditioning on the sketch. In this way, it is assumed
that p(z)has a simple (Gaussian Nµ, σ2) prior distribu-
tion over the features.
Our goal, as in OccNet [61], is to optimize the varia-
tion of the Evidence Lower Bound (ELBO) of the Nega-
tive Log-Likelihood (NLL) of a generative model 8.1, al-
lowing a joint training of the encoder and decoder network.
p(oij )j=1:K|(pij )j=1:K:
4
Lgen (θ, ψ) = PK
j=1 L(fθ(pij , zi)
| {z }
decoder
, oij )
+DKL(qψz|(pij , oij )j=1:K
|{z }
encoder
kp0(z))
where DKL denotes the KL-divergence, between p0(z)
and zi. Here p0(z)[marginal of p(x, z;θ)] is the prior
distribution on the latent variable zi(typically Gaussian,
with reparametrization trick to ensure the differentiabil-
ity of the sampling). And ziis sampled according to
qψzi|(pij , oij )j=1:K.
log P(x|c)DKL[Q(z|x, c)kP(z|x, c)] =
E[log P(x|z, c)]
| {z }
decoder
DKL[Q(z|x, c)kP(z|c)]
| {z }
encoder
Here we maximize the conditional log-likelihood, notic-
ing that the goal is not only to model the complex distribu-
tion of the shapes for buildings but also to make a discrim-
inative prediction based on the input sketch and contextual
information. Specifically, different buildings could be gen-
erated from the same image based on their different con-
textual information, such as location and orientation. For
example, a sketch of a cube could generate different 3D
models based on the weather conditions of the building site.
In a warm location, the cube could have an atrium [103] to
guarantee more sunlight, air circulation, and shading [24].
In a cold location, to better preserve the heat, the atrium is
not recommended. While the rest of the method follow the
exact implementation of [61] with the same training proce-
dure, losses and inference, our experiments are designed to
validate our initial hypothesis and to show the potential of
OccNet for applications related to building design.
4. Dataset
As mentioned in Section 3, a data-driven learning-based
method is employed in this research. Hence, it requires
a dataset to be trained on to approximate a joint distribu-
tion of a training corpus Dcomposed by x, y, φ . Initially,
Federova et al.s Synthetic Dataset [92]BuildingNet [88],
Random3DCity [5], 3DCityDB (Berlin) [120] and Real-
city3D [22] have been analyzed. BuildingNet’s dataset and
Federova et al.s [92] provide buildings with proper segmen-
tation, but unfortunately, they lack contextual information,
are too detailed for conceptual design phases, and misrep-
resent existing buildings. Instead, Realcity3D 1does not
have sketch representations. We built a custom dataset of
1NY website https://www.nyc.gov/site/planning/data-
maps/open-data/dwn-nyc-3d-model-download.page and AI4CE
https://github.com/ai4ce/RealCity3D.
building masses with respective sketches and contextual in-
formation to validate our method.
Therefore, from Realcity3D, we extracted the .obj of 46k
buildings belonging to the municipality of New York. From
these 46k(45847) buildings, we selected, based on their file
size, a subset of only one thousand shapes from Manhattan
( we called this dataset Manhattan 1K, in the GitHub repos-
itory we released the filenames of the buildings adopted).
We divided the buildings into three main file sizes ( corre-
lated to the levels of details ’LOD’ [6,54] categories, the
more details a shape has, the larger the file size to store
that amount of information) to reduce this variation and
minimize model variance within each class. For simplic-
ity, we split the dataset based on the file size: small (333),
medium (334), and large (333), as in Table [4] and data sep-
aration provided in the repository. Moreover, we randomly
composed the training/validation/test sets in 700/100/200.
The training set, with 700 buildings, used 16800 synthetic
sketches (24 sketches for each 3D building shape).
Figure 4. Dataset division of the 1k models based on file size. We
used 333 small size (12Kb), 333 medium size (12-300Kb), and
334 large size (334Kb)
Adapting the OccNets approach [61], we defined these
building shapes as implicit representations [116]. The
points are sampled uniformly from the bounding box of the
mesh as depicted in Fig. 2.
A function represented by a neural network constructs
the iso-surface determining if the point is inside or outside
the building mesh. This association is performed only with
a watertight mesh (e.g., for measuring IoU). Following Occ-
Nets approach, we implemented the code provided by [93],
which performs TSDF-fusion on random depth renderings
of the object to create watertight versions of the meshes.
We centered and re-scaled all meshes for voxelizations from
3DR2-N2 [19]. The 3D bounding box of the mesh is cen-
tered at 0, and its longest edge has a length of 1. To find
the iso-surface we sampled 100k points in the unit cube
5
摘要:

Vitruvio:3DBuildingMeshesviaSinglePerspectiveSketchesAlbertoTonoStanfordUniversity&ComputationalDesignInstituteatono@stanford.edu&alberto.tono@cd.instituteHeyaojingHuangStanfordUniversityhhyj4495@stanford.eduAshwinAgrawalStanfordUniversityashwin15@stanford.eduMartinFischerStanfordUniversityfischer@s...

展开>> 收起<<
Vitruvio 3D Building Meshes via Single Perspective Sketches Alberto Tono Stanford University Computational Design Institute.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:27 页 大小:8.23MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注