Habitat-Matterport 3D Semantics Dataset Karmesh Yadav1 Ram Ramrakhya2 Santhosh Kumar Ramakrishnan3 Theo Gervet6 John Turner1 Aaron Gokaslan4 Noah Maestre1 Angel Xuan Chang5

2025-04-24 0 0 5.05MB 15 页 10玖币
侵权投诉
Habitat-Matterport 3D Semantics Dataset
Karmesh Yadav1*
, Ram Ramrakhya2*
, Santhosh Kumar Ramakrishnan3*
,
Theo Gervet6, John Turner1, Aaron Gokaslan4, Noah Maestre1, Angel Xuan Chang5,
Dhruv Batra1,2, Manolis Savva5, Alexander William Clegg1
, Devendra Singh Chaplot1
1Meta AI 2Georgia Tech 3UT Austin
4Cornell University 5Simon Fraser University 6Carnegie Mellon University
Abstract
We present the Habitat-Matterport 3D Semantics
(HM3DSEM) dataset. HM3DSEM is the largest dataset
of 3D real-world spaces with densely annotated seman-
tics that is currently available to the academic commu-
nity. It consists of 142,646 object instance annotations
across 216 3D spaces and 3,100 rooms within those spaces.
The scale, quality, and diversity of object annotations far
exceed those of prior datasets. A key difference setting
apart HM3DSEM from other datasets is the use of tex-
ture information to annotate pixel-accurate object bound-
aries. We demonstrate the effectiveness of HM3DSEM
dataset for the Object Goal Navigation task using differ-
ent methods. Policies trained using HM3DSEM perform
outperform those trained on prior datasets. Introduction
of HM3DSEM in the Habitat ObjectNav Challenge lead
to an increase in participation from 400 submissions in
2021 to 1022 submissions in 2022. Project page:
https:
//aihabitat.org/datasets/hm3d-semantics/
1. Introduction
Over the recent past, work on acquiring and semantically
annotating datasets of real-world spaces has significantly
accelerated research into embodied AI agents that can per-
ceive, navigate and interact with realistic indoor scenes [
1
5
].
However, the acquisition of such datasets at scale is a labori-
ous process. HM3D [
5
] which is one of the largest available
datasets with 1000 high-quality and complete indoor space
reconstructions, reportedly required 800+ hours of human
effort to carry out mainly data curation and verification of
3D reconstructions. Moreover, dense semantic annotation of
such acquired spaces remains incredibly challenging.
We present the Habitat-Matterport 3D Dataset Seman-
tics (HM3DSEM). This dataset provides a dense semantic
annotation ‘layer’ augmenting the spaces from the original
*Equal Contribution, Correspondence: ykarmesh@gmail.com
Equal Contribution
HM3D dataset. This semantic ‘layer’ is implemented as a
set of textures that encode object instance semantics and
cluster objects into distinct rooms. The semantics include
architectural elements (walls, floors, ceilings), large objects
(furniture, appliances etc.), as well as ‘stuff’ categories (ag-
gregations of smaller items such as books on bookcases).
This semantic instance information is specified in the seman-
tic texture layer, providing pixel-accurate correspondences
to the original acquired RGB surface texture and underlying
geometry of the objects.
The HM3DSEM dataset currently contains annotations
for
142,646
object instances distributed across
216
spaces
and
3,100
rooms within those spaces. Figure 1shows some
examples of the semantic annotations from the HM3DSEM
dataset. The achieved scale is larger than prior work (2.8x rel-
ative to Matterport3D [
6
] (MP3D) and 2.1x relative to ARK-
itScenes [
7
] in terms of total number of object instances).
We demonstrate the usefulness of HM3DSEM on the Ob-
jectGoal navigation task. Training on HM3DSEM results
in higher cross-dataset generalization performance. Surpris-
ingly, the policies trained on HM3DSEM perform better on
average across scene datasets compared to training on the
datasets themselves. We also show that increasing the size of
training datasets improve the navigation performance. These
results highlight the importance of improving the quality
and scale of 3D datasets with dense semantic annotations for
improving downstream embodied AI task performance.
2. Related Work
3D reconstruction datasets with semantics. There is a
relatively small number of prior works that focus on seman-
tically annotated 3D interior spaces acquired from the real
world. Collecting, reconstructing, and annotating such data
at scale is a significant effort that requires complex pipelines
and annotation tools. Earlier work has therefore focused
on scenes at the scale of single rooms. For example, Scan-
Net [
8
] provided 707 typically room-scale reconstructions
annotated with object semantic instances through labeling
arXiv:2210.05633v3 [cs.CV] 12 Oct 2023
Figure 1. Habitat-Matterport 3D Semantics (HM3DSEM) provides the largest dataset of real-world spaces with densely annotated semantics.
High-fidelity textured 3D mesh reconstructions are labeled with precise instance-level object semantics, indicated by distinct colors.
of 3D mesh segments constructed using an unsupervised
segmentation algorithm. Followup work by Wald et al.
[9]
adopted a similar approach and also targeted room-sized
scenes. Most recently, ARKitScenes [
7
] contributed scans
of 1661 room-scale scenes but only provides bounding box
annotations for object instances.
Prominent prior works on building-scale datasets with
semantic annotation are Matterport3D [
6
], a subset of Gib-
son by Armeni et al.
[10]
, and the Replica [
11
] dataset. The
first uses the same methodology as ScanNet (labeling of 3D
mesh segments), while the second provides human-verified
object instance annotations created by back-projecting 2D se-
mantic segmentation masks. The third provides high-quality
mesh vertex-level object instance labels but only contains
18 scenes. Building on top of HM3D, which consists of
over
1,000
diverse environments from around the world,
HM3DSEM provides detailed texture-level semantic annota-
tions for building-scale reconstructions.
Synthetic 3D scene datasets. The use of synthetic 3D
datasets for embodied AI simulation is quite common, espe-
cially when interactive environments are desired [
4
,
12
14
].
Due to the difficulty of modeling high-fidelity synthetic envi-
ronments at scale, most existing datasets are limited in size
and typically represent room-scale scenes. Some of the prior
work in this space has adopted a ‘teleportation’ mechanism
that allows an agent to immediately move from room to
room through closed doors [
13
]. A few datasets contributed
by prior work focus on larger-scale scenes that coherently
represent entire residences with multiple rooms [
4
,
15
,
16
].
These datasets have a number of limitations. First, due to
the difficulty in modeling a broad diversity of objects and
scene layouts containing them, there is fairly limited varia-
tion in both object appearance and the spatial arrangements
of the objects in the scenes. Moreover, the objects exhibit
modeling biases that create a simulation-to-reality gap, and
the re-use of the same object models across scenes produces
the unrealistic effect of “perfect copies” of particular objects.
These limitations have inspired work that attempts to tackle
sim-to-real discrepancy by creating synthetic datasets that
conform to scenes from the real world in terms of object
appearance and spatial arrangement [
4
,
17
19
]. However,
this approach is hard to scale, and modeling biases due to
the use of synthetic 3D data content creation software still
remain. In contrast, we focus on scaling high-quality seman-
tic annotations of real scenes acquired from a diverse set of
spaces in the real world.
3. Dataset Details
The Habitat-Matterport 3D Semantics Dataset is the
largest-ever human-annotated dataset of semantically-
annotated 3D indoor spaces. It contains dense semantic anno-
tations for
216
high-resolution, 3D, scanned scenes from the
Habitat-Matterport 3D Dataset (HM3D). The HM3D scenes
are annotated with
142,646
raw object names additionally
mapped to the 40 Matterport 3D categories [
6
]. On average,
each scene consists of
661
objects from
106
categories. This
dataset is the result of over 14,200 hours of human effort for
annotation and verification by 20+ annotators. The follow-
ing subsections provide further details on asset formats, the
annotation pipeline, and scene content statistics.
3.1. Data Format and Contents
The semantic annotations are available as a set of texture
images applied to the original scene geometry from HM3D
and packed into binary glTF (.glb) format. Unique hex colors
differentiate each object instance and map it to a raw text
string classifying the instance. These mappings are included
in a metadata text file accompanying the .glb asset, which
additionally labels each instance with a region ID to define
object grouping by room.
1Human-verified subset of Gibson [20] with semantic annotations.
Dataset Scenes Rooms Object instances Objects/room Annotation type
Replica [11]18 25 2,843 114 vertex
Gibson (tiny1) [10]35 727 2,397 3vertex
ScanNet [8]707 707 36,213 24 segment
3RScan [9]478 478 43,006 29 segment
MP3D [6]90 2,056 50,851 25 segment
ARKitScenes [7]1,661 5,048 67,791 13 bounding box
HM3DSEM (ours) 216 3,100 142,646 60 texture
Table 1. Comparison of HM3DSEM to other semantically annotated indoor scene datasets. Statistics are on the publicly released portions of
the corresponding datasets (does not include ScanNet or ARKitScenes hidden test sets).
Often, semantic annotations are defined per-vertex and
directly embedded in the mesh geometry (e.g., ScanNet [
8
],
Gibson [
3
], and MP3D [
6
]). However, it is not uncommon
for mesh geometry discretization to insufficiently capture
boundaries between objects, especially on flat surfaces such
as walls, floors, and table-tops. This results in jagged inac-
curate semantic boundaries, missing annotations, or requires
generating an entirely new mesh with higher resolution than
the original, which has implications on both rendering per-
formance and visual alignment. For example, Figure 4high-
lights the common misalignment errors between annotated
and original assets from the MP3D dataset resulting from au-
tomated mesh geometry generation. In contrast, HM3DSEM
archival format encodes annotations directly in a set of tex-
tures compatible with the original geometry. As it is not
uncommon for 3D assets, especially those derived from
scanning pipelines to represent object boundaries in texture
rather than geometry, this choice seemed natural. Figure 2
shows several example scenes and contrasts them against
semantic annotations from Matterport3D [
6
], which is the
most related prior dataset. The density and quality of seman-
tic instance annotations in HM3DSEM exceeds that of prior
work as shown in Table 1. For additional compatibility with
existing simulators, the semantic texture annotations are also
baked into per-vertex colors included with the assets.
Artists were instructed to annotate architectural features
such as: walls, floors, ceilings, windows, stairs, and doors
as well as notable embellishments such as door and window
frames, banisters, area rugs, and moulding. Instance annota-
tions for architectural features are broken into regions at tran-
sition points such as room boundaries, doorways, and hall-
ways to more readily classify components into regions (e.g.
to semantically separate floors and ceilings as a room transi-
tions to a hallway) as shown in Figure 4(right). Additionally,
decorative features such as pictures, posters, switches, vents,
lighting fixtures, and wall art are segmented and labeled.
Furniture, appliances, and clutter objects were annotated
and segmented from their surroundings whenever possible.
For example, pillows and blankets are segmented individu-
ally from beds, couches, and chairs while remote controls,
electronics, lamps, and art pieces are segmented from desks,
tables, and consoles. In many cases, as scan resolution
permits, individual clothing items, linens, and books are
segmented from one another in closets and bookshelves.
3.2. Verification Process
Annotation on the scale of HM3D Semantics is not a
one-way street. Roughly 640 annotator hours were allocated
to iteration and error correction (about 4.5% of all annotator
hours). Additional verification was done by the authors,
including both qualitative manual assessment and automated
programmatic checks. Even so, some errors may yet remain.
Fortunately, the archival format of texture + text allows for
efficient iterative improvement of the annotations.
Automated verification is essential for large scale anno-
tation efforts. Our automated verification pipeline included,
among others, the following checks:
Text file annotations contain only colors from textures.
Each annotation color used only once per scene.
Text file contents conform to expected format: index,
color, category name, region id.
Qualitative verification proves challenging to automate,
and as such, manual validation by humans remains an impor-
tant part of the annotation QA pipeline. Following delivery
of the annotated assets, a manual review and iteration phase
was conducted, including the following:
Validation pass over raw text names included identifica-
tion and correction of typos, consolidation of synonyms,
and mapping of raw text names to the 40 canonical ob-
ject classes from the MP3D dataset [6].
Visual inspection through virtual walk-through in Habi-
tat [
4
]. Verifiers checked for missing annotations, messy
boundaries, annotation artifacts, over-aggregation (i.e.,
multiple unique instances sharing an annotation color),
semantic mislabeling (e.g. “dishwasher” annotated as
“washing machine”), and other common flaws.
3.3. Dataset Statistics
The 216 scenes chosen as candidates for HM3DSEM
annotation were selected at random from the 950 furnished
HM3D scan assets. These are distributed into subsets of
[145, 36, 35] scenes between [train, val, test] splits. The
摘要:

Habitat-Matterport3DSemanticsDatasetKarmeshYadav1*,RamRamrakhya2*,SanthoshKumarRamakrishnan3*,TheoGervet6,JohnTurner1,AaronGokaslan4,NoahMaestre1,AngelXuanChang5,DhruvBatra1,2,ManolisSavva5,AlexanderWilliamClegg1†,DevendraSinghChaplot1†1MetaAI2GeorgiaTech3UTAustin4CornellUniversity5SimonFraserUniver...

展开>> 收起<<
Habitat-Matterport 3D Semantics Dataset Karmesh Yadav1 Ram Ramrakhya2 Santhosh Kumar Ramakrishnan3 Theo Gervet6 John Turner1 Aaron Gokaslan4 Noah Maestre1 Angel Xuan Chang5.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:5.05MB 格式:PDF 时间:2025-04-24

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注