Habitat-Matterport 3D Semantics Dataset Karmesh Yadav1 Ram Ramrakhya2 Santhosh Kumar Ramakrishnan3 Theo Gervet6 John Turner1 Aaron Gokaslan4 Noah Maestre1 Angel Xuan Chang5

2025-04-24 0 0 5.05MB 15 页 10玖币

侵权投诉

Habitat-Matterport 3D Semantics Dataset

Karmesh Yadav1*

, Ram Ramrakhya2*

, Santhosh Kumar Ramakrishnan3*

Theo Gervet6, John Turner1, Aaron Gokaslan4, Noah Maestre1, Angel Xuan Chang5,

Dhruv Batra1,2, Manolis Savva5, Alexander William Clegg1†

, Devendra Singh Chaplot1†

1Meta AI 2Georgia Tech 3UT Austin

4Cornell University 5Simon Fraser University 6Carnegie Mellon University

Abstract

We present the Habitat-Matterport 3D Semantics

(HM3DSEM) dataset. HM3DSEM is the largest dataset

of 3D real-world spaces with densely annotated seman-

tics that is currently available to the academic commu-

nity. It consists of 142,646 object instance annotations

across 216 3D spaces and 3,100 rooms within those spaces.

The scale, quality, and diversity of object annotations far

exceed those of prior datasets. A key difference setting

apart HM3DSEM from other datasets is the use of tex-

ture information to annotate pixel-accurate object bound-

aries. We demonstrate the effectiveness of HM3DSEM

dataset for the Object Goal Navigation task using differ-

ent methods. Policies trained using HM3DSEM perform

outperform those trained on prior datasets. Introduction

of HM3DSEM in the Habitat ObjectNav Challenge lead

to an increase in participation from 400 submissions in

2021 to 1022 submissions in 2022. Project page:

https:

//aihabitat.org/datasets/hm3d-semantics/

1. Introduction

Over the recent past, work on acquiring and semantically

annotating datasets of real-world spaces has signiﬁcantly

accelerated research into embodied AI agents that can per-

ceive, navigate and interact with realistic indoor scenes [

–

However, the acquisition of such datasets at scale is a labori-

ous process. HM3D [

] which is one of the largest available

datasets with 1000 high-quality and complete indoor space

reconstructions, reportedly required 800+ hours of human

effort to carry out mainly data curation and veriﬁcation of

3D reconstructions. Moreover, dense semantic annotation of

such acquired spaces remains incredibly challenging.

We present the Habitat-Matterport 3D Dataset Seman-

tics (HM3DSEM). This dataset provides a dense semantic

annotation ‘layer’ augmenting the spaces from the original

*Equal Contribution, Correspondence: ykarmesh@gmail.com

†Equal Contribution

HM3D dataset. This semantic ‘layer’ is implemented as a

set of textures that encode object instance semantics and

cluster objects into distinct rooms. The semantics include

architectural elements (walls, ﬂoors, ceilings), large objects

(furniture, appliances etc.), as well as ‘stuff’ categories (ag-

gregations of smaller items such as books on bookcases).

This semantic instance information is speciﬁed in the seman-

tic texture layer, providing pixel-accurate correspondences

to the original acquired RGB surface texture and underlying

geometry of the objects.

The HM3DSEM dataset currently contains annotations

for

142,646

object instances distributed across

216

spaces

and

3,100

rooms within those spaces. Figure 1shows some

examples of the semantic annotations from the HM3DSEM

dataset. The achieved scale is larger than prior work (2.8x rel-

ative to Matterport3D [

] (MP3D) and 2.1x relative to ARK-

itScenes [

] in terms of total number of object instances).

We demonstrate the usefulness of HM3DSEM on the Ob-

jectGoal navigation task. Training on HM3DSEM results

in higher cross-dataset generalization performance. Surpris-

ingly, the policies trained on HM3DSEM perform better on

average across scene datasets compared to training on the

datasets themselves. We also show that increasing the size of

training datasets improve the navigation performance. These

results highlight the importance of improving the quality

and scale of 3D datasets with dense semantic annotations for

improving downstream embodied AI task performance.

2. Related Work

3D reconstruction datasets with semantics. There is a

relatively small number of prior works that focus on seman-

tically annotated 3D interior spaces acquired from the real

world. Collecting, reconstructing, and annotating such data

at scale is a signiﬁcant effort that requires complex pipelines

and annotation tools. Earlier work has therefore focused

on scenes at the scale of single rooms. For example, Scan-

Net [

] provided 707 typically room-scale reconstructions

annotated with object semantic instances through labeling

arXiv:2210.05633v3 [cs.CV] 12 Oct 2023

Figure 1. Habitat-Matterport 3D Semantics (HM3DSEM) provides the largest dataset of real-world spaces with densely annotated semantics.

High-ﬁdelity textured 3D mesh reconstructions are labeled with precise instance-level object semantics, indicated by distinct colors.

of 3D mesh segments constructed using an unsupervised

segmentation algorithm. Followup work by Wald et al.

[9]

adopted a similar approach and also targeted room-sized

scenes. Most recently, ARKitScenes [

] contributed scans

of 1661 room-scale scenes but only provides bounding box

annotations for object instances.

Prominent prior works on building-scale datasets with

semantic annotation are Matterport3D [

], a subset of Gib-

son by Armeni et al.

[10]

, and the Replica [

] dataset. The

ﬁrst uses the same methodology as ScanNet (labeling of 3D

mesh segments), while the second provides human-veriﬁed

object instance annotations created by back-projecting 2D se-

mantic segmentation masks. The third provides high-quality

mesh vertex-level object instance labels but only contains

18 scenes. Building on top of HM3D, which consists of

over

1,000

diverse environments from around the world,

HM3DSEM provides detailed texture-level semantic annota-

tions for building-scale reconstructions.

Synthetic 3D scene datasets. The use of synthetic 3D

datasets for embodied AI simulation is quite common, espe-

cially when interactive environments are desired [

–

Due to the difﬁculty of modeling high-ﬁdelity synthetic envi-

ronments at scale, most existing datasets are limited in size

and typically represent room-scale scenes. Some of the prior

work in this space has adopted a ‘teleportation’ mechanism

that allows an agent to immediately move from room to

room through closed doors [

]. A few datasets contributed

by prior work focus on larger-scale scenes that coherently

represent entire residences with multiple rooms [

These datasets have a number of limitations. First, due to

the difﬁculty in modeling a broad diversity of objects and

scene layouts containing them, there is fairly limited varia-

tion in both object appearance and the spatial arrangements

of the objects in the scenes. Moreover, the objects exhibit

modeling biases that create a simulation-to-reality gap, and

the re-use of the same object models across scenes produces

the unrealistic effect of “perfect copies” of particular objects.

These limitations have inspired work that attempts to tackle

sim-to-real discrepancy by creating synthetic datasets that

conform to scenes from the real world in terms of object

appearance and spatial arrangement [

–

]. However,

this approach is hard to scale, and modeling biases due to

the use of synthetic 3D data content creation software still

remain. In contrast, we focus on scaling high-quality seman-

tic annotations of real scenes acquired from a diverse set of

spaces in the real world.

3. Dataset Details

The Habitat-Matterport 3D Semantics Dataset is the

largest-ever human-annotated dataset of semantically-

annotated 3D indoor spaces. It contains dense semantic anno-

tations for

216

high-resolution, 3D, scanned scenes from the

Habitat-Matterport 3D Dataset (HM3D). The HM3D scenes

are annotated with

142,646

raw object names additionally

mapped to the 40 Matterport 3D categories [

]. On average,

each scene consists of

661

objects from

106

categories. This

dataset is the result of over 14,200 hours of human effort for

annotation and veriﬁcation by 20+ annotators. The follow-

ing subsections provide further details on asset formats, the

annotation pipeline, and scene content statistics.

3.1. Data Format and Contents

The semantic annotations are available as a set of texture

images applied to the original scene geometry from HM3D

and packed into binary glTF (.glb) format. Unique hex colors

differentiate each object instance and map it to a raw text

string classifying the instance. These mappings are included

in a metadata text ﬁle accompanying the .glb asset, which

additionally labels each instance with a region ID to deﬁne

object grouping by room.

1Human-veriﬁed subset of Gibson [20] with semantic annotations.

Dataset Scenes Rooms Object instances Objects/room Annotation type

Replica [11]18 ≈25 2,843 ≈114 vertex

Gibson (tiny1) [10]35 727 2,397 ≈3vertex

ScanNet [8]707 ≈707 36,213 ≈24 segment

3RScan [9]478 ≈478 43,006 ≈29 segment

MP3D [6]90 2,056 50,851 ≈25 segment

ARKitScenes [7]1,661 5,048 67,791 ≈13 bounding box

HM3DSEM (ours) 216 3,100 142,646 ≈60 texture

Table 1. Comparison of HM3DSEM to other semantically annotated indoor scene datasets. Statistics are on the publicly released portions of

the corresponding datasets (does not include ScanNet or ARKitScenes hidden test sets).

Often, semantic annotations are deﬁned per-vertex and

directly embedded in the mesh geometry (e.g., ScanNet [

Gibson [

], and MP3D [

]). However, it is not uncommon

for mesh geometry discretization to insufﬁciently capture

boundaries between objects, especially on ﬂat surfaces such

as walls, ﬂoors, and table-tops. This results in jagged inac-

curate semantic boundaries, missing annotations, or requires

generating an entirely new mesh with higher resolution than

the original, which has implications on both rendering per-

formance and visual alignment. For example, Figure 4high-

lights the common misalignment errors between annotated

and original assets from the MP3D dataset resulting from au-

tomated mesh geometry generation. In contrast, HM3DSEM

archival format encodes annotations directly in a set of tex-

tures compatible with the original geometry. As it is not

uncommon for 3D assets, especially those derived from

scanning pipelines to represent object boundaries in texture

rather than geometry, this choice seemed natural. Figure 2

shows several example scenes and contrasts them against

semantic annotations from Matterport3D [

], which is the

most related prior dataset. The density and quality of seman-

tic instance annotations in HM3DSEM exceeds that of prior

work as shown in Table 1. For additional compatibility with

existing simulators, the semantic texture annotations are also

baked into per-vertex colors included with the assets.

Artists were instructed to annotate architectural features

such as: walls, ﬂoors, ceilings, windows, stairs, and doors

as well as notable embellishments such as door and window

frames, banisters, area rugs, and moulding. Instance annota-

tions for architectural features are broken into regions at tran-

sition points such as room boundaries, doorways, and hall-

ways to more readily classify components into regions (e.g.

to semantically separate ﬂoors and ceilings as a room transi-

tions to a hallway) as shown in Figure 4(right). Additionally,

decorative features such as pictures, posters, switches, vents,

lighting ﬁxtures, and wall art are segmented and labeled.

Furniture, appliances, and clutter objects were annotated

and segmented from their surroundings whenever possible.

For example, pillows and blankets are segmented individu-

ally from beds, couches, and chairs while remote controls,

electronics, lamps, and art pieces are segmented from desks,

tables, and consoles. In many cases, as scan resolution

permits, individual clothing items, linens, and books are

segmented from one another in closets and bookshelves.

3.2. Veriﬁcation Process

Annotation on the scale of HM3D Semantics is not a

one-way street. Roughly 640 annotator hours were allocated

to iteration and error correction (about 4.5% of all annotator

hours). Additional veriﬁcation was done by the authors,

including both qualitative manual assessment and automated

programmatic checks. Even so, some errors may yet remain.

Fortunately, the archival format of texture + text allows for

efﬁcient iterative improvement of the annotations.

Automated veriﬁcation is essential for large scale anno-

tation efforts. Our automated veriﬁcation pipeline included,

among others, the following checks:

• Text ﬁle annotations contain only colors from textures.

• Each annotation color used only once per scene.

•

Text ﬁle contents conform to expected format: index,

color, category name, region id.

Qualitative veriﬁcation proves challenging to automate,

and as such, manual validation by humans remains an impor-

tant part of the annotation QA pipeline. Following delivery

of the annotated assets, a manual review and iteration phase

was conducted, including the following:

•

Validation pass over raw text names included identiﬁca-

tion and correction of typos, consolidation of synonyms,

and mapping of raw text names to the 40 canonical ob-

ject classes from the MP3D dataset [6].

•

Visual inspection through virtual walk-through in Habi-

tat [

]. Veriﬁers checked for missing annotations, messy

boundaries, annotation artifacts, over-aggregation (i.e.,

multiple unique instances sharing an annotation color),

semantic mislabeling (e.g. “dishwasher” annotated as

“washing machine”), and other common ﬂaws.

3.3. Dataset Statistics

The 216 scenes chosen as candidates for HM3DSEM

annotation were selected at random from the 950 furnished

HM3D scan assets. These are distributed into subsets of

[145, 36, 35] scenes between [train, val, test] splits. The

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Habitat-Matterport3DSemanticsDatasetKarmeshYadav1*,RamRamrakhya2*,SanthoshKumarRamakrishnan3*,TheoGervet6,JohnTurner1,AaronGokaslan4,NoahMaestre1,AngelXuanChang5,DhruvBatra1,2,ManolisSavva5,AlexanderWilliamClegg1†,DevendraSinghChaplot1†1MetaAI2GeorgiaTech3UTAustin4CornellUniversity5SimonFraserUniver...

展开>> 收起<<

Habitat-Matterport 3D Semantics Dataset Karmesh Yadav1 Ram Ramrakhya2 Santhosh Kumar Ramakrishnan3 Theo Gervet6 John Turner1 Aaron Gokaslan4 Noah Maestre1 Angel Xuan Chang5.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Habitat-Matterport 3D Semantics Dataset Karmesh Yadav1 Ram Ramrakhya2 Santhosh Kumar Ramakrishnan3 Theo Gervet6 John Turner1 Aaron Gokaslan4 Noah Maestre1 Angel Xuan Chang5

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: