
Dataset Scenes Rooms Object instances Objects/room Annotation type
Replica [11]18 ≈25 2,843 ≈114 vertex
Gibson (tiny1) [10]35 727 2,397 ≈3vertex
ScanNet [8]707 ≈707 36,213 ≈24 segment
3RScan [9]478 ≈478 43,006 ≈29 segment
MP3D [6]90 2,056 50,851 ≈25 segment
ARKitScenes [7]1,661 5,048 67,791 ≈13 bounding box
HM3DSEM (ours) 216 3,100 142,646 ≈60 texture
Table 1. Comparison of HM3DSEM to other semantically annotated indoor scene datasets. Statistics are on the publicly released portions of
the corresponding datasets (does not include ScanNet or ARKitScenes hidden test sets).
Often, semantic annotations are defined per-vertex and
directly embedded in the mesh geometry (e.g., ScanNet [
8
],
Gibson [
3
], and MP3D [
6
]). However, it is not uncommon
for mesh geometry discretization to insufficiently capture
boundaries between objects, especially on flat surfaces such
as walls, floors, and table-tops. This results in jagged inac-
curate semantic boundaries, missing annotations, or requires
generating an entirely new mesh with higher resolution than
the original, which has implications on both rendering per-
formance and visual alignment. For example, Figure 4high-
lights the common misalignment errors between annotated
and original assets from the MP3D dataset resulting from au-
tomated mesh geometry generation. In contrast, HM3DSEM
archival format encodes annotations directly in a set of tex-
tures compatible with the original geometry. As it is not
uncommon for 3D assets, especially those derived from
scanning pipelines to represent object boundaries in texture
rather than geometry, this choice seemed natural. Figure 2
shows several example scenes and contrasts them against
semantic annotations from Matterport3D [
6
], which is the
most related prior dataset. The density and quality of seman-
tic instance annotations in HM3DSEM exceeds that of prior
work as shown in Table 1. For additional compatibility with
existing simulators, the semantic texture annotations are also
baked into per-vertex colors included with the assets.
Artists were instructed to annotate architectural features
such as: walls, floors, ceilings, windows, stairs, and doors
as well as notable embellishments such as door and window
frames, banisters, area rugs, and moulding. Instance annota-
tions for architectural features are broken into regions at tran-
sition points such as room boundaries, doorways, and hall-
ways to more readily classify components into regions (e.g.
to semantically separate floors and ceilings as a room transi-
tions to a hallway) as shown in Figure 4(right). Additionally,
decorative features such as pictures, posters, switches, vents,
lighting fixtures, and wall art are segmented and labeled.
Furniture, appliances, and clutter objects were annotated
and segmented from their surroundings whenever possible.
For example, pillows and blankets are segmented individu-
ally from beds, couches, and chairs while remote controls,
electronics, lamps, and art pieces are segmented from desks,
tables, and consoles. In many cases, as scan resolution
permits, individual clothing items, linens, and books are
segmented from one another in closets and bookshelves.
3.2. Verification Process
Annotation on the scale of HM3D Semantics is not a
one-way street. Roughly 640 annotator hours were allocated
to iteration and error correction (about 4.5% of all annotator
hours). Additional verification was done by the authors,
including both qualitative manual assessment and automated
programmatic checks. Even so, some errors may yet remain.
Fortunately, the archival format of texture + text allows for
efficient iterative improvement of the annotations.
Automated verification is essential for large scale anno-
tation efforts. Our automated verification pipeline included,
among others, the following checks:
• Text file annotations contain only colors from textures.
• Each annotation color used only once per scene.
•
Text file contents conform to expected format: index,
color, category name, region id.
Qualitative verification proves challenging to automate,
and as such, manual validation by humans remains an impor-
tant part of the annotation QA pipeline. Following delivery
of the annotated assets, a manual review and iteration phase
was conducted, including the following:
•
Validation pass over raw text names included identifica-
tion and correction of typos, consolidation of synonyms,
and mapping of raw text names to the 40 canonical ob-
ject classes from the MP3D dataset [6].
•
Visual inspection through virtual walk-through in Habi-
tat [
4
]. Verifiers checked for missing annotations, messy
boundaries, annotation artifacts, over-aggregation (i.e.,
multiple unique instances sharing an annotation color),
semantic mislabeling (e.g. “dishwasher” annotated as
“washing machine”), and other common flaws.
3.3. Dataset Statistics
The 216 scenes chosen as candidates for HM3DSEM
annotation were selected at random from the 950 furnished
HM3D scan assets. These are distributed into subsets of
[145, 36, 35] scenes between [train, val, test] splits. The