
layout estimation and registration can be trained jointly. However, a major limitation of PSMNet
(also mentioned in their paper) is that the architecture relies on an initial approximate registration.
The authors argued that such an approximate registration could be given either manually or com-
puted by external methods such as Structure from Motion (SfM) methods or Shabani et al. (2021).
While a manual registration may work, the method would no longer be automatic. When experi-
menting with existing methods for approximate registration, we observed that they frequently make
registration errors and even fail to provide a registration in a substantial number of cases. The main
reason is that the required registration mainly falls into the category of wide baseline registration
with only two given images. For example, our results show that the state-of-the-art SfM method
OpenMVG (Moulon et al., 2016) fails to register 76% of panorama pairs from our test dataset. It is
thus impractical to assume an independent algorithm that can reliably give an approximate solution
to the challenging wide baseline registration problem. In addition, relying on such an algorithm
moves a critical part of the problem to a pre-process.
Therefore, we set out to develop a complete multi-view panorama registration and layout estimation
framework that no longer relies on an approximate registration given as input as shown in Figure 1.
To achieve this, we propose a novel Geometry-aware Panorama Registration Network, or GPR-
Net, based on the following design ideas. First, our experiments indicate that a global (pixel-space)
registration that directly regresses pose parameters (i.e., translation and rotation) is too ambitious.
Instead, we propose to compute more fine-grained correspondences in a different space. Specifically,
GPR-Net conceptually samples the layout boundaries of two input layouts and computes features
for the sampled locations. For each boundary sample in each of the two panoramas, it estimates the
distance from the camera (depth). In addition, it estimates the correspondence map from the samples
in the first panorama to the second panorama and a covisibility map describing if a sample in the
first panorama is visible in the second panorama. Each of these maps (depth, correspondence, and
covisibility) is a 1D sequence of values.
This representation has the advantages of having more elements to register (e.g., 256 samples per
panorama) and more supervision signal for fine-grained estimation. This leads to better learning
performance. Second, we build a non-linear registration module to compute the final relative camera
pose. The module combines two horizon-depth maps with the horizon-correspondence and horizon-
covisibility maps to obtain a set of covisible corresponding boundary samples in a 2D coordinate
system aligned with the ceiling plane, followed by a RANSAC-based pose estimation. Note that this
non-linear space is more expressive and can encode a richer range of maps between two panoramas.
The final complete layout is obtained simply by taking the union of two registered layouts.
We extensively validate our model by comparing with the state-of-the-art panorama registra-
tion method and multi-view layout estimation method on a large-scale indoor panorama dataset
ZInD (Cruz et al., 2021). The experimental results demonstrate that our model is superior
to competing methods by achieving a significant performance boost in both panorama registra-
tion accuracy (mAA@5◦:+68.5%(rotation), +63.0%(translation), mAA@10◦:+74.1%(rotation),
+72.3%(translation)) and layout reconstruction accuracy (2D IoU +4.5%).
In summary, our contributions are as follows:
• We propose the first complete multi-view panoramic layout estimation framework. Our
architecture jointly learns the layout and registration from data, is end-to-end trainable, and
most importantly, does not rely on a pose prior.
• We devise a novel panorama registration framework to effectively tackle the wide base-
line registration problem by exploiting the layout geometry and computing a fine-grained
correspondence of samples on the layout boundaries.
• We achieve state-of-the-art performance on ZInD (Cruz et al., 2021) dataset for both the
stereo panorama registration and layout reconstruction tasks.
2 RELATED WORK
2.1 SINGLE-VIEW ROOM LAYOUT ESTIMATION
There exist many methods to estimate the room layouts from just a single image taken inside an
indoor environment. Methods that take only one perspective image include earlier attempts that
2