MBW Multi-view Bootstrapping in the Wild Mosam Dabhi1Chaoyang Wang1Tim Clifford2László A. Jeni1 Ian Fasel2Simon Lucey3

2025-05-02 0 0 5.36MB 25 页 10玖币

侵权投诉

MBW: Multi-view Bootstrapping in the Wild

Mosam Dabhi1Chaoyang Wang1Tim Clifford2László A. Jeni1∗

Ian Fasel2∗Simon Lucey3∗

1Carnegie Mellon University 2Apple 3The University of Adelaide

Abstract

Labeling articulated objects in unconstrained settings have a wide variety of applica-

tions including entertainment, neuroscience, psychology, ethology, and many ﬁelds

of medicine. Large ofﬂine labeled datasets do not exist for all but the most common

articulated object categories (e.g., humans). Hand labeling these landmarks within

a video sequence is a laborious task. Learned landmark detectors can help, but

can be error-prone when trained from only a few examples. Multi-camera systems

that train ﬁne-grained detectors have shown signiﬁcant promise in detecting such

errors, allowing for self-supervised solutions that only need a small percentage

of the video sequence to be hand-labeled. The approach, however, is based on

calibrated cameras and rigid geometry, making it expensive, difﬁcult to manage,

and impractical in real-world scenarios. In this paper, we address these bottlenecks

by combining a non-rigid 3D neural prior with deep ﬂow to obtain high-ﬁdelity

landmark estimates from videos with only two or three uncalibrated, handheld

cameras. With just a few annotations (representing 1-2% of the frames), we are

able to produce 2D results comparable to state-of-the-art fully supervised methods,

along with 3D reconstructions that are impossible with other existing approaches.

Our Multi-view Bootstrapping in the Wild (MBW) approach demonstrates impres-

sive results on standard human datasets, as well as tigers, cheetahs, ﬁsh, colobus

monkeys, chimpanzees, and ﬂamingos from videos captured casually in a zoo. We

release the codebase for MBW as well as this challenging zoo dataset consisting

image frames of tail-end distribution categories with their corresponding 2D, 3D

labels generated from minimal human intervention.

1 Introduction

Hand labeling landmarks of articulated objects within video is an arduous and expensive task.

Landmark detectors [

] can be employed to automate the process. However, they require the

ingestion of large amounts of labeled training data to be reliable – an infeasible requirement for all

but the most common of articulated objects (e.g. people, hands). Semi-supervision can help [

where a small portion of frames within the video are hand labeled. Candidate labels can be generated

from the noisy landmark detectors – trained from the seed hand labeled examples – inliers are then

determined through calibrated rigid multi-view geometry. These inliers are treated as labels and used

to train the next round of landmark detectors. This semi-supervised process is iterated to increase the

number of inlier estimates, with additional human annotation being added judiciously to ensure the

full sequence is labeled. Such strategies have been instrumental for obtaining reliable ground-truth –

most notably the Multi-view Bootstrapping (MB) approach of Simon et al.

[33]

. Human annotators

∗indicates the authors advised equally

36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks.

arXiv:2210.01721v1 [cs.CV] 4 Oct 2022

Figure 1: Overview of our MBW approach.

Top

: Provided an unconstrained Multi-view uncalibrated

video with very few 2D labels (

≈1−2

% or

≈15

labels), our method recovers the 3D structure in a

canonical frame, along with camera poses and corresponding 2D landmarks for the complete video

sequence.

Bottom

: Diverse reconstructions and data labeling for videos captured in the wild. This

dataset is released as part of the paper.

are only required to hand label a subset of the dataset, with the rest just requiring visual inspection to

validate the accuracy of the inferred labels.

Although signiﬁcantly cutting down on human labor, Multi-view Bootstrapping [

] is still expensive

and cumbersome, requiring a static multi-camera rig which usually consists of tens [

] or even

sometimes hundreds of calibrated cameras [

]. The number of cameras can be reduced, but with a

trade-off in decreasing robustness of outlier rejection and increasing human interventions (see Fig. 4).

This makes it less feasible for capturing objects outside laboratories. In this paper, we advocate for

a signiﬁcant advancement by enabling its application to data captured by a few (2 to 4) handheld

cameras with only a handful of annotated frames (about 10-15 frames per several minutes of video).

We refer to our approach herein as Multi-view Bootstrapping in the Wild (MBW). The cameras need

not be calibrated, and ﬁelds of view need only overlap the articulated object, not the backgrounds.

Our innovations come from (i) utilizing Multi-View Non-Rigid Structure from Motion (MV-

NRSfM) [

] to more reliably estimate camera poses and 3D landmark positions from noisy 2D

inputs with few cameras. Compared to performing SfM / triangulation independently for each frame

as in prior works [

], MV-NRSfM leverages the redundancy in shape variations among different

frames, thus it is less sensitive to the variations of input views, more capable of detecting outliers and

denoising inlier 2D landmark estimates. (ii) We leverage recent advances in deep optical ﬂow [

] as

an alternative strategy for creating landmark label candidates – something especially useful in the

early iterations of the semi-supervision process.

As a result our approach can be effectively applied to less studied articulated object categories. We

show results on tigers, ﬁsh, colobus monkeys, gorillas, chimpanzees, and ﬂamingos from a zoo dataset

(captured by the authors, who hereby release it under a CC-BY-NC license). We also quantitatively

evaluate the proposed pipeline on common motion capture datasets (e.g. Human3.6 Million [

]).

The accuracy of the learned landmark detector is competitive to state-of-the-art fully supervised

method. A graphical depiction of our approach can be found in Figure 1.

2 Related Works

Panoptic Studio [

] paved the way for collecting data for deformable objects such as the human

body. Subsequent efforts on humans [

], hands [

], monkeys [

], canines [

], chee-

Table 1: Related efforts trying to achieve a similar application as the proposed approach.

Method Flow Calibration 3D labels Wild setup % annotated (≈)

Günel et al. [8] No Required Yes No 30%

Mathis et al. [22] No N/A No No 5%

Dong et al. [4] Yes Required No No N/A (Unknown)

Zhang and Park [42] Yes Required No No 4%

Pereira et al. [27] No N/A No No 5%

Simon et al. [33] No Required Yes No 30%

MBW (Ours) Yes No Yes Yes 2%

Figure 2: (Dotted lines) The MV-NRSfM neural shape prior is initially trained with labels for 1-2%

of the frames (shown as green images). A pre-trained optical ﬂow network then propagates the

initial labels through the video to generate additional 2D candidates. Candidates that result in high

reprojection error from the 3D lifting network are rejected as outliers (red). (Solid line) From here

on, the label set is updated with inliers from the previous iteration, and is then used both to retrain

the MV-NRSfM and to train a 2D detector. Dotted line is executed only once while solid lines are

repeated for Kiterations.

tahs [

], rats [

], and insects [

] have followed. Multi-view Boostrapping [

] has demonstrated

how these calibrated multi-camera datasets can be labeled efﬁciently through a semi-supervised

learning paradigm and a small number of hand annotations. A fundamental drawback to multi-view

bootstrapping however is that it requires a large number of views and accurate camera calibration.

Recent works have explored alternate paradigms for semi-supervised landmark labeling that do not

require such exotic calibrated multi-camera setups. Mathis et al.

[22]

, Pereira et al. [

], and Yu et al.

[40]

tackle this problem from a single view, but largely ignore the use of multi-view geometry. Gunel

et al. [

] have explored an approach that utilizes a small number of camera views, and only requires

an approximate estimate of the camera extrinsics. They use pictorial structures [

] to automatically

detect and correct labeling errors, and use active learning to iteratively improve landmark detection

performance. Although this approach is useful in lab settings where there are static cameras and the

object is anchored to a ﬁxed location (e.g. tethered ﬂies are positioned over a spherical treadmill [

]),

it is non-trivial to generalize such performance to more complex environments and across signiﬁcant

individual variations due to e.g. patterned skins in animals or demographics and clothing in humans.

In contrast, our approach accepts image frames from moving cameras and requires only a handful of

hand annotated labels. Further, it does not require any camera information, and can easily be applied

to a broad set of articulated objects such as humans, hands, and animals. Thus, the strength of our

method is its generalizability. Since the provided implementation of DeepFly3D was speciﬁc for

Drosophila, it was not readily applicable to our in-the-wild datasets. An overview highlighting major

differences between our proposed approach and related works trying to achieve a similar application

is shown in Tab. 1.

Figure 3: Sample sequences composited from our Zoo data collection – situations where traditional

SLAM pipelines fail to recover reasonable camera matrices due to lack of reliable matching features.

3 Approach

3.1 Problem Setup

Our goal is to learn 2D landmarks of articulated objects from multi-view synchronized videos

captured in the wild. Unlike other works [

] developed for laboratory settings, we focus on

the in the wild setting, i.e. data is captured using a small number (2 or 3) of cameras with unknown

extrinsics, and only a small portion (1 to 2%) of the data is manually labeled.

More speciﬁcally, our training set

consists of

synchronized videos, each with

frames. Each

training image is denoted as

I(n,v)

where

n∈[1, . . . , N]

and

v∈[1, . . . , V ]

denote frame and

view indices. Initially only a subset of frames

(n, v)∈ S0

are given with 2D landmark annotations

W(n,v)∈RP×2

points. Each row of

W(n,v)

corresponds to the 2D location of a landmark (e.g.

the left knee of ﬂamingo, see Fig. 3). To simplify explanations, we assume that only a single object

of interest is visible in each frame. For multiple non-overlapping objects, our algorithm is able to

estimate bounding boxes to reduce the problem into a single object case (see Appendix D). Finally,

the goal is to (i) infer the missing 2D landmark annotations in the training set as a self-labeling task;

(ii) train a 2D landmark detector for unseen objects of the same category.

3.2 Learnable geometric supervised self-training

We employ a self-training approach which iteratively assigns pseudo labels and retrains a 2D landmark

detector. At each iteration, the 2D pseudo labels generated by a landmark detector are veriﬁed using

geometric constraints. Samples which fail the veriﬁcation are dropped, and the remaining pseudo

labels are denoised before feeding them back as labels to retrain the landmark detector. Such geometric

supervised self-training strategy has been widely used in learning landmark detections [

what differentiates our work is that we model the geometric constraints as a learnable function, which

is learned together with the landmark detector. We abstract this function as:

g:˜

W1,˜

W2,..., ˜

WV−→ y1, y2, . . . , yV,(1)

where

Wv∈RP×2

represents detected 2D landmarks at

-th view, and

is the measured

uncertainty for outlier rejection. We derive

from performing multi-view non-rigid structure from

motion (MV-NRSfM) as described in Sec. 3.3. The remaining details of the self-training pipeline is

given as follows.

Initialization

In the initial step, we require human labelers to annotate the 2D landmark

positions of the same target object for a small portion of captured video frames. We then

train our geometric constraint function

using these initial labels. Since the initial labels

only cover a limited range of shape variations, the learned

is aggressive in detecting outliers at

the beginning stage of the training. It will be improved as it sees more shape variation in each iteration.

Label propagation through tracking

We ﬁnd that directly training a 2D landmark detector such

as HRNet [

] using very few labeled samples yields unstable results. To increase the number of

training samples, we propagate the annotated 2D landmark labels to the rest of unlabeled frames

through tracking. We use an off-the-shelf optical ﬂow network [

] to track the landmarks frame to

frame. Other tracking methods [

] can also be utilized. We employ standard forward&backward

ﬂow consistency check to detect tracking failures. Since the optical ﬂow network tends to make

consistent wrong estimations when swapping the input frames, such consistency check alone is not

enough to exclude all tracking failures. Therefore, we further employ the learned geometric constraint

function

to aggressively remove any likely outliers if the predicted uncertainty

is above a certain

threshold. We then add the remaining tracked points (inliers) to the labeled set. This new set is

then used both to re-train

, and to train the ﬁrst iteration of the 2D landmark detector used in the

subsequent stages.

Self-training iterations

At each iteration

, we deﬁne a “labeled” set

St−1

which includes all

frames that are either manually annotated, or are labeled by the landmark detector

ft−1

in the

previous stage and passes outlier rejection using

gt−1

. We then re-train the landmark detector and the

geometric constraint function on the labeled set

St−1

, which leads to a new detector

as well as

. Once trained, inference is run with this detector network

over all the captured frames. This

produces new pseudo labels

n,v

for all the

frames and

views. We then apply the geometric

constraint function

to evaluate the uncertainty score

n,v

for each pseudo label. Finally we deﬁne

a new labeled set Stwhich includes all samples (n, v)that satisfy yt

n,v is below a certain threshold.

The above process is repeated for a number of iterations. In principle frames that are still not annotated

(rejected by

) can be actively labeled by humans, however in practice we have found this situation

is rare, unless the distance between the captured views is extremely small, making it difﬁcult to learn

a reasonable 3D shape prior.

3.3 Outlier detection using multi-view NRSfM network

Uncertainty score.

Our geometric constraint function

is built upon measuring the discrepancy

between detected 2D landmarks and the 3D reconstruction by a multi-view NRSfM method. This

is in the same spirit as using the reprojection error of triangulation to measure uncertainties as in

prior works. The idea is if the detected 2D landmarks at different views are all correct, we should

be able to recover accurate camera poses and 3D structures, and consequently the reprojection of

recovered 3D landmarks matches the 2D landmarks. On the other hand, if the reprojection error is

high, it means there exists errors in the 2D landmarks which prevents perfect 3D reconstructions.

This leads to the following formulation of our uncertainty score,

y(n,v)=k˜

W(n,v)−proj(˜

T(n,v)˜

Sn)kF(2)

where

T(n,v)

are the estimated camera extrinsics and 3D landmark positions in the world

coordinate,

W(n,v)

is 2D landmarks estimated by the landmark detector, and

proj

is the projection

function. The effectiveness of the uncertainty score deﬁned by Eq. 2 depends on the reliability of

estimating

T(n,v)

. However, due to the low number of synchronized views as well as noise

W(n,v)

, simply performing SfM and triangulation gives poor result as shown in Fig. 4a. This

motivates the following use of MV-NRSfM.

Unsupervised learned MV-NRSfM.

Our solution to reliably estimate

T(n,v)

is to marry both

the multi-view geometric constraints and the temporal redundancies across frames, which leads

to the adaption of the MV-NRSfM method [

]. Limited by space, we refer interested reader to

their paper for detailed treatment. Here we brieﬂy discuss its usage in our problem. In a nutshell,

MV-NRSfM [

] assumes that 3D shapes (concatenation of 3D landmark positions) can be compressed

into low-dimensional latent codes if they are properly aligned to a canonical view. MV-NRSfM is

then trained to learn a decoder

hd:ϕ∈RK→S∈RP×3

which maps a low-dimensional code to

an aligned 3D shape, as well as an encoder network

he:W1,W2, ..., WV→ϕ

which estimates a

single shape code

from 2D landmarks

Wv∈RP×2

captured from a number of different views

(see Appendix C for the network architecture). Both

and

are learned through minimizing the

reprojection error:

min

T(n,v),hd,heX

(n,v)∈S

k˜

W(n,v)−proj(T(n,v)(hd◦he)( ˜

W(n,1),˜

W(n,2), ..., ˜

W(n,V )))kF(3)

where

refers to the training set, and

◦

denotes function composition. Thanks to the constraint from

low dimensional codes as well as the convolution structure of

inspired from factorization-based

NRSfM methods [

], the learned networks

hd◦he

are able to infer reasonable 3D landmark positions

from noisy 2D landmark inputs. We provide the network architecture of MV-NRSfM in Fig. 10 of

Appendix C.

In our task, we rely on the robustness of MV-NRSfM not only to learn the 3D reconstruction of the

labeled training set, but also to detect outliers on the unlabeled set using Eq. 2. At the

-th iteration of

our self-training, we train

given the current labeled set

St−1

from the previous iteration. We

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

MBW:Multi-viewBootstrappingintheWildMosamDabhi1ChaoyangWang1TimClifford2LászlóA.Jeni1IanFasel2SimonLucey31CarnegieMellonUniversity2Apple3TheUniversityofAdelaideAbstractLabelingarticulatedobjectsinunconstrainedsettingshaveawidevarietyofapplica-tionsincludingentertainment,neuroscience,psychology,et...

展开>> 收起<<

MBW Multi-view Bootstrapping in the Wild Mosam Dabhi1Chaoyang Wang1Tim Clifford2László A. Jeni1 Ian Fasel2Simon Lucey3.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

MBW Multi-view Bootstrapping in the Wild Mosam Dabhi1Chaoyang Wang1Tim Clifford2László A. Jeni1 Ian Fasel2Simon Lucey3

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: