MBW Multi-view Bootstrapping in the Wild Mosam Dabhi1Chaoyang Wang1Tim Clifford2László A. Jeni1 Ian Fasel2Simon Lucey3

2025-05-02 0 0 5.36MB 25 页 10玖币
侵权投诉
MBW: Multi-view Bootstrapping in the Wild
Mosam Dabhi1Chaoyang Wang1Tim Clifford2László A. Jeni1
Ian Fasel2Simon Lucey3
1Carnegie Mellon University 2Apple 3The University of Adelaide
Abstract
Labeling articulated objects in unconstrained settings have a wide variety of applica-
tions including entertainment, neuroscience, psychology, ethology, and many fields
of medicine. Large offline labeled datasets do not exist for all but the most common
articulated object categories (e.g., humans). Hand labeling these landmarks within
a video sequence is a laborious task. Learned landmark detectors can help, but
can be error-prone when trained from only a few examples. Multi-camera systems
that train fine-grained detectors have shown significant promise in detecting such
errors, allowing for self-supervised solutions that only need a small percentage
of the video sequence to be hand-labeled. The approach, however, is based on
calibrated cameras and rigid geometry, making it expensive, difficult to manage,
and impractical in real-world scenarios. In this paper, we address these bottlenecks
by combining a non-rigid 3D neural prior with deep flow to obtain high-fidelity
landmark estimates from videos with only two or three uncalibrated, handheld
cameras. With just a few annotations (representing 1-2% of the frames), we are
able to produce 2D results comparable to state-of-the-art fully supervised methods,
along with 3D reconstructions that are impossible with other existing approaches.
Our Multi-view Bootstrapping in the Wild (MBW) approach demonstrates impres-
sive results on standard human datasets, as well as tigers, cheetahs, fish, colobus
monkeys, chimpanzees, and flamingos from videos captured casually in a zoo. We
release the codebase for MBW as well as this challenging zoo dataset consisting
image frames of tail-end distribution categories with their corresponding 2D, 3D
labels generated from minimal human intervention.
1 Introduction
Hand labeling landmarks of articulated objects within video is an arduous and expensive task.
Landmark detectors [
34
,
25
,
37
] can be employed to automate the process. However, they require the
ingestion of large amounts of labeled training data to be reliable – an infeasible requirement for all
but the most common of articulated objects (e.g. people, hands). Semi-supervision can help [
33
],
where a small portion of frames within the video are hand labeled. Candidate labels can be generated
from the noisy landmark detectors – trained from the seed hand labeled examples – inliers are then
determined through calibrated rigid multi-view geometry. These inliers are treated as labels and used
to train the next round of landmark detectors. This semi-supervised process is iterated to increase the
number of inlier estimates, with additional human annotation being added judiciously to ensure the
full sequence is labeled. Such strategies have been instrumental for obtaining reliable ground-truth –
most notably the Multi-view Bootstrapping (MB) approach of Simon et al.
[33]
. Human annotators
indicates the authors advised equally
36th Conference on Neural Information Processing Systems (NeurIPS 2022) Track on Datasets and Benchmarks.
arXiv:2210.01721v1 [cs.CV] 4 Oct 2022
Figure 1: Overview of our MBW approach.
Top
: Provided an unconstrained Multi-view uncalibrated
video with very few 2D labels (
12
% or
15
labels), our method recovers the 3D structure in a
canonical frame, along with camera poses and corresponding 2D landmarks for the complete video
sequence.
Bottom
: Diverse reconstructions and data labeling for videos captured in the wild. This
dataset is released as part of the paper.
are only required to hand label a subset of the dataset, with the rest just requiring visual inspection to
validate the accuracy of the inferred labels.
Although significantly cutting down on human labor, Multi-view Bootstrapping [
33
] is still expensive
and cumbersome, requiring a static multi-camera rig which usually consists of tens [
1
] or even
sometimes hundreds of calibrated cameras [
14
]. The number of cameras can be reduced, but with a
trade-off in decreasing robustness of outlier rejection and increasing human interventions (see Fig. 4).
This makes it less feasible for capturing objects outside laboratories. In this paper, we advocate for
a significant advancement by enabling its application to data captured by a few (2 to 4) handheld
cameras with only a handful of annotated frames (about 10-15 frames per several minutes of video).
We refer to our approach herein as Multi-view Bootstrapping in the Wild (MBW). The cameras need
not be calibrated, and fields of view need only overlap the articulated object, not the backgrounds.
Our innovations come from (i) utilizing Multi-View Non-Rigid Structure from Motion (MV-
NRSfM) [
3
] to more reliably estimate camera poses and 3D landmark positions from noisy 2D
inputs with few cameras. Compared to performing SfM / triangulation independently for each frame
as in prior works [
14
,
1
], MV-NRSfM leverages the redundancy in shape variations among different
frames, thus it is less sensitive to the variations of input views, more capable of detecting outliers and
denoising inlier 2D landmark estimates. (ii) We leverage recent advances in deep optical flow [
35
] as
an alternative strategy for creating landmark label candidates – something especially useful in the
early iterations of the semi-supervision process.
As a result our approach can be effectively applied to less studied articulated object categories. We
show results on tigers, fish, colobus monkeys, gorillas, chimpanzees, and flamingos from a zoo dataset
(captured by the authors, who hereby release it under a CC-BY-NC license). We also quantitatively
evaluate the proposed pipeline on common motion capture datasets (e.g. Human3.6 Million [
12
]).
The accuracy of the learned landmark detector is competitive to state-of-the-art fully supervised
method. A graphical depiction of our approach can be found in Figure 1.
2 Related Works
Panoptic Studio [
14
] paved the way for collecting data for deformable objects such as the human
body. Subsequent efforts on humans [
12
,
29
,
41
], hands [
43
,
44
,
24
], monkeys [
1
], canines [
17
], chee-
2
Table 1: Related efforts trying to achieve a similar application as the proposed approach.
Method Flow Calibration 3D labels Wild setup % annotated ()
Günel et al. [8] No Required Yes No 30%
Mathis et al. [22] No N/A No No 5%
Dong et al. [4] Yes Required No No N/A (Unknown)
Zhang and Park [42] Yes Required No No 4%
Pereira et al. [27] No N/A No No 5%
Simon et al. [33] No Required Yes No 30%
MBW (Ours) Yes No Yes Yes 2%
Figure 2: (Dotted lines) The MV-NRSfM neural shape prior is initially trained with labels for 1-2%
of the frames (shown as green images). A pre-trained optical flow network then propagates the
initial labels through the video to generate additional 2D candidates. Candidates that result in high
reprojection error from the 3D lifting network are rejected as outliers (red). (Solid line) From here
on, the label set is updated with inliers from the previous iteration, and is then used both to retrain
the MV-NRSfM and to train a 2D detector. Dotted line is executed only once while solid lines are
repeated for Kiterations.
tahs [
15
], rats [
20
], and insects [
8
] have followed. Multi-view Boostrapping [
33
] has demonstrated
how these calibrated multi-camera datasets can be labeled efficiently through a semi-supervised
learning paradigm and a small number of hand annotations. A fundamental drawback to multi-view
bootstrapping however is that it requires a large number of views and accurate camera calibration.
Recent works have explored alternate paradigms for semi-supervised landmark labeling that do not
require such exotic calibrated multi-camera setups. Mathis et al.
[22]
, Pereira et al. [
27
], and Yu et al.
[40]
tackle this problem from a single view, but largely ignore the use of multi-view geometry. Gunel
et al. [
8
] have explored an approach that utilizes a small number of camera views, and only requires
an approximate estimate of the camera extrinsics. They use pictorial structures [
5
] to automatically
detect and correct labeling errors, and use active learning to iteratively improve landmark detection
performance. Although this approach is useful in lab settings where there are static cameras and the
object is anchored to a fixed location (e.g. tethered flies are positioned over a spherical treadmill [
8
]),
it is non-trivial to generalize such performance to more complex environments and across significant
individual variations due to e.g. patterned skins in animals or demographics and clothing in humans.
In contrast, our approach accepts image frames from moving cameras and requires only a handful of
hand annotated labels. Further, it does not require any camera information, and can easily be applied
to a broad set of articulated objects such as humans, hands, and animals. Thus, the strength of our
method is its generalizability. Since the provided implementation of DeepFly3D was specific for
Drosophila, it was not readily applicable to our in-the-wild datasets. An overview highlighting major
differences between our proposed approach and related works trying to achieve a similar application
is shown in Tab. 1.
3
Figure 3: Sample sequences composited from our Zoo data collection – situations where traditional
SLAM pipelines fail to recover reasonable camera matrices due to lack of reliable matching features.
3 Approach
3.1 Problem Setup
Our goal is to learn 2D landmarks of articulated objects from multi-view synchronized videos
captured in the wild. Unlike other works [
14
,
8
,
4
,
42
] developed for laboratory settings, we focus on
the in the wild setting, i.e. data is captured using a small number (2 or 3) of cameras with unknown
extrinsics, and only a small portion (1 to 2%) of the data is manually labeled.
More specifically, our training set
S
consists of
V
synchronized videos, each with
N
frames. Each
training image is denoted as
I(n,v)
where
n[1, . . . , N]
and
v[1, . . . , V ]
denote frame and
view indices. Initially only a subset of frames
(n, v)∈ S0
are given with 2D landmark annotations
W(n,v)RP×2
of
P
points. Each row of
W(n,v)
corresponds to the 2D location of a landmark (e.g.
the left knee of flamingo, see Fig. 3). To simplify explanations, we assume that only a single object
of interest is visible in each frame. For multiple non-overlapping objects, our algorithm is able to
estimate bounding boxes to reduce the problem into a single object case (see Appendix D). Finally,
the goal is to (i) infer the missing 2D landmark annotations in the training set as a self-labeling task;
(ii) train a 2D landmark detector for unseen objects of the same category.
3.2 Learnable geometric supervised self-training
We employ a self-training approach which iteratively assigns pseudo labels and retrains a 2D landmark
detector. At each iteration, the 2D pseudo labels generated by a landmark detector are verified using
geometric constraints. Samples which fail the verification are dropped, and the remaining pseudo
labels are denoised before feeding them back as labels to retrain the landmark detector. Such geometric
supervised self-training strategy has been widely used in learning landmark detections [
8
,
14
,
1
,
33
],
what differentiates our work is that we model the geometric constraints as a learnable function, which
is learned together with the landmark detector. We abstract this function as:
g:˜
W1,˜
W2,..., ˜
WVy1, y2, . . . , yV,(1)
where
˜
WvRP×2
represents detected 2D landmarks at
v
-th view, and
yv
is the measured
uncertainty for outlier rejection. We derive
g
from performing multi-view non-rigid structure from
motion (MV-NRSfM) as described in Sec. 3.3. The remaining details of the self-training pipeline is
given as follows.
Initialization
In the initial step, we require human labelers to annotate the 2D landmark
positions of the same target object for a small portion of captured video frames. We then
train our geometric constraint function
g
using these initial labels. Since the initial labels
only cover a limited range of shape variations, the learned
g
is aggressive in detecting outliers at
the beginning stage of the training. It will be improved as it sees more shape variation in each iteration.
Label propagation through tracking
We find that directly training a 2D landmark detector such
as HRNet [
34
] using very few labeled samples yields unstable results. To increase the number of
training samples, we propagate the annotated 2D landmark labels to the rest of unlabeled frames
through tracking. We use an off-the-shelf optical flow network [
35
] to track the landmarks frame to
frame. Other tracking methods [
30
,
9
] can also be utilized. We employ standard forward&backward
flow consistency check to detect tracking failures. Since the optical flow network tends to make
consistent wrong estimations when swapping the input frames, such consistency check alone is not
enough to exclude all tracking failures. Therefore, we further employ the learned geometric constraint
function
g
to aggressively remove any likely outliers if the predicted uncertainty
y
is above a certain
threshold. We then add the remaining tracked points (inliers) to the labeled set. This new set is
4
then used both to re-train
g
, and to train the first iteration of the 2D landmark detector used in the
subsequent stages.
Self-training iterations
At each iteration
t
, we define a “labeled” set
St1
which includes all
frames that are either manually annotated, or are labeled by the landmark detector
ft1
in the
previous stage and passes outlier rejection using
gt1
. We then re-train the landmark detector and the
geometric constraint function on the labeled set
St1
, which leads to a new detector
ft
as well as
gt
. Once trained, inference is run with this detector network
ft
over all the captured frames. This
produces new pseudo labels
˜
Wt
n,v
for all the
N
frames and
V
views. We then apply the geometric
constraint function
gt
to evaluate the uncertainty score
yt
n,v
for each pseudo label. Finally we define
a new labeled set Stwhich includes all samples (n, v)that satisfy yt
n,v is below a certain threshold.
The above process is repeated for a number of iterations. In principle frames that are still not annotated
(rejected by
gt
) can be actively labeled by humans, however in practice we have found this situation
is rare, unless the distance between the captured views is extremely small, making it difficult to learn
a reasonable 3D shape prior.
3.3 Outlier detection using multi-view NRSfM network
Uncertainty score.
Our geometric constraint function
g
is built upon measuring the discrepancy
between detected 2D landmarks and the 3D reconstruction by a multi-view NRSfM method. This
is in the same spirit as using the reprojection error of triangulation to measure uncertainties as in
prior works. The idea is if the detected 2D landmarks at different views are all correct, we should
be able to recover accurate camera poses and 3D structures, and consequently the reprojection of
recovered 3D landmarks matches the 2D landmarks. On the other hand, if the reprojection error is
high, it means there exists errors in the 2D landmarks which prevents perfect 3D reconstructions.
This leads to the following formulation of our uncertainty score,
y(n,v)=k˜
W(n,v)proj(˜
T(n,v)˜
Sn)kF(2)
where
˜
T(n,v)
,
˜
Sn
are the estimated camera extrinsics and 3D landmark positions in the world
coordinate,
˜
W(n,v)
is 2D landmarks estimated by the landmark detector, and
proj
is the projection
function. The effectiveness of the uncertainty score defined by Eq. 2 depends on the reliability of
estimating
˜
T(n,v)
,
˜
Sn
. However, due to the low number of synchronized views as well as noise
in
˜
W(n,v)
, simply performing SfM and triangulation gives poor result as shown in Fig. 4a. This
motivates the following use of MV-NRSfM.
Unsupervised learned MV-NRSfM.
Our solution to reliably estimate
˜
T(n,v)
,
˜
Sn
is to marry both
the multi-view geometric constraints and the temporal redundancies across frames, which leads
to the adaption of the MV-NRSfM method [
3
]. Limited by space, we refer interested reader to
their paper for detailed treatment. Here we briefly discuss its usage in our problem. In a nutshell,
MV-NRSfM [
3
] assumes that 3D shapes (concatenation of 3D landmark positions) can be compressed
into low-dimensional latent codes if they are properly aligned to a canonical view. MV-NRSfM is
then trained to learn a decoder
hd:ϕRKSRP×3
which maps a low-dimensional code to
an aligned 3D shape, as well as an encoder network
he:W1,W2, ..., WVϕ
which estimates a
single shape code
ϕ
from 2D landmarks
WvRP×2
captured from a number of different views
(see Appendix C for the network architecture). Both
hd
and
he
are learned through minimizing the
reprojection error:
min
T(n,v),hd,heX
(n,v)∈S
k˜
W(n,v)proj(T(n,v)(hdhe)( ˜
W(n,1),˜
W(n,2), ..., ˜
W(n,V )))kF(3)
where
S
refers to the training set, and
denotes function composition. Thanks to the constraint from
low dimensional codes as well as the convolution structure of
he
inspired from factorization-based
NRSfM methods [
18
], the learned networks
hdhe
are able to infer reasonable 3D landmark positions
from noisy 2D landmark inputs. We provide the network architecture of MV-NRSfM in Fig. 10 of
Appendix C.
In our task, we rely on the robustness of MV-NRSfM not only to learn the 3D reconstruction of the
labeled training set, but also to detect outliers on the unlabeled set using Eq. 2. At the
t
-th iteration of
our self-training, we train
ht
d
,
ht
e
given the current labeled set
St1
from the previous iteration. We
5
摘要:

MBW:Multi-viewBootstrappingintheWildMosamDabhi1ChaoyangWang1TimClifford2LászlóA.Jeni1IanFasel2SimonLucey31CarnegieMellonUniversity2Apple3TheUniversityofAdelaideAbstractLabelingarticulatedobjectsinunconstrainedsettingshaveawidevarietyofapplica-tionsincludingentertainment,neuroscience,psychology,et...

展开>> 收起<<
MBW Multi-view Bootstrapping in the Wild Mosam Dabhi1Chaoyang Wang1Tim Clifford2László A. Jeni1 Ian Fasel2Simon Lucey3.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:5.36MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注