then used both to re-train
g
, and to train the first iteration of the 2D landmark detector used in the
subsequent stages.
Self-training iterations
At each iteration
t
, we define a “labeled” set
St−1
which includes all
frames that are either manually annotated, or are labeled by the landmark detector
ft−1
in the
previous stage and passes outlier rejection using
gt−1
. We then re-train the landmark detector and the
geometric constraint function on the labeled set
St−1
, which leads to a new detector
ft
as well as
gt
. Once trained, inference is run with this detector network
ft
over all the captured frames. This
produces new pseudo labels
˜
Wt
n,v
for all the
N
frames and
V
views. We then apply the geometric
constraint function
gt
to evaluate the uncertainty score
yt
n,v
for each pseudo label. Finally we define
a new labeled set Stwhich includes all samples (n, v)that satisfy yt
n,v is below a certain threshold.
The above process is repeated for a number of iterations. In principle frames that are still not annotated
(rejected by
gt
) can be actively labeled by humans, however in practice we have found this situation
is rare, unless the distance between the captured views is extremely small, making it difficult to learn
a reasonable 3D shape prior.
3.3 Outlier detection using multi-view NRSfM network
Uncertainty score.
Our geometric constraint function
g
is built upon measuring the discrepancy
between detected 2D landmarks and the 3D reconstruction by a multi-view NRSfM method. This
is in the same spirit as using the reprojection error of triangulation to measure uncertainties as in
prior works. The idea is if the detected 2D landmarks at different views are all correct, we should
be able to recover accurate camera poses and 3D structures, and consequently the reprojection of
recovered 3D landmarks matches the 2D landmarks. On the other hand, if the reprojection error is
high, it means there exists errors in the 2D landmarks which prevents perfect 3D reconstructions.
This leads to the following formulation of our uncertainty score,
y(n,v)=k˜
W(n,v)−proj(˜
T(n,v)˜
Sn)kF(2)
where
˜
T(n,v)
,
˜
Sn
are the estimated camera extrinsics and 3D landmark positions in the world
coordinate,
˜
W(n,v)
is 2D landmarks estimated by the landmark detector, and
proj
is the projection
function. The effectiveness of the uncertainty score defined by Eq. 2 depends on the reliability of
estimating
˜
T(n,v)
,
˜
Sn
. However, due to the low number of synchronized views as well as noise
in
˜
W(n,v)
, simply performing SfM and triangulation gives poor result as shown in Fig. 4a. This
motivates the following use of MV-NRSfM.
Unsupervised learned MV-NRSfM.
Our solution to reliably estimate
˜
T(n,v)
,
˜
Sn
is to marry both
the multi-view geometric constraints and the temporal redundancies across frames, which leads
to the adaption of the MV-NRSfM method [
3
]. Limited by space, we refer interested reader to
their paper for detailed treatment. Here we briefly discuss its usage in our problem. In a nutshell,
MV-NRSfM [
3
] assumes that 3D shapes (concatenation of 3D landmark positions) can be compressed
into low-dimensional latent codes if they are properly aligned to a canonical view. MV-NRSfM is
then trained to learn a decoder
hd:ϕ∈RK→S∈RP×3
which maps a low-dimensional code to
an aligned 3D shape, as well as an encoder network
he:W1,W2, ..., WV→ϕ
which estimates a
single shape code
ϕ
from 2D landmarks
Wv∈RP×2
captured from a number of different views
(see Appendix C for the network architecture). Both
hd
and
he
are learned through minimizing the
reprojection error:
min
T(n,v),hd,heX
(n,v)∈S
k˜
W(n,v)−proj(T(n,v)(hd◦he)( ˜
W(n,1),˜
W(n,2), ..., ˜
W(n,V )))kF(3)
where
S
refers to the training set, and
◦
denotes function composition. Thanks to the constraint from
low dimensional codes as well as the convolution structure of
he
inspired from factorization-based
NRSfM methods [
18
], the learned networks
hd◦he
are able to infer reasonable 3D landmark positions
from noisy 2D landmark inputs. We provide the network architecture of MV-NRSfM in Fig. 10 of
Appendix C.
In our task, we rely on the robustness of MV-NRSfM not only to learn the 3D reconstruction of the
labeled training set, but also to detect outliers on the unlabeled set using Eq. 2. At the
t
-th iteration of
our self-training, we train
ht
d
,
ht
e
given the current labeled set
St−1
from the previous iteration. We
5