1. Introduction
The success of deep learning in semantic segmentation
still relies on great amounts of fully annotated masks (Lit-
jens et al., 2017; Shen et al., 2017; Isensee et al., 2021;
Han et al., 2023; Uslu and Bharath, 2023; Qi et al., 2022).
Annotating the segmentation masks inflicts high cost in
the field of medical imaging because of the expertise
and laborious workload needed in the process. Scribble-
supervised medical image segmentation, which trains net-
works supervised by scribble annotations only, can be a
feasible way to reduce that burden. Created by dragging
a cursor inside target regions, scribbles are flexible to edit
structures (Tajbakhsh et al., 2020), but could only provide
sparse labeled pixels while leaving vast regions unlabeled,
posing a primary challenge in algorithm design.
Conventional scribble-supervised segmentation ap-
proaches (Lin et al., 2016; Can et al., 2018) iterate be-
tween two stages: labeling pseudo-masks and optimizing
network parameters; with the masks fixed, optimize the
parameters, and vice versa. However, such paradigm has
two major drawbacks. Firstly, it could be trapped in poor
local optima, due to the reason that the networks prob-
ably regress to errors in the initial pseudo-masks and are
unable to considerably reduce such mistakes in later itera-
tions. Secondly, it is unwieldy especially when applied on
large datasets. To bypass the iterative process, recent stud-
ies have attained a non-iterative one. These non-iterative
approaches, which use either a regularizer (Tang et al.,
2018a,b) or knowledge from full masks (Valvano et al.,
2021) or mixed pseudo-masks (Zhang and Zhuang, 2022;
Luo et al., 2022a), overlooked pure pseudo-masks, as op-
posed to those artificially mixed ones, for network train-
ing.
We argue that such stream can be useful and ask: In a
non-iterative method, how and to what extent, can pure
pseudo-masks supervised by scribbles teach a network?
We attempt to answer the first part of the question by
means of a siamese architecture (Bromley et al., 1993)
which has two weight-sharing neural networks applied on
two inputs, based on the following analysis. (i) Set up a
non-iterative paradigm. This paradigm, with the siamese
architecture, can be achieved by translating the iterative
two-stage process into: one network generating pseudo-
masks supervised by scribbles (i.e., labeling) to assimi-
late the predicted-masks of the other network (i.e., opti-
mizing) during training. (ii) Use pseudo-masks to teach a
network. The pseudo-mask, supervised by scribbles, acts
to regularize network parameters via consistency regular-
ization (a regularizer) that maximizes similarity between
it and the predicted-mask. An advantage is that these
pseudo-masks are being diversified than of fixed qual-
ity, due to the continually updated network parameters
which are different when mapping images at each training
step. Each image’s pseudo-masks are thereby varying be-
tween epochs (“pacing”). Fig. 1 shows the predictions of
“PacingPseudo” gradually approximate the ground-truths
as the network learns from more equivalently improving
pseudo-masks through the training process.
To answer the second part of the question, which is
about improving PacingPseudo’s level of performance,
we leverage insights on pseudo-labeling and augmenta-
tion from consistency training. Firstly, since labeled pix-
els are scarce in scribble-supervised segmentation, out-
put pseudo-masks remain uncertain. Xie et al. (2020a);
Berthelot et al. (2019b,a); Sohn et al. (2020) use arti-
ficial post-processing (e.g., thresholding, sharpening, or
argmax) to obtain high-confidence pseudo labels, whereas
MeanTeacher (Yu et al., 2019) takes self-ensembling
model’s predictions as pseudo-masks. However, we em-
pirically find these approaches are of limited effective-
ness in our task, but the entropy regularization (Grand-
valet and Bengio, 2004), that regularizes pseudo-masks
end-to-end, performs satisfactorily. We then provide anal-
ysis about these findings. Secondly, augmentation is crit-
ical as it creates discrepancy between the pseudo-mask
and predicted-mask branches to enable consistency reg-
ularization. Previous studies have promoted advanced
augmentation techniques (Berthelot et al., 2019a; Xie
et al., 2020a; Sohn et al., 2020) or spatial augmentations
(Bortsova et al., 2019; Patel and Dolz, 2022). In con-
trast, inspired by recent findings in representation learning
(Chen et al., 2020; Grill et al., 2020) where augmentation
serves a similar objective to create different views of an
image (a positive pair) for assimilation, our study inves-
tigates a composition of distorted augmentations, which
can be suitable and more convenient for consistency-
training-based scribble-supervised segmentation.
We benchmark PacingPseudo on three public med-
ical image datasets: CHAOS T1 and T2 (abdomi-
nal multi-organs) (Kavur et al., 2021), ACDC (car-
diac structures) (Bernard et al., 2018), and LVSC (my-
2