MM ’22, October 10–14, 2022, Lisboa, Portugal Linhai Zhuo et al.
the domain gap for cross-domain few-shot learning (CD-FSL) has
attracted increasing attention recently [36, 37].
Building on top of FSL methods, the key challenge of CD-FSL is
to improve the model’s generalization ability. There are mainly two
groups of methods. The rst group has no access to the data in target
domain and relies on abstracting more discriminative features via
adversarial training or disentangle learning. For instance, Wang
and Deng
[42]
introduce adversarial task augmentation to improve
the robustness of the inductive bias across domains. Fu et al
. [10]
decompose the low-frequency and high-frequency components of
images to span the style distributions of the source domain.
However, the performances of the above methods are still unsatis-
factory. Thus, to achieve superior performance, some methods tend
to introduce target domain data. The basic idea is to mitigate the
domain gap through data augmentation. Except for source domain,
Das et al
. [6]
and Liang et al
. [22]
further ne-tune their models
on unlabeled data in target domain via self- or semi-supervised
methods, while paper [
8
] demonstrates the eectiveness of using
very few labeled target data. Considering the acceptable cost of
limited labeled data in practice, we advocate this direction.
In this paper, we propose to investigate the mixup technique
[
48
] to eciently use a small amount of labeled target domain data
during training for CD-FSL. Mixup is an easy-to-apply data aug-
mentation method. It conducts linear interpolation between source
and target domains of data. Thus the mixed data is intermediate
between source and target domains. We name it as an intermediate
domain throughout the paper. Clearly, training on the data from
this intermediate domain not only reconciles the dierent data
distribution from various domains, but also improves the model
generalization ability.
Although many works have demonstrated the eectiveness of
mixup in various tasks, focusing on CD-FSL, the severe data im-
balance issue brings a great challenge — how to set the ratio of
source domain data to target domain data. Focusing more on the
limited labeled data in target domain will fall into an over-tting
problem, and less on auxiliary target data may not be helpful in
domain adaption.
To show the great impacts of mix ratio on specic CD-FSL tasks,
we have conducted a pilot study by choosing the proposed base net-
work Mixup-3T as the model, and Mini-Imagenet [
28
] and Places
[
49
] as the datasets. As shown in Fig. 1, we can see a great uctuation
of accuracy along with the varying mix ratios. That is, an optimal
mixup strategy helps to achieve good performance. Since the op-
timal mix ratio could be dierent on dierent datasets and tasks,
it’s tedious to choose an optimal mix ratio manually. Therefore,
further investigation on mix ratio is necessary, and a well-designed
optimization strategy can be benecial here.
To address the challenges, we propose a novel
T
arget
G
uided
D
ynamic
M
ixup (TGDM) framework that controls the interme-
diate domain during training for CD-FSL. By dynamically choos-
ing a suitable mix ratio, TGDM boosts the performance on novel
classes, without the harm on base classes. There are two core com-
ponents: the classication network Mixup-3T and
D
ynamic
R
aito
G
eneration
N
etwork (DRGN). The basic idea lies in guided gener-
ating mix ratio and utilize intermediate domain eectively. First,
based on current mixed data, we optimize Mixup-3T via a tri-task
learning mechanism, involving source, target, and intermediate do-
main classication tasks. The source and target classication tasks
target better performance on specic tasks, and the intermediate
classication is to improve the generalization ability. Second, DRGN
learns to produce a target guided mix ratio to guide the generation
of intermediate domain data. Specically, We perform a pseudo
backward propagation of Mixup-3T to validate the performance on
auxiliary target data, whose loss will be utilized to update DRGN.
Overall, our contributions are summarized as follows: 1) We
propose a novel target guided dynamic mixup (TGDM) framework
that leverages the target data to dynamically control the mix ratio
for better generalization ability in cross-domain few-shot learning.
2) We propose a Mixup-3T network that utilizes the dynamic mixed
data as the intermediate domain for better transferring knowledge
between the source domain and the target domain. 3) We conduct
extensive experiments on several benchmarks, and the experimental
results demonstrate the eectiveness of our framework.
2 RELATED WORK
2.1 Few-Shot Learning
Few-shot learning aims at learning new concepts with very few
samples. Many eorts have been made in this eld. These methods
are mainly divided into three categories: model initialization [
7
,
30
],
metric-learning [
5
,
13
,
34
,
40
] and data augmentation [
5
,
9
,
11
,
21
].
More recently, Zhou et al
. [50]
apply Similarity Ratio to weight the
importance of base classes and thus select the optimal ones. Ji et al
.
[18]
propose Modal-Alternating Propagation Network to rectify
visual features with semantic class attributes. Yan et al
. [45]
adopt
bi-level meta-learning optimization framework to select samples.
These methods obtain training and testing images from the same
domain. In this paper, we stick to metric learning and bi-level meta-
learning but formulate them under the cross-domain scenario with
few labeled data.
2.2 Cross-Domain Few-Shot Learning
Cross-domain few-shot learning (CD-FSL) aims to perform few-
shot classication under the setting where the training and testing
data are from dierent domains. This task is formally dened and
proposed by [
37
]. Then, more benchmarks are proposed by [
15
].
According to whether using target dataset during the training phase
[
8
,
23
,
24
,
27
,
33
,
46
] or not [
10
,
22
,
37
], CD-FSL methods can be
divided into two groups. For training without target data, Wang and
Deng
[42]
apply adversarial training to augment data and improve
the robustness of the inductive bias. Fu et al
. [10]
believe that the
style contains domain specic information. So they transfer styles
between two training episodes and apply self-supervised learning
to make network ignore the transformation of style. Generally,
because of lacking target data, the performances of this kind of
methods are lower than that use target data in model training. As a
result, some researchers netune their models on target support
set. For example, Liang et al
. [22]
propose NSAE to enhance feature
with noise and Das et al
. [6]
apply contrastive loss. Both of these
two works eventually netune their models with support data
in target domains. [
24
,
27
,
46
] further introduce unlabeled data
and turn to additional self-supervised learning tasks on unlabeled
target data. Lin et al
. [23]
integrate several SOTA modules and