
only a batch of unlabeled target samples. This paradigm
yielded promising results in the domain generalization
setting [40, 39] because they alleviate the main challenges
of domain generalization: the lack of information about the
target domain, and the requirement to be simultaneously
robust in advance to every possible shift.
Test-time adaptation methods however suffer from a
drawback that limits their adaptation capability and that
can only be corrected at training-time. Indeed, using a
standard training procedure, only a subset of predictive
patterns is learned, corresponding to the most obvious
and efficient ones, while the less predictive patterns are
disregarded entirely [32, 13, 28, 12, 30, 2, 8]. This apparent
flaw, named shortcut learning, originates from the gradient
descent optimization [28] and prevents a test-time method
to use all the available patterns. The combination of a
training-time patterns’ diversity seeking approach with a
test-time adaptation method may thus lead to improved
results. In this paper, we show that the combined use of
test-time batch normalization, a simple test-time adaptation
method, with the state-of-the-art single-source domain
generalization methods (that are often designed to discover
normally unused patterns) does not systematically yield
increased results on the PACS benchmark [22] in the
single-source setting. Similar experiments on Office-Home
[35] yield a similar result, with only few methods perform-
ing better than the standard training procedure. We thus
propose a new method, namely L2GP, which encourages
a network to learn new predictive patterns rather than ex-
ploiting and refining already learned ones and demonstrate
its effectiveness on both the PACS and the Office-Home
benchmarks. To find such patterns, we propose to look for
predictive patterns that are less generalizable than the natu-
rally learned ones through a secondary classifier endowed
with a shortcut avoidance loss, thereby leading to learning
semantically different patterns. These less generalizable
patterns match the ones normally ignored because of the
simplicity bias of deep networks that promotes the learning
of a representation with a high generalization capability
[18, 6]. Our method requires two classifiers added to a
features extractor and trains them asymmetrically, using
a data-dependant regularization, e.g., shortcut avoidance
loss, that slightly encourages memorization rather than
generalization by learning batch specific patterns, i.e.
patterns that lower the loss on the running batch but with a
limited effect on the other batches of data.
To summarize, our contribution is threefold:
• To the best of our knowledge, we are the first to inves-
tigate the effect of training-time single-source methods
on a test-time adaptation strategy. We show that it usu-
ally does not increase performance and can even have
Figure 1. Schema of our bi-headed architecture. The naming con-
vention is the same as the one used in algorithm 1.
an adverse effect.
• We apply, for the first time, several state-of-the-art
single-source domain generalization algorithms on the
more challenging and rarely used Office-Home bench-
mark and show that very few yield a robust cross-
domain representation.
• We propose an original algorithm to learn a larger than
usual subset of predictive features and show that it
yields results over the existing state-of-the-art with the
combination of test-time batch normalization.
2. Related Works
2.1. Single-Source Domain Generalization
Most domain generalization algorithms require several
identified domains to enforce some level of distributional
invariance. Because this is an unrealistic hypothesis in some
situations (such as in healthcare or defense related tasks),
methods were developed to deal with a domain shift is-
sue with only one single domain available during training.
Some of them rely on a domain shift invariance hypothe-
sis. A commonly used invariance hypothesis is the texture
shift hypothesis: a lot of domain shifts are primarily tex-
tures shifts, and using style transfer based data augmenta-
tion will improve the generalization, whether it is explicitly
by training a model on stylized images [38, 19] or implicitly
in the internal representation of the network [43, 27]. Such
methods are limited to situations where it is indeed a shift of
the hypothesized nature that is encountered. Others wish to
learn a larger set of predictive patterns to make the network
more robust should one or several training-time predictive
patterns be missing at test-time. Volpi et al. [36] and Zhang
et al. [44] propose to incrementally add adversarial images
crafted to maximize the classification error of the network
to the training dataset. These images no longer contain the
original obvious predictive patterns which forces the learn-
ing of new patterns. These strategies are inspired by ad-