BERT is spread in all layers widely (Tenney et al.,
2019). In this regard, leveraging intermediate lay-
ers can lead to a better OOD detection performance,
as they retain some complementary information to
the last layer feature, which might be beneficial in
discriminating outliers. Several studies (Shen et al.,
2021;Sastry and Oore,2020;Lee et al.,2018b)
have shown empirical evidence that intermediate
representations are indeed beneficial in detecting
outliers. Precisely, they attempted to utilize middle
layers via naïvely aggregating the individual result
of every single intermediate feature explicitly.
Although previous studies have shown the poten-
tial of intermediate layer representations in OOD
detection, we confirmed that the aforementioned
naïve ensemble scheme spawns several problems:
(Fig. 1illustrates OOD performance of the layer-
wise and their explicit ensemble in two different
datasets.) The first problem we observed is that
the ensemble result (red bar) nor the last layer can
not guarantee the best performance among the en-
tire layer depending on the setting. Such a phe-
nomenon raises the necessity for a more elaborate
approach of deriving a more meaningful ensemble
representation from various representations rather
than a current simple summation or selecting a
single layer. Secondly, even when this explicit
ensemble gives a sound performance, it requires
multiple computations of the scoring function by
birth. Thus, explicit ensemble inevitably delays
the detecting time, which is a critical shortcoming
in OOD detection, as swift and precise decision-
making is the cornerstone in this area.
To remedy the limitations of the explicit en-
semble schemes, we propose a novel frame-
work dubbed Layer-agnostic Contrastive Learning
(LaCL). Our framework is inspired by the founda-
tion of an ensemble, which seeks a more calibrated
output by combining heterogeneous decisions from
multiple models (Kuncheva and Whitaker,2003;
Gashler et al.,2008). Specifically, LaCL regards in-
termediate layers as independent decision-makers
and assembles them into a single vector to yield
a more accurate prediction: LaCL makes middle-
layer representations richer and more diverse by in-
jecting the advantage of contrastive learning (CL)
into intermediate layers while suppressing inter-
layer representations from being similar through
additional regularization loss. Then, LaCL assem-
bles them into a single ensemble representation
implicitly
to circumvent multiple computations of
the scoring function.
We demonstrate the effectiveness of our ap-
proach in 9 different OOD scenarios where LaCL
consistently surpasses other competitive works and
their explicit ensemble performance by a signifi-
cant margin. Moreover, we conducted an in-depth
analysis of LaCL to elucidate its behavior in con-
junction with our intuition.
2 Related Work
OOD detection.
Methodologies in OOD detec-
tion can be divided into supervised (Hendrycks
et al.,2019;Lee et al.,2018a;Dhamija et al.,2018)
and unsupervised settings according to the pres-
ence of training data from OOD. Since the scope
of OOD covers nigh infinite space, gathering the
data in the whole OOD space is infeasible. For
this realistic reason, the most recent OOD detec-
tion studies generally discriminate OOD input in
an unsupervised manner, including this work. Nu-
merous branches of machine learning tactics are
employed for unsupervised OOD detection: gen-
erating pseudo-OOD data (Chen and Yu,2021;
Zheng et al.,2020), Bayesian methods (Malinin
and Gales,2018), self-supervised learning based
approaches (Moon et al.,2021;Manolache et al.,
2021;Li et al.,2021;Zhou et al.,2021;Zeng et al.,
2021;Zhan et al.,2021), and novel scoring func-
tions which measure the uncertainty of the given
input (Hendrycks and Gimpel,2017;Lee et al.,
2018b;Liu et al.,2020;Tack et al.,2020).
Contrastive learning & OOD detection.
Among the numerous approaches mentioned, con-
trastive learning (CL) based methods (Chen et al.,
2020;Zbontar et al.,2021;Grill et al.,2020) are
recently spurring predominant interest in OOD de-
tection research. The superiority of CL in OOD
detection comes from the fact that it can guide a
neural model to learn semantic similarity within
data instances. Such property is also precious for
unsupervised OOD detection, as there is no accessi-
ble clue regarding outliers or abnormal distribution.
Despite its potential, CL has been utilized in the
computer vision field (Cho et al.,2021;Sehwag
et al.,2021;Tack et al.,2020;Winkens et al.,2020)
in the early works due to its high reliance on data
augmentation. However, now it is also widely used
in various NLP applications with the help of re-
cent progress (Li et al.,2021;Liu et al.,2021;Kim
et al.,2021;Carlsson et al.,2020;Gao et al.,2021;
Sennrich et al.,2016). Specifically, Li et al. (2021)