
tation and self-supervised domain adaption.
In this paper, we propose a novel self-supervised
QA domain adaptation framework for extractive
QA called QADA. Our QADA framework is de-
signed to handle domain shifts and should thus
answer out-of-domain questions. QADA has three
stages, namely pseudo labeling, hidden space aug-
mentation and self-supervised domain adaptation.
First, we use pseudo labeling to generate and fil-
ter labeled target QA data. Next, the augmenta-
tion component integrates a novel pipeline for data
augmentation to enrich training samples in the hid-
den space. For questions, we build upon multi-
hop synonyms and introduce Dirichlet neighbor-
hood sampling in the embedding space to generate
augmented tokens. For contexts, we develop an
attentive context cutoff method which learns to
drop context spans via a sampling strategy using
attention scores. Third, for training, we propose
to train the QA model via a novel attention-based
contrastive adaptation. Specifically, we use the at-
tention weights to sample informative features that
help the QA model separate answers and generalize
across the source and target domains.
Main contributions of our work are:1
1.
We propose a novel, self-supervised framework
called QADA for domain adaptation in QA.
QADA aims at answering out-of-domain ques-
tion and should thus handle the domain shift
upon deployment in an unseen domain.
2.
To the best of our knowledge, QADA is the first
work in QA domain adaptation that (i) lever-
ages hidden space augmentation to enrich train-
ing data; and (ii) integrates attention-based con-
trastive learning for self-supervised adaptation.
3.
We demonstrate the effectiveness of QADA in
an unsupervised setting where target answers
are not accessible. Here, QADA can consid-
erably outperform state-of-the-art baselines on
multiple datasets for QA domain adaptation.
2 Related Work
Extractive QA has achieved significantly progress
recently (Devlin et al.,2019;Kratzwald et al.,2019;
Lan et al.,2020;Zhang et al.,2020). Yet, the ac-
curacy of QA models can drop drastically under
1
The code for our QADA framework is publicly available
at https://github.com/Yueeeeeeee/Self-Supervised-QA.
domain shifts; that is, when deployed in an un-
seen domain that differs from the training distribu-
tion (Fisch et al.,2019;Talmor and Berant,2019).
To overcome the above challenge, various ap-
proaches for QA domain adaptation have been
proposed, which can be categorized as follows.
(1)
(Semi-)
supervised adaptation uses partially la-
beled data from the target distribution for train-
ing (Yang et al.,2017;Kratzwald and Feuerriegel,
2019b;Yue et al.,2022a). (2) Unsupervised adap-
tation with question generation refers to settings
where only context paragraphs in the target domain
are available, QA samples are generated separately
to train the QA model (Shakeri et al.,2020;Yue
et al.,2021b). (3) Unsupervised adaptation has
access to context and question information from
the target domain, whereas answers are unavail-
able (Chung et al.,2018;Cao et al.,2020;Yue
et al.,2022d). In this paper, we focus on the third
category and study the problem of unsupervised
QA domain adaptation.
Domain adaptation for QA
: Several ap-
proaches have been developed to generate synthetic
QA samples via question generation (QG) in an
end-to-end fashion (i.e., seq2seq) (Du et al.,2017;
Sun et al.,2018). Leveraging such samples from
QG can also improve the QA performance in out-of-
domain distributions (Golub et al.,2017;Tang et al.,
2017,2018;Lee et al.,2020;Shakeri et al.,2020;
Yue et al.,2022a;Zeng et al.,2022a). Given unla-
beled questions, there are two main approaches: do-
main adversarial training can be applied to reduce
feature discrepancy between domains (Lee et al.,
2019;Cao et al.,2020), while contrastive adapta-
tion minimizes the domain discrepancy using maxi-
mum mean discrepancy (MMD) (Yue et al.,2021b,
2022d). We later use the idea from contrastive
learning but tailor it carefully for our adaptation
framework.
Data augmentation for NLP
: Data augmenta-
tion for NLP aims at improving the language under-
standing with diverse data samples. One approach
is to apply token-level augmentation and enrich the
training data with simple techniques (e.g., synonym
replacement, token swapping, etc.) (Wei and Zou,
2019) or custom heuristics (McCoy et al.,2019).
Alternatively, augmentation can be done in the hid-
den space of the underlying model (Chen et al.,
2020). For example, one can drop partial spans
in the hidden space, which aids generalization per-
formance under distributional shifts (Chen et al.,