
natural language inference (Bowman et al.,2015),
and commonsense reasoning (Lourie et al.,2021).
When the task number scales up, the training
of PrLMs would be more vulnerable to negative
transfer due to the severe inconsistency of domain
and data distribution between tasks (Wu et al.,
2020b;Padmakumar et al.,2022). As one of the
key concepts underlying MTL, task relationships
potentially provide an effective basis for employing
PrLMs in a more effective and interpretable way.
To handle the issue of negative transfer during
multi-task learning, early studies have taken task
relationships into account by employing a dual-
process model architecture that is composed of
a shared encoder and task-specific layers. The
two parts are supposed to integrate the common
features of all the learning tasks and explore the
task relationship in a predefined manner (Zheng
et al.,2019;Liu et al.,2019a;Bai et al.,2020;
Ma et al.,2021), respectively. However, these
methods require additional modifications to model
architecture and increase the model complexity and
computation cost. Therefore, they are suboptimal
for applying to PrLMs in terms of generality and
computational bottlenecks.
All the considerations above lay down our
goal to investigate simple yet effective ways to
measure the task relationship without additional
cost and keep the generality of PrLMs. In
this work, we propose a prefix-guided multi-task
learning framework (CompassMTL) to explore
the mutual effects between tasks (Figure 1) and
improve model performance with complementary
tasks. Targeting natural language understanding
(NLU) tasks, we employ a discriminative PrLM
2
as the backbone model and train the model on
40 tasks. Experimental results show that our
model achieves human-parity performance on
commonsense reasoning tasks. We further probe
into the task relationship entailed in the tasks
prefix representations, finding that the measured
relationship highly correlates with task-to-task
transfer performance, and it is also of referenced
value for optimizing the PrLM on a target task with
its complementary tasks during MTL, i.e., fewer
tasks with better performance.
In summary, our contributions are three folds:
1) A unified discriminative multi-task PrLM for
2
Also known as encoder-only PrLMs. As this work
focuses on NLU tasks, we find that encoder-only PrLMs are
competitive based on our empirical studies though they may
lose generalizability on natural language generation tasks.
NLU tasks will be released as a strong counterpart
for the dominant T5-based encoder-decoder PrLMs
trained with MTL.
2) A probing tool of using task prefix to explore
the task relationships in large-scale MTL. We
observe that the task relationships reflected by
the prefixes manifest a correlation with transfer
learning performance, and they help our model
achieve better results with complementary tasks.
3) State-of-the-art results on a variety of NLU
tasks, especially human-parity benchmark perfor-
mance on commonsense reasoning leaderboards,
i.e., HellaSwag and αNLI.
2 Background and Related Work
2.1 Self-supervised Pre-training
PrLMs are commonly pre-trained on large-scale
corpora and then used for fine-tuning individual
tasks. One of the most widely-used pre-training
tasks is masked language modeling (MLM), which
first masks out some tokens from the input
sentences and then trains the model to predict them
by the rest tokens. There are derivatives of MLM
including permuted language modeling in XLNet
(Yang et al.,2019) and sequence-to-sequence MLM
in MASS (Song et al.,2019) and T5 (Raffel et al.,
2019). Beyond the general-purpose pre-training,
domain-adaptive pre-training and task-adaptive pre-
training have attracted attention in recent studies.
1) Domain-adaptive Pre-training. To incorporate
specific in-domain knowledge, domain-aware pre-
training is designed, which directly post-trains the
original PrLMs using the domain-specific corpus.
Popular models have been proposed in the dialogue
domain (Whang et al.,2020;Wu et al.,2020a), as
well as in the medical and science domains (Lee
et al.,2020;Beltagy et al.,2019;Huang et al.,
2019a;Yu et al.,2022).
2) Task-adaptive Pre-training. The goal of task-
adaptive pre-training is to capture task-specific
skills by devising the pre-training tasks. The
popular application scenarios include logical
reasoning and dialogue-related tasks Kumar et al.
(2020); Gu et al. (2020); Zhang and Zhao (2021);
Li et al. (2021). For example, Whang et al.
(2021) proposed various utterance manipulation
strategies, including utterance insertion, deletion,
and retrieval, to maintain dialog coherence.