Task Compass Scaling Multi-task Pre-training with Task Prefix Zhuosheng Zhang1 Shuohang Wang2 Yichong Xu2 Yuwei Fang2 Wenhao Yu3 Yang Liu2 Hai Zhao1 Chenguang Zhu2and Michael Zeng2

2025-04-26 0 0 1.36MB 15 页 10玖币
侵权投诉
Task Compass: Scaling Multi-task Pre-training with Task Prefix
Zhuosheng Zhang1, Shuohang Wang2, Yichong Xu2, Yuwei Fang2,
Wenhao Yu3, Yang Liu2, Hai Zhao1, Chenguang Zhu2and Michael Zeng2
1Shanghai Jiao Tong University, Shanghai, China
2Microsoft Cognitive Services Research, Redmond, WA, USA
3University of Notre Dame, Notre Dame, IN, USA
1zhangzs@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn;
2{shuowa, yicxu, yuwfan, yaliu10, chezhu, nzeng}@microsoft.com;3wyu1@nd.edu
Abstract
Leveraging task-aware annotated data as su-
pervised signals to assist with self-supervised
learning on large-scale unlabeled data has
become a new trend in pre-training language
models. Existing studies show that multi-
task learning with large-scale supervised tasks
suers from negative eects across tasks. To
tackle the challenge, we propose a task prefix
guided multi-task pre-training framework to
explore the relationships among tasks. We
conduct extensive experiments on 40 datasets,
which show that our model can not only serve
as the strong foundation backbone for a wide
range of tasks but also be feasible as a probing
tool for analyzing task relationships. The task
relationships reflected by the prefixes align
transfer learning performance between tasks.
They also suggest directions for data augmen-
tation with complementary tasks, which help
our model achieve human-parity results on
commonsense reasoning leaderboards. Code
is available at https://github.com/cooelf/
CompassMTL
1 Introduction
Recent years have witnessed a growing interest in
leveraging a unified pre-trained language model
(PrLM) to solve a wide range of natural language
processing tasks (Tay et al.,2022;Chowdhery
et al.,2022;Xie et al.,2022;Zhang et al.,2022).
The pre-training recipe of a PrLM is driving
from self-supervised learning (Peters et al.,2018;
Radford et al.,2018;Devlin et al.,2019;Lan
et al.,2020;Clark et al.,2020) to multi-task
learning (MTL) with a mixture of standard self-
supervised tasks and various supervised tasks,
* Work done when Zhuosheng Zhang and Wenhao Yu
interned at Microsoft Cognitive Services Research group.
This work was partially supported by Key Projects of
National Natural Science Foundation of China (U1836222
and 61733011).
CompassMTL
Choice: 4 [dream], class, making, you, with, worth, effort, the, woman
[sciq]: A wetland is an area that is wet
for all or part of the year. Wetlands are
home to certain types of plants.
What is an area of land called that is
wet for all or part of the year?
1) tundra, 2) "plains", 3) "grassland",
4) "wetland"
[MASK]: M: I am considering dropping
my dancing [MASK]. I am not [MASK]
any progress.", "W: If I were [MASK], I
stick [MASK] it. It's definitely [MASK]
time and [MASK]. What does the man
suggest [MASK] [MASK] do?
MTL MLM
Figure 1: Input-output view. We append a task prefix
for each data sequence to capture common patterns
from the dataset and require the model to predict some
randomly masked prefixes to capture task dierences.
which takes advantage of learning from both large-
scale unlabeled corpus and high-quality human-
labeled datasets (Rael et al.,2019;Aribandi et al.,
2021).
1
Benefitting from supervision from related
tasks, MTL approaches reduce the cost of curating
deep learning models for an individual task and
provide a shared representation that is generally
applicable for a range of tasks (Wu et al.,2020b).
In the research line of multi-task learning for
PrLMs, a typical solution is to cast all tasks into a
text-to-text format and utilize an encoder-decoder
PrLM such as T5 to predict the target sequences
(Rael et al.,2019;Aribandi et al.,2021). Despite
the extensive eorts on leveraging supervised tasks
in strengthening PrLMs, the latest trend is extreme
scaling of task numbers, with little attention paid
to the relationships between tasks (Sanh et al.,
2021;Wei et al.,2021). Aribandi et al. (2021)
investigated co-training transfer eects amongst
task-families and empirically found that tasks in
dierent families may have side eects between
each other, e.g., summarization tasks generally
seem to hurt performance on other task families
such as dialogue system (Mehri et al.,2020),
1
Since multi-task pre-training is often implemented as an
additional large-scale learning stage between language model
pre-training and fine-tuning, it is also known as multi-task
pre-fine-tuning in literature (Aghajanyan et al.,2021).
arXiv:2210.06277v1 [cs.CL] 12 Oct 2022
natural language inference (Bowman et al.,2015),
and commonsense reasoning (Lourie et al.,2021).
When the task number scales up, the training
of PrLMs would be more vulnerable to negative
transfer due to the severe inconsistency of domain
and data distribution between tasks (Wu et al.,
2020b;Padmakumar et al.,2022). As one of the
key concepts underlying MTL, task relationships
potentially provide an eective basis for employing
PrLMs in a more eective and interpretable way.
To handle the issue of negative transfer during
multi-task learning, early studies have taken task
relationships into account by employing a dual-
process model architecture that is composed of
a shared encoder and task-specific layers. The
two parts are supposed to integrate the common
features of all the learning tasks and explore the
task relationship in a predefined manner (Zheng
et al.,2019;Liu et al.,2019a;Bai et al.,2020;
Ma et al.,2021), respectively. However, these
methods require additional modifications to model
architecture and increase the model complexity and
computation cost. Therefore, they are suboptimal
for applying to PrLMs in terms of generality and
computational bottlenecks.
All the considerations above lay down our
goal to investigate simple yet eective ways to
measure the task relationship without additional
cost and keep the generality of PrLMs. In
this work, we propose a prefix-guided multi-task
learning framework (CompassMTL) to explore
the mutual eects between tasks (Figure 1) and
improve model performance with complementary
tasks. Targeting natural language understanding
(NLU) tasks, we employ a discriminative PrLM
2
as the backbone model and train the model on
40 tasks. Experimental results show that our
model achieves human-parity performance on
commonsense reasoning tasks. We further probe
into the task relationship entailed in the tasks
prefix representations, finding that the measured
relationship highly correlates with task-to-task
transfer performance, and it is also of referenced
value for optimizing the PrLM on a target task with
its complementary tasks during MTL, i.e., fewer
tasks with better performance.
In summary, our contributions are three folds:
1) A unified discriminative multi-task PrLM for
2
Also known as encoder-only PrLMs. As this work
focuses on NLU tasks, we find that encoder-only PrLMs are
competitive based on our empirical studies though they may
lose generalizability on natural language generation tasks.
NLU tasks will be released as a strong counterpart
for the dominant T5-based encoder-decoder PrLMs
trained with MTL.
2) A probing tool of using task prefix to explore
the task relationships in large-scale MTL. We
observe that the task relationships reflected by
the prefixes manifest a correlation with transfer
learning performance, and they help our model
achieve better results with complementary tasks.
3) State-of-the-art results on a variety of NLU
tasks, especially human-parity benchmark perfor-
mance on commonsense reasoning leaderboards,
i.e., HellaSwag and αNLI.
2 Background and Related Work
2.1 Self-supervised Pre-training
PrLMs are commonly pre-trained on large-scale
corpora and then used for fine-tuning individual
tasks. One of the most widely-used pre-training
tasks is masked language modeling (MLM), which
first masks out some tokens from the input
sentences and then trains the model to predict them
by the rest tokens. There are derivatives of MLM
including permuted language modeling in XLNet
(Yang et al.,2019) and sequence-to-sequence MLM
in MASS (Song et al.,2019) and T5 (Rael et al.,
2019). Beyond the general-purpose pre-training,
domain-adaptive pre-training and task-adaptive pre-
training have attracted attention in recent studies.
1) Domain-adaptive Pre-training. To incorporate
specific in-domain knowledge, domain-aware pre-
training is designed, which directly post-trains the
original PrLMs using the domain-specific corpus.
Popular models have been proposed in the dialogue
domain (Whang et al.,2020;Wu et al.,2020a), as
well as in the medical and science domains (Lee
et al.,2020;Beltagy et al.,2019;Huang et al.,
2019a;Yu et al.,2022).
2) Task-adaptive Pre-training. The goal of task-
adaptive pre-training is to capture task-specific
skills by devising the pre-training tasks. The
popular application scenarios include logical
reasoning and dialogue-related tasks Kumar et al.
(2020); Gu et al. (2020); Zhang and Zhao (2021);
Li et al. (2021). For example, Whang et al.
(2021) proposed various utterance manipulation
strategies, including utterance insertion, deletion,
and retrieval, to maintain dialog coherence.
a) Unified Text-to-text Methods
Encoder
Decoder
Encoder
MTL MLM
b) Our CompassMTL Framework c) CompassMTL w/ Tailor
Mixture of
Task Datasets
Encoder
MTL
[RTE]: Herceptin was already approved ...
[RTE]: Twelve of Jupiter's moons are ...
[MNLI]: Conceptually cream skimming ...
[QNLI]: He bases this interpretation on ...
[MNLI]: He turned and smiled at Vrenna. ...
[QNLI]: The largest and southern main ...
[RTE]: Herceptin was already approved ...
[MASK]: Twelve of Jupiter's [MASK] are ...
[MNLI]: Conceptually cream skimming ...
[QNLI]: He bases this interpretation on ...
[MNLI]: He turned and smiled at Vrenna. ...
[QNLI]: The largest and southern main ...
[RCT]: Pain was assessed using the visual ...
[RCT]: A total of 125 patients with primary ...
Figure 2: Comparison with existing paradigms of multi-task learning. Typical unified text-to-text methods include
T5 (Rael et al.,2019), ExT5 (Aribandi et al.,2021), FLAN (Wei et al.,2021), and T0 (Sanh et al.,2021).
2.2 Multi-task Learning for PrLMs
Our concerned MTL in the field of PrLMs is
partially related to the studies of task-adaptive pre-
training discussed above. The major dierence
is that the PrLMs in MTL are fed with human-
annotated datasets instead of those automatically
constructed ones for self-supervised tasks. Figure 2
overviews the paradigms of MTL PrLMs. Existing
methods in this research line mostly vary in model
architectures and training stages. For example, MT-
DNN (Liu et al.,2019a) applied multi-task learning
to train a shared model on all the target datasets
in the fine-tuning stage, and there are several
task-aware output modules to adapt the shared
representations to each task. Recent studies, such
as ExT5 (Aribandi et al.,2021), T0 (Sanh et al.,
2021), and FLAN (Wei et al.,2021), commonly
applied an Encoder-Decoder architecture and
convert a variety of tasks into the same text-to-text
format and train those tasks jointly (Figure 2-a).
We argue that they are not the optimal solution
considering the model complexity and the gap
between original and transformed task formats,
especially for natural language understanding
tasks that are in a discriminative manner, e.g.,
classification, multiple-choice, etc. Actually, there
are studies (McCann et al.,2018;Keskar et al.,
2019;Li et al.,2020;Khashabi et al.,2020) that
transform traditional tasks into other formats like
reading comprehension or question answering and
achieve better results than prior methods. These
studies motivate us to explore superior model
backbones and data formats, especially for the
application in NLU tasks.
2.3 Modeling Task Relationships in MTL
Modeling task relationships is a classic topic in
deep learning studies. Bingel and Søgaard (2017)
studied the research question about what task
relations make gains in traditional natural language
processing tasks and investigated when and why
MTL works in sequence labeling tasks such as
chunking, sentence compression, POS tagging,
keyphrase detection, etc. Wu et al. (2020b) found
that task data alignment can significantly aect the
performance of MTL and proposed architecture
with a shared module for all tasks and a separate
output module for each task.
Since these methods require additional modifica-
tions of model architecture, they are suboptimal for
employment in PrLMs, considering computational
bottlenecks and generality when task scaling.
In the era of pre-trained models, Geva et al.
(2021) analyzed the behavior transfer in PrLMs
between related jointly-trained tasks such as QA
and summarization and thus provided evidence
for the extrapolation of skills as a consequence
of multi-task training. ExT5 (Aribandi et al.,
2021) evaluated the transfer performance among
task families in a multi-task co-training setup
and observed that negative transfer is common,
especially when training across task families.
Although there are recent studies that insert
prompts to describe the task requirements in the
data sequences (Liu et al.,2021;Su et al.,2022;
Qin et al.,2021;Vu et al.,2022), it is still not
clear whether the prompts help negative transfer
or whether the prompts necessarily capture task
relationships. In this work, we find that using task
摘要:

TaskCompass:ScalingMulti-taskPre-trainingwithTaskPrexZhuoshengZhang1,ShuohangWang2,YichongXu2,YuweiFang2,WenhaoYu3,YangLiu2,HaiZhao1,ChenguangZhu2andMichaelZeng21ShanghaiJiaoTongUniversity,Shanghai,China2MicrosoftCognitiveServicesResearch,Redmond,WA,USA3UniversityofNotreDame,NotreDame,IN,USA1zhan...

展开>> 收起<<
Task Compass Scaling Multi-task Pre-training with Task Prefix Zhuosheng Zhang1 Shuohang Wang2 Yichong Xu2 Yuwei Fang2 Wenhao Yu3 Yang Liu2 Hai Zhao1 Chenguang Zhu2and Michael Zeng2.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.36MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注