Task Compass Scaling Multi-task Pre-training with Task Preﬁx Zhuosheng Zhang1 Shuohang Wang2 Yichong Xu2 Yuwei Fang2 Wenhao Yu3 Yang Liu2 Hai Zhao1 Chenguang Zhu2and Michael Zeng2

2025-04-26 0 0 1.36MB 15 页 10玖币

侵权投诉

Task Compass: Scaling Multi-task Pre-training with Task Preﬁx

Zhuosheng Zhang1∗, Shuohang Wang2, Yichong Xu2, Yuwei Fang2,

Wenhao Yu3∗, Yang Liu2, Hai Zhao1, Chenguang Zhu2and Michael Zeng2

1Shanghai Jiao Tong University, Shanghai, China

2Microsoft Cognitive Services Research, Redmond, WA, USA

3University of Notre Dame, Notre Dame, IN, USA

1zhangzs@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn;

2{shuowa, yicxu, yuwfan, yaliu10, chezhu, nzeng}@microsoft.com;3wyu1@nd.edu

Abstract

Leveraging task-aware annotated data as su-

pervised signals to assist with self-supervised

learning on large-scale unlabeled data has

become a new trend in pre-training language

models. Existing studies show that multi-

task learning with large-scale supervised tasks

suﬀers from negative eﬀects across tasks. To

tackle the challenge, we propose a task preﬁx

guided multi-task pre-training framework to

explore the relationships among tasks. We

conduct extensive experiments on 40 datasets,

which show that our model can not only serve

as the strong foundation backbone for a wide

range of tasks but also be feasible as a probing

tool for analyzing task relationships. The task

relationships reﬂected by the preﬁxes align

transfer learning performance between tasks.

They also suggest directions for data augmen-

tation with complementary tasks, which help

our model achieve human-parity results on

commonsense reasoning leaderboards. Code

is available at https://github.com/cooelf/

CompassMTL

1 Introduction

Recent years have witnessed a growing interest in

leveraging a uniﬁed pre-trained language model

(PrLM) to solve a wide range of natural language

processing tasks (Tay et al.,2022;Chowdhery

et al.,2022;Xie et al.,2022;Zhang et al.,2022).

The pre-training recipe of a PrLM is driving

from self-supervised learning (Peters et al.,2018;

Radford et al.,2018;Devlin et al.,2019;Lan

et al.,2020;Clark et al.,2020) to multi-task

learning (MTL) with a mixture of standard self-

supervised tasks and various supervised tasks,

* Work done when Zhuosheng Zhang and Wenhao Yu

interned at Microsoft Cognitive Services Research group.

This work was partially supported by Key Projects of

National Natural Science Foundation of China (U1836222

and 61733011).

CompassMTL

Choice: 4 [dream], class, making, you, with, worth, effort, the, woman

[sciq]: A wetland is an area that is wet

for all or part of the year. Wetlands are

home to certain types of plants.

What is an area of land called that is

wet for all or part of the year?

1) tundra, 2) "plains", 3) "grassland",

4) "wetland"

[MASK]: M: I am considering dropping

my dancing [MASK]. I am not [MASK]

any progress.", "W: If I were [MASK], I

stick [MASK] it. It's definitely [MASK]

time and [MASK]. What does the man

suggest [MASK] [MASK] do?

MTL MLM

Figure 1: Input-output view. We append a task preﬁx

for each data sequence to capture common patterns

from the dataset and require the model to predict some

randomly masked preﬁxes to capture task diﬀerences.

which takes advantage of learning from both large-

scale unlabeled corpus and high-quality human-

labeled datasets (Raﬀel et al.,2019;Aribandi et al.,

2021).

Beneﬁtting from supervision from related

tasks, MTL approaches reduce the cost of curating

deep learning models for an individual task and

provide a shared representation that is generally

applicable for a range of tasks (Wu et al.,2020b).

In the research line of multi-task learning for

PrLMs, a typical solution is to cast all tasks into a

text-to-text format and utilize an encoder-decoder

PrLM such as T5 to predict the target sequences

(Raﬀel et al.,2019;Aribandi et al.,2021). Despite

the extensive eﬀorts on leveraging supervised tasks

in strengthening PrLMs, the latest trend is extreme

scaling of task numbers, with little attention paid

to the relationships between tasks (Sanh et al.,

2021;Wei et al.,2021). Aribandi et al. (2021)

investigated co-training transfer eﬀects amongst

task-families and empirically found that tasks in

diﬀerent families may have side eﬀects between

each other, e.g., summarization tasks generally

seem to hurt performance on other task families

such as dialogue system (Mehri et al.,2020),

Since multi-task pre-training is often implemented as an

additional large-scale learning stage between language model

pre-training and ﬁne-tuning, it is also known as multi-task

pre-ﬁne-tuning in literature (Aghajanyan et al.,2021).

arXiv:2210.06277v1 [cs.CL] 12 Oct 2022

natural language inference (Bowman et al.,2015),

and commonsense reasoning (Lourie et al.,2021).

When the task number scales up, the training

of PrLMs would be more vulnerable to negative

transfer due to the severe inconsistency of domain

and data distribution between tasks (Wu et al.,

2020b;Padmakumar et al.,2022). As one of the

key concepts underlying MTL, task relationships

potentially provide an eﬀective basis for employing

PrLMs in a more eﬀective and interpretable way.

To handle the issue of negative transfer during

multi-task learning, early studies have taken task

relationships into account by employing a dual-

process model architecture that is composed of

a shared encoder and task-speciﬁc layers. The

two parts are supposed to integrate the common

features of all the learning tasks and explore the

task relationship in a predeﬁned manner (Zheng

et al.,2019;Liu et al.,2019a;Bai et al.,2020;

Ma et al.,2021), respectively. However, these

methods require additional modiﬁcations to model

architecture and increase the model complexity and

computation cost. Therefore, they are suboptimal

for applying to PrLMs in terms of generality and

computational bottlenecks.

All the considerations above lay down our

goal to investigate simple yet eﬀective ways to

measure the task relationship without additional

cost and keep the generality of PrLMs. In

this work, we propose a preﬁx-guided multi-task

learning framework (CompassMTL) to explore

the mutual eﬀects between tasks (Figure 1) and

improve model performance with complementary

tasks. Targeting natural language understanding

(NLU) tasks, we employ a discriminative PrLM

as the backbone model and train the model on

40 tasks. Experimental results show that our

model achieves human-parity performance on

commonsense reasoning tasks. We further probe

into the task relationship entailed in the tasks

preﬁx representations, ﬁnding that the measured

relationship highly correlates with task-to-task

transfer performance, and it is also of referenced

value for optimizing the PrLM on a target task with

its complementary tasks during MTL, i.e., fewer

tasks with better performance.

In summary, our contributions are three folds:

1) A uniﬁed discriminative multi-task PrLM for

Also known as encoder-only PrLMs. As this work

focuses on NLU tasks, we ﬁnd that encoder-only PrLMs are

competitive based on our empirical studies though they may

lose generalizability on natural language generation tasks.

NLU tasks will be released as a strong counterpart

for the dominant T5-based encoder-decoder PrLMs

trained with MTL.

2) A probing tool of using task preﬁx to explore

the task relationships in large-scale MTL. We

observe that the task relationships reﬂected by

the preﬁxes manifest a correlation with transfer

learning performance, and they help our model

achieve better results with complementary tasks.

3) State-of-the-art results on a variety of NLU

tasks, especially human-parity benchmark perfor-

mance on commonsense reasoning leaderboards,

i.e., HellaSwag and αNLI.

2 Background and Related Work

2.1 Self-supervised Pre-training

PrLMs are commonly pre-trained on large-scale

corpora and then used for ﬁne-tuning individual

tasks. One of the most widely-used pre-training

tasks is masked language modeling (MLM), which

ﬁrst masks out some tokens from the input

sentences and then trains the model to predict them

by the rest tokens. There are derivatives of MLM

including permuted language modeling in XLNet

(Yang et al.,2019) and sequence-to-sequence MLM

in MASS (Song et al.,2019) and T5 (Raﬀel et al.,

2019). Beyond the general-purpose pre-training,

domain-adaptive pre-training and task-adaptive pre-

training have attracted attention in recent studies.

1) Domain-adaptive Pre-training. To incorporate

speciﬁc in-domain knowledge, domain-aware pre-

training is designed, which directly post-trains the

original PrLMs using the domain-speciﬁc corpus.

Popular models have been proposed in the dialogue

domain (Whang et al.,2020;Wu et al.,2020a), as

well as in the medical and science domains (Lee

et al.,2020;Beltagy et al.,2019;Huang et al.,

2019a;Yu et al.,2022).

2) Task-adaptive Pre-training. The goal of task-

adaptive pre-training is to capture task-speciﬁc

skills by devising the pre-training tasks. The

popular application scenarios include logical

reasoning and dialogue-related tasks Kumar et al.

(2020); Gu et al. (2020); Zhang and Zhao (2021);

Li et al. (2021). For example, Whang et al.

(2021) proposed various utterance manipulation

strategies, including utterance insertion, deletion,

and retrieval, to maintain dialog coherence.

a) Unified Text-to-text Methods

Encoder

Decoder

Encoder

MTL MLM

b) Our CompassMTL Framework c) CompassMTL w/ Tailor

Mixture of

Task Datasets

Encoder

MTL

[RTE]: Herceptin was already approved ...

[RTE]: Twelve of Jupiter's moons are ...

[MNLI]: Conceptually cream skimming ...

[QNLI]: He bases this interpretation on ...

[MNLI]: He turned and smiled at Vrenna. ...

[QNLI]: The largest and southern main ...

[RTE]: Herceptin was already approved ...

[MASK]: Twelve of Jupiter's [MASK] are ...

[MNLI]: Conceptually cream skimming ...

[QNLI]: He bases this interpretation on ...

[MNLI]: He turned and smiled at Vrenna. ...

[QNLI]: The largest and southern main ...

[RCT]: Pain was assessed using the visual ...

[RCT]: A total of 125 patients with primary ...

Figure 2: Comparison with existing paradigms of multi-task learning. Typical uniﬁed text-to-text methods include

T5 (Raﬀel et al.,2019), ExT5 (Aribandi et al.,2021), FLAN (Wei et al.,2021), and T0 (Sanh et al.,2021).

2.2 Multi-task Learning for PrLMs

Our concerned MTL in the ﬁeld of PrLMs is

partially related to the studies of task-adaptive pre-

training discussed above. The major diﬀerence

is that the PrLMs in MTL are fed with human-

annotated datasets instead of those automatically

constructed ones for self-supervised tasks. Figure 2

overviews the paradigms of MTL PrLMs. Existing

methods in this research line mostly vary in model

architectures and training stages. For example, MT-

DNN (Liu et al.,2019a) applied multi-task learning

to train a shared model on all the target datasets

in the ﬁne-tuning stage, and there are several

task-aware output modules to adapt the shared

representations to each task. Recent studies, such

as ExT5 (Aribandi et al.,2021), T0 (Sanh et al.,

2021), and FLAN (Wei et al.,2021), commonly

applied an Encoder-Decoder architecture and

convert a variety of tasks into the same text-to-text

format and train those tasks jointly (Figure 2-a).

We argue that they are not the optimal solution

considering the model complexity and the gap

between original and transformed task formats,

especially for natural language understanding

tasks that are in a discriminative manner, e.g.,

classiﬁcation, multiple-choice, etc. Actually, there

are studies (McCann et al.,2018;Keskar et al.,

2019;Li et al.,2020;Khashabi et al.,2020) that

transform traditional tasks into other formats like

reading comprehension or question answering and

achieve better results than prior methods. These

studies motivate us to explore superior model

backbones and data formats, especially for the

application in NLU tasks.

2.3 Modeling Task Relationships in MTL

Modeling task relationships is a classic topic in

deep learning studies. Bingel and Søgaard (2017)

studied the research question about what task

relations make gains in traditional natural language

processing tasks and investigated when and why

MTL works in sequence labeling tasks such as

chunking, sentence compression, POS tagging,

keyphrase detection, etc. Wu et al. (2020b) found

that task data alignment can signiﬁcantly aﬀect the

performance of MTL and proposed architecture

with a shared module for all tasks and a separate

output module for each task.

Since these methods require additional modiﬁca-

tions of model architecture, they are suboptimal for

employment in PrLMs, considering computational

bottlenecks and generality when task scaling.

In the era of pre-trained models, Geva et al.

(2021) analyzed the behavior transfer in PrLMs

between related jointly-trained tasks such as QA

and summarization and thus provided evidence

for the extrapolation of skills as a consequence

of multi-task training. ExT5 (Aribandi et al.,

2021) evaluated the transfer performance among

task families in a multi-task co-training setup

and observed that negative transfer is common,

especially when training across task families.

Although there are recent studies that insert

prompts to describe the task requirements in the

data sequences (Liu et al.,2021;Su et al.,2022;

Qin et al.,2021;Vu et al.,2022), it is still not

clear whether the prompts help negative transfer

or whether the prompts necessarily capture task

relationships. In this work, we ﬁnd that using task

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TaskCompass:ScalingMulti-taskPre-trainingwithTaskPrexZhuoshengZhang1,ShuohangWang2,YichongXu2,YuweiFang2,WenhaoYu3,YangLiu2,HaiZhao1,ChenguangZhu2andMichaelZeng21ShanghaiJiaoTongUniversity,Shanghai,China2MicrosoftCognitiveServicesResearch,Redmond,WA,USA3UniversityofNotreDame,NotreDame,IN,USA1zhan...

展开>> 收起<<

Task Compass Scaling Multi-task Pre-training with Task Preﬁx Zhuosheng Zhang1 Shuohang Wang2 Yichong Xu2 Yuwei Fang2 Wenhao Yu3 Yang Liu2 Hai Zhao1 Chenguang Zhu2and Michael Zeng2.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Task Compass Scaling Multi-task Pre-training with Task Preﬁx Zhuosheng Zhang1 Shuohang Wang2 Yichong Xu2 Yuwei Fang2 Wenhao Yu3 Yang Liu2 Hai Zhao1 Chenguang Zhu2and Michael Zeng2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: