Metric-guided Distillation Distilling Knowledge from the Metric to Ranker and Retriever for Generative Commonsense Reasoning Xingwei He1 Yeyun Gong2y A-Long Jin1 Weizhen Qi3 Hang Zhang2

2025-05-02 0 0 788.9KB 14 页 10玖币

侵权投诉

Metric-guided Distillation: Distilling Knowledge from the Metric to

Ranker and Retriever for Generative Commonsense Reasoning

Xingwei He1∗

, Yeyun Gong2†

, A-Long Jin1, Weizhen Qi3, Hang Zhang2,

Jian Jiao,4Bartuer Zhou,2Biao Cheng,2Siu Ming Yiu,1Nan Duan2

1The University of Hong Kong, 2Microsoft Research Asia,

3University of Science and Technology of China, 4Microsoft

hexingwei15@gmail.com,ajin@eee.hku.hk,

smyiu@cs.hku.hk,weizhen@mail.ustc.edu.cn,

{yegong, v-zhhang, jian.jiao, bazhou, bicheng, nanduan}@microsoft.com

Abstract

Commonsense generation aims to generate a

realistic sentence describing a daily scene un-

der the given concepts, which is very challeng-

ing, since it requires models to have relational

reasoning and compositional generalization ca-

pabilities. Previous work focuses on retrieving

prototype sentences for the provided concepts

to assist generation. They ﬁrst use a sparse

retriever to retrieve candidate sentences, then

re-rank the candidates with a ranker. How-

ever, the candidates returned by their ranker

may not be the most relevant sentences, since

the ranker treats all candidates equally with-

out considering their relevance to the reference

sentences of the given concepts. Another prob-

lem is that re-ranking is very expensive, but

only using retrievers will seriously degrade the

performance of their generation models. To

solve these problems, we propose the metric

distillation rule to distill knowledge from the

metric (e.g., BLEU) to the ranker. We further

transfer the critical knowledge summarized by

the distilled ranker to the retriever. In this way,

the relevance scores of candidate sentences

predicted by the ranker and retriever will be

more consistent with their quality measured

by the metric. Experimental results on the

CommonGen benchmark verify the effective-

ness of our proposed method: (1) Our gener-

ation model with the distilled ranker achieves

a new state-of-the-art result. (2) Our genera-

tion model with the distilled retriever even sur-

passes the previous SOTA.

1 Introduction

Commonsense reasoning is the ability to make rea-

sonable and logical assumptions about daily scenes,

which is a long-standing challenge in natural lan-

guage processing. Recently, many discriminative

tasks, such as CommonsenseQA (Talmor et al.,

∗

Work done during internship at Microsoft Research Asia.

†Corresponding author.

Concepts:eye,hang,head,shut,squeeze

Reference: A man squeezes his eyes shut and hangs his head.

BART: He squeezes her head shut, then grasps her eyes shut.

Our: A baby with a blue shirt hangs his head and squeezes his eyes shut.

Table 1: Sentences generated by BART and our pro-

posed model, DKMR2.

2019) and SWAG (Sap et al.,2019), have been pro-

posed to evaluate the commonsense reasoning abil-

ity by testing whether models can select the correct

answer from the choices according to the given con-

text. To test whether models acquire the generative

commonsense reasoning ability, Lin et al. (2020)

proposed the commonsense generation (Common-

Gen) task, which requires models to produce a

plausible sentence describing a speciﬁc daily life

scenario based on the given concepts.

CommonGen proposes two main challenges to

models, and it expects models to (1) reason over

the commonsense relations among concepts to gen-

erate sentences in line with our commonsense; (2)

possess the compositional generalization ability

to generate realistic sentences with unseen con-

cept compositions. Experiment results (Lin et al.,

2020) show that large-scale pre-trained models

(e.g., BART) alone is not competent for this task

(see Table 1). The main reason is that the source

information is very limited; therefore, the models

can only rely on the internal implicit knowledge

acquired during pre-training to solve this problem,

resulting in generating some sentences that violate

commonsense.

To enrich the source information, EKI-BART

(Fan et al.,2020) ﬁrst retrieves prototype sentences

for the input concepts, and then feeds the concepts

and retrieved sentences into the generation model.

Recent work, such as RE-T5 (Wang et al.,2021),

KFCNet (Li et al.,2021), and KGR

(Liu et al.,

2022), extends this retrieve-and-generate frame-

work by introducing a binary classiﬁer to re-rank

the retrieved candidate sentences and ﬁlter out can-

arXiv:2210.11708v1 [cs.CL] 21 Oct 2022

didates irrelevant to the input concepts. One prob-

lem with these works is the discrepancy between

training and re-ranking for their ranker. Concretely,

when training the ranker, they treat all retrieved can-

didate sentences as negatives, regardless of their rel-

evance to the reference sentences of input concepts.

However, during re-ranking, the ranker is asked

to point out how these candidates differ in their

relevance to references. Another problem is that

the re-ranking process of the cross-encoder ranker

is very time-consuming, which is non-negligible,

especially for online systems.

In this paper, we also resort to the retrieve-and-

generate pipeline to solve CommonGen, yet fur-

ther improve the retrieval module by alleviating

the above problems. Our motivation is to expect

that the relevance scores of candidates computed

by the ranker and retriever are in line with the gold

quality scores between candidates and reference

sentences measured with the evaluation metric. To

achieve this, we ﬁrst distill the gold rank knowl-

edge of candidates measured by the metric to the

ranker. Next, we improve the retriever by transfer-

ring the metric knowledge from the distilled ranker

to the retriever rather than directly distilling it from

the metric (please refer to Section 3.3 for more ex-

planation). By doing so, the distilled ranker and

retriever can select more relevant sentences than

their counterparts without metric distillation.

The contributions of this work are summarized

as follows: (1) We propose to

istill

nowledge

from the

etric to

anker and

etriever, termed

DKMR

, for generative commonsense reasoning,

which uses the metric-guided distillation to im-

prove the ranker and a progressive distillation strat-

egy to improve the retriever

. (2) We conduct exten-

sive experiments on the CommonGen benchmark.

Our proposed model achieves a new state-of-the-art

(SOTA) on both the v1.0 test set (43.37 vs. 39.15

on SPICE) and the ofﬁcial test set (v1.1) (34.589 vs.

33.911 on SPICE) of the leaderboard. (3) The per-

formance of DKMR

with the distilled retriever is

on par with DKMR2using the distilled ranker. As

a result, the expensive retrieve-then-rank pipeline

can be replaced with the distilled retriever at the

expense of negligible performance.

Our code and models are available at

https://github.com/microsoft/advNLG.

2 Problem Statement

CommonGen is a constrained text generation task,

with the goal of generating a coherent and plausi-

ble sentence

describing an everyday scenario us-

ing an unordered concept set

c={c1, c2...,cm}

Therefore, this task is typically formulated to max-

imize the conditional probability of s:

p(s|c;θ) =

t=1

p(st|si<t,c;θ),(1)

where

denotes the length of the generated se-

quence

and

si<t

refers to the sub-sequence gen-

erated before the time step t.

3 Methodology

Following the previous work (Fan et al.,2020),

we resort to the retrieve-then-generate framework

to solve CommonGen, which mainly consists of

two modules, the retrieval module and the genera-

tion module. The retrieval module aims to retrieve

relevant sentences to assist the generation mod-

ule in generating desirable outputs. Recent work

(Wang et al.,2021;Li et al.,2021;Liu et al.,2022)

extends this idea by introducing a ranker to the

retrieval module. In this work, our retrieval mod-

ule also resorts to the retrieve-then-rank pipeline,

as illustrated in Figure 1. To be speciﬁc, the re-

trieval module mainly contains two models, the

retriever and ranker, where the retriever is used to

retrieve candidate sentences for the given concept

set and the ranker further re-ranks the retrieved sen-

tences. Different from previous work, we improve

the ranker by distilling knowledge from the gold

quality scores between the candidate sentences and

gold reference sentences computed by the evalua-

tion metric. Then, the distilled ranker will pass the

distilled knowledge to the retriever, aiming to cor-

rect the retriever’s inaccurate retrieval operations.

In Section 3.1, we ﬁrst introduce the warm-up

of the retriever. Then, we will show how to distill

knowledge from the metric, in turn, to the ranker

and retriever in Sections 3.2 and 3.3, respectively.

Finally, we will show how to generate sentences

based on the retrieved sentences in Section 3.4.

3.1 Warm-up of the Retriever

We use a typical dense retrieval model (Karpukhin

et al.,2020) as the retriever. As shown in Fig-

ure 3(a) in Appendix A, the retriever is based on

the dual-encoder architecture. In this work, we

Hard

Negative

Pool

sample

query:

positive:

hard negative:

s−

hard

Dense

Retriever

Cross-

Encoder

Ranker

Candidate

Sentence

Pool ( )

t≥0

retrieve

External Corpus

Sparse

Retriever

train

s−

sample

Metric Distillation Rule:

Suppose: ,

Expect:

M(s,S) > M(s−

1,S) > M(s−

2,S)

sim(x,s) > sim(x,s−

1) > sim(x,s−

ranker distillation

re-rank

Re-ranked

Candidate

Sentence

Pool ( )

P′

t≥0

retrieve

Figure 1: The pipeline of the retrieval module. Dotted, red, black and blue lines denote the ‘sample’, ‘retrieve’,

‘train’ and ‘re-rank’ processes, respectively. xand sare one paired source and target from CommonGen. Sdenotes

the reference sentences (s∈S). Mis the automatic evaluation metric, used to measure the quality of the retrieved

sentence in terms of S.

implement the retriever with two independent en-

coders, initialized with BERT. We use the hidden

state of the ﬁrst token (i.e., [CLS]) at the last layer

as the representation of the input sentence. We ﬁrst

use the sentence encoder

Es(.)

to compute the

dimensional dense representations for all sentences

in the external corpus

. Then, we use the concept

encoder

Ec(.)

to compute the dense representation

of the concept set. The similarity

sim(c,s)

be-

tween them is measured by the dot product of their

dense representation vectors:

sim(c,s) = Ec(c)TEs(s).(2)

Train.

During training, we warm up the dual-

encoder retriever, Retriever

, with the contrastive

loss (Chen et al.,2020):

L(c;s,s1,...,s−

=−log e(sim(c,s))

exp(sim(c,s)) +PN

i=1 exp(sim(c,s−

i)) ,

where

sim(c;s)

denotes the relevance score be-

tween the concept set

and the positive sentence

. Similarly,

sim(c;s−

is the relevance score be-

tween

and

i−

th negative sentence.

refers to

the number of negative sentences.

Following DPR (Karpukhin et al.,2020), the

negative sentences consist of one hard negative and

N−1

in-batch negatives

. We consider two sparse

retrievers to build the hard negative pool: (1) TF-

IDF: compute the similarity scores between the

Note that in-batch negatives come from other positive

target sentences in a mini-batch. Therefore,

equals the

batch size in one GPU card.

sparse vectors of

and sentences in

; (2) Con-

cept matching: sort sentences in

according to

the number of concepts appearing in each sentence

in descending order. Each sparse retriever will re-

turn the top

sentences as hard negative pool

for one concept set. When training Retriever

, we

randomly sample one from

as the hard negative.

Retrieve.

During the retrieval stage, we ﬁrst com-

pute the sentence representations for all sentences

with the sentence encoder

Es(.)

. To accel-

erate the retrieval process, we build IndexFlatIP

indexes for representation vectors with the FAISS

(Johnson et al.,2019) library, which is efﬁcient

for approximate nearest neighbor similarity search

for billions of dense vectors. We return the top

sentences for each concept set with Retriever

the candidate sentence pool

. We ﬁnd that when

training the retriever with hard negatives from con-

cept matching,

is more helpful to the generation

model (refer to Appendix Cfor more details).

3.2 Distilling Knowledge from the Metric to

the Ranker

Our ranker is based on the cross-encoder architec-

ture, as shown in Figure 3(b) in Appendix A. We

implement the ranker with BERT by putting a for-

ward layer over the hidden state of [CLS] at the

last layer. The one-dimensional output is regarded

as the similarity score between the concept set and

the candidate sentence, sim(c,s).

Previous work uses the binary cross-entropy loss

or contrastive loss to train rankers. During train-

ing, these works treat all negatives equally, without

distinguishing the differences between them, but

during the re-ranking period, they expect rankers

to tell the differences between candidate sentences.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Metric-guidedDistillation:DistillingKnowledgefromtheMetrictoRankerandRetrieverforGenerativeCommonsenseReasoningXingweiHe1,YeyunGong2y,A-LongJin1,WeizhenQi3,HangZhang2,JianJiao,4BartuerZhou,2BiaoCheng,2SiuMingYiu,1NanDuan21TheUniversityofHongKong,2MicrosoftResearchAsia,3UniversityofScienceandTechnol...

展开>> 收起<<

Metric-guided Distillation Distilling Knowledge from the Metric to Ranker and Retriever for Generative Commonsense Reasoning Xingwei He1 Yeyun Gong2y A-Long Jin1 Weizhen Qi3 Hang Zhang2.pdf

共14页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Metric-guided Distillation Distilling Knowledge from the Metric to Ranker and Retriever for Generative Commonsense Reasoning Xingwei He1 Yeyun Gong2y A-Long Jin1 Weizhen Qi3 Hang Zhang2

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: