Metric-guided Distillation Distilling Knowledge from the Metric to Ranker and Retriever for Generative Commonsense Reasoning Xingwei He1 Yeyun Gong2y A-Long Jin1 Weizhen Qi3 Hang Zhang2

2025-05-02 0 0 788.9KB 14 页 10玖币
侵权投诉
Metric-guided Distillation: Distilling Knowledge from the Metric to
Ranker and Retriever for Generative Commonsense Reasoning
Xingwei He1
, Yeyun Gong2
, A-Long Jin1, Weizhen Qi3, Hang Zhang2,
Jian Jiao,4Bartuer Zhou,2Biao Cheng,2Siu Ming Yiu,1Nan Duan2
1The University of Hong Kong, 2Microsoft Research Asia,
3University of Science and Technology of China, 4Microsoft
hexingwei15@gmail.com,ajin@eee.hku.hk,
smyiu@cs.hku.hk,weizhen@mail.ustc.edu.cn,
{yegong, v-zhhang, jian.jiao, bazhou, bicheng, nanduan}@microsoft.com
Abstract
Commonsense generation aims to generate a
realistic sentence describing a daily scene un-
der the given concepts, which is very challeng-
ing, since it requires models to have relational
reasoning and compositional generalization ca-
pabilities. Previous work focuses on retrieving
prototype sentences for the provided concepts
to assist generation. They first use a sparse
retriever to retrieve candidate sentences, then
re-rank the candidates with a ranker. How-
ever, the candidates returned by their ranker
may not be the most relevant sentences, since
the ranker treats all candidates equally with-
out considering their relevance to the reference
sentences of the given concepts. Another prob-
lem is that re-ranking is very expensive, but
only using retrievers will seriously degrade the
performance of their generation models. To
solve these problems, we propose the metric
distillation rule to distill knowledge from the
metric (e.g., BLEU) to the ranker. We further
transfer the critical knowledge summarized by
the distilled ranker to the retriever. In this way,
the relevance scores of candidate sentences
predicted by the ranker and retriever will be
more consistent with their quality measured
by the metric. Experimental results on the
CommonGen benchmark verify the effective-
ness of our proposed method: (1) Our gener-
ation model with the distilled ranker achieves
a new state-of-the-art result. (2) Our genera-
tion model with the distilled retriever even sur-
passes the previous SOTA.
1 Introduction
Commonsense reasoning is the ability to make rea-
sonable and logical assumptions about daily scenes,
which is a long-standing challenge in natural lan-
guage processing. Recently, many discriminative
tasks, such as CommonsenseQA (Talmor et al.,
Work done during internship at Microsoft Research Asia.
Corresponding author.
Concepts:eye,hang,head,shut,squeeze
Reference: A man squeezes his eyes shut and hangs his head.
BART: He squeezes her head shut, then grasps her eyes shut.
Our: A baby with a blue shirt hangs his head and squeezes his eyes shut.
Table 1: Sentences generated by BART and our pro-
posed model, DKMR2.
2019) and SWAG (Sap et al.,2019), have been pro-
posed to evaluate the commonsense reasoning abil-
ity by testing whether models can select the correct
answer from the choices according to the given con-
text. To test whether models acquire the generative
commonsense reasoning ability, Lin et al. (2020)
proposed the commonsense generation (Common-
Gen) task, which requires models to produce a
plausible sentence describing a specific daily life
scenario based on the given concepts.
CommonGen proposes two main challenges to
models, and it expects models to (1) reason over
the commonsense relations among concepts to gen-
erate sentences in line with our commonsense; (2)
possess the compositional generalization ability
to generate realistic sentences with unseen con-
cept compositions. Experiment results (Lin et al.,
2020) show that large-scale pre-trained models
(e.g., BART) alone is not competent for this task
(see Table 1). The main reason is that the source
information is very limited; therefore, the models
can only rely on the internal implicit knowledge
acquired during pre-training to solve this problem,
resulting in generating some sentences that violate
commonsense.
To enrich the source information, EKI-BART
(Fan et al.,2020) first retrieves prototype sentences
for the input concepts, and then feeds the concepts
and retrieved sentences into the generation model.
Recent work, such as RE-T5 (Wang et al.,2021),
KFCNet (Li et al.,2021), and KGR
4
(Liu et al.,
2022), extends this retrieve-and-generate frame-
work by introducing a binary classifier to re-rank
the retrieved candidate sentences and filter out can-
arXiv:2210.11708v1 [cs.CL] 21 Oct 2022
didates irrelevant to the input concepts. One prob-
lem with these works is the discrepancy between
training and re-ranking for their ranker. Concretely,
when training the ranker, they treat all retrieved can-
didate sentences as negatives, regardless of their rel-
evance to the reference sentences of input concepts.
However, during re-ranking, the ranker is asked
to point out how these candidates differ in their
relevance to references. Another problem is that
the re-ranking process of the cross-encoder ranker
is very time-consuming, which is non-negligible,
especially for online systems.
In this paper, we also resort to the retrieve-and-
generate pipeline to solve CommonGen, yet fur-
ther improve the retrieval module by alleviating
the above problems. Our motivation is to expect
that the relevance scores of candidates computed
by the ranker and retriever are in line with the gold
quality scores between candidates and reference
sentences measured with the evaluation metric. To
achieve this, we first distill the gold rank knowl-
edge of candidates measured by the metric to the
ranker. Next, we improve the retriever by transfer-
ring the metric knowledge from the distilled ranker
to the retriever rather than directly distilling it from
the metric (please refer to Section 3.3 for more ex-
planation). By doing so, the distilled ranker and
retriever can select more relevant sentences than
their counterparts without metric distillation.
The contributions of this work are summarized
as follows: (1) We propose to
D
istill
K
nowledge
from the
M
etric to
R
anker and
R
etriever, termed
DKMR
2
, for generative commonsense reasoning,
which uses the metric-guided distillation to im-
prove the ranker and a progressive distillation strat-
egy to improve the retriever
1
. (2) We conduct exten-
sive experiments on the CommonGen benchmark.
Our proposed model achieves a new state-of-the-art
(SOTA) on both the v1.0 test set (43.37 vs. 39.15
on SPICE) and the official test set (v1.1) (34.589 vs.
33.911 on SPICE) of the leaderboard. (3) The per-
formance of DKMR
2
with the distilled retriever is
on par with DKMR2using the distilled ranker. As
a result, the expensive retrieve-then-rank pipeline
can be replaced with the distilled retriever at the
expense of negligible performance.
1
Our code and models are available at
https://github.com/microsoft/advNLG.
2 Problem Statement
CommonGen is a constrained text generation task,
with the goal of generating a coherent and plausi-
ble sentence
s
describing an everyday scenario us-
ing an unordered concept set
c={c1, c2...,cm}
.
Therefore, this task is typically formulated to max-
imize the conditional probability of s:
p(s|c;θ) =
n
Y
t=1
p(st|si<t,c;θ),(1)
where
n
denotes the length of the generated se-
quence
s
and
si<t
refers to the sub-sequence gen-
erated before the time step t.
3 Methodology
Following the previous work (Fan et al.,2020),
we resort to the retrieve-then-generate framework
to solve CommonGen, which mainly consists of
two modules, the retrieval module and the genera-
tion module. The retrieval module aims to retrieve
relevant sentences to assist the generation mod-
ule in generating desirable outputs. Recent work
(Wang et al.,2021;Li et al.,2021;Liu et al.,2022)
extends this idea by introducing a ranker to the
retrieval module. In this work, our retrieval mod-
ule also resorts to the retrieve-then-rank pipeline,
as illustrated in Figure 1. To be specific, the re-
trieval module mainly contains two models, the
retriever and ranker, where the retriever is used to
retrieve candidate sentences for the given concept
set and the ranker further re-ranks the retrieved sen-
tences. Different from previous work, we improve
the ranker by distilling knowledge from the gold
quality scores between the candidate sentences and
gold reference sentences computed by the evalua-
tion metric. Then, the distilled ranker will pass the
distilled knowledge to the retriever, aiming to cor-
rect the retriever’s inaccurate retrieval operations.
In Section 3.1, we first introduce the warm-up
of the retriever. Then, we will show how to distill
knowledge from the metric, in turn, to the ranker
and retriever in Sections 3.2 and 3.3, respectively.
Finally, we will show how to generate sentences
based on the retrieved sentences in Section 3.4.
3.1 Warm-up of the Retriever
We use a typical dense retrieval model (Karpukhin
et al.,2020) as the retriever. As shown in Fig-
ure 3(a) in Appendix A, the retriever is based on
the dual-encoder architecture. In this work, we
Hard
Negative
Pool
sample
query:
x
positive:
s
hard negative:
s
hard
Cross-
Encoder
Ranker
Candidate
Sentence
Pool ( )
Pt
t0
retrieve
External Corpus
Sparse
Retriever
x
train
train
x
s
1
s
2
sample
s
Metric Distillation Rule:
Suppose: ,
Expect:
M(s,S) > M(s
1,S) > M(s
2,S)
sim(x,s) > sim(x,s
1) > sim(x,s
2)
ranker distillation
re-rank
Re-ranked
Candidate
Sentence
Pool ( )
P
t
t0
retrieve
Figure 1: The pipeline of the retrieval module. Dotted, red, black and blue lines denote the ‘sample’, ‘retrieve’,
‘train’ and ‘re-rank’ processes, respectively. xand sare one paired source and target from CommonGen. Sdenotes
the reference sentences (sS). Mis the automatic evaluation metric, used to measure the quality of the retrieved
sentence in terms of S.
implement the retriever with two independent en-
coders, initialized with BERT. We use the hidden
state of the first token (i.e., [CLS]) at the last layer
as the representation of the input sentence. We first
use the sentence encoder
Es(.)
to compute the
d
-
dimensional dense representations for all sentences
in the external corpus
D
. Then, we use the concept
encoder
Ec(.)
to compute the dense representation
of the concept set. The similarity
sim(c,s)
be-
tween them is measured by the dot product of their
dense representation vectors:
sim(c,s) = Ec(c)TEs(s).(2)
Train.
During training, we warm up the dual-
encoder retriever, Retriever
0
, with the contrastive
loss (Chen et al.,2020):
L(c;s,s1,...,s
N)
=log e(sim(c,s))
exp(sim(c,s)) +PN
i=1 exp(sim(c,s
i)) ,
where
sim(c;s)
denotes the relevance score be-
tween the concept set
c
and the positive sentence
s
. Similarly,
sim(c;s
i)
is the relevance score be-
tween
c
and
i
th negative sentence.
N
refers to
the number of negative sentences.
Following DPR (Karpukhin et al.,2020), the
negative sentences consist of one hard negative and
N1
in-batch negatives
2
. We consider two sparse
retrievers to build the hard negative pool: (1) TF-
IDF: compute the similarity scores between the
2
Note that in-batch negatives come from other positive
target sentences in a mini-batch. Therefore,
N
equals the
batch size in one GPU card.
sparse vectors of
c
and sentences in
D
; (2) Con-
cept matching: sort sentences in
D
according to
the number of concepts appearing in each sentence
in descending order. Each sparse retriever will re-
turn the top
K
sentences as hard negative pool
P
for one concept set. When training Retriever
0
, we
randomly sample one from
P
as the hard negative.
Retrieve.
During the retrieval stage, we first com-
pute the sentence representations for all sentences
in
D
with the sentence encoder
Es(.)
. To accel-
erate the retrieval process, we build IndexFlatIP
indexes for representation vectors with the FAISS
(Johnson et al.,2019) library, which is efficient
for approximate nearest neighbor similarity search
for billions of dense vectors. We return the top
K
sentences for each concept set with Retriever
0
as
the candidate sentence pool
P0
. We find that when
training the retriever with hard negatives from con-
cept matching,
P0
is more helpful to the generation
model (refer to Appendix Cfor more details).
3.2 Distilling Knowledge from the Metric to
the Ranker
Our ranker is based on the cross-encoder architec-
ture, as shown in Figure 3(b) in Appendix A. We
implement the ranker with BERT by putting a for-
ward layer over the hidden state of [CLS] at the
last layer. The one-dimensional output is regarded
as the similarity score between the concept set and
the candidate sentence, sim(c,s).
Previous work uses the binary cross-entropy loss
or contrastive loss to train rankers. During train-
ing, these works treat all negatives equally, without
distinguishing the differences between them, but
during the re-ranking period, they expect rankers
to tell the differences between candidate sentences.
摘要:

Metric-guidedDistillation:DistillingKnowledgefromtheMetrictoRankerandRetrieverforGenerativeCommonsenseReasoningXingweiHe1,YeyunGong2y,A-LongJin1,WeizhenQi3,HangZhang2,JianJiao,4BartuerZhou,2BiaoCheng,2SiuMingYiu,1NanDuan21TheUniversityofHongKong,2MicrosoftResearchAsia,3UniversityofScienceandTechnol...

展开>> 收起<<
Metric-guided Distillation Distilling Knowledge from the Metric to Ranker and Retriever for Generative Commonsense Reasoning Xingwei He1 Yeyun Gong2y A-Long Jin1 Weizhen Qi3 Hang Zhang2.pdf

共14页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:14 页 大小:788.9KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 14
客服
关注