Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA Junjie Huang1y Wanjun Zhong2y Qian Liu1y Ming Gong3 Daxin Jiang3 Nan Duan4

2025-05-02 0 0 954.95KB 13 页 10玖币
侵权投诉
Mixed-modality Representation Learning and Pre-training for
Joint Table-and-Text Retrieval in OpenQA
Junjie Huang1
, Wanjun Zhong 2
, Qian Liu1, Ming Gong3, Daxin Jiang3, Nan Duan 4
1Beihang University 2Sun Yat-sen University
3Microsoft STC Asia 4Microsoft Research Asia
{huangjunjie, qian.liu}@buaa.edu.cn
{mingo, djiang, nanduan}@microsoft.com
zhongwj25@mail2.sysu.edu.cn
Abstract
Retrieving evidences from tabular and textual
resources is essential for open-domain ques-
tion answering (OpenQA), which provides
more comprehensive information. However,
training an effective dense table-text retriever
is difficult due to the challenges of table-text
discrepancy and data sparsity problem. To
address the above challenges, we introduce
an optimized OpenQA Table-TExt Retriever
(OTTER) to jointly retrieve tabular and tex-
tual evidences. Firstly, we propose to en-
hance mixed-modality representation learning
via two mechanisms: modality-enhanced rep-
resentation and mixed-modality negative sam-
pling strategy. Secondly, to alleviate data spar-
sity problem and enhance the general retrieval
ability, we conduct retrieval-centric mixed-
modality synthetic pre-training. Experimental
results demonstrate that OTTER substantially
improves the performance of table-and-text re-
trieval on the OTT-QA dataset. Comprehen-
sive analyses examine the effectiveness of all
the proposed mechanisms. Besides, equipped
with OTTER, our OpenQA system achieves
the state-of-the-art result on the downstream
QA task, with 10.1% absolute performance
gain in terms of the exact match over the pre-
vious best system. 1
1 Introduction
Open-domain question answering (Joshi et al.,
2017;Dunn et al.,2017;Lee et al.,2019) aims to an-
swer questions with evidence retrieved from a large-
scale corpus. The prevailing solution follows a two-
stage framework (Chen et al.,2017), where a re-
triever first retrieves relevant evidences and then a
reader extracts answers from the evidences. Exist-
ing OpenQA systems (Lee et al.,2019;Karpukhin
Indicates equal contribution
Work is done during internship at Microsoft Research
Asia.
1
All the code and data are available at
https://github.
com/Jun-jie-Huang/OTTeR.
Question:
Retrieved Table:
What date was the location established where the 1920
Summer Olympics boxing and wrestling events were held?
Venue
Sports
Capacity
Antwerp
[4]
Cycling
[5] (road)
Not listed
Antwerp Zoo
[1]
Boxing
[2], Wrestling
[2]
Not listed
Retrieved Passages:
[1] Antwerp Zoo: Antwerp Zoo is a zoo in the centre of Antwerp,
Belgium. It is …, established on 21 July 1843.
[2] Boxing: These are the results of the boxing competition at the
1920 Summer Olympics in Antwerp.
[3] Wrestling: At the 1920 Summer Olympics, ten wrestling events
were contested, for all men. There were five weight classes …
[4] Antwerp: … [5] Cycling: …
Answer: 21 July 1843
venues were used in the 1920 Summer Olympics
Figure 1: An example of the open question answering
over tables and text. Highlighted phrases in the same
color indicate evidence pieces related to the question in
each single modality. The answer is marked in red.
et al.,2020;Mao et al.,2021) have demonstrated
great success in retrieving and reading passages.
However, most approaches are limited to questions
whose answers reside in single modal evidences,
such as free-form text (Xiong et al.,2021b) or
semi-structured tables (Herzig et al.,2021). How-
ever, solving many real-world questions requires
aggregating heterogeneous knowledge (e.g., tables
and passages), because massive amounts of human
knowledge are stored in different modalities. As
the example shown in Figure 1, the supporting ev-
idence for the given question resides in both the
table and related passages. Therefore, retrieving
relevant evidence from heterogeneous knowledge
resources involving tables and passages is essential
for advanced OpenQA, which is also our focus.
There are two major challenges in joint table-
and-text retrieval: (1) There exists the discrepancy
between table and text, which leads to the difficulty
of jointly retrieving heterogeneous knowledge and
considering their cross-modality connections; (2)
arXiv:2210.05197v1 [cs.CL] 11 Oct 2022
Venue
Sports
Capacity
Antwerp Zoo
Boxing, Wrestling
Not listed
Figure 2: The framework of the overall OpenQA system. It first jointly retrieves top-k table-text blocks with our
OTTER. Then it answers the questions from the retrieved evidence with a reader model.
The data sparsity problem is extremely severe be-
cause training a joint table-text retriever requires
large-scale supervised data to cover all targeted ar-
eas, which is labourious and impractical to obtain.
In light of this two challenges, we introduce an
optimized
O
penQA
T
able-
TE
xt
R
etriever, dubbed
OTTER, which utilizes mixed-modality dense rep-
resentations to jointly retrieve tables and text.
Firstly, to model the interaction between tables
and text, we propose to enhance mixed-modality
representation learning via two novel mechanisms:
modality-enhanced representations (MER) and
mixed-modality hard negative sampling (MMHN).
MER incorporates fine-grained representations of
each modality to enrich the semantics. MMHN
utilizes table structures and creates hard negatives
by substituting fine-grained key information in two
modalities, to encourage better discrimination of
relevant evidences. Secondly, to alleviate the data
sparsity problem and empower the model with gen-
eral retrieval ability, we propose a retrieval-centric
pre-training task with a large-scale synthesized cor-
pus, which is constructed by automatically synthe-
sizing mixed-modal evidences and reversely gener-
ating questions by a BART-based generator.
Our primary contributions are three-fold: (1) We
propose three novel mechanisms to improve table-
and-text retrieval for OpenQA, namely modality-
enhanced representation, mixed-modality hard neg-
ative sampling strategy, and mixed-modality syn-
thetic pre-training. (2) Evaluated on OTT-QA, OT-
TER substantially improves retrieval performance
compared with baselines. Extensive experiments
and analyses further examine the effectiveness of
the above three mechanisms. (3) Equipped with
OTTER, our OpenQA system significantly sur-
passes previous state-of-the-art models with 10.1%
absolute improvement in terms of exact match.
2 Background
2.1 Problem Formulation
The task of OpenQA over tables and text is de-
fined as follows. Given two corpus of tables
CT=
{t1, ..., tT}
and passages
CP={p1, ..., pP}
, the
task aims to answer question
q
by extracting answer
a
from the knowledge resources
CP
and
CT
. The
standard system of solving this task involves two
components: a retriever that first retrieves relevant
evidences
cCTCP
, and a reader to extract
a
from the retrieved evidence set.
2.2 Table-and-text Retrieval
In this paper, we focus on table-and-text retrieval
for OpenQA. To better align the mixed-modality
information in table-and-text retrieval, we follow
Chen et al. (2021) and take a table-text block as
a basic retrieval unit, which consists of a table
segment and relevant passages. Different from re-
trieving a single table/passage, retrieving table-text
blocks could bring more clues for retrievers to uti-
lize since single modal data often contain incom-
plete context. Figure 2illustrates table-and-text
retrieval and our overall system.
2.2.1 Table-Text Block
Since relevant tables and passages do not neces-
sarily naturally coexist, we need to construct table-
text blocks before retrieval. One observation is
that tables often hold large quantities of entities
and events. Based on this observation and prior
work (Chen et al.,2020a), we apply entity linking
to group the heterogeneous data. Here we apply
BLINK (Ledell et al.,2020) to fuse tables and text,
which is an effective entity linker and capable to
link against all Wikipedia entities and their cor-
responding passages. Given a flat table segment,
BLINK returns
l
relevant passages linked to the en-
tities in table. However, as table size and passage
quantity grow, the input may become too long for
BERT-based encoders (Devlin et al.,2019). Thus,
we split a table into several segments to limit the in-
put token number that each segment contains only
a single row. This setup can be seen as a trade-
off to resolve input limit but our approaches are
scalable to full tables when input capacity permits.
More details about table-text blocks can be found
in Appendix A.1.
3 Methodology
We present
OTTER
, an
O
penQA
T
able-
TE
xt
R
etriever. We first introduce the basic dual-encoder
architecture for dense retrieval (§ 3.1). We then
describe three mechanisms to mitigate the table-
text discrepancy and data sparsity problems, i.e.,
modality-enhanced representation (§ 3.2), mixed-
modality hard negative sampling (§ 3.3), and
mixed-modality synthetic pre-training (§ 3.4).
3.1 The Dual-Encoder Architecture
The prevailing choice for dense retrieval is the dual-
encoder method. In this framework, a question
q
and a table-text block
b
are separately encoded
into two
d
-dimensional vectors by a neural encoder
E(·)
. Then, the relevance between
q
and
b
is mea-
sured by dot product over these two vectors:
s(q, b) = q>·b=E(q)>·E(b).(1)
The benefit of this method is that all the table-text
blocks can be pre-encoded into vectors to support
indexed searching during inference time. In this
work, we initialize the encoder with a pre-trained
RoBERTa (Liu et al.,2019), and take the represen-
tation of the first
[CLS]
token as the encoded vector.
When an incoming question is encoded, the approx-
imate nearest neighbor search can be leveraged for
efficient retrieval (Johnson et al.,2021).
Training
The training objective aims to learn rep-
resentations by maximizing the relevance of the
gold table-text block and the question. We follow
Karpukhin et al. (2020) to learn the representations.
Formally, given a training set of
N
instances, the
ith
instance
(qi, b+
i, b
i,1, ..., b
i,m)
consists of a pos-
itive block
b+
i
and
m
negative blocks
{b
i,j }m
j=1
, we
minimize the cross-entropy loss as follows:
L(qi, b+
i,{b
i,j }m
j=1) = log es(qi,b+
i)
es(qi,b+
i)+Pm
j=1 es(qi,b
i,j ).
Negatives are a hard negative and
m1
in-batch
negatives from other instances in a mini-batch.
3.2 Modality-enhanced Representation
Most dense retrievers use a coarse-grained single-
modal representation from either the representation
of the
[CLS]
token or the averaged representations
of tokens (Zhan et al.,2020;Huang et al.,2021),
which is insufficient to represent cross-modal in-
formation. To remedy this, we propose to learn
modality-enhanced representation (MER) of table-
text blocks.
As illustrated in Figure 3, instead of using only
the coarse representation
h[CLS]
at the
[CLS]
to-
ken, MER incorporates tabular and textual repre-
sentations (
htable
and
htext
) to enhance the seman-
tics of table and text. Thus, the modality-enhanced
representation is
b= [h[CLS];htable;htext]
,
where ;denotes concatenation.
Given the tokens in a tabular/textual modality,
we calculate a representation in the following ways:
(1)
FIRST
: representations of the beginning to-
ken (i.e.,
[TAB]
and
[PSG]
); (2)
AVG
: averaged
token representations; (3)
MAX
: max pooling over
token representations ; (4)
SelfAtt
: weighted aver-
age over token representations where weights are
computed by a self attention layer. We discuss the
impact of different types of MERs in § 5.3. Our
best model adopts FIRST as the final setting. To
ensure the same vector dimensionality with the en-
riched representation, we represent the question by
replicating the encoded question representation.
3.3 Mixed-modality Hard Negative Sampling
Prior studies (Nogueira and Cho,2019;Gillick
et al.,2019) have found that hard negative sam-
pling is essential in training a dense retriever. These
methods take each evidence as a whole and retrieve
the most similar irrelevant one as the hard negative.
Instead of finding an entire irrelevant block, we
propose a mixed-modality hard negative sampling
mechanism, which constructs more challenging
hard negatives by only substituting partial informa-
tion in the table or text.
摘要:

Mixed-modalityRepresentationLearningandPre-trainingforJointTable-and-TextRetrievalinOpenQAJunjieHuang1y,WanjunZhong2y,QianLiu1y,MingGong3,DaxinJiang3,NanDuan41BeihangUniversity2SunYat-senUniversity3MicrosoftSTCAsia4MicrosoftResearchAsia{huangjunjie,qian.liu}@buaa.edu.cn{mingo,djiang,nanduan}@micro...

展开>> 收起<<
Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA Junjie Huang1y Wanjun Zhong2y Qian Liu1y Ming Gong3 Daxin Jiang3 Nan Duan4.pdf

共13页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:13 页 大小:954.95KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 13
客服
关注