Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA Junjie Huang1y Wanjun Zhong2y Qian Liu1y Ming Gong3 Daxin Jiang3 Nan Duan4

2025-05-02 0 0 954.95KB 13 页 10玖币

侵权投诉

Mixed-modality Representation Learning and Pre-training for

Joint Table-and-Text Retrieval in OpenQA

Junjie Huang1†∗

, Wanjun Zhong 2∗†

, Qian Liu1†, Ming Gong3, Daxin Jiang3, Nan Duan 4

1Beihang University 2Sun Yat-sen University

3Microsoft STC Asia 4Microsoft Research Asia

{huangjunjie, qian.liu}@buaa.edu.cn

{mingo, djiang, nanduan}@microsoft.com

zhongwj25@mail2.sysu.edu.cn

Abstract

Retrieving evidences from tabular and textual

resources is essential for open-domain ques-

tion answering (OpenQA), which provides

more comprehensive information. However,

training an effective dense table-text retriever

is difﬁcult due to the challenges of table-text

discrepancy and data sparsity problem. To

address the above challenges, we introduce

an optimized OpenQA Table-TExt Retriever

(OTTER) to jointly retrieve tabular and tex-

tual evidences. Firstly, we propose to en-

hance mixed-modality representation learning

via two mechanisms: modality-enhanced rep-

resentation and mixed-modality negative sam-

pling strategy. Secondly, to alleviate data spar-

sity problem and enhance the general retrieval

ability, we conduct retrieval-centric mixed-

modality synthetic pre-training. Experimental

results demonstrate that OTTER substantially

improves the performance of table-and-text re-

trieval on the OTT-QA dataset. Comprehen-

sive analyses examine the effectiveness of all

the proposed mechanisms. Besides, equipped

with OTTER, our OpenQA system achieves

the state-of-the-art result on the downstream

QA task, with 10.1% absolute performance

gain in terms of the exact match over the pre-

vious best system. 1

1 Introduction

Open-domain question answering (Joshi et al.,

2017;Dunn et al.,2017;Lee et al.,2019) aims to an-

swer questions with evidence retrieved from a large-

scale corpus. The prevailing solution follows a two-

stage framework (Chen et al.,2017), where a re-

triever ﬁrst retrieves relevant evidences and then a

reader extracts answers from the evidences. Exist-

ing OpenQA systems (Lee et al.,2019;Karpukhin

∗Indicates equal contribution

†

Work is done during internship at Microsoft Research

Asia.

All the code and data are available at

https://github.

com/Jun-jie-Huang/OTTeR.

Question:

Retrieved Table:

What date was the location established where the 1920

Summer Olympics boxing and wrestling events were held?

Venue

Sports

Capacity

Antwerp

[4]

Cycling

[5] (road)

Not listed

Antwerp Zoo

[1]

Boxing

[2], Wrestling

[2]

Not listed

Retrieved Passages:

[1] Antwerp Zoo: Antwerp Zoo is a zoo in the centre of Antwerp,

Belgium. It is …, established on 21 July 1843.

[2] Boxing: These are the results of the boxing competition at the

1920 Summer Olympics in Antwerp.

[3] Wrestling: At the 1920 Summer Olympics, ten wrestling events

were contested, for all men. There were five weight classes …

[4] Antwerp: … [5] Cycling: …

Answer: 21 July 1843

venues were used in the 1920 Summer Olympics

Figure 1: An example of the open question answering

over tables and text. Highlighted phrases in the same

color indicate evidence pieces related to the question in

each single modality. The answer is marked in red.

et al.,2020;Mao et al.,2021) have demonstrated

great success in retrieving and reading passages.

However, most approaches are limited to questions

whose answers reside in single modal evidences,

such as free-form text (Xiong et al.,2021b) or

semi-structured tables (Herzig et al.,2021). How-

ever, solving many real-world questions requires

aggregating heterogeneous knowledge (e.g., tables

and passages), because massive amounts of human

knowledge are stored in different modalities. As

the example shown in Figure 1, the supporting ev-

idence for the given question resides in both the

table and related passages. Therefore, retrieving

relevant evidence from heterogeneous knowledge

resources involving tables and passages is essential

for advanced OpenQA, which is also our focus.

There are two major challenges in joint table-

and-text retrieval: (1) There exists the discrepancy

between table and text, which leads to the difﬁculty

of jointly retrieving heterogeneous knowledge and

considering their cross-modality connections; (2)

arXiv:2210.05197v1 [cs.CL] 11 Oct 2022

Venue

Sports

Capacity

Antwerp Zoo

Boxing, Wrestling

Not listed

venues were used in the 1920 Summer Olympics

A Fused Table-Text Block

Q: What date was the location established where the 1920 Summer Olympics boxing and wrestling events were held?

Table-Text Retrieval

Top-k

Blocks

A: 21 July 1843

Answer Extraction

Antwerp Zoo

… established on

21 July 1843.

Boxing

These are results of

the boxing

competition. …

Wrestling

…, ten wrestling

events were

contested, …

Longformer

Reader

OTTER

Figure 2: The framework of the overall OpenQA system. It ﬁrst jointly retrieves top-k table-text blocks with our

OTTER. Then it answers the questions from the retrieved evidence with a reader model.

The data sparsity problem is extremely severe be-

cause training a joint table-text retriever requires

large-scale supervised data to cover all targeted ar-

eas, which is labourious and impractical to obtain.

In light of this two challenges, we introduce an

optimized

penQA

able-

etriever, dubbed

OTTER, which utilizes mixed-modality dense rep-

resentations to jointly retrieve tables and text.

Firstly, to model the interaction between tables

and text, we propose to enhance mixed-modality

representation learning via two novel mechanisms:

modality-enhanced representations (MER) and

mixed-modality hard negative sampling (MMHN).

MER incorporates ﬁne-grained representations of

each modality to enrich the semantics. MMHN

utilizes table structures and creates hard negatives

by substituting ﬁne-grained key information in two

modalities, to encourage better discrimination of

relevant evidences. Secondly, to alleviate the data

sparsity problem and empower the model with gen-

eral retrieval ability, we propose a retrieval-centric

pre-training task with a large-scale synthesized cor-

pus, which is constructed by automatically synthe-

sizing mixed-modal evidences and reversely gener-

ating questions by a BART-based generator.

Our primary contributions are three-fold: (1) We

propose three novel mechanisms to improve table-

and-text retrieval for OpenQA, namely modality-

enhanced representation, mixed-modality hard neg-

ative sampling strategy, and mixed-modality syn-

thetic pre-training. (2) Evaluated on OTT-QA, OT-

TER substantially improves retrieval performance

compared with baselines. Extensive experiments

and analyses further examine the effectiveness of

the above three mechanisms. (3) Equipped with

OTTER, our OpenQA system signiﬁcantly sur-

passes previous state-of-the-art models with 10.1%

absolute improvement in terms of exact match.

2 Background

2.1 Problem Formulation

The task of OpenQA over tables and text is de-

ﬁned as follows. Given two corpus of tables

CT=

{t1, ..., tT}

and passages

CP={p1, ..., pP}

, the

task aims to answer question

by extracting answer

from the knowledge resources

and

. The

standard system of solving this task involves two

components: a retriever that ﬁrst retrieves relevant

evidences

c⊂CT∪CP

, and a reader to extract

from the retrieved evidence set.

2.2 Table-and-text Retrieval

In this paper, we focus on table-and-text retrieval

for OpenQA. To better align the mixed-modality

information in table-and-text retrieval, we follow

Chen et al. (2021) and take a table-text block as

a basic retrieval unit, which consists of a table

segment and relevant passages. Different from re-

trieving a single table/passage, retrieving table-text

blocks could bring more clues for retrievers to uti-

lize since single modal data often contain incom-

plete context. Figure 2illustrates table-and-text

retrieval and our overall system.

2.2.1 Table-Text Block

Since relevant tables and passages do not neces-

sarily naturally coexist, we need to construct table-

text blocks before retrieval. One observation is

that tables often hold large quantities of entities

and events. Based on this observation and prior

work (Chen et al.,2020a), we apply entity linking

to group the heterogeneous data. Here we apply

BLINK (Ledell et al.,2020) to fuse tables and text,

which is an effective entity linker and capable to

link against all Wikipedia entities and their cor-

responding passages. Given a ﬂat table segment,

BLINK returns

relevant passages linked to the en-

tities in table. However, as table size and passage

quantity grow, the input may become too long for

BERT-based encoders (Devlin et al.,2019). Thus,

we split a table into several segments to limit the in-

put token number that each segment contains only

a single row. This setup can be seen as a trade-

off to resolve input limit but our approaches are

scalable to full tables when input capacity permits.

More details about table-text blocks can be found

in Appendix A.1.

3 Methodology

We present

OTTER

, an

penQA

able-

etriever. We ﬁrst introduce the basic dual-encoder

architecture for dense retrieval (§ 3.1). We then

describe three mechanisms to mitigate the table-

text discrepancy and data sparsity problems, i.e.,

modality-enhanced representation (§ 3.2), mixed-

modality hard negative sampling (§ 3.3), and

mixed-modality synthetic pre-training (§ 3.4).

3.1 The Dual-Encoder Architecture

The prevailing choice for dense retrieval is the dual-

encoder method. In this framework, a question

and a table-text block

are separately encoded

into two

-dimensional vectors by a neural encoder

E(·)

. Then, the relevance between

and

is mea-

sured by dot product over these two vectors:

s(q, b) = q>·b=E(q)>·E(b).(1)

The beneﬁt of this method is that all the table-text

blocks can be pre-encoded into vectors to support

indexed searching during inference time. In this

work, we initialize the encoder with a pre-trained

RoBERTa (Liu et al.,2019), and take the represen-

tation of the ﬁrst

[CLS]

token as the encoded vector.

When an incoming question is encoded, the approx-

imate nearest neighbor search can be leveraged for

efﬁcient retrieval (Johnson et al.,2021).

Training

The training objective aims to learn rep-

resentations by maximizing the relevance of the

gold table-text block and the question. We follow

Karpukhin et al. (2020) to learn the representations.

Formally, given a training set of

instances, the

ith

instance

(qi, b+

i, b−

i,1, ..., b−

i,m)

consists of a pos-

itive block

and

negative blocks

{b−

i,j }m

j=1

, we

minimize the cross-entropy loss as follows:

L(qi, b+

i,{b−

i,j }m

j=1) = −log es(qi,b+

es(qi,b+

i)+Pm

j=1 es(qi,b−

i,j ).

Negatives are a hard negative and

m−1

in-batch

negatives from other instances in a mini-batch.

3.2 Modality-enhanced Representation

Most dense retrievers use a coarse-grained single-

modal representation from either the representation

of the

[CLS]

token or the averaged representations

of tokens (Zhan et al.,2020;Huang et al.,2021),

which is insufﬁcient to represent cross-modal in-

formation. To remedy this, we propose to learn

modality-enhanced representation (MER) of table-

text blocks.

As illustrated in Figure 3, instead of using only

the coarse representation

h[CLS]

at the

[CLS]

to-

ken, MER incorporates tabular and textual repre-

sentations (

htable

and

htext

) to enhance the seman-

tics of table and text. Thus, the modality-enhanced

representation is

b= [h[CLS];htable;htext]

where ;denotes concatenation.

Given the tokens in a tabular/textual modality,

we calculate a representation in the following ways:

(1)

FIRST

: representations of the beginning to-

ken (i.e.,

[TAB]

and

[PSG]

); (2)

AVG

: averaged

token representations; (3)

MAX

: max pooling over

token representations ; (4)

SelfAtt

: weighted aver-

age over token representations where weights are

computed by a self attention layer. We discuss the

impact of different types of MERs in § 5.3. Our

best model adopts FIRST as the ﬁnal setting. To

ensure the same vector dimensionality with the en-

riched representation, we represent the question by

replicating the encoded question representation.

3.3 Mixed-modality Hard Negative Sampling

Prior studies (Nogueira and Cho,2019;Gillick

et al.,2019) have found that hard negative sam-

pling is essential in training a dense retriever. These

methods take each evidence as a whole and retrieve

the most similar irrelevant one as the hard negative.

Instead of ﬁnding an entire irrelevant block, we

propose a mixed-modality hard negative sampling

mechanism, which constructs more challenging

hard negatives by only substituting partial informa-

tion in the table or text.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Mixed-modalityRepresentationLearningandPre-trainingforJointTable-and-TextRetrievalinOpenQAJunjieHuang1y,WanjunZhong2y,QianLiu1y,MingGong3,DaxinJiang3,NanDuan41BeihangUniversity2SunYat-senUniversity3MicrosoftSTCAsia4MicrosoftResearchAsia{huangjunjie,qian.liu}@buaa.edu.cn{mingo,djiang,nanduan}@micro...

展开>> 收起<<

Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA Junjie Huang1y Wanjun Zhong2y Qian Liu1y Ming Gong3 Daxin Jiang3 Nan Duan4.pdf

共13页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA Junjie Huang1y Wanjun Zhong2y Qian Liu1y Ming Gong3 Daxin Jiang3 Nan Duan4

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: