Detect Retrieve Comprehend A Flexible Framework for Zero-Shot Document-Level Question Answering Tavish McDonald1 Brian Tsan2 Amar Saini1 Juanita Ordonez1

2025-04-27 1 0 2.24MB 9 页 10玖币

侵权投诉

Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot

Document-Level Question Answering

Tavish McDonald1, Brian Tsan 2, Amar Saini 1, Juanita Ordonez 1,

Luis Gutierrez 1, Phan Nguyen 1, Blake Mason 1, Brenda Ng 1

1Lawrence Livermore National Laboratory; 2University of California, Merced;

{mcdonald53, saini5, ordonez2, gutierrez74, nguyen97, mason35, ng30}@llnl.gov; btsan@ucmerced.edu

Abstract

Researchers produce thousands of scholarly documents con-

taining valuable technical knowledge. The community faces

the laborious task of reading these documents to identify, ex-

tract, and synthesize information. To automate information

gathering, document-level question answering (QA) offers

a ﬂexible framework where human-posed questions can be

adapted to extract diverse knowledge. Finetuning QA sys-

tems requires access to labeled data (tuples of context, ques-

tion and answer). However, data curation for document QA

is uniquely challenging because the context (i.e., text passage

containing evidence to answer the question) needs to be re-

trieved from potentially long, ill-formatted documents. Ex-

isting QA datasets sidestep this challenge by providing short,

well-deﬁned contexts that are unrealistic in real-world appli-

cations. We present a three-stage document QA approach:

(1) text extraction from PDF; (2) evidence retrieval from ex-

tracted texts to form well-posed contexts; (3) QA to extract

knowledge from contexts to return high-quality answers – ex-

tractive, abstractive, or Boolean. Using the QASPER dataset

for evaluation, our Detect-Retrieve-Comprehend (DRC) sys-

tem achieves a +7.19 improvement in Answer-F1over exist-

ing baselines due to superior context selection. Our results

demonstrate that DRC holds tremendous promise as a ﬂexi-

ble framework for practical scientiﬁc document QA.

1 Introduction

Growth in new machine learning publications has exploded

in recent years, with much of this activity occurring out-

side traditional publication venues. For example, arXiv hosts

researchers’ manuscripts detailing the latest progress and

burgeoning initiatives. In 2021 alone, over 68,000 machine

learning papers were submitted to arXiv. Since 2015, sub-

missions to this category have increased yearly at an aver-

age rate of 52%. While it is admirable that the accelerated

pace of AI research has produced many innovative works

and manuscripts, the sheer amount of papers makes it pro-

hibitively difﬁcult to keep pace with the latest developments

in the ﬁeld. Increasingly, researchers turn to scientiﬁc search

engines (e.g., Semantic Scholar and Zeta Alpha), powered

by neural information retrieval, to ﬁnd relevant literature. To

date, scientiﬁc search engines (Fadaee et al. 2020; Zhao and

What is the seed lexicon?

Question

The seed lexicon consists of positive

and negative predicates. If the pred-

icate of an extracted event is in the

seed lexicon and does not involve com-

plex phenomena like negation, we as-

sign the corresponding polarity score

(+1 for positive events and -1 for neg-

ative events) to the event. We expect

the model to automatically learn com-

plex phenomena through label propa-

gation. Based on the availability of

scores and the types of discourse re-

lations, we classify the extracted

event pairs into the following three

types.

Evidence

a vocabulary of positive and negative

predicates that helps determine the

polarity score of an event

Answer

Figure 1: QASPER questions require PDF text extraction

and evidence retrieval to generate an answer.

Lee 2020; Parisot and Zavrel 2022) have focused on serv-

ing recommendations based on semantic similarity and lex-

ical matching between a query phrase and a collection of

document-derived contents, particularly titles and abstracts.

Other efforts to elicit the details of scholarly papers have

extracted quantiﬁed experimental results from structured ta-

bles (Kardas et al. 2020) and generated detailed summaries

from the hierarchical content of scientiﬁc documents (So-

tudeh, Cohan, and Goharian 2020).

While these scientiﬁc search engines sufﬁce for topic ex-

ploration, once a set of papers are identiﬁed as relevant, re-

searchers would want to probe deeper for information to ad-

dress speciﬁc questions conditioned on their prior domain

knowledge (e.g., What baselines is the neural relation ex-

tractor compared to?). While one can gain a sense of the

main ﬁndings of a paper by reading the abstract, the answers

to these probing questions are frequently found in the de-

tails of the methodology, experimental setup, and results sec-

tions. Furthermore, questions may require synthesis of doc-

ument passages to produce an abstractive answer rather than

simply extracting a contiguous span. Reading and manually

cross-referencing the results of several papers is a labor-

intensive approach to glean speciﬁc knowledge from scien-

tiﬁc documents. Therefore, effective tools to help automate

knowledge discovery are sorely needed.

A promising approach to extracting knowledge from sci-

entiﬁc publications is document-level question answering

(QA): using an open set of questions to comprehend ﬁg-

arXiv:2210.01959v3 [cs.CL] 11 Dec 2023

Detect (Text Extraction)

Retrieve (Evidence Retrieval) Comprehend (Question Answering)

PDF File

PDF pdf2image

Page Images

DiT

Paragraph Bounding Boxes

pdfminer.six

CO (CONCESSION Pairs)

The seed lexicon matches

neither the former nor the

latter event, and their

discourse relation type

is CONCESSION. We assume

the two events have the

reversed polarities.

Paragraph Texts

CO (CONCESSION Pairs)

The seed lexicon matches

neither the former nor the

latter event, and their

discourse relation type

is CONCESSION. We assume

the two events have the

reversed polarities.

CO (CONCESSION Pairs)

The seed lexicon matches

neither the former nor the

latter event, and their

discourse relation type

is CONCESSION. We assume

the two events have the

reversed polarities.

CO (CONCESSION Pairs)

The seed lexicon matches

neither the former nor the

latter event, and their

discourse relation type

is CONCESSION. We assume

the two events have the

reversed polarities.

CO (CONCESSION Pairs)

The seed lexicon matches

neither the former nor the

latter event, and their

discourse relation type

is CONCESSION. We assume

the two events have the

reversed polarities.

What is the seed

lexicon?

Question Text

ELECTRA CE

CO (CONCESSION Pairs)

The seed lexicon matches

neither the former nor the

latter event, and their

discourse relation type

is CONCESSION. We assume

the two events have the

reversed polarities.

Top-K Paragraph Texts (K=3)

r=0.07

CA (CAUSE Pairs) The seed

lexicon matches neither

the former nor the latter

event, and their discourse

relation type is CAUSE. We

assume the two events have

the same polarities.

r=0.10

UnifiedQA

no answer

K Answers (K=3)

matches neither

the former nor the

latter event

positive and nega-

tive predicates

The seed lexicon consists

of positive and nega-

tive predicates. If the

predicate of an extracted

r=0.93

Figure 2: An instance of our modular end-to-end DRC system comprised of DIT + ELECTRA CE + UNIFIEDQA.

ure captions, tables, and accompanying text (Borchmann

et al. 2021). Traditionally, the NLP community has focused

on using clean texts as context to their QA systems. How-

ever, this is not representative of the vast majority of schol-

arly information found in structured documents. As QA gar-

ners interest from the computer vision community, DocVQA

(Mathew, Karatzas, and Jawahar 2021) and VisualMRC

(Tanaka, Nishida, and Yoshida 2021) have extended docu-

ment QA to extracting evidence from single images, paving

the way to extend contexts from text to visual sources.

A foundational challenge in building robust document QA

systems is ensuring well-formed contexts, which entails ac-

curate text extraction and requires adaptation to new docu-

ment layouts. Nonetheless, even when text can be cleanly

extracted, there still remains the crucial task of identifying

question-relevant paragraphs for answer prediction.

Our contribution is a general-purpose system for QA on

full documents in their original PDF form, that addresses

the key challenges of scientiﬁc document QA: (1) accurate

text extraction from unseen layouts, (2) evidence retrieval

(i.e., context selection), and (3) robust QA. A demo of our

system is available through Hugging Face.

2 Dataset

The Question Answering on Scientiﬁc Research Papers

(QASPER) dataset consists of 1,585 NLP papers sourced

from arXiv, and is accompanied by 5,049 questions from

NLP readers and corresponding answers from NLP practi-

tioners. Papers in QASPER are cited by their arXiv DOIs,

which we used to fetch the original PDF documents as input

to our system, as our work is focused on knowledge extrac-

tion at the PDF level.

QASPER contains 7,993 answers categorized by answer

type: Extractive (4142), Abstractive (1931), Yes/No (1110),

and Unanswerable (810). Using only the Extractive,Ab-

stractive and Yes/No answers, we match our model predic-

tion to the most similar answer when a question has more

than one answer, and report our performance accordingly.

QASPER is ideal for evaluating our proposed framework

because it provides: (1) paragraph text and table informa-

tion to evaluate our layout-analysis model (in its ability to

cleanly extract document regions); (2) evidence paragraphs

to validate, and optionally ﬁnetune, our evidence retrieval

model (in its ability to retrieve good context paragraphs);

and (3) ground-truth answers to assess the accuracy of our

QA model (in its ability to answer the question given the

context).

3 Methodology

Document QA on raw PDFs is necessary towards automat-

ing knowledge extraction from scientiﬁc corpora and has re-

mained an unaddressed problem. To address this, we pro-

pose a ﬂexible information extraction tool to alleviate la-

boriously searching for answers grounded in evidence. Our

system combines: (1) a robust text detector for visually rich

documents, (2) explicit passage retrieval for evidence selec-

tion, and (3) multi-format answer prediction. We used pre-

trained open-source machine learning models that are effec-

tive in a zero-shot setting. We also ﬁnetuned these models to

improve our system’s end-to-end performance.

3.1 Problem Description

Our work addresses evidence retrieval at the PDF level.

Thus, our document QA task is deﬁned as: given a question

and a PDF document, predict the answer to the question. We

decompose this problem into three subtasks: text extraction

(§ 3.2), evidence retrieval (§ 3.3), and QA (§ 3.4).

First, the PDF document, represented as a series of im-

ages, has its semantic regions identiﬁed and their corre-

sponding text content extracted as passages. Second, the pas-

sages are ranked by their relevance to the question. Irrelevant

passages are ﬁltered out so only the most relevant passages

are used as contexts for QA. Finally, given a context and

question, the answer is predicted. The overall architecture is

shown in Figure 2. These components correspond to the re-

spective tasks of Detect,Retrieve and Comprehend, or DRC,

which is also the name of our proposed system.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

Detect,Retrieve,Comprehend:AFlexibleFrameworkforZero-ShotDocument-LevelQuestionAnsweringTavishMcDonald1,BrianTsan2,AmarSaini1,JuanitaOrdonez1,LuisGutierrez1,PhanNguyen1,BlakeMason1,BrendaNg11LawrenceLivermoreNationalLaboratory;2UniversityofCalifornia,Merced;{mcdonald53,saini5,ordonez2,gutierrez74,ng...

展开>> 收起<<

Detect Retrieve Comprehend A Flexible Framework for Zero-Shot Document-Level Question Answering Tavish McDonald1 Brian Tsan2 Amar Saini1 Juanita Ordonez1.pdf

共9页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Detect Retrieve Comprehend A Flexible Framework for Zero-Shot Document-Level Question Answering Tavish McDonald1 Brian Tsan2 Amar Saini1 Juanita Ordonez1

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: