Detect Retrieve Comprehend A Flexible Framework for Zero-Shot Document-Level Question Answering Tavish McDonald1 Brian Tsan2 Amar Saini1 Juanita Ordonez1

2025-04-27 0 0 2.24MB 9 页 10玖币
侵权投诉
Detect, Retrieve, Comprehend: A Flexible Framework for Zero-Shot
Document-Level Question Answering
Tavish McDonald1, Brian Tsan 2, Amar Saini 1, Juanita Ordonez 1,
Luis Gutierrez 1, Phan Nguyen 1, Blake Mason 1, Brenda Ng 1
1Lawrence Livermore National Laboratory; 2University of California, Merced;
{mcdonald53, saini5, ordonez2, gutierrez74, nguyen97, mason35, ng30}@llnl.gov; btsan@ucmerced.edu
Abstract
Researchers produce thousands of scholarly documents con-
taining valuable technical knowledge. The community faces
the laborious task of reading these documents to identify, ex-
tract, and synthesize information. To automate information
gathering, document-level question answering (QA) offers
a flexible framework where human-posed questions can be
adapted to extract diverse knowledge. Finetuning QA sys-
tems requires access to labeled data (tuples of context, ques-
tion and answer). However, data curation for document QA
is uniquely challenging because the context (i.e., text passage
containing evidence to answer the question) needs to be re-
trieved from potentially long, ill-formatted documents. Ex-
isting QA datasets sidestep this challenge by providing short,
well-defined contexts that are unrealistic in real-world appli-
cations. We present a three-stage document QA approach:
(1) text extraction from PDF; (2) evidence retrieval from ex-
tracted texts to form well-posed contexts; (3) QA to extract
knowledge from contexts to return high-quality answers – ex-
tractive, abstractive, or Boolean. Using the QASPER dataset
for evaluation, our Detect-Retrieve-Comprehend (DRC) sys-
tem achieves a +7.19 improvement in Answer-F1over exist-
ing baselines due to superior context selection. Our results
demonstrate that DRC holds tremendous promise as a flexi-
ble framework for practical scientific document QA.
1 Introduction
Growth in new machine learning publications has exploded
in recent years, with much of this activity occurring out-
side traditional publication venues. For example, arXiv hosts
researchers’ manuscripts detailing the latest progress and
burgeoning initiatives. In 2021 alone, over 68,000 machine
learning papers were submitted to arXiv. Since 2015, sub-
missions to this category have increased yearly at an aver-
age rate of 52%. While it is admirable that the accelerated
pace of AI research has produced many innovative works
and manuscripts, the sheer amount of papers makes it pro-
hibitively difficult to keep pace with the latest developments
in the field. Increasingly, researchers turn to scientific search
engines (e.g., Semantic Scholar and Zeta Alpha), powered
by neural information retrieval, to find relevant literature. To
date, scientific search engines (Fadaee et al. 2020; Zhao and
Copyright © 2023, Association for the Advancement of Artificial
Intelligence (www.aaai.org). All rights reserved.
What is the seed lexicon?
Question
The seed lexicon consists of positive
and negative predicates. If the pred-
icate of an extracted event is in the
seed lexicon and does not involve com-
plex phenomena like negation, we as-
sign the corresponding polarity score
(+1 for positive events and -1 for neg-
ative events) to the event. We expect
the model to automatically learn com-
plex phenomena through label propa-
gation. Based on the availability of
scores and the types of discourse re-
lations, we classify the extracted
event pairs into the following three
types.
Evidence
a vocabulary of positive and negative
predicates that helps determine the
polarity score of an event
Answer
Figure 1: QASPER questions require PDF text extraction
and evidence retrieval to generate an answer.
Lee 2020; Parisot and Zavrel 2022) have focused on serv-
ing recommendations based on semantic similarity and lex-
ical matching between a query phrase and a collection of
document-derived contents, particularly titles and abstracts.
Other efforts to elicit the details of scholarly papers have
extracted quantified experimental results from structured ta-
bles (Kardas et al. 2020) and generated detailed summaries
from the hierarchical content of scientific documents (So-
tudeh, Cohan, and Goharian 2020).
While these scientific search engines suffice for topic ex-
ploration, once a set of papers are identified as relevant, re-
searchers would want to probe deeper for information to ad-
dress specific questions conditioned on their prior domain
knowledge (e.g., What baselines is the neural relation ex-
tractor compared to?). While one can gain a sense of the
main findings of a paper by reading the abstract, the answers
to these probing questions are frequently found in the de-
tails of the methodology, experimental setup, and results sec-
tions. Furthermore, questions may require synthesis of doc-
ument passages to produce an abstractive answer rather than
simply extracting a contiguous span. Reading and manually
cross-referencing the results of several papers is a labor-
intensive approach to glean specific knowledge from scien-
tific documents. Therefore, effective tools to help automate
knowledge discovery are sorely needed.
A promising approach to extracting knowledge from sci-
entific publications is document-level question answering
(QA): using an open set of questions to comprehend fig-
arXiv:2210.01959v3 [cs.CL] 11 Dec 2023
Detect (Text Extraction)
Retrieve (Evidence Retrieval) Comprehend (Question Answering)
PDF File
PDF pdf2image
Page Images
DiT
Paragraph Bounding Boxes
pdfminer.six
CO (CONCESSION Pairs)
The seed lexicon matches
neither the former nor the
latter event, and their
discourse relation type
is CONCESSION. We assume
the two events have the
reversed polarities.
Paragraph Texts
CO (CONCESSION Pairs)
The seed lexicon matches
neither the former nor the
latter event, and their
discourse relation type
is CONCESSION. We assume
the two events have the
reversed polarities.
CO (CONCESSION Pairs)
The seed lexicon matches
neither the former nor the
latter event, and their
discourse relation type
is CONCESSION. We assume
the two events have the
reversed polarities.
CO (CONCESSION Pairs)
The seed lexicon matches
neither the former nor the
latter event, and their
discourse relation type
is CONCESSION. We assume
the two events have the
reversed polarities.
CO (CONCESSION Pairs)
The seed lexicon matches
neither the former nor the
latter event, and their
discourse relation type
is CONCESSION. We assume
the two events have the
reversed polarities.
What is the seed
lexicon?
Question Text
ELECTRA CE
CO (CONCESSION Pairs)
The seed lexicon matches
neither the former nor the
latter event, and their
discourse relation type
is CONCESSION. We assume
the two events have the
reversed polarities.
Top-K Paragraph Texts (K=3)
r=0.07
CA (CAUSE Pairs) The seed
lexicon matches neither
the former nor the latter
event, and their discourse
relation type is CAUSE. We
assume the two events have
the same polarities.
r=0.10
UnifiedQA
no answer
K Answers (K=3)
matches neither
the former nor the
latter event
positive and nega-
tive predicates
The seed lexicon consists
of positive and nega-
tive predicates. If the
predicate of an extracted
r=0.93
Figure 2: An instance of our modular end-to-end DRC system comprised of DIT + ELECTRA CE + UNIFIEDQA.
ure captions, tables, and accompanying text (Borchmann
et al. 2021). Traditionally, the NLP community has focused
on using clean texts as context to their QA systems. How-
ever, this is not representative of the vast majority of schol-
arly information found in structured documents. As QA gar-
ners interest from the computer vision community, DocVQA
(Mathew, Karatzas, and Jawahar 2021) and VisualMRC
(Tanaka, Nishida, and Yoshida 2021) have extended docu-
ment QA to extracting evidence from single images, paving
the way to extend contexts from text to visual sources.
A foundational challenge in building robust document QA
systems is ensuring well-formed contexts, which entails ac-
curate text extraction and requires adaptation to new docu-
ment layouts. Nonetheless, even when text can be cleanly
extracted, there still remains the crucial task of identifying
question-relevant paragraphs for answer prediction.
Our contribution is a general-purpose system for QA on
full documents in their original PDF form, that addresses
the key challenges of scientific document QA: (1) accurate
text extraction from unseen layouts, (2) evidence retrieval
(i.e., context selection), and (3) robust QA. A demo of our
system is available through Hugging Face.
2 Dataset
The Question Answering on Scientific Research Papers
(QASPER) dataset consists of 1,585 NLP papers sourced
from arXiv, and is accompanied by 5,049 questions from
NLP readers and corresponding answers from NLP practi-
tioners. Papers in QASPER are cited by their arXiv DOIs,
which we used to fetch the original PDF documents as input
to our system, as our work is focused on knowledge extrac-
tion at the PDF level.
QASPER contains 7,993 answers categorized by answer
type: Extractive (4142), Abstractive (1931), Yes/No (1110),
and Unanswerable (810). Using only the Extractive,Ab-
stractive and Yes/No answers, we match our model predic-
tion to the most similar answer when a question has more
than one answer, and report our performance accordingly.
QASPER is ideal for evaluating our proposed framework
because it provides: (1) paragraph text and table informa-
tion to evaluate our layout-analysis model (in its ability to
cleanly extract document regions); (2) evidence paragraphs
to validate, and optionally finetune, our evidence retrieval
model (in its ability to retrieve good context paragraphs);
and (3) ground-truth answers to assess the accuracy of our
QA model (in its ability to answer the question given the
context).
3 Methodology
Document QA on raw PDFs is necessary towards automat-
ing knowledge extraction from scientific corpora and has re-
mained an unaddressed problem. To address this, we pro-
pose a flexible information extraction tool to alleviate la-
boriously searching for answers grounded in evidence. Our
system combines: (1) a robust text detector for visually rich
documents, (2) explicit passage retrieval for evidence selec-
tion, and (3) multi-format answer prediction. We used pre-
trained open-source machine learning models that are effec-
tive in a zero-shot setting. We also finetuned these models to
improve our system’s end-to-end performance.
3.1 Problem Description
Our work addresses evidence retrieval at the PDF level.
Thus, our document QA task is defined as: given a question
and a PDF document, predict the answer to the question. We
decompose this problem into three subtasks: text extraction
(§ 3.2), evidence retrieval (§ 3.3), and QA (§ 3.4).
First, the PDF document, represented as a series of im-
ages, has its semantic regions identified and their corre-
sponding text content extracted as passages. Second, the pas-
sages are ranked by their relevance to the question. Irrelevant
passages are filtered out so only the most relevant passages
are used as contexts for QA. Finally, given a context and
question, the answer is predicted. The overall architecture is
shown in Figure 2. These components correspond to the re-
spective tasks of Detect,Retrieve and Comprehend, or DRC,
which is also the name of our proposed system.
摘要:

Detect,Retrieve,Comprehend:AFlexibleFrameworkforZero-ShotDocument-LevelQuestionAnsweringTavishMcDonald1,BrianTsan2,AmarSaini1,JuanitaOrdonez1,LuisGutierrez1,PhanNguyen1,BlakeMason1,BrendaNg11LawrenceLivermoreNationalLaboratory;2UniversityofCalifornia,Merced;{mcdonald53,saini5,ordonez2,gutierrez74,ng...

展开>> 收起<<
Detect Retrieve Comprehend A Flexible Framework for Zero-Shot Document-Level Question Answering Tavish McDonald1 Brian Tsan2 Amar Saini1 Juanita Ordonez1.pdf

共9页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:9 页 大小:2.24MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 9
客服
关注