Large Language Models are few1-shot Table Reasoners Wenhu Chen University of Waterloo Vector Institute

2025-05-03 0 0 4.02MB 11 页 10玖币
侵权投诉
Large Language Models are few(1)-shot Table Reasoners
Wenhu Chen
University of Waterloo, Vector Institute
wenhuchen@uwaterloo.ca
Abstract
Recent literature has shown that large lan-
guage models (LLMs) are generally excel-
lent few-shot reasoners to solve text reasoning
tasks. However, the capability of LLMs on ta-
ble reasoning tasks is yet to be explored. In
this paper, we aim at understanding how well
LLMs can perform table-related tasks with
few-shot in-context learning. Specifically, we
evaluated LLMs on popular table QA and fact
verification datasets like WikiTableQuestion,
FetaQA, TabFact, and FEVEROUS and found
that LLMs are competent at complex reason-
ing over table structures, though these models
are not pre-trained on any table corpus. When
combined with ‘chain of thoughts’ prompting,
LLMs can achieve very strong performance
with only a 1-shot demonstration, even on par
with some SoTA models. We show that LLMs
are even more competent at generating com-
prehensive long-form answers on FetaQA than
tuned T5-large. We further manually studied
the reasoning chains elicited from LLMs and
found that these reasoning chains are highly
consistent with the underlying semantic form.
We believe that LLMs can serve as a sim-
ple yet generic baseline for future research.
The code and data are released in https://
github.com/wenhuchen/TableCoT.
1 Introduction
The problem of structured knowledge grounding
has been extensively studied for many years. Ta-
bles, as one of the most popular (semi)-structured
forms to store world knowledge receive signifi-
cant attention from the natural language processing
(NLP) community. Traditional approaches mostly
rely on synthesizing executable languages like SQL
or SPARQL to access the information inside the ta-
ble. However, these symbolic languages normally
make a rigid assumption about the table and can-
not capture the semantics of text chunks inside the
table. Such issues are even more pronounced with
web tables due to their irregular forms. To fully
understand web tables, both structured reasoning
and textual reasoning are required. Such challenges
have attracted many researchers to work in the field.
Recently, a wide range of table-based tasks have
been proposed like table question answering (Pasu-
pat and Liang,2015;Chen et al.,2020c;Zhu et al.,
2021;Chen et al.,2021b;Talmor et al.,2020;Chen
et al.,2020a;Nan et al.,2022), table fact verifi-
cation (Chen et al.,2019;Aly et al.,2021), table-
based generation (Chen et al.,2020b;Parikh et al.,
2020;Nan et al.,2021), and table-grounded con-
versation (Budzianowski et al.,2018;Nakamura
et al.,2022). This wide range of table-based tasks
all come with different input-output formats and
domains. Due to the heterogeneity of these tasks,
models achieving the best results on these tasks
normally need to be fully fine-tuned on the specific
downstream dataset with 10K-100K examples to
achieve reasonable performance.
Recently, there have been efforts like Unified-
SKG (Xie et al.,2022) aiming to unify these het-
erogeneous table-based tasks as a generic text-to-
text format. UnifiedSKG has shown that using
T5-3B (Raffel et al.,2020) with the text-to-text
format can already achieve state-of-the-art perfor-
mance on almost all the table-based tasks without
task-specific designs. However, the proposed text-
to-text models still need to be fully fine-tuned on
the downstream tasks. UnifiedSKG also identified
that T0-style (Sanh et al.,2022) cross-task transfer
can only achieve almost random performance.
Wei et al. (2022); Wang et al. (2022); Zhou et al.
(2022); Drozdov et al. (2022) have recently dis-
covered that large language models (Brown et al.,
2020;Chowdhery et al.,2022;Ouyang et al.,2022)
can be used to solve complex mathematical and
commonsense reasoning tasks with few-shot in-
context learning. Inspired by this discovery, we
aim at understanding whether these LLMs can also
solve complex table-based reasoning tasks. Though
the LLMs are not specifically designed to encode ta-
arXiv:2210.06710v2 [cs.CL] 23 Jan 2023
Figure 1: In-context learning for table-related tasks
with chain-of-thoughts reasoning.
bles, given the enormous number of tables present
in the pre-training corpus, we believe they are also
competent at reasoning over table information.
In this paper, we experimented with few-shot
in-context learning for LLMs as depicted in Fig-
ure 1. Instead of fine-tuning the model, we only
provide a few examples to showcase the desired
input-output format as the condition for the model
to follow to solve unseen test examples. We ex-
periment with several prompting variants including
(1) direct prediction, (2) Chain of Thoughts (Wei
et al.,2022) (CoT), (3) Chains of thoughts with
self-consistency (Wang et al.,2022) (CoT+SC).
We evaluate these methods on WikiTableQA (Pa-
supat and Liang,2015), FetaQA (Nan et al.,2022),
TabFact (Chen et al.,2019) and FEVEROUS (Aly
et al.,2021). Our results reveal that LLMs (Ouyang
et al.,2022;Chen et al.,2021a;Chowdhery et al.,
2022) can achieve striking performance with only
1 or 2 demonstrations, e.g. 48.8% on WikiTable-
Questions and 78.8% on TabFact, which are on par
some near-SoTA models (Yu et al.,2021;Eisen-
schlos et al.,2020). On other datasets like FetaQA
with long-form answers, our human evaluation re-
veals that GPT-3 can significantly outperform the
fine-tuned T5-large by more than 30% in terms of
correctness and adequacy.
Furthermore, we manually studied the chain of
thoughts elicited from LLMs and found that the ra-
tionale is highly consistent with the ‘ground truth’
semantic forms when the model predictions are
correct. We found that these models are surpris-
ingly competent at performing symbolic operations
over the table, like maximum, minimum, counting,
comparison, addition, and difference. However, we
also identify several issues of the LLMs on these ta-
ble reasoning tasks: (1) due to the token limitation,
the model is unable to generalize to ‘huge’ tables
with 30+ rows, which is the major error source, (2)
LLMs can sometimes make simple mistakes when
performing symbolic operations.
Due to the simplicity and generality, we believe
LLMs with CoT should be used as an important
baseline for any future table-related research.
2 Related Work
2.1 Reasoning over Tables
Table-based reasoning is traditionally accom-
plished by semantic parsing to execute commands
on tables like WikiTableQuestions (Pasupat and
Liang,2015), WikiSQL (Zhong et al.,2017), and
Spider (Yu et al.,2018). These models aim to
synthesize SQL/SPARQL to interact with tables.
However, these machine languages have a rigorous
requirement regarding the tables, e.g. the value
in the same column should follow the same data
type. Such rigorous assumptions are frequently vi-
olated by web tables containing unnormalized free-
form text in cells. Therefore, language understand-
ing inside the table is essential to achieve a better
score. Recently, Yin et al. (2020); Herzig et al.
(2020); Liu et al. (2021); Deng et al. (2022) have
proposed to pre-train table and text to learn joint
representation. These pre-trained models can use
joint representation to perform reasoning implicitly
without relying on symbolic execution. By pre-
training the model on large-scale crawled or syn-
thesized data, these models can normally achieve
the best-known performance on table tasks. How-
ever, these models still require a significant amount
of fine-tuning on the downstream datasets. Un-
like these methods, we are interested in in-context
learning, where the model can only learn with a
few examples (demonstration) without any fine-
tuning. One contemporary work similar to ours
is BINDER (Cheng et al.,2022), which utilizes
Codex to synthesize SQL to execute logical forms
against tables for question answering. One big
difference is that BINDER (Cheng et al.,2022) in-
volves logical form execution, if the execution fails,
BINDER will fall back to using language models
to answer the question, which is more similar to
ours.
2.2 In-context Learning with LLMs
GPT-3 (Brown et al.,2020) and other large lan-
guage models demonstrated strong abilities to
perform few-shot predictions without fine-tuning,
where the model is given a description of the task
in natural language with few examples. Scaling
model size, data, and computing are crucial to en-
able this learning ability. Recently, (Rae et al.,
2021;Smith et al.,2022;Chowdhery et al.,2022;
Du et al.,2022) have proposed to train different
types of large language models with different train-
ing recipes. The LLMs have demonstrated a strik-
ing capability utilizing the few-shot prompts to
accomplish unseen tasks without any fine-tuning,
which is found to be an emergent capability not
presented in smaller language models.
2.3 Chain of Thoughts Reasoning
Although LLMs (Brown et al.,2020;Chowdhery
et al.,2022) have demonstrated remarkable success
across a range of NLP tasks, their ability to demon-
strate reasoning is often seen as a limitation. Such
capability cannot be acquired simply by scaling up
the model size. Recently, the ‘chain of thoughts’
prompting (Wei et al.,2022) has been discovered to
empower LLMs to perform complex reasoning over
text. By providing the model with several exem-
plars of reasoning chains, LLMs can learn to follow
the template to solve difficult unseen tasks. Later,
Wang et al. (2022) propose to use self-consistency
with CoT to further improve performance. Later
on, Kojima et al. (2022) discovered that LLMs can
even perform reasoning without any demonstra-
tion by using appropriate prompts. These recent
findings reveal the strong capability of LLMs to
perform complex reasoning. However, the current
studies are still heavily focused on text-based tasks
like question answering, common sense reasoning,
etc. The models’ capability to reason over tables
is yet unknown. In this paper, we are specifically
interested in understanding LLMs’ capability to
Figure 2: Prompts used for question answering and fact
verification tasks.
reason over web tables with CoT prompting.
3 Method
We experiment with different in-context learning
methods to solve the table-based reasoning tasks.
To formulate the prompt, we linearize the table
and concatenate it with a few examples as demon-
strations of the language model to predict the out-
put from an unseen test example. The format
is described in Figure 2. We mainly investigate
three different variants for language model prompt-
ing, including (1) Direct Prediction, (2) Chain
of Thoughts (CoT), and (3) Chain of Thoughts
+ Celf-Consistentcy decoding (CoT+SC). For self-
consistency methods, we use LLMs to generate
ve diverse reasoning paths and then use majority
voting to select the most voted answer.
To limit the budget and constrain the input token
length, we truncate the input tables to contain only
摘要:

LargeLanguageModelsarefew(1)-shotTableReasonersWenhuChenUniversityofWaterloo,VectorInstitutewenhuchen@uwaterloo.caAbstractRecentliteraturehasshownthatlargelan-guagemodels(LLMs)aregenerallyexcel-lentfew-shotreasonerstosolvetextreasoningtasks.However,thecapabilityofLLMsonta-blereasoningtasksisyettobee...

展开>> 收起<<
Large Language Models are few1-shot Table Reasoners Wenhu Chen University of Waterloo Vector Institute.pdf

共11页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:11 页 大小:4.02MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 11
客服
关注