Large Language Models are few1-shot Table Reasoners Wenhu Chen University of Waterloo Vector Institute

2025-05-03 0 0 4.02MB 11 页 10玖币

侵权投诉

Large Language Models are few(1)-shot Table Reasoners

Wenhu Chen

University of Waterloo, Vector Institute

wenhuchen@uwaterloo.ca

Abstract

Recent literature has shown that large lan-

guage models (LLMs) are generally excel-

lent few-shot reasoners to solve text reasoning

tasks. However, the capability of LLMs on ta-

ble reasoning tasks is yet to be explored. In

this paper, we aim at understanding how well

LLMs can perform table-related tasks with

few-shot in-context learning. Speciﬁcally, we

evaluated LLMs on popular table QA and fact

veriﬁcation datasets like WikiTableQuestion,

FetaQA, TabFact, and FEVEROUS and found

that LLMs are competent at complex reason-

ing over table structures, though these models

are not pre-trained on any table corpus. When

combined with ‘chain of thoughts’ prompting,

LLMs can achieve very strong performance

with only a 1-shot demonstration, even on par

with some SoTA models. We show that LLMs

are even more competent at generating com-

prehensive long-form answers on FetaQA than

tuned T5-large. We further manually studied

the reasoning chains elicited from LLMs and

found that these reasoning chains are highly

consistent with the underlying semantic form.

We believe that LLMs can serve as a sim-

ple yet generic baseline for future research.

The code and data are released in https://

github.com/wenhuchen/TableCoT.

1 Introduction

The problem of structured knowledge grounding

has been extensively studied for many years. Ta-

bles, as one of the most popular (semi)-structured

forms to store world knowledge receive signiﬁ-

cant attention from the natural language processing

(NLP) community. Traditional approaches mostly

rely on synthesizing executable languages like SQL

or SPARQL to access the information inside the ta-

ble. However, these symbolic languages normally

make a rigid assumption about the table and can-

not capture the semantics of text chunks inside the

table. Such issues are even more pronounced with

web tables due to their irregular forms. To fully

understand web tables, both structured reasoning

and textual reasoning are required. Such challenges

have attracted many researchers to work in the ﬁeld.

Recently, a wide range of table-based tasks have

been proposed like table question answering (Pasu-

pat and Liang,2015;Chen et al.,2020c;Zhu et al.,

2021;Chen et al.,2021b;Talmor et al.,2020;Chen

et al.,2020a;Nan et al.,2022), table fact veriﬁ-

cation (Chen et al.,2019;Aly et al.,2021), table-

based generation (Chen et al.,2020b;Parikh et al.,

2020;Nan et al.,2021), and table-grounded con-

versation (Budzianowski et al.,2018;Nakamura

et al.,2022). This wide range of table-based tasks

all come with different input-output formats and

domains. Due to the heterogeneity of these tasks,

models achieving the best results on these tasks

normally need to be fully ﬁne-tuned on the speciﬁc

downstream dataset with 10K-100K examples to

achieve reasonable performance.

Recently, there have been efforts like Uniﬁed-

SKG (Xie et al.,2022) aiming to unify these het-

erogeneous table-based tasks as a generic text-to-

text format. UniﬁedSKG has shown that using

T5-3B (Raffel et al.,2020) with the text-to-text

format can already achieve state-of-the-art perfor-

mance on almost all the table-based tasks without

task-speciﬁc designs. However, the proposed text-

to-text models still need to be fully ﬁne-tuned on

the downstream tasks. UniﬁedSKG also identiﬁed

that T0-style (Sanh et al.,2022) cross-task transfer

can only achieve almost random performance.

Wei et al. (2022); Wang et al. (2022); Zhou et al.

(2022); Drozdov et al. (2022) have recently dis-

covered that large language models (Brown et al.,

2020;Chowdhery et al.,2022;Ouyang et al.,2022)

can be used to solve complex mathematical and

commonsense reasoning tasks with few-shot in-

context learning. Inspired by this discovery, we

aim at understanding whether these LLMs can also

solve complex table-based reasoning tasks. Though

the LLMs are not speciﬁcally designed to encode ta-

arXiv:2210.06710v2 [cs.CL] 23 Jan 2023

Figure 1: In-context learning for table-related tasks

with chain-of-thoughts reasoning.

bles, given the enormous number of tables present

in the pre-training corpus, we believe they are also

competent at reasoning over table information.

In this paper, we experimented with few-shot

in-context learning for LLMs as depicted in Fig-

ure 1. Instead of ﬁne-tuning the model, we only

provide a few examples to showcase the desired

input-output format as the condition for the model

to follow to solve unseen test examples. We ex-

periment with several prompting variants including

(1) direct prediction, (2) Chain of Thoughts (Wei

et al.,2022) (CoT), (3) Chains of thoughts with

self-consistency (Wang et al.,2022) (CoT+SC).

We evaluate these methods on WikiTableQA (Pa-

supat and Liang,2015), FetaQA (Nan et al.,2022),

TabFact (Chen et al.,2019) and FEVEROUS (Aly

et al.,2021). Our results reveal that LLMs (Ouyang

et al.,2022;Chen et al.,2021a;Chowdhery et al.,

2022) can achieve striking performance with only

1 or 2 demonstrations, e.g. 48.8% on WikiTable-

Questions and 78.8% on TabFact, which are on par

some near-SoTA models (Yu et al.,2021;Eisen-

schlos et al.,2020). On other datasets like FetaQA

with long-form answers, our human evaluation re-

veals that GPT-3 can signiﬁcantly outperform the

ﬁne-tuned T5-large by more than 30% in terms of

correctness and adequacy.

Furthermore, we manually studied the chain of

thoughts elicited from LLMs and found that the ra-

tionale is highly consistent with the ‘ground truth’

semantic forms when the model predictions are

correct. We found that these models are surpris-

ingly competent at performing symbolic operations

over the table, like maximum, minimum, counting,

comparison, addition, and difference. However, we

also identify several issues of the LLMs on these ta-

ble reasoning tasks: (1) due to the token limitation,

the model is unable to generalize to ‘huge’ tables

with 30+ rows, which is the major error source, (2)

LLMs can sometimes make simple mistakes when

performing symbolic operations.

Due to the simplicity and generality, we believe

LLMs with CoT should be used as an important

baseline for any future table-related research.

2 Related Work

2.1 Reasoning over Tables

Table-based reasoning is traditionally accom-

plished by semantic parsing to execute commands

on tables like WikiTableQuestions (Pasupat and

Liang,2015), WikiSQL (Zhong et al.,2017), and

Spider (Yu et al.,2018). These models aim to

synthesize SQL/SPARQL to interact with tables.

However, these machine languages have a rigorous

requirement regarding the tables, e.g. the value

in the same column should follow the same data

type. Such rigorous assumptions are frequently vi-

olated by web tables containing unnormalized free-

form text in cells. Therefore, language understand-

ing inside the table is essential to achieve a better

score. Recently, Yin et al. (2020); Herzig et al.

(2020); Liu et al. (2021); Deng et al. (2022) have

proposed to pre-train table and text to learn joint

representation. These pre-trained models can use

joint representation to perform reasoning implicitly

without relying on symbolic execution. By pre-

training the model on large-scale crawled or syn-

thesized data, these models can normally achieve

the best-known performance on table tasks. How-

ever, these models still require a signiﬁcant amount

of ﬁne-tuning on the downstream datasets. Un-

like these methods, we are interested in in-context

learning, where the model can only learn with a

few examples (demonstration) without any ﬁne-

tuning. One contemporary work similar to ours

is BINDER (Cheng et al.,2022), which utilizes

Codex to synthesize SQL to execute logical forms

against tables for question answering. One big

difference is that BINDER (Cheng et al.,2022) in-

volves logical form execution, if the execution fails,

BINDER will fall back to using language models

to answer the question, which is more similar to

ours.

2.2 In-context Learning with LLMs

GPT-3 (Brown et al.,2020) and other large lan-

guage models demonstrated strong abilities to

perform few-shot predictions without ﬁne-tuning,

where the model is given a description of the task

in natural language with few examples. Scaling

model size, data, and computing are crucial to en-

able this learning ability. Recently, (Rae et al.,

2021;Smith et al.,2022;Chowdhery et al.,2022;

Du et al.,2022) have proposed to train different

types of large language models with different train-

ing recipes. The LLMs have demonstrated a strik-

ing capability utilizing the few-shot prompts to

accomplish unseen tasks without any ﬁne-tuning,

which is found to be an emergent capability not

presented in smaller language models.

2.3 Chain of Thoughts Reasoning

Although LLMs (Brown et al.,2020;Chowdhery

et al.,2022) have demonstrated remarkable success

across a range of NLP tasks, their ability to demon-

strate reasoning is often seen as a limitation. Such

capability cannot be acquired simply by scaling up

the model size. Recently, the ‘chain of thoughts’

prompting (Wei et al.,2022) has been discovered to

empower LLMs to perform complex reasoning over

text. By providing the model with several exem-

plars of reasoning chains, LLMs can learn to follow

the template to solve difﬁcult unseen tasks. Later,

Wang et al. (2022) propose to use self-consistency

with CoT to further improve performance. Later

on, Kojima et al. (2022) discovered that LLMs can

even perform reasoning without any demonstra-

tion by using appropriate prompts. These recent

ﬁndings reveal the strong capability of LLMs to

perform complex reasoning. However, the current

studies are still heavily focused on text-based tasks

like question answering, common sense reasoning,

etc. The models’ capability to reason over tables

is yet unknown. In this paper, we are speciﬁcally

interested in understanding LLMs’ capability to

Figure 2: Prompts used for question answering and fact

veriﬁcation tasks.

reason over web tables with CoT prompting.

3 Method

We experiment with different in-context learning

methods to solve the table-based reasoning tasks.

To formulate the prompt, we linearize the table

and concatenate it with a few examples as demon-

strations of the language model to predict the out-

put from an unseen test example. The format

is described in Figure 2. We mainly investigate

three different variants for language model prompt-

ing, including (1) Direct Prediction, (2) Chain

of Thoughts (CoT), and (3) Chain of Thoughts

+ Celf-Consistentcy decoding (CoT+SC). For self-

consistency methods, we use LLMs to generate

ﬁve diverse reasoning paths and then use majority

voting to select the most voted answer.

To limit the budget and constrain the input token

length, we truncate the input tables to contain only

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

LargeLanguageModelsarefew(1)-shotTableReasonersWenhuChenUniversityofWaterloo,VectorInstitutewenhuchen@uwaterloo.caAbstractRecentliteraturehasshownthatlargelan-guagemodels(LLMs)aregenerallyexcel-lentfew-shotreasonerstosolvetextreasoningtasks.However,thecapabilityofLLMsonta-blereasoningtasksisyettobee...

展开>> 收起<<

Large Language Models are few1-shot Table Reasoners Wenhu Chen University of Waterloo Vector Institute.pdf

共11页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Large Language Models are few1-shot Table Reasoners Wenhu Chen University of Waterloo Vector Institute

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: