Enhancing Tabular Reasoning with Pattern Exploiting Training Abhilash Reddy Shankarampeta1 Vivek Gupta2y Shuo Zhang3 1IIT Guwahati2University of Utah3Bloomberg

2025-05-06 0 0 1.01MB 21 页 10玖币

侵权投诉

Enhancing Tabular Reasoning with Pattern Exploiting Training

Abhilash Reddy Shankarampeta1∗

, Vivek Gupta2*†

, Shuo Zhang3

1IIT Guwahati; 2University of Utah; 3Bloomberg

sareddy53@gmail.com; vgupta@cs.utah.edu; szhang611@bloomberg.net

Abstract

Recent methods based on pre-trained language

models have exhibited superior performance

over tabular tasks (e.g., tabular NLI), despite

showing inherent problems such as not using

the right evidence and inconsistent predictions

across inputs while reasoning over the tabu-

lar data (Gupta et al.,2021). In this work,

we utilize Pattern-Exploiting Training (PET)

(i.e., strategic MLM) on pre-trained language

models to strengthen these tabular reasoning

models’ pre-existing knowledge and reasoning

abilities. Our upgraded model exhibits a supe-

rior understanding of knowledge facts and tab-

ular reasoning compared to current baselines.

Additionally, we demonstrate that such models

are more effective for underlying downstream

tasks of tabular inference on INFOTABS. Fur-

thermore, we show our model’s robustness

against adversarial sets generated through var-

ious character and word level perturbations.

1 Introduction

Natural Language Inference (NLI) is the problem

of categorizing a hypothesis into entailment, con-

tradiction, or neutral based on the given premise

(Dagan et al.,2013). Large language models such

as BERT (Devlin et al.,2019), RoBERTa (Liu et al.,

2019c) have been applied to large datasets like

SNLI (Bowman et al.,2015), MultiNLI (Williams

et al.,2018), where they have shown performance

comparable to that of humans.

However, the existing methods based on lan-

guage models are ineffective for reasoning over

semi-structured data (Gupta et al.,2021). These

models often ignore relevant rows and use spurious

correlations in hypothesis or pre-training informa-

tion for making inferences (Neeraja et al.,2021;

Poliak et al.,2018;Gururangan et al.,2018;Jain

et al.,2021;Gupta et al.,2021). Due to existing

biases in human curated datasets (Rajpurkar et al.,

∗Equal Contribution †Corresponding Author

Breakfast in America

Released 29 March 1979

Recorded May–December 1978

Studio The Village Recorder in LA

Genre Pop, art rock, soft rock

Length 46:06

Label A&M

Producer Peter Henderson, Supertramp

H1: Breakfast in America is a pop album with a duration

less than 50 minutes.

H2: Peter Henderson produces only rock albums.

H3: Breakfast in America was released towards the end

of 1979.

H4: Breakfast in America is recorded in California.

H5: Supertramp is an English band.

H6: The album was released on 29 March 1978.

Table 1: An example of tabular premise from IN-

FOTABS (Gupta et al.,2020). The hypotheses H1, H4

is entailed, H2, H5 is a neutral and H3, H6 is a con-

tradiction. Here, the bold entries, which correspond to

the ﬁrst column, are the keys, while the corresponding

entries in the second column of the same row are their

respective values.

2018;Zhou and Bansal,2020) with hypothesis hav-

ing annotation artifacts (Gururangan et al.,2018),

often models trained on such data lack generaliz-

ability and robustness (Glockner et al.,2018). Fur-

thermore, the absence of comprehensive test sets

hinders robust model evaluation. Thus, evaluating

models based only on accuracy does not reﬂect

their reliability and robustness (Ribeiro et al.,2020;

Moradi and Samwald,2021).

In this paper, we investigate the current model’s

reasoning capability, particularly whether they can

extract the right knowledge and correctly make ra-

tional inferences from that extracted knowledge.

We focus on the task of tabular reasoning through

table inference on INFOTABS (Gupta et al.,2020).

For instance, in table 1, a model must ﬁlter out the

relevant rows, i.e., extract knowledge, before apply-

ing the proper reasoning to categorize H1. Reason-

ing steps can be complex when involving numerical

arXiv:2210.12259v1 [cs.CL] 21 Oct 2022

reasoning like count, sort, compare, arithmetic (H1:

46 < 50), commonsense knowledge (H3: December

occurs at the end of the year), and factual knowl-

edge (H4: LA is short for Los Angeles).

It has been proven that LMs pre-trained without

explicit supervision on a huge corpus of free web

data implicitly incorporate several types of knowl-

edge into their parameters (Peters et al.,2019). For

extracting this knowledge from language models

(LM), various methods utilize probing (Hewitt and

Liang,2019;Voita and Titov,2020, and others), at-

tention (Jain and Wallace,2019;Wiegreffe and Pin-

ter,2019), and prompting (Petroni et al.,2019;Shin

et al.,2020, and others) strategies. This internalized

knowledge cannot be retrieved when ﬁne-turning

for a subsequent task. One explanation is that the

objectives of pre-training and ﬁne-tuning are vastly

different. This variation in training objectives also

diminishes the expected performance gains of the

task, hence necessitating further pre-training on

training data (Xiong et al.,2020;Roberts et al.,

2020;Eisenschlos et al.,2020). Therefore, refram-

ing the subsequent task as a joint pre-training objec-

tive becomes essential. Hence, we reformulate the

tabular NLI, i.e., our downstream task as a cloze-

style problem, a.k.a, a mask language modeling

(MLM) problem. For ﬁne-tuning, we utilize the ef-

ﬁcient Pattern-Exploiting Training (PET) technique

(Schick and Schütze,2021a,b;Tam et al.,2021).

PET entails establishing pairs of cloze question pat-

terns and verbalizers that enable subsequent tasks

to utilize the knowledge of the pre-trained language

models. In addition, PET does not need model up-

grades, such as adding more layers or parameters

during pre-training.

Compared to direct ﬁne-tuning-based techniques,

i.e., training a classiﬁer layer on top of LM, our

method improved +8.1 and +25.8 on factual and

relational knowledge evaluation tasks, respectively

(see table 4). On INFOTABS , a tabular inference

dataset, our PET training approach outperforms

+1.72 on

α1

(similar to dev), +2.11 on

α2

(adver-

sarial set), and +2.55 on

α3

(zero-shot set), see

table 5) the existing baselines. This shows the ef-

fectiveness of our approach, especially on adversar-

ial and out-of-domain challenging instances. Fur-

thermore, we evaluate our improved model against

instance perturbations to examine its robustness.

These perturbations are generated by modifying

existing INFOTABS instances, namely by chang-

ing names, numbers, places, phrases (paraphras-

ing), and characters (spelling errors). In addition,

we also incorporated counterfactual instances (i.e.,

negation) to evaluate the model’s robustness against

pre-trained knowledge overﬁtting. The improve-

ment in the counterfactual setting demonstrates that

our approach beneﬁts the model to ground better

with premise table evidence.

Our main contributions are the following:

•

We propose a method for generating prompts

for determining if current models can infer

from knowledge.

•

We enhance the model’s reasoning via prompt

learning, i.e., PET, to extract knowledge from

semi-structured tables.

•

Our experiments on INFOTABS show that our

proposed approach preserves knowledge and

improves performance on downstream NLI

tasks. The results are robust when assessed on

multiple curated adversarial test sets.

The dataset and associated scripts, are available at

https://infoadapet.github.io/.

2 Motivation

Case for Reasoning on Semi-structured Data.

Reasoning semi-structured data acquire skills such

as arithmetic and commonsense, understanding the

text types in the tabular cells, and aggregating in-

formation across numerous rows if necessary. For

example, to judge the H1 in table 1, the model

needs to understand "duration" and "length" are

the same in the context of the table, which is about

a music album. Also, numerical reasoning is re-

quired to compare "46:06" minutes" is less than

"50 minutes". At the same time, the model should

understand that the premise (table) is about a music

album, so to classify the H1 model needs to under-

stand the information present in 2 rows ({"Genre",

"Length"}) and perform numerical reasoning on

top of that factual information.

Implicit Knowledge is Required for Reasoning.

For instance, for H3 in table 1, the model needs to

ﬁrst extract the relevant row, i.e., "Released" row

from the table, then compares the phrase "end of

1979" with the "Released" row value "29 March

1979" implicitly. The model needs to perform tem-

poral reasoning to know that "year 1979" is correct.

However, the month "March" is not the "end of

the year", but "November" or "December" is (im-

plicit commonsense temporal knowledge). While

previous works tried to incorporate knowledge via

pre-training (Eisenschlos et al.,2020;Neeraja et al.,

2021). In this work, we integrate knowledge and

reasoning ability simultaneously using Pattern Ex-

ploiting Training (Tam et al.,2021). This approach

improves the existing knowledge and enhances rea-

soning compared to existing methods.

Robustness is Critical for Model Evaluation.

Tabular reasoning models typically fail on modest

input modiﬁcation, a.k.a. adversarial manipulation

of inputs, highlighting the model’s poor robust-

ness and generalizability limit (Gupta et al.,2021).

Thus, evaluating reasoning models on adversarial

sets generated by minimal input perturbation be-

comes vital. As a result, we propose additional

adversarial test sets, such as using character and

word level perturbations to evaluate various aspects

of model understanding and reasoning over tables.

For example, if H1 (table 1) is changed to "Break-

fast in Wales is a pop album with a duration of

fewer than 50 minutes." now the label of hypothe-

sis H1 is changes from

entailment

neutral

since

we do not know any information of "Breakfast in

Wales" from table 1. These minor input perturba-

tions can alter the hypothesis’ semantic interpreta-

tion. Idealistically, a robust model with superior

reasoning ability should perform well on these in-

put perturbed adversarial sets, as our technique also

demonstrates.

3 Our Approach

In this section we describe our method to

(a)

evalu-

ate pre-trained LM knowledge for tabular reason-

ing,

(b)

enhance model tabular reasoning capability

using PET training,

(c)

and assess model robust-

ness to input perturbations.

3.1 Evaluation of Pre-training Knowledge

To examine how pre-training affects knowledge-

based reasoning for tabular data, we focus on two

types of knowledge (a.) factual knowledge (aware-

ness of speciﬁc factual knowledge about entities),

(b.) and relational knowledge (awareness of pos-

sible right relations between two distinct entities).

For instance, in the sentence "Breakfast in America

was released on March 29, 1979","Breakfast in

America" and "March 29, 1979" are considered as

factual knowledge, while their relationship term,

i.e., "released" corresponds to relational knowl-

edge.

We evaluate factual and relational knowledge in

the language model before and after training for

the downstream task like reasoning. In speciﬁc,

we query the model using "ﬁll-in-the-blank" cloze

statements (a.k.a. prompts). As gauging knowl-

edge using prompts is limited by how the prompts

are constructed. We use part-of-speech tagging to

detect nouns and verbs that are then used to mask

names, numbers, and dates. These prompts are gen-

erated using hypotheses from the

α1

, and dev sets

as these sets have similar distribution as the training

data (Gupta et al.,2020). We construct the prompts

from both entailed and contradictory hypotheses.

For prompts derived from entailed hypotheses, the

model must predict the correct masked word, i.e.,

a term semantically equivalent to the word in the

hypothesis. In contrast, for the prompts derived

from contradicting hypotheses, the model should

predict a semantically different term with the same

entity type as the one mentioned in the hypothesis.

To study the effect of the premise, we also query

the model with the premise. To do this we modify

the input as premise + prompt.

Prompts for Factual Knowledge Evaluation

As most factual knowledge is contained in proper

nouns and numbers, we randomly mask proper

nouns or numbers in the hypothesis to generate a

prompt and query the Language Model to ﬁll the

masked tokens. For example "Duration of Break-

fast in America is 46 minutes" (table 1), "Break-

fast in America",46 are the factual information

present in the sentence and they are connected by

"duration". We randomly mask either "Breakfast

in America" or "46" to generate prompt "Duration

of Breakfast in America is <mask> minutes". Occa-

sionally, a masked term can be a number in numeric

form (e.g., 2); however, the model predicted word

form ("two"). We solved this issue by converting

the predicted word into its numeric form or vice

versa. E.g. "Breakfast in America is produced by

<mask> producers", where <mask> = two.

Prompts for Relational Knowledge Evaluation.

Similar prompts are leveraged for relational knowl-

edge. For example, to predict <mask> = released

for "Breakfast in America was <mask> towards the

end of 1979", the model needs to understand that

"Breakfast in America" is a music album to predict

"released" instead of "eaten" which is highly prob-

able due the neighbor context term "Breakfast". We

also use WordNet (Miller,1995) to discover syn-

Figure 1: The training uses the two ADAPET components. Here, the blue boxes represent the task inputs (entailed,

in this case) a) Decoupling Label Loss: Using the cross entropy loss across all labels, the model must predict

the right and wrong labels at the masked-out position. b) Label Conditioning: The model should predict the

original token at a randomly masked-out position if the input text has the entail label. Otherwise, not if the label is

contradiction or neutral.

onyms for the masked term and see if the predicted

word is among them.

3.2 Knowledge Incorporation for Reasoning

The issue of deducing inferences from tabular

premises is similar to the typical NLI problem,

except that the premises are tables rather than

sentences. When evaluating the reasoning skills,

we use a variety of representations of the tabular

premise (see section 4, appendix A.1). We also

study the effect of pretraining on an NLI task on

INFOTABS.

Pattern-Exploiting Training.

Using Pattern-

Exploiting Training (PET) (Schick and Schütze,

2021a), NLU tasks are reformulated as cloze-

style questions, and ﬁne-tuning is performed us-

ing gradient-based methods. We use ADAPET

(A Densely-supervised Approach to Pattern-

Exploiting Training) (Tam et al.,2021), which in-

creases supervision by separating the label token

losses and applying a label-conditioned masked

language modeling (MLM) to the entire input.

The input to the language model is converted

into a cloze-style form with the pattern <premise>

? <mask>, <hypothesis>. The model is tasked to

predict the masked word from the vocabulary. The

model computes each token’s probability as a soft-

max normalized overall tokens, allowing the logits

of all vocabulary tokens to impact each likelihood,

similar to the regular MLM objective. While in

PET, the masked word is forced to predict from the

output space {Yes, Maybe, No} which are mapped

to labels {Entailment, Neutral, Contradiction}. As

a result, there will never be a gradient signal for

non-label tokens. Inverting the query to the model

to "In light of the answer, what is the appropri-

ate context?" from "What is the appropriate label

based on the input?" label conditioned mask lan-

guage modeling is introduced by randomly mask-

ing out context tokens. If the label is "entail", dur-

ing training, the model is obligated to predict the

original token; however, if the label is "contradic-

tion" or "neutral", the model is forced to ignore the

original token.

Masked Language Modeling.

ADAPET ran-

domly masks tokens (RoBERTa style) from the

context. Inspired by SpanBERT (Joshi et al.,2020),

ERNIE (Sun et al.,2019), we sample and mask the

entire words based on pre-deﬁned conditions. In

Conditional Whole Word Masking (CWWM), we

create a set of words

from a given sentence, and

the POS of the words in that set must be from {"Ad-

jective", "Adverb", "Noun, "Verb", "Proper Noun",

"Adposition", "Numeral", "Coordinating Conjunc-

tion", "Subordinating Conjunction" }

. We sample

words from the set

and mask all tokens match-

ing the sampled word concurrently while maintain-

ing the same overall masking rate.

3.3 Robustness with Input Perturbations

We apply a range of character- and word-level per-

turbations to hypotheses to simulate circumstances

where the input is slightly noisy or deviates from

the training data distribution. We use TextAttack

(Morris et al.,2020), NLP Checklist (Ribeiro et al.,

1https://universaldependencies.org/u/pos/

Perturbation Original text Perturbed text

Character Peter Henderson produces only rock albums

Peter Henbgderson produces only rock albsums

Peter Hendersno produces only rokc albums

Pter Henderson produces onl rock abus

Petqr Henkerson prgduces only rock alocms

Location

Breakfast in America is recorded in California Breakfast in America is recorded in Florida.

Breakfast in America is recorded in USA Breakfast in America is recorded in Syria.

Breakfast in America is by an English rock band. Breakfast in America is by an Mexican rock band.

Name Peter Henderson produces only rock albums John Doe produces only rock albums

Numbers The album was released on 29 March 1978. The album was released on 29 March 346.

The album was released on 1March 1978.

Negation The genres of the album are pop and rock. The genres of the album are not pop and rock.

Paraphrase The album was recorded in the last half of 1979. In the second part of 1979, the album was recorded.

Table 2: Examples of various perturbations used to generate the adversarial test sets based on table 1.

2020), and manual perturbations for generating the

adversarial data. These adversarial sets will test the

dependence of the model on word overlap, numer-

ical comprehension, and hypothetical assertions.

Refer to tables 2and 9for examples.

Character-level perturbation

employs pertur-

bations such as introducing random characters,

switching characters, removing a random charac-

ter, and substituting a random character in the ran-

domly selected word. This alteration does not im-

pact the label of the hypothesis because it does not

alter the sentence’s meaning.

Location perturbation

modiﬁes the identiﬁed lo-

cations (countries, cities, and nationalities) in a

sentence to another place speciﬁed in the location

map. The NER model (TextAttack) identiﬁes the

location in a given sentence and replaces it with a

sampled location from a dictionary. Here, cities are

replaced with other cities and similar changes for

countries. This perturbation transforms the entail

clauses into contradictions but does not affect the

original neutral and contradiction labels.

Name perturbation

randomly replaces a person’s

name with the other one from a name list. This

perturbation alters the label of every hypothesis

into a neutral because the perturbed hypothesis and

premise mention different persons.

Peturb Type Size Peturb Type Size

character 1800 negation+char 1726

location 1229 negation+name 1677

name 1646 number+char 837

negation 1726 number+name 776

number 837 number+negation 817

paraphrase 1800 num+paraphrase 837

num+para+name 776 paraphrase+name 1721

Table 3: Number of examples for each perturbation

type in the adversarial set.

Perturbing Numbers

changes the entailed sen-

tences into contradictions but does not affect the

labels of neutral and contradictions. Contradic-

tory statements remain contradictory because it is

implausible that a randomly sampled number will

be the actual number in the premise, making the

hypothesis entailed.

Negation

transforms entailment into a contradic-

tion by negating the given sentence, keeping neu-

trals intact.

Paraphrasing

paraphrases the given sentences

without the loss of meaning using manual para-

phrasing and Pegasus model

. Paraphrasing does

not affect the inference label as it does not change

the semantic meaning of the hypothesis.

Composition of Perturbations

perturbs sentences

by applying various distinct perturbations sequen-

tially. E.g., in

num+para+name

we perturbed a

sentence "Supertramp, produced an album that was

less than 60 minutes long", with premise table 1

to "Supertramp, produced an album that was less

than 40 minutes long" (number) then "Supertramp

released an album which lasted less than 40 min-

utes." (paraphrase) then "James released an album

which lasted less than 40 minutes" (name).

4 Experiments and Analysis

Dataset.

Our experiments we use INFOTABS, a

tabular inference dataset introduced by Gupta et al.

(2020). The dataset is diverse in terms of the tables

domains, categories, and corresponding keys (en-

tity types and forms) it contains, as illustrated in

examples table 1. In addition, Gupta et al. (2020)

reveals that inference on corresponding hypotheses

requires extensive knowledge and commonsense

reasoning ability. Given the premise table, hypoth-

2https://biturl.top/MzQnMv

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

EnhancingTabularReasoningwithPatternExploitingTrainingAbhilashReddyShankarampeta1,VivekGupta2*y,ShuoZhang31IITGuwahati;2UniversityofUtah;3Bloombergsareddy53@gmail.com;vgupta@cs.utah.edu;szhang611@bloomberg.netAbstractRecentmethodsbasedonpre-trainedlanguagemodelshaveexhibitedsuperiorperformanceovert...

展开>> 收起<<

Enhancing Tabular Reasoning with Pattern Exploiting Training Abhilash Reddy Shankarampeta1 Vivek Gupta2y Shuo Zhang3 1IIT Guwahati2University of Utah3Bloomberg.pdf

共21页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Enhancing Tabular Reasoning with Pattern Exploiting Training Abhilash Reddy Shankarampeta1 Vivek Gupta2y Shuo Zhang3 1IIT Guwahati2University of Utah3Bloomberg

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: