Enhancing Tabular Reasoning with Pattern Exploiting Training Abhilash Reddy Shankarampeta1 Vivek Gupta2y Shuo Zhang3 1IIT Guwahati2University of Utah3Bloomberg

2025-05-06 0 0 1.01MB 21 页 10玖币
侵权投诉
Enhancing Tabular Reasoning with Pattern Exploiting Training
Abhilash Reddy Shankarampeta1
, Vivek Gupta2*
, Shuo Zhang3
1IIT Guwahati; 2University of Utah; 3Bloomberg
sareddy53@gmail.com; vgupta@cs.utah.edu; szhang611@bloomberg.net
Abstract
Recent methods based on pre-trained language
models have exhibited superior performance
over tabular tasks (e.g., tabular NLI), despite
showing inherent problems such as not using
the right evidence and inconsistent predictions
across inputs while reasoning over the tabu-
lar data (Gupta et al.,2021). In this work,
we utilize Pattern-Exploiting Training (PET)
(i.e., strategic MLM) on pre-trained language
models to strengthen these tabular reasoning
models’ pre-existing knowledge and reasoning
abilities. Our upgraded model exhibits a supe-
rior understanding of knowledge facts and tab-
ular reasoning compared to current baselines.
Additionally, we demonstrate that such models
are more effective for underlying downstream
tasks of tabular inference on INFOTABS. Fur-
thermore, we show our model’s robustness
against adversarial sets generated through var-
ious character and word level perturbations.
1 Introduction
Natural Language Inference (NLI) is the problem
of categorizing a hypothesis into entailment, con-
tradiction, or neutral based on the given premise
(Dagan et al.,2013). Large language models such
as BERT (Devlin et al.,2019), RoBERTa (Liu et al.,
2019c) have been applied to large datasets like
SNLI (Bowman et al.,2015), MultiNLI (Williams
et al.,2018), where they have shown performance
comparable to that of humans.
However, the existing methods based on lan-
guage models are ineffective for reasoning over
semi-structured data (Gupta et al.,2021). These
models often ignore relevant rows and use spurious
correlations in hypothesis or pre-training informa-
tion for making inferences (Neeraja et al.,2021;
Poliak et al.,2018;Gururangan et al.,2018;Jain
et al.,2021;Gupta et al.,2021). Due to existing
biases in human curated datasets (Rajpurkar et al.,
Equal Contribution Corresponding Author
Breakfast in America
Released 29 March 1979
Recorded May–December 1978
Studio The Village Recorder in LA
Genre Pop, art rock, soft rock
Length 46:06
Label A&M
Producer Peter Henderson, Supertramp
H1: Breakfast in America is a pop album with a duration
less than 50 minutes.
H2: Peter Henderson produces only rock albums.
H3: Breakfast in America was released towards the end
of 1979.
H4: Breakfast in America is recorded in California.
H5: Supertramp is an English band.
H6: The album was released on 29 March 1978.
Table 1: An example of tabular premise from IN-
FOTABS (Gupta et al.,2020). The hypotheses H1, H4
is entailed, H2, H5 is a neutral and H3, H6 is a con-
tradiction. Here, the bold entries, which correspond to
the first column, are the keys, while the corresponding
entries in the second column of the same row are their
respective values.
2018;Zhou and Bansal,2020) with hypothesis hav-
ing annotation artifacts (Gururangan et al.,2018),
often models trained on such data lack generaliz-
ability and robustness (Glockner et al.,2018). Fur-
thermore, the absence of comprehensive test sets
hinders robust model evaluation. Thus, evaluating
models based only on accuracy does not reflect
their reliability and robustness (Ribeiro et al.,2020;
Moradi and Samwald,2021).
In this paper, we investigate the current model’s
reasoning capability, particularly whether they can
extract the right knowledge and correctly make ra-
tional inferences from that extracted knowledge.
We focus on the task of tabular reasoning through
table inference on INFOTABS (Gupta et al.,2020).
For instance, in table 1, a model must filter out the
relevant rows, i.e., extract knowledge, before apply-
ing the proper reasoning to categorize H1. Reason-
ing steps can be complex when involving numerical
arXiv:2210.12259v1 [cs.CL] 21 Oct 2022
reasoning like count, sort, compare, arithmetic (H1:
46 < 50), commonsense knowledge (H3: December
occurs at the end of the year), and factual knowl-
edge (H4: LA is short for Los Angeles).
It has been proven that LMs pre-trained without
explicit supervision on a huge corpus of free web
data implicitly incorporate several types of knowl-
edge into their parameters (Peters et al.,2019). For
extracting this knowledge from language models
(LM), various methods utilize probing (Hewitt and
Liang,2019;Voita and Titov,2020, and others), at-
tention (Jain and Wallace,2019;Wiegreffe and Pin-
ter,2019), and prompting (Petroni et al.,2019;Shin
et al.,2020, and others) strategies. This internalized
knowledge cannot be retrieved when fine-turning
for a subsequent task. One explanation is that the
objectives of pre-training and fine-tuning are vastly
different. This variation in training objectives also
diminishes the expected performance gains of the
task, hence necessitating further pre-training on
training data (Xiong et al.,2020;Roberts et al.,
2020;Eisenschlos et al.,2020). Therefore, refram-
ing the subsequent task as a joint pre-training objec-
tive becomes essential. Hence, we reformulate the
tabular NLI, i.e., our downstream task as a cloze-
style problem, a.k.a, a mask language modeling
(MLM) problem. For fine-tuning, we utilize the ef-
ficient Pattern-Exploiting Training (PET) technique
(Schick and Schütze,2021a,b;Tam et al.,2021).
PET entails establishing pairs of cloze question pat-
terns and verbalizers that enable subsequent tasks
to utilize the knowledge of the pre-trained language
models. In addition, PET does not need model up-
grades, such as adding more layers or parameters
during pre-training.
Compared to direct fine-tuning-based techniques,
i.e., training a classifier layer on top of LM, our
method improved +8.1 and +25.8 on factual and
relational knowledge evaluation tasks, respectively
(see table 4). On INFOTABS , a tabular inference
dataset, our PET training approach outperforms
+1.72 on
α1
(similar to dev), +2.11 on
α2
(adver-
sarial set), and +2.55 on
α3
(zero-shot set), see
table 5) the existing baselines. This shows the ef-
fectiveness of our approach, especially on adversar-
ial and out-of-domain challenging instances. Fur-
thermore, we evaluate our improved model against
instance perturbations to examine its robustness.
These perturbations are generated by modifying
existing INFOTABS instances, namely by chang-
ing names, numbers, places, phrases (paraphras-
ing), and characters (spelling errors). In addition,
we also incorporated counterfactual instances (i.e.,
negation) to evaluate the model’s robustness against
pre-trained knowledge overfitting. The improve-
ment in the counterfactual setting demonstrates that
our approach benefits the model to ground better
with premise table evidence.
Our main contributions are the following:
We propose a method for generating prompts
for determining if current models can infer
from knowledge.
We enhance the model’s reasoning via prompt
learning, i.e., PET, to extract knowledge from
semi-structured tables.
Our experiments on INFOTABS show that our
proposed approach preserves knowledge and
improves performance on downstream NLI
tasks. The results are robust when assessed on
multiple curated adversarial test sets.
The dataset and associated scripts, are available at
https://infoadapet.github.io/.
2 Motivation
Case for Reasoning on Semi-structured Data.
Reasoning semi-structured data acquire skills such
as arithmetic and commonsense, understanding the
text types in the tabular cells, and aggregating in-
formation across numerous rows if necessary. For
example, to judge the H1 in table 1, the model
needs to understand "duration" and "length" are
the same in the context of the table, which is about
a music album. Also, numerical reasoning is re-
quired to compare "46:06" minutes" is less than
"50 minutes". At the same time, the model should
understand that the premise (table) is about a music
album, so to classify the H1 model needs to under-
stand the information present in 2 rows ({"Genre",
"Length"}) and perform numerical reasoning on
top of that factual information.
Implicit Knowledge is Required for Reasoning.
For instance, for H3 in table 1, the model needs to
first extract the relevant row, i.e., "Released" row
from the table, then compares the phrase "end of
1979" with the "Released" row value "29 March
1979" implicitly. The model needs to perform tem-
poral reasoning to know that "year 1979" is correct.
However, the month "March" is not the "end of
the year", but "November" or "December" is (im-
plicit commonsense temporal knowledge). While
previous works tried to incorporate knowledge via
pre-training (Eisenschlos et al.,2020;Neeraja et al.,
2021). In this work, we integrate knowledge and
reasoning ability simultaneously using Pattern Ex-
ploiting Training (Tam et al.,2021). This approach
improves the existing knowledge and enhances rea-
soning compared to existing methods.
Robustness is Critical for Model Evaluation.
Tabular reasoning models typically fail on modest
input modification, a.k.a. adversarial manipulation
of inputs, highlighting the model’s poor robust-
ness and generalizability limit (Gupta et al.,2021).
Thus, evaluating reasoning models on adversarial
sets generated by minimal input perturbation be-
comes vital. As a result, we propose additional
adversarial test sets, such as using character and
word level perturbations to evaluate various aspects
of model understanding and reasoning over tables.
For example, if H1 (table 1) is changed to "Break-
fast in Wales is a pop album with a duration of
fewer than 50 minutes." now the label of hypothe-
sis H1 is changes from
entailment
to
neutral
since
we do not know any information of "Breakfast in
Wales" from table 1. These minor input perturba-
tions can alter the hypothesis’ semantic interpreta-
tion. Idealistically, a robust model with superior
reasoning ability should perform well on these in-
put perturbed adversarial sets, as our technique also
demonstrates.
3 Our Approach
In this section we describe our method to
(a)
evalu-
ate pre-trained LM knowledge for tabular reason-
ing,
(b)
enhance model tabular reasoning capability
using PET training,
(c)
and assess model robust-
ness to input perturbations.
3.1 Evaluation of Pre-training Knowledge
To examine how pre-training affects knowledge-
based reasoning for tabular data, we focus on two
types of knowledge (a.) factual knowledge (aware-
ness of specific factual knowledge about entities),
(b.) and relational knowledge (awareness of pos-
sible right relations between two distinct entities).
For instance, in the sentence "Breakfast in America
was released on March 29, 1979","Breakfast in
America" and "March 29, 1979" are considered as
factual knowledge, while their relationship term,
i.e., "released" corresponds to relational knowl-
edge.
We evaluate factual and relational knowledge in
the language model before and after training for
the downstream task like reasoning. In specific,
we query the model using "fill-in-the-blank" cloze
statements (a.k.a. prompts). As gauging knowl-
edge using prompts is limited by how the prompts
are constructed. We use part-of-speech tagging to
detect nouns and verbs that are then used to mask
names, numbers, and dates. These prompts are gen-
erated using hypotheses from the
α1
, and dev sets
as these sets have similar distribution as the training
data (Gupta et al.,2020). We construct the prompts
from both entailed and contradictory hypotheses.
For prompts derived from entailed hypotheses, the
model must predict the correct masked word, i.e.,
a term semantically equivalent to the word in the
hypothesis. In contrast, for the prompts derived
from contradicting hypotheses, the model should
predict a semantically different term with the same
entity type as the one mentioned in the hypothesis.
To study the effect of the premise, we also query
the model with the premise. To do this we modify
the input as premise + prompt.
Prompts for Factual Knowledge Evaluation
As most factual knowledge is contained in proper
nouns and numbers, we randomly mask proper
nouns or numbers in the hypothesis to generate a
prompt and query the Language Model to fill the
masked tokens. For example "Duration of Break-
fast in America is 46 minutes" (table 1), "Break-
fast in America",46 are the factual information
present in the sentence and they are connected by
"duration". We randomly mask either "Breakfast
in America" or "46" to generate prompt "Duration
of Breakfast in America is <mask> minutes". Occa-
sionally, a masked term can be a number in numeric
form (e.g., 2); however, the model predicted word
form ("two"). We solved this issue by converting
the predicted word into its numeric form or vice
versa. E.g. "Breakfast in America is produced by
<mask> producers", where <mask> = two.
Prompts for Relational Knowledge Evaluation.
Similar prompts are leveraged for relational knowl-
edge. For example, to predict <mask> = released
for "Breakfast in America was <mask> towards the
end of 1979", the model needs to understand that
"Breakfast in America" is a music album to predict
"released" instead of "eaten" which is highly prob-
able due the neighbor context term "Breakfast". We
also use WordNet (Miller,1995) to discover syn-
Figure 1: The training uses the two ADAPET components. Here, the blue boxes represent the task inputs (entailed,
in this case) a) Decoupling Label Loss: Using the cross entropy loss across all labels, the model must predict
the right and wrong labels at the masked-out position. b) Label Conditioning: The model should predict the
original token at a randomly masked-out position if the input text has the entail label. Otherwise, not if the label is
contradiction or neutral.
onyms for the masked term and see if the predicted
word is among them.
3.2 Knowledge Incorporation for Reasoning
The issue of deducing inferences from tabular
premises is similar to the typical NLI problem,
except that the premises are tables rather than
sentences. When evaluating the reasoning skills,
we use a variety of representations of the tabular
premise (see section 4, appendix A.1). We also
study the effect of pretraining on an NLI task on
INFOTABS.
Pattern-Exploiting Training.
Using Pattern-
Exploiting Training (PET) (Schick and Schütze,
2021a), NLU tasks are reformulated as cloze-
style questions, and fine-tuning is performed us-
ing gradient-based methods. We use ADAPET
(A Densely-supervised Approach to Pattern-
Exploiting Training) (Tam et al.,2021), which in-
creases supervision by separating the label token
losses and applying a label-conditioned masked
language modeling (MLM) to the entire input.
The input to the language model is converted
into a cloze-style form with the pattern <premise>
? <mask>, <hypothesis>. The model is tasked to
predict the masked word from the vocabulary. The
model computes each token’s probability as a soft-
max normalized overall tokens, allowing the logits
of all vocabulary tokens to impact each likelihood,
similar to the regular MLM objective. While in
PET, the masked word is forced to predict from the
output space {Yes, Maybe, No} which are mapped
to labels {Entailment, Neutral, Contradiction}. As
a result, there will never be a gradient signal for
non-label tokens. Inverting the query to the model
to "In light of the answer, what is the appropri-
ate context?" from "What is the appropriate label
based on the input?" label conditioned mask lan-
guage modeling is introduced by randomly mask-
ing out context tokens. If the label is "entail", dur-
ing training, the model is obligated to predict the
original token; however, if the label is "contradic-
tion" or "neutral", the model is forced to ignore the
original token.
Masked Language Modeling.
ADAPET ran-
domly masks tokens (RoBERTa style) from the
context. Inspired by SpanBERT (Joshi et al.,2020),
ERNIE (Sun et al.,2019), we sample and mask the
entire words based on pre-defined conditions. In
Conditional Whole Word Masking (CWWM), we
create a set of words
Sw
from a given sentence, and
the POS of the words in that set must be from {"Ad-
jective", "Adverb", "Noun, "Verb", "Proper Noun",
"Adposition", "Numeral", "Coordinating Conjunc-
tion", "Subordinating Conjunction" }
1
. We sample
words from the set
Sw
and mask all tokens match-
ing the sampled word concurrently while maintain-
ing the same overall masking rate.
3.3 Robustness with Input Perturbations
We apply a range of character- and word-level per-
turbations to hypotheses to simulate circumstances
where the input is slightly noisy or deviates from
the training data distribution. We use TextAttack
(Morris et al.,2020), NLP Checklist (Ribeiro et al.,
1https://universaldependencies.org/u/pos/
Perturbation Original text Perturbed text
Character Peter Henderson produces only rock albums
Peter Henbgderson produces only rock albsums
Peter Hendersno produces only rokc albums
Pter Henderson produces onl rock abus
Petqr Henkerson prgduces only rock alocms
Location
Breakfast in America is recorded in California Breakfast in America is recorded in Florida.
Breakfast in America is recorded in USA Breakfast in America is recorded in Syria.
Breakfast in America is by an English rock band. Breakfast in America is by an Mexican rock band.
Name Peter Henderson produces only rock albums John Doe produces only rock albums
Numbers The album was released on 29 March 1978. The album was released on 29 March 346.
The album was released on 1March 1978.
Negation The genres of the album are pop and rock. The genres of the album are not pop and rock.
Paraphrase The album was recorded in the last half of 1979. In the second part of 1979, the album was recorded.
Table 2: Examples of various perturbations used to generate the adversarial test sets based on table 1.
2020), and manual perturbations for generating the
adversarial data. These adversarial sets will test the
dependence of the model on word overlap, numer-
ical comprehension, and hypothetical assertions.
Refer to tables 2and 9for examples.
Character-level perturbation
employs pertur-
bations such as introducing random characters,
switching characters, removing a random charac-
ter, and substituting a random character in the ran-
domly selected word. This alteration does not im-
pact the label of the hypothesis because it does not
alter the sentence’s meaning.
Location perturbation
modifies the identified lo-
cations (countries, cities, and nationalities) in a
sentence to another place specified in the location
map. The NER model (TextAttack) identifies the
location in a given sentence and replaces it with a
sampled location from a dictionary. Here, cities are
replaced with other cities and similar changes for
countries. This perturbation transforms the entail
clauses into contradictions but does not affect the
original neutral and contradiction labels.
Name perturbation
randomly replaces a person’s
name with the other one from a name list. This
perturbation alters the label of every hypothesis
into a neutral because the perturbed hypothesis and
premise mention different persons.
Peturb Type Size Peturb Type Size
character 1800 negation+char 1726
location 1229 negation+name 1677
name 1646 number+char 837
negation 1726 number+name 776
number 837 number+negation 817
paraphrase 1800 num+paraphrase 837
num+para+name 776 paraphrase+name 1721
Table 3: Number of examples for each perturbation
type in the adversarial set.
Perturbing Numbers
changes the entailed sen-
tences into contradictions but does not affect the
labels of neutral and contradictions. Contradic-
tory statements remain contradictory because it is
implausible that a randomly sampled number will
be the actual number in the premise, making the
hypothesis entailed.
Negation
transforms entailment into a contradic-
tion by negating the given sentence, keeping neu-
trals intact.
Paraphrasing
paraphrases the given sentences
without the loss of meaning using manual para-
phrasing and Pegasus model
2
. Paraphrasing does
not affect the inference label as it does not change
the semantic meaning of the hypothesis.
Composition of Perturbations
perturbs sentences
by applying various distinct perturbations sequen-
tially. E.g., in
num+para+name
we perturbed a
sentence "Supertramp, produced an album that was
less than 60 minutes long", with premise table 1
to "Supertramp, produced an album that was less
than 40 minutes long" (number) then "Supertramp
released an album which lasted less than 40 min-
utes." (paraphrase) then "James released an album
which lasted less than 40 minutes" (name).
4 Experiments and Analysis
Dataset.
Our experiments we use INFOTABS, a
tabular inference dataset introduced by Gupta et al.
(2020). The dataset is diverse in terms of the tables
domains, categories, and corresponding keys (en-
tity types and forms) it contains, as illustrated in
examples table 1. In addition, Gupta et al. (2020)
reveals that inference on corresponding hypotheses
requires extensive knowledge and commonsense
reasoning ability. Given the premise table, hypoth-
2https://biturl.top/MzQnMv
摘要:

EnhancingTabularReasoningwithPatternExploitingTrainingAbhilashReddyShankarampeta1,VivekGupta2*y,ShuoZhang31IITGuwahati;2UniversityofUtah;3Bloombergsareddy53@gmail.com;vgupta@cs.utah.edu;szhang611@bloomberg.netAbstractRecentmethodsbasedonpre-trainedlanguagemodelshaveexhibitedsuperiorperformanceovert...

展开>> 收起<<
Enhancing Tabular Reasoning with Pattern Exploiting Training Abhilash Reddy Shankarampeta1 Vivek Gupta2y Shuo Zhang3 1IIT Guwahati2University of Utah3Bloomberg.pdf

共21页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:21 页 大小:1.01MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 21
客服
关注