Realistic Data Augmentation Framework for Enhancing Tabular Reasoning Dibyakanti Kumar1 Vivek Gupta2y Soumya Sharma3 Shuo Zhang4

2025-04-29 0 0 721.52KB 19 页 10玖币
侵权投诉
Realistic Data Augmentation Framework for Enhancing Tabular
Reasoning
Dibyakanti Kumar1
, Vivek Gupta2*
, Soumya Sharma3, Shuo Zhang4
1IIT Guwahati; 2University of Utah; 3IIT Kharagpur; 4Bloomberg;
dibyakan@iitg.ac.in; vgupta@cs.utah.edu; soumyasharma20@gmail.com; szhang611@bloomberg.net
Abstract
Existing approaches to constructing training
data for Natural Language Inference (NLI)
tasks, such as for semi-structured table reason-
ing, are either via crowdsourcing or fully au-
tomatic methods. However, the former is ex-
pensive and time-consuming and thus limits
scale, and the latter often produces naive exam-
ples that may lack complex reasoning. This pa-
per develops a realistic semi-automated frame-
work for data augmentation for tabular infer-
ence. Instead of manually generating a hypoth-
esis for each table, our methodology gener-
ates hypothesis templates transferable to simi-
lar tables. In addition, our framework entails
the creation of rational counterfactual tables
based on human written logical constraints and
premise paraphrasing. For our case study, we
use the INFOTABS (Gupta et al.,2020), which
is an entity-centric tabular inference dataset.
We observed that our framework could gen-
erate human-like tabular inference examples,
which could benefit training data augmenta-
tion, especially in the scenario with limited su-
pervision.
1 Introduction
Natural Language Inference (NLI) is a Natural Lan-
guage Processing task of determining if a hypothe-
sis is entailed or contradicted given a premise or is
unrelated to it (Dagan et al.,2013). The NLI task
has been extended for tabular data where it takes
tables as the premise instead of sentences, namely
tabular inference task. Two popular human-curated
datasets for tabular reasoning, TABFACT (Chen
et al.,2020b) and INFOTABS (Gupta et al.,2020)
datasets, have enhanced recent research in this area.
However, human-generated datasets are limited
in scale and thus insufficient for learning with large
language models (e.g., Devlin et al.,2019;Liu
et al.,2019a). Since curating these datasets requires
Equal Contribution Corresponding Author
expertise, huge annotation time, and expense, they
cannot be scaled. Furthermore, it has been shown
that these datasets suffer from annotation bias and
spurious correlation problem (e.g., Poliak et al.,
2018;Gururangan et al.,2018;Geva et al.,2019).
In contrast, automatically generated data lacks di-
versity and have naive reasoning aspects. Recently,
use of large language generation model (e.g., Rad-
ford et al.;Lewis et al.,2020;Raffel et al.,2020) is
also proposed for data generation (e.g., Zhao et al.,
2022;Ouyang et al.,2022;Mishra et al.,2022).
Despite substantial improvement, these generation
approaches still lack factuality, i.e., suffer hallucina-
tion, have poor facts coverage, and also suffer from
token repetition (refer to Appendix §Eanalysis).
Recently, Chen et al. (2020a) shows that automatic
tabular NLG frameworks cannot produce logical
statements and provide only surface reasoning.
To address the above shortcomings, we propose
a semi-automatic framework that exploits the pat-
terns in tabular structure for hypothesis generation.
Specifically, this framework generates hypothesis
templates transferable to similar tables since ta-
bles with similar categories, e.g., two athlete tables
in Wikipedia, will share many common attributes.
In Table 1the premise table key attributes such
as “Born”, “Died”, “Children” will soon be shared
across other tables from the “Person” category. One
can generate a template for tables in the Person cat-
egory, such as <Person
_
Name> died before/after
<Died:Year>. This template could be used to gen-
erate sentences as shown in Table 1hypothesis
H1 and H1
C
. Furthermore, humans can utilize
cell types (e.g., Date, Boolean) for generation tem-
plates. Recently, it has been shown that training
on counterfactual data enhances model robustness
(Müller et al.,2021;Wang and Culotta,2021;Ra-
jagopal et al.,2022). Therefore, we also utilize
the overlapping key pattern to create counterfac-
tual tables. The complexity and diversity of the
templates can be enforced via human annotators.
arXiv:2210.12795v1 [cs.CL] 23 Oct 2022
Janet Leigh (Original) Janet Leigh (Counter-Factual)
Born July 6, 1927 Born July 6, 1927
Died October 3, 2004 Died January 13, 1994
Children Kelly Curtis; Jamie Lee Curtis Children Kelly Curtis
Alma Mater Stanford University Alma Mater University of California
Occupation None Occupation Scientist
H1: Janet Leigh was born before 1940. EH1C: Janet Leigh was born after 1915. E
H2: The age of Janet Leigh is more than 70. EH2C: The age of Janet Leigh is more than 70. C
H3: Janet Leigh has 1 children CH3C: Janet Leigh has more than 2 children. C
H4: Janet Leigh graduated from Stanford University EH4C: Janet Leigh graduated from Stanford University C
Table 1: A example of an original and counterfactual table from the "Person" category. Here, we illustrate how
multiple operations can be used to alter different keys. In addition, we have shown how the labels (E - Entail,C -
Contradict) for a specific hypothesis can alter. In the “Janet Leigh” example table, the first column represents the
keys (e.g. Born; Died) and the second column has the relevant values (e.g. July 6,1927; October 3, 2004 etc).
Additionally, one can further enhance the diversity
by automatic/manual paraphrasing (Dagan et al.,
2013) of the template or generated sentences.
To show the effectiveness of our proposed frame-
work, we conduct a case study with INFOTABS
dataset. INFOTABS is an entity-centric dataset for
tabular inference, as shown in example Table 1. We
extend the INFOTABS data (25K table-hypothesis
pair) by creating AUTO-TNLI, which consists of
1,478,662 table-hypothesis pairs derived from 660
human written templates based on 134 unique ta-
ble keys from 10,182 tables. For experiments, we
utilize AUTO-TNLI in three ways (a.) as a stan-
dalone tabular inference dataset for benchmarking,
(b.) as a potential augmentation dataset to enhance
tabular reasoning on INFOTABS, i.e., the human-
created data (c.) as evaluation set to assess model
reasoning ability. We show that AUTO-TNLI is an
effective data for benchmarking and data augmen-
tation, especially in a limited supervision setting.
Thus, this semi-automatic generation methodology
has the potential to provide the best of both worlds
(automatic and human generation).
To summarize, we make the following contribu-
tions in this paper:
We propose a semi-automatic framework that
exploits the patterns in tabular structure for
hypothesis generation.
We apply this framework to extend the IN-
FOTABS (Gupta et al.,2020) dataset and
create a large-scale human-like synthetic
data AUTO-TNLI that contains counterfac-
tual entity-based tables.
We conduct intensive experiments using
AUTO-TNLI and demonstrate it helps bench-
mark and data augmentation, especially in a
limited supervision setting.
The dataset and associated scripts, are available
at https://autotnli.github.io.
2 Proposed Framework
Our framework includes four main components:
(a.) Hypothesis Template Creation, (b.) Ratio-
nal Counterfactual Table Creation, (c.) Paraphras-
ing of Premise Tables, and (d.) Automatic Table-
Hypothesis Generation. .
2.1 Hypothesis Template Creation
For a particular category of tables (e.g., movie),
the row attributes (i.e. keys) are mostly overlap-
ping across all tables (e.g., Length,Producer,Di-
rector, and others). Therefore, this consistency
across table benefits in writing table category spe-
cific
key-based rules
to create logical hypothesis
sentences. We create such key-based rules for the
following reasoning types: (a.) Temporal Reason-
ing, (b.) Numerical Reasoning, (c.) Spatial Rea-
soning, (d.) Common Sense Reasoning. Table 3
provide examples of logical rules used to create
templates. We denote the category of a table as
Category
and the table row keys of as <Key>. In
addition, each template is paraphrased to enhance
lexical diversity.
Frequently, these key-based reasoning rules gen-
eralize effectively across several categories. For
example, the temporal reasoning rule based on
the date-time type could be minimally modified
to work for <Release Date> of category
Movies
ta-
bles, as well as the <Established Date> of category
University
tables, in addition to the <Born> of cat-
egory
Person
in Table 3. Additionally, reasoning
rules can be expanded to incorporate multi-row en-
tities from the same table’s data, as illustrated in
Table 3for the numerical reasoning type. Other
examples for the same are "The elevation range of
<City> is <HighestElevation>
<LowestEleva-
Figure 1: Our Proposed Framework. yellow represents modified values in the counterfactual tables.
tion>" for category
City
table, "<SportName> was
held at <location> on <date>" for
Sports
category.
2.2 Rational Counterfactual Table Creation
We also construct counterfactual tables, as illus-
trated in Table 1, in which the values correspond-
ing to the original table’s keys are altered. This
counterfactual table contains non-factual unreal in-
formation but is consistent, i.e., the table facts are
not self contradictory. Language models trained
on such counterfactual instances exhibit greater ro-
bustness (Müller et al.,2021;Wang and Culotta,
2021;Rajagopal et al.,2022;Gupta et al.,2021)
and prevent the model from over-fitting its pre-
learned knowledge. Benefiting model in grounding
and examining the premise evidence as opposed to
employing spurious correlation. To create counter-
factual table, we modify an original table with
k
keys. For a given category, these
k
keys constitute
a subset of the
n
possible unique keys (
n >=k
)
for that category.
To construct a counterfactual table, we modify
the original table in one or more of the following
ways: (a.) keep the row as it without any change,
(b.) adding new value to an existing key, (c.) substi-
tuting the existing key-value with counter-factual
data, (d.) deleting a particular key-value pair from
the table, (e.) and add a missing new keys (i.e. a
key from (
nk
) ), (f.) and adding a missing key
row to the table. For creating counterfactual ta-
bles, for each row of existing, a subset of operation
is selected at a random each with a pre-decided
probability p(a hyper-parameter).
While creating these tables, we impose an es-
sential key-specific constraints to ensure logical
rational in the generated sentences. E.g. in the ex-
ample Table 1, for the counterfactual table of Janet
Leigh (Counterfactual), the <Born> is kept simi-
lar to original of Janet Leigh (Original), whereas
<Died> has been substituted for another Person
table, while ensuring the constraint BORN DATE
< DEATH DATE i.e. Jan 13, 1994 (Died Date of
Counterfactual Table) is after July 6, 1927 (Born
Date of Counterfactual Table)). Without the fol-
Train-Data City Album Person Movie Book F&D Org Paint Fest S&E Univ
Orig 78.32 67.81 92.45 97.12 96.31 92.27 92.44 98.93 87.44 82.53 85.59
Orig +Count 61.89 68.26 94.45 98.67 98.72 97.04 96.46 99.56 93.73 95.68 93.02
MNLI +Orig 78.6 68.12 92.89 97.74 97.21 93.19 93.06 99.36 88.12 84.18 87.03
MNLI +Orig +Count 62.32 68.01 94.54 99.01 98.46 97.47 96.8 99.63 93.66 95.08 93.56
Table 2: Category-wise results for AUTO-TNLI (F&D- Food & Drinks, S&E - Sports & Events)
Reasoning Category Template-Rules Table-Constraints
Temporal Person <Person> was born in a leap year. Born Date
<Person> died before/after <Died:Year> Death Date
Numerical Movie <Movie> was a "hit if <Box Office> <Budget> else flop" Budget 0
<Movie> had a Box Office collection of <BoxOffice>
Spatial Movie <Movie> was released in <Release1:Loc>, "X" months before/after Release1:Location 6=
<Release2:Location> Release2:Location
KCS City The governing of <City> is supervised by <Mayor> Lowest Elevation
<Mayor> is an important local leader of <City> Highest Elevation
Table 3: Rules and Constraints are classified into specific areas of reasoning, as indicated in the table. A few
examples of rules and constraints have been provided for each category. <Died:Year> indicates that the year value
is extracted from <Died> , whereas <Release1:Location> indicates that the location is extracted from a single
key-value pair in <Release>. KCS denote knowledge and common sense reasoning in this context.
lowing the constraint that BORN DATE < DEATH
DATE, the table with become rationally incorrect
or self contradictory.
2.3 Paraphrasing of Premise Tables
Lack of linguistic variety is a significant con-
cern with grammar-based data generating methods.
Therefore, we employ both automated and human
paraphrase of premise tables to address diversity
problem. For each key for of a given category, we
create at least three to five simple paraphrased sen-
tences of the key-specific template. E.g. for <Alma
Mater> from
Person
, possible paraphrases can be
"<PersonName> earned his degree from <Alma-
Mater>", "<PersonName> is a graduate of <Al-
maMater>", and "<AlmaMater> is a alma mater
of <PersonName>". We observe that paraphras-
ing considerably increases the variability across
instances.
2.4 Automatic Table-Hypothesis Generation
Once the templates are constructed as discussed in
§2.1, they can be used to automatically fill in the
blanks from the entries of the considered tables and
create logically rational hypothesis sentences. To
create contradictory sentences, we randomly select
a value from a collection of key values shared by
all tables to fill in the blanks. This replacement
ensures that the key-specific constraints, such as
the key-value type, are adhered to. Furthermore,
we ensure that similar template with minimal to-
ken alteration is used to create entail contradict
pair. This way of creating entail and contradiction
statement pairs with lexically overlapping tokens
ensure that, future model trained on such data won’t
adhere spurious correlation from the tabular NLI
data i.e. minimising the hypothesis bias problem
(Poliak et al.,2018). For example, for movie "Iron-
man" movie with rows "Budget:$140 million" and
"Box-office:$585.8 million", using the template
<Movie> was a "hit
if
<Box Office>
<Budget>
else
flop" from example Table 3, one can generate
hypothesis entail: "The movie Ironman was a hit"
and contradict: "The movie Ironman was a flop".
3 The AUTO-TNLI Dataset
We apply our framework as described in §2on an
entity specific tabular inference dataset INFOTABS
to construct AUTO-TNLI. INFOTABS (Gupta et al.,
2020) consists of pairs of NLI instances: a hypoth-
esis statement grounded and inferred on premise
table is extracted from Wikipedia Infobox table
across multiple diverse categories. We construct
the AUTO-TNLI dataset from a subset of the IN-
FOTABS dataset (
11
out of
13
total categories),
which includes the original table plus five counter-
factual tables corresponding to each original table,
for a total of
10,182
tables. We retrieve
134
keys
and
660
templates, which we utilize to generate
1,478,662
sentences. However, unlike INFOTABS,
which contains
3
labels, ENTAIL,CONTRADICT
and NEUTRAL, AUTO-TNLI contains only two
labels ENTAIL and CONTRADICT.
Statistic Metric Numbers
Number of Unique Keys 134
Average number of keys per table 12.63
Average number of sentences per table 164.51
Table 4: AUTO-TNLI Statistics.
As previously reported in the original IN-
FOTABS paper by Gupta et al. (2020), annotators
are biased towards specific keys over others. For
example, for the category
Company
, annotators
would create more sentences for the key <Founded
by> than for the key <Website>, resulting in an
inherent hypothesis bias in the dataset. While cre-
ating the templates for AUTO-TNLI, we ensure
that each key has a minimum of two hypotheses
and a minimum of three (
>3
) premise paraphrases,
which helps mitigate hypothesis bias. To address
the inference class imbalance labeling issue, we
construct approximately 1:1 ENTAIL to CONTRA-
DICT the hypothesis.
We observe that most additional human labor
required to build such sentences is spent on the set
of key-specific rules and constraints that ensure the
sentences are grammatically accurate. The counter-
factual tabular data is logically consistent, i.e., not
self-contradictory. Table 4details the number of
unique keys, the minimum/maximum/average num-
ber of keys, and the total number of sentences per
table in AUTO-TNLI. As can be observed, the sys-
tem generates a large amount of AUTO-TNLI data
compared to limited INFOTABS while using only a
few human-constructed templates with key-specific
rules and constraints.
We have chosen INFOTABS as it has three evalu-
ation sets
α1
,
α2
, and
α3
, in addition to the regular
training and development sets. The
α1
set is lex-
ically and topic-wise similar to the train set, and
in
α2
the hypothesis is lexically adversarial to the
train set. And in
α3
the tables are from topics not
in the train set. Moreover, it has multiple reason-
ing types such as multi-row reasoning, entity type,
摘要:

RealisticDataAugmentationFrameworkforEnhancingTabularReasoningDibyakantiKumar1,VivekGupta2*y,SoumyaSharma3,ShuoZhang41IITGuwahati;2UniversityofUtah;3IITKharagpur;4Bloomberg;dibyakan@iitg.ac.in;vgupta@cs.utah.edu;soumyasharma20@gmail.com;szhang611@bloomberg.netAbstractExistingapproachestoconstructin...

展开>> 收起<<
Realistic Data Augmentation Framework for Enhancing Tabular Reasoning Dibyakanti Kumar1 Vivek Gupta2y Soumya Sharma3 Shuo Zhang4.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:721.52KB 格式:PDF 时间:2025-04-29

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注