
auxiliary structure indicative embeddings (Herzig
et al.,2020;Eisenschlos et al.,2020;Wang et al.,
2021b), and designing table-specific pre-training
objectives (Yin et al.,2020;Yu et al.,2021a;Wang
et al.,2021b;Liu et al.,2022b,a). While these
methods are effective in understanding table struc-
tures, they increase the modeling complexity and
lack interpretability on why models learns table
reasoning skills during pre-training.
This paper presents a new table pre-training ap-
proach, named REASTAP, which enables a model
to efficiently learn table structure understanding
and table reasoning skills during pre-training. We
first defined 7 table reasoning skills, such as numer-
ical operation and temporal comparison. As shown
in Figure 1, for each reasoning skill, a correspond-
ing example generator was applied to synthesize
Question Answering (QA) examples over tables.
We modeled the pre-training task as a sequence gen-
eration task and pre-trained a sequence-to-sequence
(seq2seq) LM to generate the answer to the syn-
thetic questions. REASTAP is theoretically appli-
cable to any seq2seq LM without a table-specific
architecture design. Our key insight is that if a
language model can be pre-trained to generate the
answers to synthetic questions, which require var-
ious table reasoning skills, it should have a great
table structure understanding and table reasoning
capacity, thereby conferring benefits to downstream
tasks. The main contributions of our work can be
summarized as follows:
•
We develop a new table reasoning example
generation pipeline, which produces a large-
scale table QA corpus that requires various
reasoning skills over semi-structured tables.
•
We propose a new table pre-training method,
REASTAP, which helps the model to learn ta-
ble structure understanding and various table
reasoning skills during pre-training without
any table-specific architecture design.
•
REASTAP is evaluated on four downstream
benchmarks. Experimental results demon-
strate that REASTAP achieves new state-of-
the-art results on all of them, and delivers a
great improvement on low-resource setting.
2 Pre-training Corpus
2.1 Table Source and Pre-processing
We chose publicly available semi-structured tables
as the table source. Specifically, we extracted ta-
bles from English Wikipedia
1
, which covered a
wide range of domains including popular culture,
geography, politics, and science. We kept tables
with 8-30 rows and at least three columns, resulting
in around 600K tables. For each extracted table, a
pre-processing script was applied to automatically
annotate table columns with their data types (i.e,
string, number, and date), which allows us to gen-
erate questions that involve manipulating numbers
and dates. Furthermore, recent work (Yang et al.,
2022;Wang et al.,2022) demonstrates that existing
table pre-training approaches might encode table
row order as an unwanted bias. For example, the
pre-trained model being aware of row order infor-
mation is inclined to select the first or last row of
tables when answering superlative-type questions
without truly understanding the table content. To
alleviate this problem, we randomly shuffled table
rows during pre-processing.
2.2 Example Generation
We defined 7 types of table reasoning skills, with
examples and explanations shown in Table 1. The
example generation pipeline was adapted from
Yoran et al. (2021). Each reasoning skill is as-
sociated with one example generator and several
question templates. The example generator was
implemented as a function that takes a table
T
and generates several reasoning examples (
T
,
q
,
a
) according to the template, where
q
denotes the
question, and adenotes the answer.
Each template contains typed variables that are
instantiated with content from the extracted ta-
ble. Specifically, column
col
and cell value
val
are indexed to specify that
val:i
must be in-
stantiated by a cell value from the
i
-th column.
Some templates also regulate that the selected
column and cell value must be date or number
type.
OPERATOR
and
ORDINAL
correspond to
operators and ordinal numerals that are instanti-
ated according to the specific reasoning skill. And
CONDITION:i
can be 1) a cell value from the
i
-th column; or 2) a number/temporal comparison
statement if the
i
-th column is date or number type.
For example, the question from Figure 1"Which
Company Name, with Headquarter was United
States, has the 4th Profit?" are generated from one
of the "Numerical Comparison" templates: "Which
col:1
, with
col:2
was
CONDITION:2
, has
1
We parsed the 02-20-2022 Wikipedia dump using WikiEx-
tractor Tools from
https://github.com/attardi/
wikiextractor