
approaches that are motivated by the issues with the current,
large pretrained LMs: coherence [18] and correctness. To im-
prove coherence, we explore n-best reranking using a query
plan produced by an independent model. Next, to improve
correctness, we propose a heuristic algorithm performing
schema linking on n-best lists – imposing constraints that
are missing in PICARD. The combined reranking approaches
consistently yield improvements across all T5 model sizes,
obtaining a 1.0% absolute improvement in EM and a 2.5%
absolute improvement on EX. Lastly, to analyze persistent
gap with Oracle results, we performed a detailed analysis
of errors and metrics. Our studies show that annotation and
metrics issues can be significant factors to improving models
and evaluation on Spider.
Contributions. The contributions of this paper are: a) Using
Oracle analysis and rerankers on n-best lists, we outperform
SOTA performance on a competitive Spider task; b) Analysis
of errors on Spider DEV data, showing much work remains
to be done on metrics and annotation.
2. RELATED WORK
We review related work in 4 categories: a) reranking ap-
proaches; b) coherence in LMs; c) schema linking for text-to-
SQL; d) noise in Spider.
Rerankers. N-best list reranking has a long history in speech
recognition, machine translation, and semantic parsing com-
munities [19, 20, 21, 22, 23, 24, 25]. These are performed
with a much larger 2nd-stage model [26, 27] or using ad-
ditional knowledge sources not available during the first
pass [28, 29]. On Spider, a larger 2nd-stage reranker was
successfully used in [30]. Some reranking methods combine
outputs from multiple systems [31] or use improved output
confidence [32]. Imposing task specific constraints [14, 15]
to remove invalid options can be a helpful strategy to improve
reranking. In our work, in addition to the output target being
structured queries (instead of natural language), the 1st-stage
model is large (T5 family), and the baseline model imposes
additional knowledge during beam search in the form of SQL
parsing. Our proposed methods are designed for this setting.
Coherence with query plan. Coherence issues with LMs [33]
are associated with hallucination phenomenon [34, 35].
While these are well-known problems in unstructured text
generation with LMs [36, 37, 38, 39], coherence issues with
structured text generation is somewhat less unexplored: [40]
uses semantic coherence for data augmentation, while [41]
uses a reconstruction model for reranking semantic parses,
and [14, 15, 16] use parsers to improve coherence with
SQL grammar. Our reranker is designed to improve seman-
tic coherence, and it is distinct from [41] in that, our query
plan model predicts the structure of a query (from natural
language), and orders the n-best list for consistency.
Correctness with schema linking. Historically, schema
linking (SL) has been explored for Text-to-SQL. Schema has
been encoded in model input for small encoder/decoder mod-
Fig. 1. PICARD explained by an example. The prediction
pattern is “〈Database name〉|〈pred SQL〉”.
els [15, 42, 43] and large pretrained LMs [8, 16]. More work
has been done on modeling schema linking [9], where the role
of SL in Text-to-SQL is recognized. RAT-SQL [12], a system
that had been influential in text-to-SQL, proposed a frame-
work to encode relational structure in the database schema
and simultaneously address schema linking. In contrast, we
use schema linking as a post-processor and a reranker on the
output of large LMs.
Noise in Spider. Spider dataset [5] has been well-explored in
Text-to-SQL, and it’s corpus noise are documented in related
work [9, 12, 44]. One type of corpus noise comes from anno-
tation errors in groundtruth SQL queries and typos in natural
language queries [9]. Another source of the noise is in the
pairing of only one groundtruth SQL query to each natural
language query, which results in a large portion of predicted
queries being incorrectly marked as “mispredicted” [12, 44].
3. TEXT-TO-SQL USING PRE-TRAINED LMS
We use PICARD as our baseline system, and use it to produce
the n-best hypotheses in this paper. It uses finetuned T5 mod-
els, and implements incremental SQL parsing to constrain
decoding during beam search. Input to the T5 model includes
natural language query, database name, and database schema
(table name : col1 , ... , coln). PICARD
employs a post-processor during beam search, and to save
computing time it only checks the next top-k tokens for va-
lidity. Figure 1 shows how predictions are generated when
PICARD is enabled, using k= 2 and beam size of 3. Shad-
owed box means that the result is not expanded or explored
for the next time stamp; blue (solid) boxes show valid SQL
query prefixes, while orange (dashed) boxes show sequences
that violate SQL syntax.
The incremental parser not only filters out hypotheses
that violate the SQL grammar, but also imposes rudimen-
tary schema checks. Fig. 1 gives two examples of filter-
ing hypotheses, the first one is due to a syntax error (e.g.,
count()) and the other one is due to schema error (e.g.,