N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS Lu Zeng Sree Hari Krishnan Parthasarathi Dilek Hakkani-Tur Alexa Amazon USA.

2025-05-02 0 0 497.85KB 8 页 10玖币
侵权投诉
N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS
Lu Zeng, Sree Hari Krishnan Parthasarathi, Dilek Hakkani-Tur
Alexa, Amazon, USA.
{luzeng, sparta,hakkanit}@amazon.com
ABSTRACT
Text-to-SQL task maps natural language utterances to struc-
tured queries that can be issued to a database. State-of-the-
art (SOTA) systems rely on finetuning large, pre-trained lan-
guage models in conjunction with constrained decoding ap-
plying a SQL parser. On the well established Spider dataset,
we begin with Oracle studies: specifically, choosing an Or-
acle hypothesis from a SOTA model’s 10-best list, yields a
7.7% absolute improvement in both exact match (EM) and
execution (EX) accuracy, showing significant potential im-
provements with reranking. Identifying coherence and cor-
rectness as reranking approaches, we design a model gener-
ating a query plan and propose a heuristic schema linking
algorithm. Combining both approaches, with T5-Large, we
obtain a consistent 1% improvement in EM accuracy, and a
2.5% improvement in EX, establishing a new SOTA for this
task. Our comprehensive error studies on DEV data show the
underlying difficulty in making progress on this task.
Index TermsText-To-SQL, Semantic parsing
1. INTRODUCTION
Large language models (LM) are widely used for natural
language generation [1, 2, 3]. Recently, the use of large
LMs has extended to semantic parsing tasks such as code
generation. For general-purpose code generation, large LMs
such as Codex [4] are trained on massive, paired codebases
(code, NL). For domain-specific code generation tasks, such
as Text-to-SQL that aims to convert natural language in-
structions to SQL queries, multiple public domain datasets
are available (with associated leaderboards), including Spi-
der [5], CoSQL [6] and SParC [7]. However, the amount of
training data in these datasets is much smaller than the natu-
ral language and code pairs mined from the internet. In such
cases, instead of training a new model from scratch, the pre-
train/finetune strategy with publicly available LMs has been
shown to be more accurate. For example, in UnifiedSKG [8],
the T5 model [1] is finetuned for various semantic parsing
tasks (including Text-to-SQL), achieving SOTA performance.
Besides the amount of the training data, another challenge
with SQL generation is that the generated code is underspec-
ified without the corresponding schema 1. To handle this
challenge, implicit or explicit schema linking becomes an
important sub-task for SQL code generation [9, 10, 11].
In this paper, we focus on complex, cross-domain, SQL
generation using Spider [5] a large-scale, Text-to-SQL dataset
consisting of 200 complex databases covering 138 domains.
Spider is a well established dataset with nearly 70 entries on
the leaderboard, demonstrating the difficulty of the task. The
data is split into training, development (DEV), and test sets
without overlaps in databases across these sets, as the aim of
this task is to learn models that can issue queries from natural
language text to previously unseen databases in the training
set. Furthermore, SQL queries in this dataset contain nested
sub-queries, requiring the model to understand compositional
structures. The model performance is based on two metrics,
exact-set-match accuracy (EM) and execution accuracy (EX);
the former compares individual query components between
the predicted and groundtruth SQL queries, while the latter
compares their execution output.
Similar to general-purpose code generation, the output of
Text-to-SQL models is constrained to follow SQL grammar.
Previous solutions have employed encoder/decoder models,
with the decoder being constrained to produce well-formed
outputs [12, 13]. An approach that is more compatible with
large pretrained LMs is to remove the constraints on de-
coder: [14, 15] prune the finalized hypotheses from beam
search to those that are syntactically correct. Meanwhile,
PICARD [16], a top entry on the Spider leaderboard 2, in
addition to finetuning the pretrained T5 model, also imposes
SQL syntax constraints during beam search (via constrained
decoding). It achieves an EM of 74.8% and an EX of 79.2%
on the Spider DEV set. Natural ways to potentially improve
these metrics include increasing the model size or collecting
and/or synthesizing additional training data [17].
In this paper, we take a different approach: we first
perform an Oracle analysis on n-best lists obtained from PI-
CARD and observe significant improvements (7.7% absolute
improvement in EM and EX) even at small beam sizes (such
as 10). The gap in performance between 1-best and Oracle
motivates reranking approaches. We propose 2 reranking
1For Text-to-SQL, in addition to SQL grammar constraints, there are
schema constraints.
2https://yale-lily.github.io/spider
978-1-6654-7189-3/22/$31.00 ©2023 IEEE
arXiv:2210.10668v1 [cs.CL] 19 Oct 2022
approaches that are motivated by the issues with the current,
large pretrained LMs: coherence [18] and correctness. To im-
prove coherence, we explore n-best reranking using a query
plan produced by an independent model. Next, to improve
correctness, we propose a heuristic algorithm performing
schema linking on n-best lists – imposing constraints that
are missing in PICARD. The combined reranking approaches
consistently yield improvements across all T5 model sizes,
obtaining a 1.0% absolute improvement in EM and a 2.5%
absolute improvement on EX. Lastly, to analyze persistent
gap with Oracle results, we performed a detailed analysis
of errors and metrics. Our studies show that annotation and
metrics issues can be significant factors to improving models
and evaluation on Spider.
Contributions. The contributions of this paper are: a) Using
Oracle analysis and rerankers on n-best lists, we outperform
SOTA performance on a competitive Spider task; b) Analysis
of errors on Spider DEV data, showing much work remains
to be done on metrics and annotation.
2. RELATED WORK
We review related work in 4 categories: a) reranking ap-
proaches; b) coherence in LMs; c) schema linking for text-to-
SQL; d) noise in Spider.
Rerankers. N-best list reranking has a long history in speech
recognition, machine translation, and semantic parsing com-
munities [19, 20, 21, 22, 23, 24, 25]. These are performed
with a much larger 2nd-stage model [26, 27] or using ad-
ditional knowledge sources not available during the first
pass [28, 29]. On Spider, a larger 2nd-stage reranker was
successfully used in [30]. Some reranking methods combine
outputs from multiple systems [31] or use improved output
confidence [32]. Imposing task specific constraints [14, 15]
to remove invalid options can be a helpful strategy to improve
reranking. In our work, in addition to the output target being
structured queries (instead of natural language), the 1st-stage
model is large (T5 family), and the baseline model imposes
additional knowledge during beam search in the form of SQL
parsing. Our proposed methods are designed for this setting.
Coherence with query plan. Coherence issues with LMs [33]
are associated with hallucination phenomenon [34, 35].
While these are well-known problems in unstructured text
generation with LMs [36, 37, 38, 39], coherence issues with
structured text generation is somewhat less unexplored: [40]
uses semantic coherence for data augmentation, while [41]
uses a reconstruction model for reranking semantic parses,
and [14, 15, 16] use parsers to improve coherence with
SQL grammar. Our reranker is designed to improve seman-
tic coherence, and it is distinct from [41] in that, our query
plan model predicts the structure of a query (from natural
language), and orders the n-best list for consistency.
Correctness with schema linking. Historically, schema
linking (SL) has been explored for Text-to-SQL. Schema has
been encoded in model input for small encoder/decoder mod-
Fig. 1. PICARD explained by an example. The prediction
pattern is “Database name|pred SQL”.
els [15, 42, 43] and large pretrained LMs [8, 16]. More work
has been done on modeling schema linking [9], where the role
of SL in Text-to-SQL is recognized. RAT-SQL [12], a system
that had been influential in text-to-SQL, proposed a frame-
work to encode relational structure in the database schema
and simultaneously address schema linking. In contrast, we
use schema linking as a post-processor and a reranker on the
output of large LMs.
Noise in Spider. Spider dataset [5] has been well-explored in
Text-to-SQL, and it’s corpus noise are documented in related
work [9, 12, 44]. One type of corpus noise comes from anno-
tation errors in groundtruth SQL queries and typos in natural
language queries [9]. Another source of the noise is in the
pairing of only one groundtruth SQL query to each natural
language query, which results in a large portion of predicted
queries being incorrectly marked as “mispredicted” [12, 44].
3. TEXT-TO-SQL USING PRE-TRAINED LMS
We use PICARD as our baseline system, and use it to produce
the n-best hypotheses in this paper. It uses finetuned T5 mod-
els, and implements incremental SQL parsing to constrain
decoding during beam search. Input to the T5 model includes
natural language query, database name, and database schema
(table name : col1 , ... , coln). PICARD
employs a post-processor during beam search, and to save
computing time it only checks the next top-k tokens for va-
lidity. Figure 1 shows how predictions are generated when
PICARD is enabled, using k= 2 and beam size of 3. Shad-
owed box means that the result is not expanded or explored
for the next time stamp; blue (solid) boxes show valid SQL
query prefixes, while orange (dashed) boxes show sequences
that violate SQL syntax.
The incremental parser not only filters out hypotheses
that violate the SQL grammar, but also imposes rudimen-
tary schema checks. Fig. 1 gives two examples of filter-
ing hypotheses, the first one is due to a syntax error (e.g.,
count()) and the other one is due to schema error (e.g.,
摘要:

N-BESTHYPOTHESESRERANKINGFORTEXT-TO-SQLSYSTEMSLuZeng,SreeHariKrishnanParthasarathi,DilekHakkani-TurAlexa,Amazon,USA.fluzeng,sparta,hakkanitg@amazon.comABSTRACTText-to-SQLtaskmapsnaturallanguageutterancestostruc-turedqueriesthatcanbeissuedtoadatabase.State-of-the-art(SOTA)systemsrelyonnetuninglarge,...

展开>> 收起<<
N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS Lu Zeng Sree Hari Krishnan Parthasarathi Dilek Hakkani-Tur Alexa Amazon USA..pdf

共8页,预览2页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:8 页 大小:497.85KB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 8
客服
关注