N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS Lu Zeng Sree Hari Krishnan Parthasarathi Dilek Hakkani-Tur Alexa Amazon USA.

2025-05-02 0 0 497.85KB 8 页 10玖币

侵权投诉

N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS

Lu Zeng, Sree Hari Krishnan Parthasarathi, Dilek Hakkani-Tur

Alexa, Amazon, USA.

{luzeng, sparta,hakkanit}@amazon.com

ABSTRACT

Text-to-SQL task maps natural language utterances to struc-

tured queries that can be issued to a database. State-of-the-

art (SOTA) systems rely on ﬁnetuning large, pre-trained lan-

guage models in conjunction with constrained decoding ap-

plying a SQL parser. On the well established Spider dataset,

we begin with Oracle studies: speciﬁcally, choosing an Or-

acle hypothesis from a SOTA model’s 10-best list, yields a

7.7% absolute improvement in both exact match (EM) and

execution (EX) accuracy, showing signiﬁcant potential im-

provements with reranking. Identifying coherence and cor-

rectness as reranking approaches, we design a model gener-

ating a query plan and propose a heuristic schema linking

algorithm. Combining both approaches, with T5-Large, we

obtain a consistent 1% improvement in EM accuracy, and a

2.5% improvement in EX, establishing a new SOTA for this

task. Our comprehensive error studies on DEV data show the

underlying difﬁculty in making progress on this task.

Index Terms—Text-To-SQL, Semantic parsing

1. INTRODUCTION

Large language models (LM) are widely used for natural

language generation [1, 2, 3]. Recently, the use of large

LMs has extended to semantic parsing tasks such as code

generation. For general-purpose code generation, large LMs

such as Codex [4] are trained on massive, paired codebases

(code, NL). For domain-speciﬁc code generation tasks, such

as Text-to-SQL that aims to convert natural language in-

structions to SQL queries, multiple public domain datasets

are available (with associated leaderboards), including Spi-

der [5], CoSQL [6] and SParC [7]. However, the amount of

training data in these datasets is much smaller than the natu-

ral language and code pairs mined from the internet. In such

cases, instead of training a new model from scratch, the pre-

train/ﬁnetune strategy with publicly available LMs has been

shown to be more accurate. For example, in UniﬁedSKG [8],

the T5 model [1] is ﬁnetuned for various semantic parsing

tasks (including Text-to-SQL), achieving SOTA performance.

Besides the amount of the training data, another challenge

with SQL generation is that the generated code is underspec-

iﬁed without the corresponding schema 1. To handle this

challenge, implicit or explicit schema linking becomes an

important sub-task for SQL code generation [9, 10, 11].

In this paper, we focus on complex, cross-domain, SQL

generation using Spider [5] a large-scale, Text-to-SQL dataset

consisting of 200 complex databases covering 138 domains.

Spider is a well established dataset with nearly 70 entries on

the leaderboard, demonstrating the difﬁculty of the task. The

data is split into training, development (DEV), and test sets

without overlaps in databases across these sets, as the aim of

this task is to learn models that can issue queries from natural

language text to previously unseen databases in the training

set. Furthermore, SQL queries in this dataset contain nested

sub-queries, requiring the model to understand compositional

structures. The model performance is based on two metrics,

exact-set-match accuracy (EM) and execution accuracy (EX);

the former compares individual query components between

the predicted and groundtruth SQL queries, while the latter

compares their execution output.

Similar to general-purpose code generation, the output of

Text-to-SQL models is constrained to follow SQL grammar.

Previous solutions have employed encoder/decoder models,

with the decoder being constrained to produce well-formed

outputs [12, 13]. An approach that is more compatible with

large pretrained LMs is to remove the constraints on de-

coder: [14, 15] prune the ﬁnalized hypotheses from beam

search to those that are syntactically correct. Meanwhile,

PICARD [16], a top entry on the Spider leaderboard 2, in

addition to ﬁnetuning the pretrained T5 model, also imposes

SQL syntax constraints during beam search (via constrained

decoding). It achieves an EM of 74.8% and an EX of 79.2%

on the Spider DEV set. Natural ways to potentially improve

these metrics include increasing the model size or collecting

and/or synthesizing additional training data [17].

In this paper, we take a different approach: we ﬁrst

perform an Oracle analysis on n-best lists obtained from PI-

CARD and observe signiﬁcant improvements (7.7% absolute

improvement in EM and EX) even at small beam sizes (such

as 10). The gap in performance between 1-best and Oracle

motivates reranking approaches. We propose 2 reranking

1For Text-to-SQL, in addition to SQL grammar constraints, there are

schema constraints.

2https://yale-lily.github.io/spider

arXiv:2210.10668v1 [cs.CL] 19 Oct 2022

approaches that are motivated by the issues with the current,

large pretrained LMs: coherence [18] and correctness. To im-

prove coherence, we explore n-best reranking using a query

plan produced by an independent model. Next, to improve

correctness, we propose a heuristic algorithm performing

schema linking on n-best lists – imposing constraints that

are missing in PICARD. The combined reranking approaches

consistently yield improvements across all T5 model sizes,

obtaining a 1.0% absolute improvement in EM and a 2.5%

absolute improvement on EX. Lastly, to analyze persistent

gap with Oracle results, we performed a detailed analysis

of errors and metrics. Our studies show that annotation and

metrics issues can be signiﬁcant factors to improving models

and evaluation on Spider.

Contributions. The contributions of this paper are: a) Using

Oracle analysis and rerankers on n-best lists, we outperform

SOTA performance on a competitive Spider task; b) Analysis

of errors on Spider DEV data, showing much work remains

to be done on metrics and annotation.

2. RELATED WORK

We review related work in 4 categories: a) reranking ap-

proaches; b) coherence in LMs; c) schema linking for text-to-

SQL; d) noise in Spider.

Rerankers. N-best list reranking has a long history in speech

recognition, machine translation, and semantic parsing com-

munities [19, 20, 21, 22, 23, 24, 25]. These are performed

with a much larger 2nd-stage model [26, 27] or using ad-

ditional knowledge sources not available during the ﬁrst

pass [28, 29]. On Spider, a larger 2nd-stage reranker was

successfully used in [30]. Some reranking methods combine

outputs from multiple systems [31] or use improved output

conﬁdence [32]. Imposing task speciﬁc constraints [14, 15]

to remove invalid options can be a helpful strategy to improve

reranking. In our work, in addition to the output target being

structured queries (instead of natural language), the 1st-stage

model is large (T5 family), and the baseline model imposes

additional knowledge during beam search in the form of SQL

parsing. Our proposed methods are designed for this setting.

Coherence with query plan. Coherence issues with LMs [33]

are associated with hallucination phenomenon [34, 35].

While these are well-known problems in unstructured text

generation with LMs [36, 37, 38, 39], coherence issues with

structured text generation is somewhat less unexplored: [40]

uses semantic coherence for data augmentation, while [41]

uses a reconstruction model for reranking semantic parses,

and [14, 15, 16] use parsers to improve coherence with

SQL grammar. Our reranker is designed to improve seman-

tic coherence, and it is distinct from [41] in that, our query

plan model predicts the structure of a query (from natural

language), and orders the n-best list for consistency.

Correctness with schema linking. Historically, schema

linking (SL) has been explored for Text-to-SQL. Schema has

been encoded in model input for small encoder/decoder mod-

Fig. 1. PICARD explained by an example. The prediction

pattern is “〈Database name〉|〈pred SQL〉”.

els [15, 42, 43] and large pretrained LMs [8, 16]. More work

has been done on modeling schema linking [9], where the role

of SL in Text-to-SQL is recognized. RAT-SQL [12], a system

that had been inﬂuential in text-to-SQL, proposed a frame-

work to encode relational structure in the database schema

and simultaneously address schema linking. In contrast, we

use schema linking as a post-processor and a reranker on the

output of large LMs.

Noise in Spider. Spider dataset [5] has been well-explored in

Text-to-SQL, and it’s corpus noise are documented in related

work [9, 12, 44]. One type of corpus noise comes from anno-

tation errors in groundtruth SQL queries and typos in natural

language queries [9]. Another source of the noise is in the

pairing of only one groundtruth SQL query to each natural

language query, which results in a large portion of predicted

queries being incorrectly marked as “mispredicted” [12, 44].

3. TEXT-TO-SQL USING PRE-TRAINED LMS

We use PICARD as our baseline system, and use it to produce

the n-best hypotheses in this paper. It uses ﬁnetuned T5 mod-

els, and implements incremental SQL parsing to constrain

decoding during beam search. Input to the T5 model includes

natural language query, database name, and database schema

(table name : col1 , ... , coln). PICARD

employs a post-processor during beam search, and to save

computing time it only checks the next top-k tokens for va-

lidity. Figure 1 shows how predictions are generated when

PICARD is enabled, using k= 2 and beam size of 3. Shad-

owed box means that the result is not expanded or explored

for the next time stamp; blue (solid) boxes show valid SQL

query preﬁxes, while orange (dashed) boxes show sequences

that violate SQL syntax.

The incremental parser not only ﬁlters out hypotheses

that violate the SQL grammar, but also imposes rudimen-

tary schema checks. Fig. 1 gives two examples of ﬁlter-

ing hypotheses, the ﬁrst one is due to a syntax error (e.g.,

count()) and the other one is due to schema error (e.g.,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

N-BESTHYPOTHESESRERANKINGFORTEXT-TO-SQLSYSTEMSLuZeng,SreeHariKrishnanParthasarathi,DilekHakkani-TurAlexa,Amazon,USA.fluzeng,sparta,hakkanitg@amazon.comABSTRACTText-to-SQLtaskmapsnaturallanguageutterancestostruc-turedqueriesthatcanbeissuedtoadatabase.State-of-the-art(SOTA)systemsrelyonnetuninglarge,...

展开>> 收起<<

N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS Lu Zeng Sree Hari Krishnan Parthasarathi Dilek Hakkani-Tur Alexa Amazon USA..pdf

共8页,预览2页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

N-BEST HYPOTHESES RERANKING FOR TEXT-TO-SQL SYSTEMS Lu Zeng Sree Hari Krishnan Parthasarathi Dilek Hakkani-Tur Alexa Amazon USA.

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: