Detect Distill and Update Learned DB Systems Facing Out of Distribution Data

2025-04-26 0 0 1.22MB 15 页 10玖币

侵权投诉

Accepted as a conference paper for SIGMOD 2023 @Seattle, WA, USA

Detect, Distill and Update:

Learned DB Systems Facing Out of Distribution Data

Meghdad Kurmanji , Peter Triantallou

University of Warwick, Coventry, UK

{meghdad.kurmanji,p.triantallou}@warwick.ac.uk

ABSTRACT

Machine Learning (ML) is changing DBs as many DB components

are being replaced by ML models. One open problem in this setting

is how to update such ML models in the presence of data updates.

We start this investigation focusing on data insertions (dominating

updates in analytical DBs). We study how to update neural network

(NN) models when new data follows a dierent distribution (a.k.a. it

is "out-of-distribution" – OOD), rendering previously-trained NNs

inaccurate. A requirement in our problem setting is that learned

DB components should ensure high accuracy for tasks on old and

new data (e.g., for approximate query processing (AQP), cardinality

estimation (CE), synthetic data generation (DG), etc.).

This paper proposes a novel updatability framework (DDUp).

DDUp can provide updatability for dierent learned DB system

components, even based on dierent NNs, without the high costs

to retrain the NNs from scratch. DDUp entails two components:

First, a novel, ecient, and principled statistical-testing approach

to detect OOD data. Second, a novel model updating approach,

grounded on the principles of transfer learning with knowledge

distillation, to update learned models eciently, while still ensuring

high accuracy. We develop and showcase DDUp’s applicability for

three dierent learned DB components, AQP, CE, and DG, each

employing a dierent type of NN. Detailed experimental evaluation

using real and benchmark datasets for AQP, CE, and DG detail

DDUp’s performance advantages.

KEYWORDS

Learned DBs, Out of Distribution Data, Knowledge Distillation,

Transfer Learning

1 INTRODUCTION

Database systems (DBs) are largely embracing ML. With data vol-

umes reaching unprecedented levels, ML can provide highly-accurate

methods to perform central data management tasks more eciently.

Applications abound: AQP engines are leveraging ML to answer

queries much faster and more accurately than traditional DBs

[

]. Cardinality/selectivity estimation, has improved

considerably leveraging ML [

]. Likewise for query

optimization [

], indexes [

], cost estimation

[

], workload forecasting [

], DB tuning [

], syn-

thetic data generation [7, 54, 76], etc.

1.1 Challenges

As research in learned DB systems matures, two key pitfalls are

emerging. First, if the "context" (such as the data, the DB system,

and/or the workload) changes, previously trained models are no

longer accurate. Second, training accurate ML models is costly.

Hence, retraining from scratch when the context changes should

be avoided whenever possible. Emerging ML paradigms, such as

active learning, transfer learning, meta-learning, and zero/few-shot

learning are a good t for such context changes and have been the

focus of recent related works [

], where the primary focus

is to glean what is learned from existing ML models (trained for

dierent learning tasks and/or DBs and/or workloads), and adapt

them for new tasks and/or DBs, and/or workloads, while avoiding

the need to retrain models from scratch.

OOD Data insertions.

In analytical DBs data updates primar-

ily take the form of new data insertions. New data may be OOD

(representing new knowledge – distributional shifts), rendering

previously-built ML models obsolete/inaccurate. Or, new data may

not be OOD. In the former case, the model must be updated and

it must be decided how the new data could be eciently reected

in the model to continue ensuring accuracy. In the latter case, it is

desirable to avoid updating the model, as that would waste time/re-

sources. Therefore, it is also crucial to check (eciently) whether

the new data render the previously built model inaccurate. However,

related research has not yet tackled this problem setting, whereby

models for the same learning tasks (e.g., AQP, DG, CE, etc.) trained

on old data, continue to provide high accuracy for the new data state

(on old and new data, as queries now may access both old data

and new data, old data, or simply the new data). Related work for

learned DB systems have a limited (or sometimes completely lack

the) capability of handling such data insertions (as is independently

veried in [70] and will be shown in this paper as well).

Sources of Diculty and Baselines.

In the presence of OOD,

a simple solution is adopted by some of the learned DB compo-

nents like Naru [

], NeuroCard [

], DBest++ [

], and even the

aforementioned transfer/few-shot learning methods [

]. That

is to "ne-tune" the original model

𝑀

on the new data. Alas, this

is problematic. For instance, while a DBest++ model on the "For-

est" dataset has a 95th percentile q-error of 2, updating it with an

OOD sample using ne-tuning increases the 95th q-error to 63. A

similar accuracy drop occurs for other key models as well – [

]

showcases this for learned CE works. This drastic drop of accuracy

is due to the fundamental problem of catastrophic forgetting [

where retraining a previously learned model on new tasks, i.e. new

data, causes the model to lose the knowledge it had acquired about

old data. To avoid catastrophic forgetting, Naru and DBest++ sug-

gest using a smaller learning rate while ne-tuning with the new

data. This, however, causes another fundamental problem, namely

intransigence, [

] whereby the model resists tting to new data,

rendering queries on new data inaccurate.

Another simple solution to avoid these problems would be to

aggregate the old data and new data and retrain the model from

scratch. However, as mentioned, this is undesirable in our envi-

ronment. As a concrete example, training Naru/NeuroCard on the

arXiv:2210.05508v2 [cs.DB] 8 Dec 2022

"Forest" dataset (with only 600k rows) on a 40-core CPU takes ca. 1.5

hours. Similarly high retraining overheads are typically observed

for neural network models, for various tasks. And, retraining time

progressively increases as the DB size increases.

Therefore, more sophisticated approaches are needed, which

can avoid intransigence and catastrophic forgetting, update models

only when needed and do so while ensuring much smaller training

overheads than retraining from scratch and at the same time ensure

high accuracy for queries on old and new data. While for some

tasks, like CE, some researchers question whether achieving very

high accuracy through learned models will actually help the end-

task (query optimization) [

], for tasks like AQP (which is itself

the end-task) and for DG (with classication as the end-task) high

accuracy is clearly needed, as shown here. Even for CE, with OOD

data, accuracy can become horribly poor, as shown here, which is

likely to aect query optimization.

1.2 Contributions

To the best of our knowledge, this work proposes the rst updata-

bility framework (DDUp) for learned DBs (in the face of new data

insertions possibly carrying OOD data) that can ensure high accu-

racy for queries on new and/or old data. DDUp is also ecient and

it can enjoy wide applicability, capable of being utilized for dierent

NNs and/or dierent learning tasks (such as AQP, DG, CE, etc.).

DDUp consists of a novel OOD detection and a novel model-update

module. More specically, the contributions of DDUp are:

•

A general and principled two-sample test for OOD detection.

Generality stems from it being based on the training loss function

of the NNs. Compared to prior art, it introduces no extra costs and

overheads, and could be used with dierent NNs, with dierent

loss functions, in dierent applications. To further minimize

detection time, it is divided into oine and online phases.

•

A novel and general formulation of transfer-learning based on

sequential self-distillation for model updating. This formulation

allows a higher degree of freedom in balancing tasks w.r.t new

and old data, can adapt to dierent models and tasks, and maxi-

mizes performance via self-distillation.

•

Importantly, DDUp can be used by any pre-trained NN without

introducing any assumptions on models or requiring additional

components that might require to retrain models or incur more

costs. Here, we instantiate it for three dierent tasks (namely,

the CE task, using the Naru/NeuroCard deep autoregressive net-

work (DARN) models [

], the AQP task, using the DBEst++

mixture density network (MDN) model [

], and for the DG task,

using the Tabular Variational AutoEncoder (TVAE) model [76])

each of which employs a dierent NN type. These are represen-

tative learning tasks and networks with evident importance in

DBs and beyond. These instantiations are also novel, showing

how to distil-and-update MDNs, DARNs, and TVAEs.

•

Finally, DDUp is evaluated using six dierent datasets and the

three instantiated learned DB components, for AQP, CE, and DG

1.3 Limitations

DDUp focuses only on data insertions, which are essential and dom-

inant in analytical DBs, and not on updates in place and deletes,

which are prevalent in transactional DBs. Nonetheless, the lat-

ter touch upon an open problem in the ML literature, namely

”

𝑢𝑛𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔

”, where it typically concerns privacy (e.g., removing

sensitive data from images in classication tasks) (e.g., [

]).

Studying unlearning for DB problem settings is a formidable task

of its own and of high interest for future research.

Also, DDUp is designed for NN-based learned DB components.

This is so as neural networks are a very rich family of models

which have collectively received very large attention for learned

DBs. Extending DDUp principles beyond NN models is also left for

future research.

2 THE PROBLEM AND SOLUTION

OVERVIEW

2.1 Problem Formulation

Consider a database relation

𝑅

with attributes

{𝐴1, 𝐴2, ..., 𝐴𝑚}

. This

can be a raw table or the result of a join query. Also consider

a sequence of

𝑁

insertion updates denoted by

𝐼={𝐼1, 𝐼2, ...𝐼𝑁}

Each

𝐼𝑡

is an insert operation which appends a data batch

𝐷𝑡=

{(𝐴1, 𝐴2, ..., 𝐴𝑚)(𝑖)

𝑡

;

𝑖=

, ..., 𝑛𝑡}

𝑅

, where

𝑛𝑡

is the number of

rows. Let

𝑆𝑡

be a sucient sample of

𝐷𝑡

and

𝑆≤

𝑡−1

be a sucient

sample from

∪𝑡−1

𝑗=0𝐷𝑗

. We naturally assume that

|𝑅|

is nite. And,

due to the training restrictions of existing models, we also make

the natural assumption:

∀𝐴𝑖∈𝑅:𝑠𝑢𝑝𝑝 (𝐷𝑡(𝐴𝑖)) ⊆ 𝑠𝑢𝑝𝑝 (𝐷𝑡−1(𝐴𝑖))

where

𝑠𝑢𝑝𝑝 (𝐷(𝐴𝑖))

is the support of attribute

𝐴𝑖

in dataset

𝐷

. This

assumption satises the condition based on which the domain of

each attribute is not violated in the upcoming update batches.

Statistical test for data changes

. We dene out-of-distribution

detection as a two-sample hypothesis test between a sample of

historical data and a sample of the new data. Let

𝑆≤

𝑡−1

have a joint

distribution of

𝑃(𝐴1, . . . , 𝐴

𝑚) ≡ P

and

𝑆𝑡

have a joint distribution

𝑄(𝐴1, . . . , 𝐴𝑚) ≡ Q

. We dene the null hypothesis

𝐻0

P=Q

which asserts that

𝑆𝑡

and

𝑆≤

𝑡−1

are coming from a same distribution;

and the alternative hypothesis

𝐻𝐴

P≠Q

which declares that the

two samples are generated by two dierent distributions.

Incrementally updating the model

. Consider for

𝐼0

a model

𝑀0

is trained by minimizing a loss function

ℒ(𝐷0

;

Θ0

). This model

may be stale for

𝐼𝑡

;

𝑡>

0. Ideally, the goal of incremental learn-

ing is: at time

𝑡

train a model

𝑀𝑡

that minimizes a function over

Í𝑡

𝑖=1ℒ(𝐷𝑖

;

Θ𝑖)

. This new model should not forget

{𝐼𝑖

;

𝑖=

, ..., 𝑡−

1}and also learn 𝐼𝑡.

2.2 A High Level View of DDUp

The overall architecture of DDUp is depicted in Figure 1.

DDUp process batches of tuples at a time. Such batched handling

of insertions is typical in analytical DBs. Furthermore, this takes

into account that the eect of single tuples is usually negligible

for the overall large space modelled by NNs. And, for most tasks

like CE, AQP and DG, the eect of single tuples in the nal result

is very small, considering the large sizes of tables. And batching

amortizes detect-and-update costs over many insertion operations.

Upon a new batch insertion, DDUp takes the latest model

𝑀𝑡−1

and performs a bootstrapping sampling from the previous data to

Figure 1: The overall structure of DDUp. DDUp uses the latest model and previous data to build a sampling distribution for

the two-sample test, and updates the learned component based on the shift in the data distribution.

build the sampling distribution for the average loss values. DDUp

uses this distribution to calculate a signicance level corresponding

to a condence interval (e.g a 95th condence interval). The general

idea is that if the new data is similar to the previous data (IND

in Figure 1), the loss values of

𝑀𝑡−1

for this new data should lie

within the threshold. This means that the new data has the same

distribution and therefore the model could be left intact (updating

maybe just the hyper-parameters of the system, including possible

frequency tables and other table statistics. Alternatively, a simple

ne-tuning can be performed to adapt the model to the new data.

If the loss values exceeded the threshold, this implies that the data

distribution has signicantly changed. DDUp will deploy a teacher-

student transfer learning method based on knowledge distillation

to learn this new distribution without forgetting the knowledge of

the old data. In this framework, while the student directly learns

the distribution of the new data, the teacher act as a regularizer to

make the student also learn about the old distribution.

3 OUT-OF-DISTRIBUTION DETECTION

3.1 Background

In ML, OOD is typically addressed from a classication perspective.

Formally, assume

𝐷

is a dataset of

(𝑥, 𝑦)

pairs which are drawn

from a joint distribution,

𝑝(𝑥, 𝑦)

, where

𝑥∈ X

={𝑥1, 𝑥2, . . . , 𝑥𝑛}

is the input (independent variable) consisting of

𝑛

features, and

𝑦∈ Y

, . . . , 𝑘}

is the label corresponding to one of the

𝑘

in-distribution classes. A sample

(𝑥, 𝑦)

, that probably is generated

by a dierent distribution than

𝑝(𝑥, 𝑦)

, is called OOD, if

𝑦∉Y

, i.e

it does not belong to any previously seen classes.

A similar problem has previously been addressed in statistics

as concept drift detection, where dierent types of shifts are distin-

guished by expanding 𝑝(𝑥,𝑦)using the Bayes rule:

𝑝(𝑥, 𝑦)=𝑝(𝑥)𝑝(𝑦|𝑥)(1)

Based on Eq. 1, changes in

𝑃(𝑦|𝑥)

are usually referred to as Real

drift, while changes in

𝑃(𝑥)

are called virtual drift [

]. In

𝑋→𝑦

problems the latter mostly is known as covariate shift. Deciding

which drift to detect is dependent on the underlying models. For

example, deep autoregressive networks (e.g., used by [

]) learn the

full joint distribution of a table. Hence, they are sensitive to covariate

shift upon insertions. On the other hand, mixture density networks

(e.g., used by [

]), model the conditional probability between a set

of independent attributes and a target attribute. Hence, for these

models, one would be interested in detecting real shift.

3.2 Loss based OOD Detection

There are several challenges that make it dicult to simply adopt

one of the OOD detection algorithms in the ML or statistical learn-

ing literature. First, DB tables are multivariate in nature and learned

models are usually trained on multiple attributes. As a result, uni-

variate two-sample tests like Kolmogorov–Smirnov (KS) test are

not suitable for this purpose. Second, the test should introduce low

overheads to the system as insertions may be frequent. Therefore,

multivariate tests like kernel methods that require to learn densities

and perform expensive inference computations are not desirable.

Third, we aim to support dierent learning tasks for which dierent

models might be used. Thus, most of OOD detection methods in ML

that are based on probability scores (condence) of classication

tasks are not useful here. Moreover, the test should be able to adapt

eciently to the case where insertions occur within old data, that

is, without having to recalculate baseline thresholds etc.

An ecient OOD detection method is now proposed that resolves

all above issues by leveraging the underlying ML models themselves.

Central to most learned data system components is the ability to

derive from the underlying data tables a model for the joint or

conditional data distribution like

𝑝(𝑥)

𝑝(𝑦|𝑥)

. A model usually

achieves this by learning a set of parameters

that represent a

function

𝑓

by iteratively optimizing over a loss function as follows:

𝑓Θ=arg min

𝑓∈F

𝑛



𝑖=1

ℒ(𝑓(𝑥);Θ) + Ω(𝑓)(2)

where, Ωis a regularizer term, 𝑛is the number of samples, and

𝑓

could be the outputs of the model in the last layer (called logits),

or the probabilities assigned by a "softmax" function.

We will later discuss dierent loss functions in more details

when instantiating dierent models. In general, loss functions are

usually highly non-convex with many local mimina. However, a

good learning strategy will nd the global minimum. Because of

the large data sizes, training is usually done by iterating over mini-

batches and a gradient descent algorithm updates the parameters

based on the average of loss of the samples in each mini-batch per

iteration. For the rest of the paper, when we mention ’loss value’

we mean average of losses of the samples in a batch. Once the

model is trained, i.e. the loss values have converged, the model can

serve as a transformer to map (high-dimensional) input data to the

one-dimensional loss functions space around the global minimum.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AcceptedasaconferencepaperforSIGMOD2023@Seattle,WA,USADetect,DistillandUpdate:LearnedDBSystemsFacingOutofDistributionDataMeghdadKurmanji,PeterTriantafillouUniversityofWarwick,Coventry,UK{meghdad.kurmanji,p.triantafillou}@warwick.ac.ukABSTRACTMachineLearning(ML)ischangingDBsasmanyDBcomponentsarebeing...

展开>> 收起<<

Detect Distill and Update Learned DB Systems Facing Out of Distribution Data.pdf

共15页,预览3页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Detect Distill and Update Learned DB Systems Facing Out of Distribution Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: