Detect Distill and Update Learned DB Systems Facing Out of Distribution Data

2025-04-26 0 0 1.22MB 15 页 10玖币
侵权投诉
Accepted as a conference paper for SIGMOD 2023 @Seattle, WA, USA
Detect, Distill and Update:
Learned DB Systems Facing Out of Distribution Data
Meghdad Kurmanji , Peter Triantallou
University of Warwick, Coventry, UK
{meghdad.kurmanji,p.triantallou}@warwick.ac.uk
ABSTRACT
Machine Learning (ML) is changing DBs as many DB components
are being replaced by ML models. One open problem in this setting
is how to update such ML models in the presence of data updates.
We start this investigation focusing on data insertions (dominating
updates in analytical DBs). We study how to update neural network
(NN) models when new data follows a dierent distribution (a.k.a. it
is "out-of-distribution" – OOD), rendering previously-trained NNs
inaccurate. A requirement in our problem setting is that learned
DB components should ensure high accuracy for tasks on old and
new data (e.g., for approximate query processing (AQP), cardinality
estimation (CE), synthetic data generation (DG), etc.).
This paper proposes a novel updatability framework (DDUp).
DDUp can provide updatability for dierent learned DB system
components, even based on dierent NNs, without the high costs
to retrain the NNs from scratch. DDUp entails two components:
First, a novel, ecient, and principled statistical-testing approach
to detect OOD data. Second, a novel model updating approach,
grounded on the principles of transfer learning with knowledge
distillation, to update learned models eciently, while still ensuring
high accuracy. We develop and showcase DDUp’s applicability for
three dierent learned DB components, AQP, CE, and DG, each
employing a dierent type of NN. Detailed experimental evaluation
using real and benchmark datasets for AQP, CE, and DG detail
DDUp’s performance advantages.
KEYWORDS
Learned DBs, Out of Distribution Data, Knowledge Distillation,
Transfer Learning
1 INTRODUCTION
Database systems (DBs) are largely embracing ML. With data vol-
umes reaching unprecedented levels, ML can provide highly-accurate
methods to perform central data management tasks more eciently.
Applications abound: AQP engines are leveraging ML to answer
queries much faster and more accurately than traditional DBs
[
21
,
42
,
43
,
65
]. Cardinality/selectivity estimation, has improved
considerably leveraging ML [
17
,
70
,
77
,
78
,
84
]. Likewise for query
optimization [
27
,
44
,
45
], indexes [
9
,
10
,
30
,
49
], cost estimation
[
63
,
83
], workload forecasting [
85
], DB tuning [
34
,
68
,
81
], syn-
thetic data generation [7, 54, 76], etc.
1.1 Challenges
As research in learned DB systems matures, two key pitfalls are
emerging. First, if the "context" (such as the data, the DB system,
and/or the workload) changes, previously trained models are no
longer accurate. Second, training accurate ML models is costly.
Hence, retraining from scratch when the context changes should
be avoided whenever possible. Emerging ML paradigms, such as
active learning, transfer learning, meta-learning, and zero/few-shot
learning are a good t for such context changes and have been the
focus of recent related works [
20
,
41
,
74
], where the primary focus
is to glean what is learned from existing ML models (trained for
dierent learning tasks and/or DBs and/or workloads), and adapt
them for new tasks and/or DBs, and/or workloads, while avoiding
the need to retrain models from scratch.
OOD Data insertions.
In analytical DBs data updates primar-
ily take the form of new data insertions. New data may be OOD
(representing new knowledge – distributional shifts), rendering
previously-built ML models obsolete/inaccurate. Or, new data may
not be OOD. In the former case, the model must be updated and
it must be decided how the new data could be eciently reected
in the model to continue ensuring accuracy. In the latter case, it is
desirable to avoid updating the model, as that would waste time/re-
sources. Therefore, it is also crucial to check (eciently) whether
the new data render the previously built model inaccurate. However,
related research has not yet tackled this problem setting, whereby
models for the same learning tasks (e.g., AQP, DG, CE, etc.) trained
on old data, continue to provide high accuracy for the new data state
(on old and new data, as queries now may access both old data
and new data, old data, or simply the new data). Related work for
learned DB systems have a limited (or sometimes completely lack
the) capability of handling such data insertions (as is independently
veried in [70] and will be shown in this paper as well).
Sources of Diculty and Baselines.
In the presence of OOD,
a simple solution is adopted by some of the learned DB compo-
nents like Naru [
78
], NeuroCard [
77
], DBest++ [
42
], and even the
aforementioned transfer/few-shot learning methods [
20
,
74
]. That
is to "ne-tune" the original model
𝑀
on the new data. Alas, this
is problematic. For instance, while a DBest++ model on the "For-
est" dataset has a 95th percentile q-error of 2, updating it with an
OOD sample using ne-tuning increases the 95th q-error to 63. A
similar accuracy drop occurs for other key models as well – [
70
]
showcases this for learned CE works. This drastic drop of accuracy
is due to the fundamental problem of catastrophic forgetting [
46
],
where retraining a previously learned model on new tasks, i.e. new
data, causes the model to lose the knowledge it had acquired about
old data. To avoid catastrophic forgetting, Naru and DBest++ sug-
gest using a smaller learning rate while ne-tuning with the new
data. This, however, causes another fundamental problem, namely
intransigence, [
6
] whereby the model resists tting to new data,
rendering queries on new data inaccurate.
Another simple solution to avoid these problems would be to
aggregate the old data and new data and retrain the model from
scratch. However, as mentioned, this is undesirable in our envi-
ronment. As a concrete example, training Naru/NeuroCard on the
1
arXiv:2210.05508v2 [cs.DB] 8 Dec 2022
"Forest" dataset (with only 600k rows) on a 40-core CPU takes ca. 1.5
hours. Similarly high retraining overheads are typically observed
for neural network models, for various tasks. And, retraining time
progressively increases as the DB size increases.
Therefore, more sophisticated approaches are needed, which
can avoid intransigence and catastrophic forgetting, update models
only when needed and do so while ensuring much smaller training
overheads than retraining from scratch and at the same time ensure
high accuracy for queries on old and new data. While for some
tasks, like CE, some researchers question whether achieving very
high accuracy through learned models will actually help the end-
task (query optimization) [
44
], for tasks like AQP (which is itself
the end-task) and for DG (with classication as the end-task) high
accuracy is clearly needed, as shown here. Even for CE, with OOD
data, accuracy can become horribly poor, as shown here, which is
likely to aect query optimization.
1.2 Contributions
To the best of our knowledge, this work proposes the rst updata-
bility framework (DDUp) for learned DBs (in the face of new data
insertions possibly carrying OOD data) that can ensure high accu-
racy for queries on new and/or old data. DDUp is also ecient and
it can enjoy wide applicability, capable of being utilized for dierent
NNs and/or dierent learning tasks (such as AQP, DG, CE, etc.).
DDUp consists of a novel OOD detection and a novel model-update
module. More specically, the contributions of DDUp are:
A general and principled two-sample test for OOD detection.
Generality stems from it being based on the training loss function
of the NNs. Compared to prior art, it introduces no extra costs and
overheads, and could be used with dierent NNs, with dierent
loss functions, in dierent applications. To further minimize
detection time, it is divided into oine and online phases.
A novel and general formulation of transfer-learning based on
sequential self-distillation for model updating. This formulation
allows a higher degree of freedom in balancing tasks w.r.t new
and old data, can adapt to dierent models and tasks, and maxi-
mizes performance via self-distillation.
Importantly, DDUp can be used by any pre-trained NN without
introducing any assumptions on models or requiring additional
components that might require to retrain models or incur more
costs. Here, we instantiate it for three dierent tasks (namely,
the CE task, using the Naru/NeuroCard deep autoregressive net-
work (DARN) models [
77
,
78
], the AQP task, using the DBEst++
mixture density network (MDN) model [
42
], and for the DG task,
using the Tabular Variational AutoEncoder (TVAE) model [76])
each of which employs a dierent NN type. These are represen-
tative learning tasks and networks with evident importance in
DBs and beyond. These instantiations are also novel, showing
how to distil-and-update MDNs, DARNs, and TVAEs.
Finally, DDUp is evaluated using six dierent datasets and the
three instantiated learned DB components, for AQP, CE, and DG
1.3 Limitations
DDUp focuses only on data insertions, which are essential and dom-
inant in analytical DBs, and not on updates in place and deletes,
which are prevalent in transactional DBs. Nonetheless, the lat-
ter touch upon an open problem in the ML literature, namely
𝑢𝑛𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔
, where it typically concerns privacy (e.g., removing
sensitive data from images in classication tasks) (e.g., [
15
,
61
]).
Studying unlearning for DB problem settings is a formidable task
of its own and of high interest for future research.
Also, DDUp is designed for NN-based learned DB components.
This is so as neural networks are a very rich family of models
which have collectively received very large attention for learned
DBs. Extending DDUp principles beyond NN models is also left for
future research.
2 THE PROBLEM AND SOLUTION
OVERVIEW
2.1 Problem Formulation
Consider a database relation
𝑅
with attributes
{𝐴1, 𝐴2, ..., 𝐴𝑚}
. This
can be a raw table or the result of a join query. Also consider
a sequence of
𝑁
insertion updates denoted by
𝐼={𝐼1, 𝐼2, ...𝐼𝑁}
.
Each
𝐼𝑡
is an insert operation which appends a data batch
𝐷𝑡=
{(𝐴1, 𝐴2, ..., 𝐴𝑚)(𝑖)
𝑡
;
𝑖=
1
, ..., 𝑛𝑡}
to
𝑅
, where
𝑛𝑡
is the number of
rows. Let
𝑆𝑡
be a sucient sample of
𝐷𝑡
and
𝑆
𝑡1
be a sucient
sample from
𝑡1
𝑗=0𝐷𝑗
. We naturally assume that
|𝑅|
is nite. And,
due to the training restrictions of existing models, we also make
the natural assumption:
𝐴𝑖𝑅:𝑠𝑢𝑝𝑝 (𝐷𝑡(𝐴𝑖)) 𝑠𝑢𝑝𝑝 (𝐷𝑡1(𝐴𝑖))
where
𝑠𝑢𝑝𝑝 (𝐷(𝐴𝑖))
is the support of attribute
𝐴𝑖
in dataset
𝐷
. This
assumption satises the condition based on which the domain of
each attribute is not violated in the upcoming update batches.
Statistical test for data changes
. We dene out-of-distribution
detection as a two-sample hypothesis test between a sample of
historical data and a sample of the new data. Let
𝑆
𝑡1
have a joint
distribution of
𝑃(𝐴1, . . . , 𝐴
1
𝑚) P
and
𝑆𝑡
have a joint distribution
of
𝑄(𝐴1, . . . , 𝐴𝑚) Q
. We dene the null hypothesis
𝐻0
:
P=Q
which asserts that
𝑆𝑡
and
𝑆
𝑡1
are coming from a same distribution;
and the alternative hypothesis
𝐻𝐴
:
PQ
which declares that the
two samples are generated by two dierent distributions.
Incrementally updating the model
. Consider for
𝐼0
a model
𝑀0
is trained by minimizing a loss function
(𝐷0
;
Θ0
). This model
may be stale for
𝐼𝑡
;
𝑡>
0. Ideally, the goal of incremental learn-
ing is: at time
𝑡
train a model
𝑀𝑡
that minimizes a function over
Í𝑡
𝑖=1(𝐷𝑖
;
Θ𝑖)
. This new model should not forget
{𝐼𝑖
;
𝑖=
0
,
1
, ..., 𝑡
1}and also learn 𝐼𝑡.
2.2 A High Level View of DDUp
The overall architecture of DDUp is depicted in Figure 1.
DDUp process batches of tuples at a time. Such batched handling
of insertions is typical in analytical DBs. Furthermore, this takes
into account that the eect of single tuples is usually negligible
for the overall large space modelled by NNs. And, for most tasks
like CE, AQP and DG, the eect of single tuples in the nal result
is very small, considering the large sizes of tables. And batching
amortizes detect-and-update costs over many insertion operations.
Upon a new batch insertion, DDUp takes the latest model
𝑀𝑡1
,
and performs a bootstrapping sampling from the previous data to
2
Figure 1: The overall structure of DDUp. DDUp uses the latest model and previous data to build a sampling distribution for
the two-sample test, and updates the learned component based on the shift in the data distribution.
build the sampling distribution for the average loss values. DDUp
uses this distribution to calculate a signicance level corresponding
to a condence interval (e.g a 95th condence interval). The general
idea is that if the new data is similar to the previous data (IND
in Figure 1), the loss values of
𝑀𝑡1
for this new data should lie
within the threshold. This means that the new data has the same
distribution and therefore the model could be left intact (updating
maybe just the hyper-parameters of the system, including possible
frequency tables and other table statistics. Alternatively, a simple
ne-tuning can be performed to adapt the model to the new data.
If the loss values exceeded the threshold, this implies that the data
distribution has signicantly changed. DDUp will deploy a teacher-
student transfer learning method based on knowledge distillation
to learn this new distribution without forgetting the knowledge of
the old data. In this framework, while the student directly learns
the distribution of the new data, the teacher act as a regularizer to
make the student also learn about the old distribution.
3 OUT-OF-DISTRIBUTION DETECTION
3.1 Background
In ML, OOD is typically addressed from a classication perspective.
Formally, assume
𝐷
is a dataset of
(𝑥, 𝑦)
pairs which are drawn
from a joint distribution,
𝑝(𝑥, 𝑦)
, where
𝑥∈ X
:
={𝑥1, 𝑥2, . . . , 𝑥𝑛}
is the input (independent variable) consisting of
𝑛
features, and
𝑦∈ Y
:
={
1
,
2
, . . . , 𝑘}
is the label corresponding to one of the
𝑘
in-distribution classes. A sample
(𝑥, 𝑦)
, that probably is generated
by a dierent distribution than
𝑝(𝑥, 𝑦)
, is called OOD, if
𝑦Y
, i.e
it does not belong to any previously seen classes.
A similar problem has previously been addressed in statistics
as concept drift detection, where dierent types of shifts are distin-
guished by expanding 𝑝(𝑥,𝑦)using the Bayes rule:
𝑝(𝑥, 𝑦)=𝑝(𝑥)𝑝(𝑦|𝑥)(1)
Based on Eq. 1, changes in
𝑃(𝑦|𝑥)
are usually referred to as Real
drift, while changes in
𝑃(𝑥)
are called virtual drift [
14
]. In
𝑋𝑦
problems the latter mostly is known as covariate shift. Deciding
which drift to detect is dependent on the underlying models. For
example, deep autoregressive networks (e.g., used by [
78
]) learn the
full joint distribution of a table. Hence, they are sensitive to covariate
shift upon insertions. On the other hand, mixture density networks
(e.g., used by [
42
]), model the conditional probability between a set
of independent attributes and a target attribute. Hence, for these
models, one would be interested in detecting real shift.
3.2 Loss based OOD Detection
There are several challenges that make it dicult to simply adopt
one of the OOD detection algorithms in the ML or statistical learn-
ing literature. First, DB tables are multivariate in nature and learned
models are usually trained on multiple attributes. As a result, uni-
variate two-sample tests like Kolmogorov–Smirnov (KS) test are
not suitable for this purpose. Second, the test should introduce low
overheads to the system as insertions may be frequent. Therefore,
multivariate tests like kernel methods that require to learn densities
and perform expensive inference computations are not desirable.
Third, we aim to support dierent learning tasks for which dierent
models might be used. Thus, most of OOD detection methods in ML
that are based on probability scores (condence) of classication
tasks are not useful here. Moreover, the test should be able to adapt
eciently to the case where insertions occur within old data, that
is, without having to recalculate baseline thresholds etc.
An ecient OOD detection method is now proposed that resolves
all above issues by leveraging the underlying ML models themselves.
Central to most learned data system components is the ability to
derive from the underlying data tables a model for the joint or
conditional data distribution like
𝑝(𝑥)
or
𝑝(𝑦|𝑥)
. A model usually
achieves this by learning a set of parameters
Θ
that represent a
function
𝑓
by iteratively optimizing over a loss function as follows:
𝑓Θ=arg min
𝑓F
1
𝑛
𝑛
𝑖=1
(𝑓(𝑥);Θ) + Ω(𝑓)(2)
where, Ωis a regularizer term, 𝑛is the number of samples, and
𝑓
could be the outputs of the model in the last layer (called logits),
or the probabilities assigned by a "softmax" function.
We will later discuss dierent loss functions in more details
when instantiating dierent models. In general, loss functions are
usually highly non-convex with many local mimina. However, a
good learning strategy will nd the global minimum. Because of
the large data sizes, training is usually done by iterating over mini-
batches and a gradient descent algorithm updates the parameters
based on the average of loss of the samples in each mini-batch per
iteration. For the rest of the paper, when we mention ’loss value’
we mean average of losses of the samples in a batch. Once the
model is trained, i.e. the loss values have converged, the model can
serve as a transformer to map (high-dimensional) input data to the
one-dimensional loss functions space around the global minimum.
3
摘要:

AcceptedasaconferencepaperforSIGMOD2023@Seattle,WA,USADetect,DistillandUpdate:LearnedDBSystemsFacingOutofDistributionDataMeghdadKurmanji,PeterTriantafillouUniversityofWarwick,Coventry,UK{meghdad.kurmanji,p.triantafillou}@warwick.ac.ukABSTRACTMachineLearning(ML)ischangingDBsasmanyDBcomponentsarebeing...

展开>> 收起<<
Detect Distill and Update Learned DB Systems Facing Out of Distribution Data.pdf

共15页,预览3页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:15 页 大小:1.22MB 格式:PDF 时间:2025-04-26

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 15
客服
关注