Accepted as a conference paper for SIGMOD 2023 @Seattle, WA, USA
Detect, Distill and Update:
Learned DB Systems Facing Out of Distribution Data
Meghdad Kurmanji , Peter Triantallou
University of Warwick, Coventry, UK
{meghdad.kurmanji,p.triantallou}@warwick.ac.uk
ABSTRACT
Machine Learning (ML) is changing DBs as many DB components
are being replaced by ML models. One open problem in this setting
is how to update such ML models in the presence of data updates.
We start this investigation focusing on data insertions (dominating
updates in analytical DBs). We study how to update neural network
(NN) models when new data follows a dierent distribution (a.k.a. it
is "out-of-distribution" – OOD), rendering previously-trained NNs
inaccurate. A requirement in our problem setting is that learned
DB components should ensure high accuracy for tasks on old and
new data (e.g., for approximate query processing (AQP), cardinality
estimation (CE), synthetic data generation (DG), etc.).
This paper proposes a novel updatability framework (DDUp).
DDUp can provide updatability for dierent learned DB system
components, even based on dierent NNs, without the high costs
to retrain the NNs from scratch. DDUp entails two components:
First, a novel, ecient, and principled statistical-testing approach
to detect OOD data. Second, a novel model updating approach,
grounded on the principles of transfer learning with knowledge
distillation, to update learned models eciently, while still ensuring
high accuracy. We develop and showcase DDUp’s applicability for
three dierent learned DB components, AQP, CE, and DG, each
employing a dierent type of NN. Detailed experimental evaluation
using real and benchmark datasets for AQP, CE, and DG detail
DDUp’s performance advantages.
KEYWORDS
Learned DBs, Out of Distribution Data, Knowledge Distillation,
Transfer Learning
1 INTRODUCTION
Database systems (DBs) are largely embracing ML. With data vol-
umes reaching unprecedented levels, ML can provide highly-accurate
methods to perform central data management tasks more eciently.
Applications abound: AQP engines are leveraging ML to answer
queries much faster and more accurately than traditional DBs
[
21
,
42
,
43
,
65
]. Cardinality/selectivity estimation, has improved
considerably leveraging ML [
17
,
70
,
77
,
78
,
84
]. Likewise for query
optimization [
27
,
44
,
45
], indexes [
9
,
10
,
30
,
49
], cost estimation
[
63
,
83
], workload forecasting [
85
], DB tuning [
34
,
68
,
81
], syn-
thetic data generation [7, 54, 76], etc.
1.1 Challenges
As research in learned DB systems matures, two key pitfalls are
emerging. First, if the "context" (such as the data, the DB system,
and/or the workload) changes, previously trained models are no
longer accurate. Second, training accurate ML models is costly.
Hence, retraining from scratch when the context changes should
be avoided whenever possible. Emerging ML paradigms, such as
active learning, transfer learning, meta-learning, and zero/few-shot
learning are a good t for such context changes and have been the
focus of recent related works [
20
,
41
,
74
], where the primary focus
is to glean what is learned from existing ML models (trained for
dierent learning tasks and/or DBs and/or workloads), and adapt
them for new tasks and/or DBs, and/or workloads, while avoiding
the need to retrain models from scratch.
OOD Data insertions.
In analytical DBs data updates primar-
ily take the form of new data insertions. New data may be OOD
(representing new knowledge – distributional shifts), rendering
previously-built ML models obsolete/inaccurate. Or, new data may
not be OOD. In the former case, the model must be updated and
it must be decided how the new data could be eciently reected
in the model to continue ensuring accuracy. In the latter case, it is
desirable to avoid updating the model, as that would waste time/re-
sources. Therefore, it is also crucial to check (eciently) whether
the new data render the previously built model inaccurate. However,
related research has not yet tackled this problem setting, whereby
models for the same learning tasks (e.g., AQP, DG, CE, etc.) trained
on old data, continue to provide high accuracy for the new data state
(on old and new data, as queries now may access both old data
and new data, old data, or simply the new data). Related work for
learned DB systems have a limited (or sometimes completely lack
the) capability of handling such data insertions (as is independently
veried in [70] and will be shown in this paper as well).
Sources of Diculty and Baselines.
In the presence of OOD,
a simple solution is adopted by some of the learned DB compo-
nents like Naru [
78
], NeuroCard [
77
], DBest++ [
42
], and even the
aforementioned transfer/few-shot learning methods [
20
,
74
]. That
is to "ne-tune" the original model
𝑀
on the new data. Alas, this
is problematic. For instance, while a DBest++ model on the "For-
est" dataset has a 95th percentile q-error of 2, updating it with an
OOD sample using ne-tuning increases the 95th q-error to 63. A
similar accuracy drop occurs for other key models as well – [
70
]
showcases this for learned CE works. This drastic drop of accuracy
is due to the fundamental problem of catastrophic forgetting [
46
],
where retraining a previously learned model on new tasks, i.e. new
data, causes the model to lose the knowledge it had acquired about
old data. To avoid catastrophic forgetting, Naru and DBest++ sug-
gest using a smaller learning rate while ne-tuning with the new
data. This, however, causes another fundamental problem, namely
intransigence, [
6
] whereby the model resists tting to new data,
rendering queries on new data inaccurate.
Another simple solution to avoid these problems would be to
aggregate the old data and new data and retrain the model from
scratch. However, as mentioned, this is undesirable in our envi-
ronment. As a concrete example, training Naru/NeuroCard on the
1
arXiv:2210.05508v2 [cs.DB] 8 Dec 2022