Structural Kernel Search via Bayesian Optimization and Symbolical Optimal Transport Matthias Bitzer Mona Meister Christoph Zimmer

2025-05-02 0 0 2.65MB 23 页 10玖币
侵权投诉
Structural Kernel Search via Bayesian Optimization
and Symbolical Optimal Transport
Matthias Bitzer Mona Meister Christoph Zimmer
Bosch Center for Artificial Intelligence, Renningen, Germany
{matthias.bitzer3,mona.meister,christoph.zimmer}@de.bosch.com
Abstract
Despite recent advances in automated machine learning, model selection is still
a complex and computationally intensive process. For Gaussian processes (GPs),
selecting the kernel is a crucial task, often done manually by the expert. Addition-
ally, evaluating the model selection criteria for Gaussian processes typically scales
cubically in the sample size, rendering kernel search particularly computationally
expensive. We propose a novel, efficient search method through a general, struc-
tured kernel space. Previous methods solved this task via Bayesian optimization
and relied on measuring the distance between GP’s directly in function space
to construct a kernel-kernel. We present an alternative approach by defining a
kernel-kernel over the symbolic representation of the statistical hypothesis that is
associated with a kernel. We empirically show that this leads to a computationally
more efficient way of searching through a discrete kernel space.
1 Introduction
In many real-work applications of machine learning, tuning the hyperparameters or selecting the
machine learning method itself is a crucial part of the workflow. It is often done by experts and
data-scientist. However, the number of possible methods and models is constantly growing, and
it is becoming increasingly important to automatically select the right model for the task at hand.
Bayesian optimization (BO) is a prominent method that can be used for model selection and hyperpa-
rameter tuning. It can handle black-box oracles with expensive function evaluations, which are two
characteristics often encountered when doing model selection [
17
]. Important applications in this
context are choosing the hyperparameters in the training process of deep neural networks (DNN) [
17
],
or dealing with discrete and structured problems like choosing the architecture of DNN’s [5, 11].
Gaussian processes (GP) are another important model-class. They are often utilized as surrogate
models in BO [
17
], for time-series and statistical modeling [
7
,
3
] or in active learning loops [
23
,
15
].
The properties of GP’s are mainly governed by its kernel that specifies the assumptions made on the
underlying function. Choosing the right kernel is therefore a crucial part of applying GP’s and is
often done by the expert. Recent work [
9
] treated the kernel selection as a black-box optimization
problem and used Bayesian optimization to solve it. This allowed searching over a highly structured,
discrete space of kernels. However, their proposed kernel-kernel measures the distance between
two GP’s directly in function space, which is a computationally expensive task itself. This makes
the method difficult to apply for the frequent scenarios where the evaluation of the model selection
criteria requires only a medium amount of time.
We propose measuring the distance between two kernels via their symbolical representation of their
associated statistical hypothesis. We utilize the highly general kernel-grammar, presented in [
3
], as
underlying kernel space, where each kernel is build from base kernels and operators, like e.g.
LIN + ((SE ×PER) + SE)
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.11836v1 [cs.LG] 21 Oct 2022
forming effectively a description of the statistical hypothesis that is modeled by the GP. Our main idea
is to build a distance over these symbolical descriptions, rather than measuring the distance between
two GP’s directly in function space. We employ optimal transport principles, known from neural-
architecture search (see [
11
],[
5
]), to build a pseudo-distance between two hypotheses descriptions
and use it to construct a kernel-kernel, which is subsequently utilized in the BO loop. We will show
that the induced kernel search method is more efficient, in terms of number of function evaluations
and computational time, compared to alternative kernel search methods over discrete search spaces.
The main challenge we encountered is the quantification of dissimilarity between two symbolical
representations of kernels. We use the tree representation of each symbolical description and apply
optimal transport distances over tree features. Subsequently, we empirically show that the deduced
GP over GP’s provides a well-behaved meta-model and show its advantages for kernel search. In
summary, our contributions are:
1.
We construct a pseudo-distance over GP’s that acts over the symbolical representations of
the underlying statistical hypothesis.
2.
We use the pseudo-distance to construct a novel "kernel-kernel" and build it in a BO loop to
do model selection for GP’s.
3.
We empirically show that our meta-GP model is well-behaved and that we outperform
previous methods in kernel search over discrete kernel spaces.
Related Work:
There is considerable existing work on constructing flexible kernels and learning
their hyperparameters [
1
,
20
,
21
,
18
,
22
]. In these kinds of works, the structural form of the kernel
is predefined and the free parameters of the kernel are optimized, often via marginal likelihood
maximization. While being able to efficiently finding the hyperparameter due to the differentiability
of the marginal likelihood, one still needs to predefine the structural form of the kernel in the first
place. This is a hard task as one need to decide if long-range correlation or nonstationarity should
be considered or if dimensions are ignored or not. Our method is build to automatically select
the structure of the kernel. Some of the mentioned methods [
20
,
21
] are able to approximate any
stationary kernel via Bochner’s Theorem and therefore consider a broad kernel space themselves.
However, even elementary statistical hypotheses require nonstationary kernels, such as linear trends
modeled by the linear kernel. Our search space is not restricted to stationary kernels.
We consider a highly general, discrete kernel space that is induced by the kernel grammar [
3
]. Recent
work [
9
] used BO to search through this space via a kernel that measures similarity in function space.
We also utilize BO but employ a fundamentally different principle of measuring the distance, which is
computationally more efficient. We dedicate Section 3.2 to a more precise comparison to the method
of [
9
]. Additionally, the original kernel grammar paper [
3
] suggested greedy search for searching
through the kernel grammar. Furthermore, [
4
] employed a genetic algorithm based on cross-over
mutations. We empirically compare against all approaches.
In the area of BO over structured spaces, our method is most similar to the neural-architecture search
(NAS) procedures presented in [
11
,
5
] who use optimal transport distances over features extracted
from the graph-representation of neural networks. Compared to these methods we use OT principles
to do model selection for GP’s which is a fundamentally different task.
2 Background and Set-up
Our main task is efficient model selection for Gaussian processes. In order to provide background
information, we give a small introduction to GP’s and model selection for GP’s. Subsequently, we
present a review of the kernel-grammar [
3
] and show how we use it for our approach. We show how
a kernel can be represented via a symbolical description. In Section 3, we will present how we use
the symbolical description to construct a kernel over kernels and how we use it in the BO loop.
2.1 Gaussian Processes
A Gaussian process is a distribution over functions
f:X R
over a given input space
X
which is
fully characterized via the covariance/kernel function
k(x, x0) = Cov(f(x), f(x0))
and the mean
function
µ(x) := E[f(x)]
. We therefore can write as shorthand notation
f∼ GP(µ(·), k(·,·))
. The
2
kernel can be interpreted as a similarity measure between two elements of the input space, with
the GP assigning higher correlations to function values whose inputs are more similar according to
the kernel. Furthermore, the kernel governs the main assumptions on the modeled function such as
smoothness, periodicity or long-range correlations and therefore provides the inductive-bias of the
GP. An important property of GP’s is that they are not restricted to euclidean input spaces
X Rd
,
but can also be defined on highly structured spaces like trees and graphs, a property we will later use
to define a GP over GP’s.
While our method might also be used for model selection in classification, we consider from now on
Gaussian processes regression. For regression, a dataset
D= (X,y)
with
X={x1, . . . , xN}⊂X
and
y= (y1, . . . , yN)|RN
is given, where we suppose that
f∼ GP(µ(·), k(·,·))
and
yi=
f(xi) + i
with
i
i.i.d
N (0, σ2)
. Given the observed data
D
the posterior distribution
f|D
is again a
GP with mean and covariance functions
µD(x) = µ(x) + k(x)|(K+σ2I)1(yµ(X)),
kD(x, y) = k(x, y)k(x)|(K+σ2I)1k(y)
with
K= [k(xm, xl)]N
m,l=1
and
k(x)=[k(x, x1), . . . , k(x, xN)]|
(see [
13
]). Probabilistic predic-
tions can be done via the resulting predictive distribution p(f|x,D) = N(µD(x), kD(x, x)).
2.2 Model selection for GP’s
Typically, the kernel
kθ
comes with a set of parameters
θ
that can be learned via maximization of
the marginal likelihood
p(y|X, θ, σ2) = N(y;µ(X), kθ(X,X) + σ2I)
or via maximum a posteriori
(MAP) estimation, in case the parameters
θ
are equipped with a prior
p(θ)
. This procedure is
sometimes also called model selection, as one selects the hyperparameters of the kernel given the data
(see [
13
]). However, we consider selecting the structural form of the kernel itself. The structural form
of the kernel determines the statistical hypothesis that is assumed to be true for the data-generating
process. Intuitively, the kernel is similar to the architecture in deep neural networks, which induces
an inductive bias.
Our goal is to do model selection over a discrete, possibly infinite space of kernels
K:= {k1, k2, . . . }
.
As each kernel comes with its own parameters, we are actually dealing with a space of kernel families.
Thus, when mentioning a kernel
k
we associate it with its whole family over parameters
{kθ|θΘ}
.
Once a kernel is selected, predictions are done with learned kernel parameters (that usually are a
by-product of calculating the model selection criteria). The parameters
θ
are potentially equipped
with a prior
p(θ)
depending on the selection criteria. As mean function, we always utilize the zero
mean function
µ(x) := 0
in case
y
is centered and a constant mean function otherwise. This is a
common choice in GP regression. Given some model selection criteria
g:KR
our task is solving
k= arg maxkKg(k|D)
. While our method is not restricted to a specific model selection criteria,
we focus on the model evidence
p(y|X, k)
, which is a well-known selection criteria for probabilistic
models (see [
9
,
14
,
8
]). Given a prior on the kernel parameters
p(θ)
and the likelihood variance
p(σ2)
,
the log-model evidence of the marginalized GP is given as
g(k|D) = log p(y|X, k) = log Zp(y|X, θ, σ2, k)p(σ2)p(θ|k)2.
This quantity can be approximated via Laplace approximation of
p(θ, σ2|D)
(see Appendix A
for details). Computing this approximation includes performing a MAP estimation of the GP
parameters. Thus, once the log evidence has been computed, learned kernel hyperparameters θMAP
are automatically provided. Performing the MAP estimation scales cubically in the data set size
N
,
which renders model selection for GP’s computationally intense.
2.3 Kernel Grammar
The kernel grammar introduced by [
3
] specifies a highly general search space over kernels. The
grammar is based on the observation that kernels are closed under addition and multiplication,
meaning, for kernels
k1(x, x0)
and
k2(x, x0)
also
k1(x, x0) + k2(x, x0)
and
k1(x, x0)×k2(x, x0)
are
kernels. Given some base kernels, such as the squared exponential kernel
SE
, the linear kernel
LIN
,
the periodic kernel
PER
or the rational quadratic kernel
RQ
, different statistical hypotheses can be
3
generated via addition and multiplication. For example,
LIN +PER ×SE
describes a linear trend with a locally periodic component, such that it might be a useful hypoth-
esis for time-series applications. For multidimensional data the base kernels are applied to single
dimensions, denoted e.g. with
SEi
for the squared exponential kernel defined on dimension
i
. The
grammar contains many popular hypotheses such as the ARD-RBF kernel that can be expressed via
multiplication over the dimensions
Qd
i=1 SEi
as well as additive models
Pd
i=1 SEi
or polynomials
of order mwith Qm
j=1 LIN.
The specification of the complete search space is given as a grammar, where a base kernel is denoted
as
B
and a subexpression is denoted with
S
. For example, in the expression
LIN + (PER ×SE)
the
expression
PER ×SE
is a subexpression. Starting with all base kernels, the search space is defined as
all kernels that can be reached via the following operations:
1. Add a base kernel to a subexpression: S → S +B
2. Multiply a subexpression with a base kernel: S S × B
3. Exchange a base kernel with another base kernel: B → B0
Starting from the set of base kernels, these operations can lead to all algebraic expressions, showing
the expressiveness of the grammar [
3
]. The authors of [
3
] suggest using greedy search to search
through the grammar, where the kernel with the highest value for the selection criteria is selected and
expanded with all possible operations. After the selection criteria is calculated on the neighbors, the
search progresses to the next stage by expanding the neighbors of the best kernel found.
We consider a generalized notion of the kernel grammar, where we consider a set of base kernels
{B1,...,Br}, r N
and a set of operators
{T1, . . . , Tl}, l N
where
Tj:K × K K, j = 1, . . . , l
are closed operators on the space of all kernel functions
K
. Examples are addition and multiplication,
but also the change-point operator which was considered in [
7
]. Thus, in general the grammar
operations are
1. Apply operator Tjonto a subexpression and a base kernel1:S Tj(S,B)
2. Exchange a base kernel with another base kernel: B → B0
The considered kernel space can be defined precisely in the following way:
Definition 1
For
r, l N
, let
{k1, . . . , kr}
be a set of (base-) kernels (symbollically represented as
{B1,...,Br}) and {T1, . . . , Tl}a set of operators with Tj:K × K K, j = 1, . . . , l. Let
L0:= {k1, . . . , kr}
Li:= {Tj(k1, k2)|k1, k2Li1, j = 1, . . . , l} ∪ Li1
for i= 1, . . . , M. We call ˜
K:= LMthe kernel-grammar generated kernel space with depth M.
2.4 Representation of Kernels
The base kernels and their combination via the operators describe the structural assumptions of the
final kernel and imply the assumptions that are made in function space. Our main hypothesis is that
the symbolical representation of the kernel, given by the subexpressions
S
and base kernels, already
contains sufficient information for model selection. Therefore, we will define a kernel-kernel over the
symbolical representations and utilize it for Bayesian optimization. Concretely, each kernel
k˜
K
in
our described space can be written as a tree T, for example:
LIN + ((SE + PER) ×SE)
ADD
LIN MULT
ADD
SE PER
SE
1In case the operator Tjis not symmetric we also add the operation S Tj(B,S). However, all operators
considered in this work are symmetric.
4
Each operator
Tj
and each base kernel
Bi
is represented by their respective name, where operators
are the nodes of the tree and base kernels are the leafs. The way how the operators and base kernels
are connected is represented through the tree structure. For a given expression tree
T
, we denote the
multiset of all subexpressions/subtrees as
Subtrees(T)
. Furthermore, we consider paths to the leafs
of the tree, for example, for the tree above one path to a leaf would be:
ADD MULT ADD PER
We denote the multiset of all paths in the tree
T
as
Paths(T)
. Lastly, we also consider the multiset
of all base kernels that exist in a given expression tree as
Base(T)
. For each described multiset, we
denote the number of occurrences of element
E
in the multiset as
n(E)
. When building the multisets
we also account for two symmetries in the elements that can be applied if an operator is associative
and commutative, which we elaborate further in Appendix A.
Depending on the operators that are used, several trees can describe the same kernel
k˜
K
. For
technical reasons, we denote with
f:˜
K7→ T (˜
K)
a mapping that maps a given kernel
k˜
K
to one
tree
T
that induces this kernel, where
T(˜
K)
denotes the set of all expression trees that can generate a
kernel in ˜
K(see details in Appendix A).
3 Kernel-Kernel
Our kernel-kernel will be defined via a pseudo-metric over the expression trees. Optimal transport
(OT) principles have proven themselves to be effective in BO methods over structured spaces, as
shown in [
5
] and [
11
] for neural architecture search or in [
6
] for BO over molecule structures. We
follow this line of work and also rely on optimal transport, although we only use a simple ground
metric to allow for closed-form computations. To allow OT metrics to be used, we summarize each
expression tree
T
to discrete probability distributions of their building blocks
Base(T)
,
Paths(T)
and Subtrees(T)via
ωbase := X
EBase(T)
ωEδE, ωpaths := X
EPaths(T)
ωEδE, ωsubtrees := X
ESubtree(T)
ωEδE,
where
δE
is the Dirac delta and
ωE
is the frequency of the element
E
in the respective multiset. For
example, the frequency of expression SE ×PER in the multiset Subtree(T)is calculated as
ωSE×PER =n(SE ×PER)
|Subtree(T)|.
Each probability distribution represents a different modeling aspect that is induced by a kernel
k˜
K
and its corresponding expression tree T:
1. ωbase
specifies which base kernels are present in
T
, thus, which base assumptions in function
space are included such as periodicity, linearity, local smoothness.
2. ωpaths
specifies how the base kernels in
T
are used, e.g. whether a periodic component is
applied additively or multiplicatively.
3. ωsubtrees
specifies the interaction between the base kernels in
T
, for example if
T
contains
an addition of a linear and a periodic component or not.
Our pseudo-metric between kernels
k1
and
k2
(and its associated trees
T1
and
T2
) uses all three
modeling aspects via the optimal transport distances between the respective discrete probability
distributions ωbase,ωpaths and ωsubtrees.
In general the OT distance with ground metric
˜
d
between two discrete probability distribution
ω1=PEω1,EδEand ω2=PEω2,EδEover is defined as
W˜
d(ω1, ω2) = infπ∈R(ω12)Z×
˜
d(E,E0)π(dE, dE0),(1)
where
π∈ R(ω1, ω2)
is a combined probability distribution over
×
with marginal distribution
π(A×Ω) = ω1(A)
and
π(Ω ×B) = ω2(B)
for Borel sets
A, B
. While for general ground metric
5
摘要:

StructuralKernelSearchviaBayesianOptimizationandSymbolicalOptimalTransportMatthiasBitzerMonaMeisterChristophZimmerBoschCenterforArticialIntelligence,Renningen,Germany{matthias.bitzer3,mona.meister,christoph.zimmer}@de.bosch.comAbstractDespiterecentadvancesinautomatedmachinelearning,modelselectionis...

展开>> 收起<<
Structural Kernel Search via Bayesian Optimization and Symbolical Optimal Transport Matthias Bitzer Mona Meister Christoph Zimmer.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:2.65MB 格式:PDF 时间:2025-05-02

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注