Structural Kernel Search via Bayesian Optimization and Symbolical Optimal Transport Matthias Bitzer Mona Meister Christoph Zimmer

2025-05-02 0 0 2.65MB 23 页 10玖币

侵权投诉

Structural Kernel Search via Bayesian Optimization

and Symbolical Optimal Transport

Matthias Bitzer Mona Meister Christoph Zimmer

Bosch Center for Artiﬁcial Intelligence, Renningen, Germany

{matthias.bitzer3,mona.meister,christoph.zimmer}@de.bosch.com

Abstract

Despite recent advances in automated machine learning, model selection is still

a complex and computationally intensive process. For Gaussian processes (GPs),

selecting the kernel is a crucial task, often done manually by the expert. Addition-

ally, evaluating the model selection criteria for Gaussian processes typically scales

cubically in the sample size, rendering kernel search particularly computationally

expensive. We propose a novel, efﬁcient search method through a general, struc-

tured kernel space. Previous methods solved this task via Bayesian optimization

and relied on measuring the distance between GP’s directly in function space

to construct a kernel-kernel. We present an alternative approach by deﬁning a

kernel-kernel over the symbolic representation of the statistical hypothesis that is

associated with a kernel. We empirically show that this leads to a computationally

more efﬁcient way of searching through a discrete kernel space.

1 Introduction

In many real-work applications of machine learning, tuning the hyperparameters or selecting the

machine learning method itself is a crucial part of the workﬂow. It is often done by experts and

data-scientist. However, the number of possible methods and models is constantly growing, and

it is becoming increasingly important to automatically select the right model for the task at hand.

Bayesian optimization (BO) is a prominent method that can be used for model selection and hyperpa-

rameter tuning. It can handle black-box oracles with expensive function evaluations, which are two

characteristics often encountered when doing model selection [

]. Important applications in this

context are choosing the hyperparameters in the training process of deep neural networks (DNN) [

or dealing with discrete and structured problems like choosing the architecture of DNN’s [5, 11].

Gaussian processes (GP) are another important model-class. They are often utilized as surrogate

models in BO [

], for time-series and statistical modeling [

] or in active learning loops [

The properties of GP’s are mainly governed by its kernel that speciﬁes the assumptions made on the

underlying function. Choosing the right kernel is therefore a crucial part of applying GP’s and is

often done by the expert. Recent work [

] treated the kernel selection as a black-box optimization

problem and used Bayesian optimization to solve it. This allowed searching over a highly structured,

discrete space of kernels. However, their proposed kernel-kernel measures the distance between

two GP’s directly in function space, which is a computationally expensive task itself. This makes

the method difﬁcult to apply for the frequent scenarios where the evaluation of the model selection

criteria requires only a medium amount of time.

We propose measuring the distance between two kernels via their symbolical representation of their

associated statistical hypothesis. We utilize the highly general kernel-grammar, presented in [

], as

underlying kernel space, where each kernel is build from base kernels and operators, like e.g.

LIN + ((SE ×PER) + SE)

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.11836v1 [cs.LG] 21 Oct 2022

forming effectively a description of the statistical hypothesis that is modeled by the GP. Our main idea

is to build a distance over these symbolical descriptions, rather than measuring the distance between

two GP’s directly in function space. We employ optimal transport principles, known from neural-

architecture search (see [

],[

]), to build a pseudo-distance between two hypotheses descriptions

and use it to construct a kernel-kernel, which is subsequently utilized in the BO loop. We will show

that the induced kernel search method is more efﬁcient, in terms of number of function evaluations

and computational time, compared to alternative kernel search methods over discrete search spaces.

The main challenge we encountered is the quantiﬁcation of dissimilarity between two symbolical

representations of kernels. We use the tree representation of each symbolical description and apply

optimal transport distances over tree features. Subsequently, we empirically show that the deduced

GP over GP’s provides a well-behaved meta-model and show its advantages for kernel search. In

summary, our contributions are:

We construct a pseudo-distance over GP’s that acts over the symbolical representations of

the underlying statistical hypothesis.

We use the pseudo-distance to construct a novel "kernel-kernel" and build it in a BO loop to

do model selection for GP’s.

We empirically show that our meta-GP model is well-behaved and that we outperform

previous methods in kernel search over discrete kernel spaces.

Related Work:

There is considerable existing work on constructing ﬂexible kernels and learning

their hyperparameters [

]. In these kinds of works, the structural form of the kernel

is predeﬁned and the free parameters of the kernel are optimized, often via marginal likelihood

maximization. While being able to efﬁciently ﬁnding the hyperparameter due to the differentiability

of the marginal likelihood, one still needs to predeﬁne the structural form of the kernel in the ﬁrst

place. This is a hard task as one need to decide if long-range correlation or nonstationarity should

be considered or if dimensions are ignored or not. Our method is build to automatically select

the structure of the kernel. Some of the mentioned methods [

] are able to approximate any

stationary kernel via Bochner’s Theorem and therefore consider a broad kernel space themselves.

However, even elementary statistical hypotheses require nonstationary kernels, such as linear trends

modeled by the linear kernel. Our search space is not restricted to stationary kernels.

We consider a highly general, discrete kernel space that is induced by the kernel grammar [

]. Recent

work [

] used BO to search through this space via a kernel that measures similarity in function space.

We also utilize BO but employ a fundamentally different principle of measuring the distance, which is

computationally more efﬁcient. We dedicate Section 3.2 to a more precise comparison to the method

of [

]. Additionally, the original kernel grammar paper [

] suggested greedy search for searching

through the kernel grammar. Furthermore, [

] employed a genetic algorithm based on cross-over

mutations. We empirically compare against all approaches.

In the area of BO over structured spaces, our method is most similar to the neural-architecture search

(NAS) procedures presented in [

] who use optimal transport distances over features extracted

from the graph-representation of neural networks. Compared to these methods we use OT principles

to do model selection for GP’s which is a fundamentally different task.

2 Background and Set-up

Our main task is efﬁcient model selection for Gaussian processes. In order to provide background

information, we give a small introduction to GP’s and model selection for GP’s. Subsequently, we

present a review of the kernel-grammar [

] and show how we use it for our approach. We show how

a kernel can be represented via a symbolical description. In Section 3, we will present how we use

the symbolical description to construct a kernel over kernels and how we use it in the BO loop.

2.1 Gaussian Processes

A Gaussian process is a distribution over functions

f:X → R

over a given input space

which is

fully characterized via the covariance/kernel function

k(x, x0) = Cov(f(x), f(x0))

and the mean

function

µ(x) := E[f(x)]

. We therefore can write as shorthand notation

f∼ GP(µ(·), k(·,·))

. The

kernel can be interpreted as a similarity measure between two elements of the input space, with

the GP assigning higher correlations to function values whose inputs are more similar according to

the kernel. Furthermore, the kernel governs the main assumptions on the modeled function such as

smoothness, periodicity or long-range correlations and therefore provides the inductive-bias of the

GP. An important property of GP’s is that they are not restricted to euclidean input spaces

X ⊂ Rd

but can also be deﬁned on highly structured spaces like trees and graphs, a property we will later use

to deﬁne a GP over GP’s.

While our method might also be used for model selection in classiﬁcation, we consider from now on

Gaussian processes regression. For regression, a dataset

D= (X,y)

with

X={x1, . . . , xN}⊂X

and

y= (y1, . . . , yN)|∈RN

is given, where we suppose that

f∼ GP(µ(·), k(·,·))

and

yi=

f(xi) + i

with

i

i.i.d

∼ N (0, σ2)

. Given the observed data

the posterior distribution

f|D

is again a

GP with mean and covariance functions

µD(x) = µ(x) + k(x)|(K+σ2I)−1(y−µ(X)),

kD(x, y) = k(x, y)−k(x)|(K+σ2I)−1k(y)

with

K= [k(xm, xl)]N

m,l=1

and

k(x)=[k(x, x1), . . . , k(x, xN)]|

(see [

]). Probabilistic predic-

tions can be done via the resulting predictive distribution p(f∗|x∗,D) = N(µD(x∗), kD(x∗, x∗)).

2.2 Model selection for GP’s

Typically, the kernel

kθ

comes with a set of parameters

that can be learned via maximization of

the marginal likelihood

p(y|X, θ, σ2) = N(y;µ(X), kθ(X,X) + σ2I)

or via maximum a posteriori

(MAP) estimation, in case the parameters

are equipped with a prior

p(θ)

. This procedure is

sometimes also called model selection, as one selects the hyperparameters of the kernel given the data

(see [

]). However, we consider selecting the structural form of the kernel itself. The structural form

of the kernel determines the statistical hypothesis that is assumed to be true for the data-generating

process. Intuitively, the kernel is similar to the architecture in deep neural networks, which induces

an inductive bias.

Our goal is to do model selection over a discrete, possibly inﬁnite space of kernels

K:= {k1, k2, . . . }

As each kernel comes with its own parameters, we are actually dealing with a space of kernel families.

Thus, when mentioning a kernel

we associate it with its whole family over parameters

{kθ|θ∈Θ}

Once a kernel is selected, predictions are done with learned kernel parameters (that usually are a

by-product of calculating the model selection criteria). The parameters

are potentially equipped

with a prior

p(θ)

depending on the selection criteria. As mean function, we always utilize the zero

mean function

µ(x) := 0

in case

is centered and a constant mean function otherwise. This is a

common choice in GP regression. Given some model selection criteria

g:K→R

our task is solving

k∗= arg maxk∈Kg(k|D)

. While our method is not restricted to a speciﬁc model selection criteria,

we focus on the model evidence

p(y|X, k)

, which is a well-known selection criteria for probabilistic

models (see [

]). Given a prior on the kernel parameters

p(θ)

and the likelihood variance

p(σ2)

the log-model evidence of the marginalized GP is given as

g(k|D) = log p(y|X, k) = log Zp(y|X, θ, σ2, k)p(σ2)p(θ|k)dθdσ2.

This quantity can be approximated via Laplace approximation of

p(θ, σ2|D)

(see Appendix A

for details). Computing this approximation includes performing a MAP estimation of the GP

parameters. Thus, once the log evidence has been computed, learned kernel hyperparameters θMAP

are automatically provided. Performing the MAP estimation scales cubically in the data set size

which renders model selection for GP’s computationally intense.

2.3 Kernel Grammar

The kernel grammar introduced by [

] speciﬁes a highly general search space over kernels. The

grammar is based on the observation that kernels are closed under addition and multiplication,

meaning, for kernels

k1(x, x0)

and

k2(x, x0)

also

k1(x, x0) + k2(x, x0)

and

k1(x, x0)×k2(x, x0)

are

kernels. Given some base kernels, such as the squared exponential kernel

, the linear kernel

LIN

the periodic kernel

PER

or the rational quadratic kernel

, different statistical hypotheses can be

generated via addition and multiplication. For example,

LIN +PER ×SE

describes a linear trend with a locally periodic component, such that it might be a useful hypoth-

esis for time-series applications. For multidimensional data the base kernels are applied to single

dimensions, denoted e.g. with

SEi

for the squared exponential kernel deﬁned on dimension

. The

grammar contains many popular hypotheses such as the ARD-RBF kernel that can be expressed via

multiplication over the dimensions

i=1 SEi

as well as additive models

i=1 SEi

or polynomials

of order mwith Qm

j=1 LIN.

The speciﬁcation of the complete search space is given as a grammar, where a base kernel is denoted

and a subexpression is denoted with

. For example, in the expression

LIN + (PER ×SE)

the

expression

PER ×SE

is a subexpression. Starting with all base kernels, the search space is deﬁned as

all kernels that can be reached via the following operations:

1. Add a base kernel to a subexpression: S → S +B

2. Multiply a subexpression with a base kernel: S → S × B

3. Exchange a base kernel with another base kernel: B → B0

Starting from the set of base kernels, these operations can lead to all algebraic expressions, showing

the expressiveness of the grammar [

]. The authors of [

] suggest using greedy search to search

through the grammar, where the kernel with the highest value for the selection criteria is selected and

expanded with all possible operations. After the selection criteria is calculated on the neighbors, the

search progresses to the next stage by expanding the neighbors of the best kernel found.

We consider a generalized notion of the kernel grammar, where we consider a set of base kernels

{B1,...,Br}, r ∈N

and a set of operators

{T1, . . . , Tl}, l ∈N

where

Tj:K × K → K, j = 1, . . . , l

are closed operators on the space of all kernel functions

. Examples are addition and multiplication,

but also the change-point operator which was considered in [

]. Thus, in general the grammar

operations are

1. Apply operator Tjonto a subexpression and a base kernel1:S → Tj(S,B)

2. Exchange a base kernel with another base kernel: B → B0

The considered kernel space can be deﬁned precisely in the following way:

Deﬁnition 1

For

r, l ∈N

, let

{k1, . . . , kr}

be a set of (base-) kernels (symbollically represented as

{B1,...,Br}) and {T1, . . . , Tl}a set of operators with Tj:K × K → K, j = 1, . . . , l. Let

L0:= {k1, . . . , kr}

Li:= {Tj(k1, k2)|k1, k2∈Li−1, j = 1, . . . , l} ∪ Li−1

for i= 1, . . . , M. We call ˜

K:= LMthe kernel-grammar generated kernel space with depth M.

2.4 Representation of Kernels

The base kernels and their combination via the operators describe the structural assumptions of the

ﬁnal kernel and imply the assumptions that are made in function space. Our main hypothesis is that

the symbolical representation of the kernel, given by the subexpressions

and base kernels, already

contains sufﬁcient information for model selection. Therefore, we will deﬁne a kernel-kernel over the

symbolical representations and utilize it for Bayesian optimization. Concretely, each kernel

k∈˜

our described space can be written as a tree T, for example:

LIN + ((SE + PER) ×SE) ←→

ADD

LIN MULT

ADD

SE PER

1In case the operator Tjis not symmetric we also add the operation S → Tj(B,S). However, all operators

considered in this work are symmetric.

Each operator

and each base kernel

is represented by their respective name, where operators

are the nodes of the tree and base kernels are the leafs. The way how the operators and base kernels

are connected is represented through the tree structure. For a given expression tree

, we denote the

multiset of all subexpressions/subtrees as

Subtrees(T)

. Furthermore, we consider paths to the leafs

of the tree, for example, for the tree above one path to a leaf would be:

ADD −→ MULT −→ ADD −→ PER

We denote the multiset of all paths in the tree

Paths(T)

. Lastly, we also consider the multiset

of all base kernels that exist in a given expression tree as

Base(T)

. For each described multiset, we

denote the number of occurrences of element

in the multiset as

n(E)

. When building the multisets

we also account for two symmetries in the elements that can be applied if an operator is associative

and commutative, which we elaborate further in Appendix A.

Depending on the operators that are used, several trees can describe the same kernel

k∈˜

. For

technical reasons, we denote with

f:˜

K7→ T (˜

a mapping that maps a given kernel

k∈˜

to one

tree

that induces this kernel, where

T(˜

denotes the set of all expression trees that can generate a

kernel in ˜

K(see details in Appendix A).

3 Kernel-Kernel

Our kernel-kernel will be deﬁned via a pseudo-metric over the expression trees. Optimal transport

(OT) principles have proven themselves to be effective in BO methods over structured spaces, as

shown in [

] and [

] for neural architecture search or in [

] for BO over molecule structures. We

follow this line of work and also rely on optimal transport, although we only use a simple ground

metric to allow for closed-form computations. To allow OT metrics to be used, we summarize each

expression tree

to discrete probability distributions of their building blocks

Base(T)

Paths(T)

and Subtrees(T)via

ωbase := X

E∈Base(T)

ωEδE, ωpaths := X

E∈Paths(T)

ωEδE, ωsubtrees := X

E∈Subtree(T)

ωEδE,

where

δE

is the Dirac delta and

ωE

is the frequency of the element

in the respective multiset. For

example, the frequency of expression SE ×PER in the multiset Subtree(T)is calculated as

ωSE×PER =n(SE ×PER)

|Subtree(T)|.

Each probability distribution represents a different modeling aspect that is induced by a kernel

k∈˜

and its corresponding expression tree T:

1. ωbase

speciﬁes which base kernels are present in

, thus, which base assumptions in function

space are included such as periodicity, linearity, local smoothness.

2. ωpaths

speciﬁes how the base kernels in

are used, e.g. whether a periodic component is

applied additively or multiplicatively.

3. ωsubtrees

speciﬁes the interaction between the base kernels in

, for example if

contains

an addition of a linear and a periodic component or not.

Our pseudo-metric between kernels

and

(and its associated trees

and

) uses all three

modeling aspects via the optimal transport distances between the respective discrete probability

distributions ωbase,ωpaths and ωsubtrees.

In general the OT distance with ground metric

between two discrete probability distribution

ω1=PE∈Ωω1,EδEand ω2=PE∈Ωω2,EδEover Ωis deﬁned as

W˜

d(ω1, ω2) = infπ∈R(ω1,ω2)ZΩ×Ω

d(E,E0)π(dE, dE0),(1)

where

π∈ R(ω1, ω2)

is a combined probability distribution over

Ω×Ω

with marginal distribution

π(A×Ω) = ω1(A)

and

π(Ω ×B) = ω2(B)

for Borel sets

A, B

. While for general ground metric

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

StructuralKernelSearchviaBayesianOptimizationandSymbolicalOptimalTransportMatthiasBitzerMonaMeisterChristophZimmerBoschCenterforArticialIntelligence,Renningen,Germany{matthias.bitzer3,mona.meister,christoph.zimmer}@de.bosch.comAbstractDespiterecentadvancesinautomatedmachinelearning,modelselectionis...

展开>> 收起<<

Structural Kernel Search via Bayesian Optimization and Symbolical Optimal Transport Matthias Bitzer Mona Meister Christoph Zimmer.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Structural Kernel Search via Bayesian Optimization and Symbolical Optimal Transport Matthias Bitzer Mona Meister Christoph Zimmer

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: