Tikhonov Regularization is Optimal Transport Robust under Martingale Constraints Jiajin Li Sirui Lin José Blanchet

2025-05-06 0 0 3.33MB 27 页 10玖币

侵权投诉

Tikhonov Regularization is Optimal Transport

Robust under Martingale Constraints

Jiajin Li Sirui Lin José Blanchet

Stanford University

{jiajinli, siruilin, jose.blanchet}@stanford.edu

Viet Anh Nguyen

Chinese University of Hong Kong

nguyen@se.cuhk.edu.hk

Abstract

Distributionally robust optimization has been shown to offer a principled way to

regularize learning models. In this paper, we ﬁnd that Tikhonov regularization is

distributionally robust in an optimal transport sense (i.e., if an adversary chooses

distributions in a suitable optimal transport neighborhood of the empirical measure),

provided that suitable martingale constraints are also imposed. Further, we intro-

duce a relaxation of the martingale constraints which not only provides a uniﬁed

viewpoint to a class of existing robust methods but also leads to new regularization

tools. To realize these novel tools, tractable computational algorithms are proposed.

As a byproduct, the strong duality theorem proved in this paper can be potentially

applied to other problems of independent interest.

1 Introduction

Regularization is an important tool in machine learning which is used in, for instance, reducing

overﬁtting [23]. Recently, ideas from distributionally robust optimization (DRO) have led to a fresh

viewpoint on regularization precisely in connection to overﬁtting; see, e.g., [

]

and the references therein.

In these references it is shown that many standard regularization-based estimators arise as the solution

of a min-max game in which one wishes to minimize a loss over a class of parameters against

an adversary that maximizes the out-of-distribution impact of any given parameter choice, that

is, the adversary perturbs the empirical distribution in a certain way. The choice of adversarial

distributions or perturbations in DRO is often non-parametric thus providing reassurance that the

decision is reasonably robust to a wide range of out-of-distribution perturbations. For example,

one such non-parametric choice is given by employing optimal transport costs [

] to construct a

so-called distributional uncertainty set (e.g. a Wasserstein ball around the empirical distribution) for

the adversary to choose. [

] shows that optimal transport-based DRO (OT-DRO) is closely related

to adversarial robustness in the sense of steepest gradient loss contamination. This can be further

explained by OT-DRO’s hidden connection with generalized Lipschitz regularization [

]. Thus,

understanding if a well-known regularization technique is actually distributionally robust and in what

sense, allows us to understand its out-of-distribution beneﬁts and potentially introduce improvements.

In this paper, we introduce a novel set of regularization techniques which incorporate martingale

constraints into the OT-DRO framework. Our starting point is the conventional OT-DRO formulation.

The conventional OT-DRO formulation can generally be interpreted as perturbing each data point

in such a way that the average size perturbation is less than a given budget. In addition to this

36th Conference on Neural Information Processing Systems (NeurIPS 2022).

arXiv:2210.01413v1 [math.OC] 4 Oct 2022

conventional formulation, we will impose a martingale constraint in the joint distribution of the

empirical data and the resulting adversarially perturbed data.

Why do we believe that the martingale constraint makes sense as a regularization technique? It turns

out that two random variables

and

form a martingale in the sense that

E[¯

X|X] = X

if and only

if the distribution of

dominates

in convex order [

]. In this sense, the adversary

will have

higher dispersion in non-parametric sense than the observed data

but in a suitably constrained

way so that the average locations are preserved. This novel OT-DRO constrained regularization, we

believe, is helpful to potentially combat conservative solutions, see [

]. Moreover, by allowing a

small amount of violation in the martingale property, we can control the regularization properties of

this constraint, thus obtaining a natural interpolation towards the conventional OT-DRO formulation

and potentially improved regularization performance. We point out that related optimal transport

problems with martingale constraints have been studied in robust mathematical ﬁnance [1,8].

Consider, for example, the linear regression setting with the exact martingale constraints, which

means that for any given observed data point, the conditional expectation of the additive perturbation

under the worst-case joint distribution equals zero. Surprisingly, we show that the resulting martingale

DRO model is exactly equivalent to the ridge regression [

] with the Tikhonov regularization. To

the best of our knowledge, this paper is the ﬁrst work to interpret the Tikhonov regularization from

a DRO perspective showing that it is distributionally robust in a precise non-parametric sense. In

stark contrast, it is well-known that the conventional OT-based DRO model (without the martingale

constraint) is identical to the regularized square-root regression problem [

]. Therefore, introducing

an additional power in norm regularization (i.e., converting square-root regression to Tikhonov

regularization) can be translated into adding martingale constraints in the adversarial perturbations

thus reducing the adversarial power. A natural question that arises here is whether we can interpolate

between the conventional DRO model and the Tikhonov regularization, and further improve them.

We will provide a comprehensive and positive answer to the above question in this paper. The key idea

here is to relax the equality constraint on the conditional expectation of the adversarial violation and

thus allow a small perturbation of the martingale property to gain more ﬂexibility of the uncertainty

set. This idea leads to another novel model, termed the perturbed martingale DRO in the sequel.

Intuitively, if the relaxation is sufﬁciently loose, the perturbed martingale DRO model will reduce to

the conventional DRO, which is formally equivalent to setting an inﬁnite amount of possible violations

for the martingale constraint. By contrast, if no violation is allowed, the perturbed martingale DRO

will automatically reduce to the exact counterpart — Tikhonov regularization. As a result, we are

able to introduce a new class of regularizers via the interpolation between the conventional DRO

model and the Tikhonov regularization.

Furthermore, such insightful interpolation also works for a broad class of nonlinear learning models.

Inspired by our extensive exploration of linear regression, the developed martingale DRO model can

also provide a new principled adversarial training procedure for deep neural networks. Extensive

experiments are conducted to demonstrate the effectiveness of the proposed perturbed martingale

DRO model for both linear regression and deep neural network training under the adversarial setting.

We summarize our main contributions as below:

•

We reveal a new hidden connection in this paper, that is, Tikhonov regularization is optimal

transport robust when exact martingale constraints (i.e., convex order between the adversary and

empirical data) are imposed.

•

Upon this ﬁnding, we develop a new perturbed martingale DRO model, which not only provides

a uniﬁed viewpoint of existing regularization techniques, but also leads to a new class of robust

regularizers.

•

We introduce an easy-to-implement computational approach to capitalize the theoretical beneﬁts

in practice, in both linear regression and neural network training under the adversarial setting.

•

As a byproduct, the strong duality theorem, which is proved in this paper and is used as the main

technical tool, can be applied to a wider spectrum of problems of independent interest.

2 Preliminaries

Let us introduce some basic deﬁnitions and concepts preparing for the subsequent analysis.

Deﬁnition 2.1

(Optimal transport costs and the Wasserstein distance [

])

Suppose that

c(·,·) :

Rd×Rd→[0, ∞]

is a lower semi-continuous cost function such that

c(X,X)=0

for every

X∈Rd

The optimal transport cost between two distributions Qand Psupported on Rdis deﬁned as

D(Q,P),min

π∈P(X ×X )Eπc(¯

X,X):P1π=Q,P2π=P.

Here,

P(X × X)

is the set of joint probability distribution

(¯

X,X)

supported on

X × X

while

P1πand P2πrespectively refer to the marginals of ¯

Xand Xunder the joint distribution π.

c(X,¯

X) = k¯

X−Xk

is any given norm on

, then

recovers the Wasserstein distance [

]. In

this paper, we are interested in a ﬂexible family of functions for the computational tractability, so

called the Mahalanobis cost functions in the form of

c(¯

X,X)=(X−¯

X)>M(X−¯

X),

where

is a d-by-dpositive deﬁnite matrix.

Next, we consider the conventional OT-DRO problem:

min

βLβ(b

P,ρ), where Lβ(b

P,ρ),









sup

Eπ[`(fβ(¯

X))]

s.t. π ∈ P(X × X)

Eπc(¯

X,X)≤ρ,P2π=b

(2.1)

where

P,1

NPN

i=1 δXi

is the empirical distribution. Using Deﬁnition 2.1, we have

Lβ(b

P,ρ) =

maxQ:D(Q,

P)≤ρEQ[`(fβ(¯

X))]

, which is the worst-case expected loss under all possible distributions

around the empirical measure

at most

with respect to the OT distance. It is well-known that

under appropriate assumptions, the DRO problem

(2.1)

is equivalent to the regularized square-root

regression problem.

Proposition 2.2

([

, Proposition 2.])

Suppose that (i) the loss function

`(·)

is a convex quadratic

function, i.e.,

∇2`(·) = γ > 0

, where

is a constant, (ii) the feature mapping

fβ(¯

X) = β>¯

linear, and (iii) the ground cost cis the squared Euclidean norm on X=Rd. Then

Lβ(b

P,ρ) = qEb

P[`(fβ(X))] + √ρkβk22

We also present the strong duality result for a general class of optimal transport based DRO models

with martingale constraints. This result serves as our main technical tool for reformulating the DRO

models, and it can also be applied to other semi-inﬁnity structured DRO models, which could be of

independent interests. We consider the primal problem

sup

πRX ×X f(¯

X) dπ

s.t. π ∈ P(X × X)

RX ×X c(¯

X,X) dπ≤ρ,P2π=b

Eπ[¯

X|X] = Xb

P-a.s.

(Primal)

and its associated dual form

inf

λ∈R+

αi∈Rd∀i

λρ +

i=1

α>

iXi+1

i=1

sup

Xf(¯

X)−α>

i¯

X−λc(¯

X,Xi).(Dual)

Here,

f:X → R

is upper semi-continuous and

-integrable. The next theorem states the strong

duality result linking these two problems.

Theorem 2.3

(Strong duality)

Let

P,1

NPi∈[N]δXi

be the reference measure. Suppose that (i)

every sample point is in the interior of the cone generated by

, i.e.,

Xi∈int(cone(X)) ∀i∈[N]

and (ii) the ambiguity radius ρ > 0. Then the strong duality holds, i.e., Val(Primal)=Val(Dual).

Remark 2.4.

At the heart of our analysis tools is the abstract semi-inﬁnite duality theory for conic

linear program [

, Proposition 3.4]. Notably, it will be tricky and subtle to reformulate our problem

into the standard form and further carefully check the general Slater condition. Moreover, we indeed

ﬁll the technical gap in [19, Theorem 4.2].

Notation.

We use

to denote the set of

-by-

positive deﬁnite matrices and

kXkM,√X>MX

for any

X∈Rd,M∈Sd

; if

is the identity matrix, then we omit

and write

kXk,

√X>X

. We use

δX

to denote the Dirac measure at

and let

P,1

NPN

i=1 δXi

be the empirical

measure constructed from sample

{X1,. . . ,XN}

. We use

to denote the integration over

EP[f(X)] = RXf(X)dP

. Speciﬁcally, for

(¯

X,X)

following joint distribution

Eπ[c(¯

X,X)] =

RX ×X c(¯

X,X)dπ

Eπ[f(¯

X)] = RX ×X f(¯

X)dπ

. We use

`(·)

to denote the loss function applied to

the parametrized feature mapping

fβ

. Let

∇`(·), ∇2`(·)

be the ﬁrst and second order derivative of

`(·)

respectively. Notably, we use

Lβ(b

P,ρ)

to denote the objective function of the conventional DRO

model

(2.1)

; we use

Lβ(b

P,ρ)

to refer to the exact martingale DRO model

(3.1)

; we use

Lβ(b

P,ρ,)

to refer to the perturbed martingale DRO model (3.2).

3 Tractable Reformulations

In this section, we introduce an optimal transport-based DRO model with the exact martingale

constraint at ﬁrst. That is, on top of the vanilla DRO model [

], we add an additional martingale

equality constraint on its coupling. It is surprisingly interesting to ﬁnd out that the resulting DRO

approach is equivalent to empirical risk minimization with Tikhonov regularization. Naturally, we

can relax the equality constraint and thus allow a small violation of the martingale property to enrich

the uncertainty set. Formally, by sending violation size to inﬁnity in the martingale constraint, our

relaxation allows to interpolate between the conventional DRO formulation (i.e. with no martingale

constraints) and Tikhonov regularization (which involves exact martingale constraints). Therefore,

this relaxation further leads to a new class of regularizers in a principled way, which improves upon

Tikhonov regularization as we show in our experiments.

Assumption 3.1. The following assumptions hold throughout.

(i) The ground cost c(·,·)is the Mahalanobis cost with the weighting matrix M∈Sd

++,

(ii) The domain Xis unconstrained, i.e., X=Rd.

3.1 Optimal Transport-based DRO with Martingale Constraints

To start with, we investigate the exact martingale DRO problem:

min

βLβ(b

P,ρ), where Lβ(b

P,ρ),









sup

Eπ[`(fβ(¯

X))]

s.t. π ∈ P(X × X)

Eπc(¯

X,X)≤ρ,P2π=b

Eπ[¯

X|X] = Xb

P-a.s.,

(3.1)

and

ρ≥0

is the radius of uncertainty set centered at

. Note that because

is the empirical measure,

the martingale constraint implies that the conditional expected value of the perturbation obtained

by modifying each observed data point equals the observed data point itself. The quantity

Lβ(b

P,ρ)

is referred to as the worst-case expected loss of the model parameter

under the martingale DRO

model. It is easy to see that

Lβ(b

P,ρ)≤Lβ(b

P,ρ)

, where

Lβ

is deﬁned as in

(2.1)

. This is because

the adversary in

(3.1)

has a smaller feasible set, and thus is less powerful than the adversary in

(2.1)

Hence, the martingale DRO solution for problem

(3.1)

is considered to be less conservative than the

conventional DRO solution for problem (2.1).

The next result asserts that the martingale DRO problem coincides with the Tikhonov regularization

problem under similar conditions of Proposition 2.2.

Proposition 3.2

(Tikhonov equivalence)

Suppose that (i) the loss function

`(·)

is a convex quadratic

function, i.e., ∇2`(·) = γ > 0, and (ii) the feature mapping fβ(¯

X) = β>¯

Xis linear. Then we have

Lβ(b

P,ρ) = Eb

P[`(β>X)] + γρ

2kβk2

M−1.

If the Mahalanobis matrix

is the identity matrix, the martingale DRO model

(3.1)

recovers the

Tikhonov regularization problem.

Proof of Proposition 3.2.By a change of the variable, let ∆ = ¯

X−Xand we have

sup

Eπ[k∆k2

M]≤ρ

Eπ[∆|X]=0

Eπ`(β>(X+ ∆))= sup

Eπ[k∆k2

M]≤ρ

Eπ[∆|X]=0

Eπh`(β>X) + ∇`(β>X)β>∆ + γ

2kβ>∆k2i

=Eb

P`(β>X)+ sup

Eπ[k∆k2

M]≤ρ

Eπ[∆|X]=0

Eπhγ

2kβ>∆k2i

=Eb

P`(β>X)+γρ

2kβk2

M−1.

The last equality follows from the general Hölder’s inequality. To achieve the equality, we can, for

example, take a normally distributed random variable

with mean 0 and variance

and which is

independent of X, and then let ∆ = CM−1β.

Example 3.3

(Linear regression)

Let

X>,(Y,Z>)∈Rd

and

β>,(1, −b>)∈Rd

, we have

β>X=Y−b>Z

. For any

Q∈Sd−1

, we take

M=diag(+∞,Q)

, which implies that we do not

allow transport of the response Y, then the problem (3.1)with γ= 2 becomes

min

bnEb

P(Y−b>Z)2+ρkbk2

Q−1o.

In Appendix C.1, we give another instructive proof based on the strong duality result in Section 2.

For general convex loss functions, we have the following certiﬁcate of robustness that provides upper

and lower bound for the worst-case DRO loss in (3.1).

Corollary 3.4

(General convex loss functions)

Suppose that (i) the loss function

`(·)

-strongly

convex and

-smooth, that is,

`(θ)≥`(θ0) + ∇`(θ0)(θ0−θ) + µ

2(θ0−θ)2

and

`(θ)≤`(θ0) +

∇`(θ0)(θ0−θ) + C

2(θ0−θ)2

hold for all

θ,θ0∈R

, and (ii) the feature mapping

fβ(¯

X) = β>¯

linear. Then we have for any ρ≥0,

P[`(fβ(X))] + µρ

2kβk2

M−1≤Lβ(b

P,ρ)≤Eb

P[`(fβ(X))] + Cρ

2kβk2

M−1.

Example 3.5

(Logistic regression)

Let

X=Y Z ∈Rd

, where

Z∈Rd,Y∈ {±1}

, and

`(t) =

log(1 + exp(−t)), where `(·)satisﬁes Assumption (i) in Corollary 3.4 with C=1

4. Then we have

Lβ(b

P,ρ)≤Eb

P[log(1 + exp(−Y β>Z))] + ρ

8kβk2

M−1.

3.2 Optimal Transport-based DRO with Perturbed Martingale Constraints

Now we turn to the relaxation of the martingale constraint to improve upon Tikhonov regularization

and gain more ﬂexibility. We consider the perturbed martingale coupling based DRO model (perturbed

martingale DRO):

min

βLβ(b

P,ρ,), where Lβ(b

P,ρ,),









sup

Eπ[`(fβ(¯

X))]

s.t. π ∈ P(X × X)

Eπc(¯

X,X)≤ρ,P2π=b

kEπ[¯

X|X]−XkM≤b

P-a.s.

(3.2)

The parameter



controls the allowed violations of the martingale constraint for the adversary. It

is trivial that if we set

= 0

, then we obtain

Lβ(b

P,ρ, 0) = Lβ(b

P,ρ)

and we recover the exact

martingale DRO model

(3.1)

. If we set

= +∞

then the martingale constraint becomes ineffective,

thus we have

Lβ(b

P,ρ, +∞) = Lβ(b

P,ρ)

and model

(3.2)

collapses to the DRO formulation

(2.1)

We thus can think of



as an interpolating parameter connecting two extremes: the conventional DRO

model (2.1) (at = +∞) and the exact martingale DRO model (3.1) (at = 0).

However, the resulting optimization problem

(3.2)

constitutes an inﬁnite-dimensional optimization

problem over probability distributions and thus appears to be computationally intractable. To

overcome this issue, we leverage Theorem 2.3 and prove that the problem

(3.2)

is actually equivalent

to a ﬁnite-dimensional problem. To begin with, we present one crucial proposition that can be applied

to more general settings.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TikhonovRegularizationisOptimalTransportRobustunderMartingaleConstraintsJiajinLiSiruiLinJoséBlanchetStanfordUniversity{jiajinli,siruilin,jose.blanchet}@stanford.eduVietAnhNguyenChineseUniversityofHongKongnguyen@se.cuhk.edu.hkAbstractDistributionallyrobustoptimizationhasbeenshowntoofferaprincipledway...

展开>> 收起<<

Tikhonov Regularization is Optimal Transport Robust under Martingale Constraints Jiajin Li Sirui Lin José Blanchet.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Tikhonov Regularization is Optimal Transport Robust under Martingale Constraints Jiajin Li Sirui Lin José Blanchet

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: