Tikhonov Regularization is Optimal Transport Robust under Martingale Constraints Jiajin Li Sirui Lin José Blanchet

2025-05-06 0 0 3.33MB 27 页 10玖币
侵权投诉
Tikhonov Regularization is Optimal Transport
Robust under Martingale Constraints
Jiajin Li Sirui Lin José Blanchet
Stanford University
{jiajinli, siruilin, jose.blanchet}@stanford.edu
Viet Anh Nguyen
Chinese University of Hong Kong
nguyen@se.cuhk.edu.hk
Abstract
Distributionally robust optimization has been shown to offer a principled way to
regularize learning models. In this paper, we find that Tikhonov regularization is
distributionally robust in an optimal transport sense (i.e., if an adversary chooses
distributions in a suitable optimal transport neighborhood of the empirical measure),
provided that suitable martingale constraints are also imposed. Further, we intro-
duce a relaxation of the martingale constraints which not only provides a unified
viewpoint to a class of existing robust methods but also leads to new regularization
tools. To realize these novel tools, tractable computational algorithms are proposed.
As a byproduct, the strong duality theorem proved in this paper can be potentially
applied to other problems of independent interest.
1 Introduction
Regularization is an important tool in machine learning which is used in, for instance, reducing
overfitting [23]. Recently, ideas from distributionally robust optimization (DRO) have led to a fresh
viewpoint on regularization precisely in connection to overfitting; see, e.g., [
7
,
2
,
20
,
10
,
29
,
24
,
25
,
5
]
and the references therein.
In these references it is shown that many standard regularization-based estimators arise as the solution
of a min-max game in which one wishes to minimize a loss over a class of parameters against
an adversary that maximizes the out-of-distribution impact of any given parameter choice, that
is, the adversary perturbs the empirical distribution in a certain way. The choice of adversarial
distributions or perturbations in DRO is often non-parametric thus providing reassurance that the
decision is reasonably robust to a wide range of out-of-distribution perturbations. For example,
one such non-parametric choice is given by employing optimal transport costs [
31
] to construct a
so-called distributional uncertainty set (e.g. a Wasserstein ball around the empirical distribution) for
the adversary to choose. [
27
] shows that optimal transport-based DRO (OT-DRO) is closely related
to adversarial robustness in the sense of steepest gradient loss contamination. This can be further
explained by OT-DRO’s hidden connection with generalized Lipschitz regularization [
7
]. Thus,
understanding if a well-known regularization technique is actually distributionally robust and in what
sense, allows us to understand its out-of-distribution benefits and potentially introduce improvements.
In this paper, we introduce a novel set of regularization techniques which incorporate martingale
constraints into the OT-DRO framework. Our starting point is the conventional OT-DRO formulation.
The conventional OT-DRO formulation can generally be interpreted as perturbing each data point
in such a way that the average size perturbation is less than a given budget. In addition to this
36th Conference on Neural Information Processing Systems (NeurIPS 2022).
arXiv:2210.01413v1 [math.OC] 4 Oct 2022
conventional formulation, we will impose a martingale constraint in the joint distribution of the
empirical data and the resulting adversarially perturbed data.
Why do we believe that the martingale constraint makes sense as a regularization technique? It turns
out that two random variables
X
and
¯
X
form a martingale in the sense that
E[¯
X|X] = X
if and only
if the distribution of
¯
X
dominates
X
in convex order [
30
]. In this sense, the adversary
¯
X
will have
higher dispersion in non-parametric sense than the observed data
X
but in a suitably constrained
way so that the average locations are preserved. This novel OT-DRO constrained regularization, we
believe, is helpful to potentially combat conservative solutions, see [
16
]. Moreover, by allowing a
small amount of violation in the martingale property, we can control the regularization properties of
this constraint, thus obtaining a natural interpolation towards the conventional OT-DRO formulation
and potentially improved regularization performance. We point out that related optimal transport
problems with martingale constraints have been studied in robust mathematical finance [1,8].
Consider, for example, the linear regression setting with the exact martingale constraints, which
means that for any given observed data point, the conditional expectation of the additive perturbation
under the worst-case joint distribution equals zero. Surprisingly, we show that the resulting martingale
DRO model is exactly equivalent to the ridge regression [
18
] with the Tikhonov regularization. To
the best of our knowledge, this paper is the first work to interpret the Tikhonov regularization from
a DRO perspective showing that it is distributionally robust in a precise non-parametric sense. In
stark contrast, it is well-known that the conventional OT-based DRO model (without the martingale
constraint) is identical to the regularized square-root regression problem [
2
]. Therefore, introducing
an additional power in norm regularization (i.e., converting square-root regression to Tikhonov
regularization) can be translated into adding martingale constraints in the adversarial perturbations
thus reducing the adversarial power. A natural question that arises here is whether we can interpolate
between the conventional DRO model and the Tikhonov regularization, and further improve them.
We will provide a comprehensive and positive answer to the above question in this paper. The key idea
here is to relax the equality constraint on the conditional expectation of the adversarial violation and
thus allow a small perturbation of the martingale property to gain more flexibility of the uncertainty
set. This idea leads to another novel model, termed the perturbed martingale DRO in the sequel.
Intuitively, if the relaxation is sufficiently loose, the perturbed martingale DRO model will reduce to
the conventional DRO, which is formally equivalent to setting an infinite amount of possible violations
for the martingale constraint. By contrast, if no violation is allowed, the perturbed martingale DRO
will automatically reduce to the exact counterpart — Tikhonov regularization. As a result, we are
able to introduce a new class of regularizers via the interpolation between the conventional DRO
model and the Tikhonov regularization.
Furthermore, such insightful interpolation also works for a broad class of nonlinear learning models.
Inspired by our extensive exploration of linear regression, the developed martingale DRO model can
also provide a new principled adversarial training procedure for deep neural networks. Extensive
experiments are conducted to demonstrate the effectiveness of the proposed perturbed martingale
DRO model for both linear regression and deep neural network training under the adversarial setting.
We summarize our main contributions as below:
We reveal a new hidden connection in this paper, that is, Tikhonov regularization is optimal
transport robust when exact martingale constraints (i.e., convex order between the adversary and
empirical data) are imposed.
Upon this finding, we develop a new perturbed martingale DRO model, which not only provides
a unified viewpoint of existing regularization techniques, but also leads to a new class of robust
regularizers.
We introduce an easy-to-implement computational approach to capitalize the theoretical benefits
in practice, in both linear regression and neural network training under the adversarial setting.
As a byproduct, the strong duality theorem, which is proved in this paper and is used as the main
technical tool, can be applied to a wider spectrum of problems of independent interest.
2 Preliminaries
Let us introduce some basic definitions and concepts preparing for the subsequent analysis.
2
Definition 2.1
(Optimal transport costs and the Wasserstein distance [
21
,
31
])
.
Suppose that
c(·,·) :
Rd×Rd[0, ]
is a lower semi-continuous cost function such that
c(X,X)=0
for every
XRd
.
The optimal transport cost between two distributions Qand Psupported on Rdis defined as
D(Q,P),min
π∈P(X ×X )Eπc(¯
X,X):P1π=Q,P2π=P.
Here,
P(X × X)
is the set of joint probability distribution
π
of
(¯
X,X)
supported on
X × X
while
P1πand P2πrespectively refer to the marginals of ¯
Xand Xunder the joint distribution π.
If
c(X,¯
X) = k¯
XXk
is any given norm on
Rd
, then
D
recovers the Wasserstein distance [
31
]. In
this paper, we are interested in a flexible family of functions for the computational tractability, so
called the Mahalanobis cost functions in the form of
c(¯
X,X)=(X¯
X)>M(X¯
X),
where
M
is a d-by-dpositive definite matrix.
Next, we consider the conventional OT-DRO problem:
min
βLβ(b
P,ρ), where Lβ(b
P,ρ),
sup
π
Eπ[`(fβ(¯
X))]
s.t. π ∈ P(X × X)
Eπc(¯
X,X)ρ,P2π=b
P,
(2.1)
where
b
P,1
NPN
i=1 δXi
is the empirical distribution. Using Definition 2.1, we have
Lβ(b
P,ρ) =
maxQ:D(Q,
b
P)ρEQ[`(fβ(¯
X))]
, which is the worst-case expected loss under all possible distributions
around the empirical measure
b
P
at most
ρ
with respect to the OT distance. It is well-known that
under appropriate assumptions, the DRO problem
(2.1)
is equivalent to the regularized square-root
regression problem.
Proposition 2.2
([
2
, Proposition 2.])
.
Suppose that (i) the loss function
`(·)
is a convex quadratic
function, i.e.,
2`(·) = γ > 0
, where
γ
is a constant, (ii) the feature mapping
fβ(¯
X) = β>¯
X
is
linear, and (iii) the ground cost cis the squared Euclidean norm on X=Rd. Then
Lβ(b
P,ρ) = qEb
P[`(fβ(X))] + ρkβk22
.
We also present the strong duality result for a general class of optimal transport based DRO models
with martingale constraints. This result serves as our main technical tool for reformulating the DRO
models, and it can also be applied to other semi-infinity structured DRO models, which could be of
independent interests. We consider the primal problem
sup
πRX ×X f(¯
X) dπ
s.t. π ∈ P(X × X)
RX ×X c(¯
X,X) dπρ,P2π=b
P
Eπ[¯
X|X] = Xb
P-a.s.
(Primal)
and its associated dual form
inf
λR+
αiRdi
λρ +
N
X
i=1
α>
iXi+1
N
N
X
i=1
sup
¯
Xf(¯
X)α>
i¯
Xλc(¯
X,Xi).(Dual)
Here,
f:X R
is upper semi-continuous and
L1
-integrable. The next theorem states the strong
duality result linking these two problems.
Theorem 2.3
(Strong duality)
.
Let
b
P,1
NPi[N]δXi
be the reference measure. Suppose that (i)
every sample point is in the interior of the cone generated by
X
, i.e.,
Xiint(cone(X)) i[N]
,
and (ii) the ambiguity radius ρ > 0. Then the strong duality holds, i.e., Val(Primal)=Val(Dual).
Remark 2.4.
At the heart of our analysis tools is the abstract semi-infinite duality theory for conic
linear program [
26
, Proposition 3.4]. Notably, it will be tricky and subtle to reformulate our problem
into the standard form and further carefully check the general Slater condition. Moreover, we indeed
fill the technical gap in [19, Theorem 4.2].
3
Notation.
We use
Sd
++
to denote the set of
d
-by-
d
positive definite matrices and
kXkM,X>MX
for any
XRd,MSd
++
; if
M
is the identity matrix, then we omit
M
and write
kXk,
X>X
. We use
δX
to denote the Dirac measure at
X
and let
b
P,1
NPN
i=1 δXi
be the empirical
measure constructed from sample
{X1,. . . ,XN}
. We use
EP
to denote the integration over
P
:
EP[f(X)] = RXf(X)dP
. Specifically, for
(¯
X,X)
following joint distribution
π
,
Eπ[c(¯
X,X)] =
RX ×X c(¯
X,X)dπ
,
Eπ[f(¯
X)] = RX ×X f(¯
X)dπ
. We use
`(·)
to denote the loss function applied to
the parametrized feature mapping
fβ
. Let
`(·), 2`(·)
be the first and second order derivative of
`(·)
respectively. Notably, we use
Lβ(b
P,ρ)
to denote the objective function of the conventional DRO
model
(2.1)
; we use
Lβ(b
P,ρ)
to refer to the exact martingale DRO model
(3.1)
; we use
Lβ(b
P,ρ,)
to refer to the perturbed martingale DRO model (3.2).
3 Tractable Reformulations
In this section, we introduce an optimal transport-based DRO model with the exact martingale
constraint at first. That is, on top of the vanilla DRO model [
3
], we add an additional martingale
equality constraint on its coupling. It is surprisingly interesting to find out that the resulting DRO
approach is equivalent to empirical risk minimization with Tikhonov regularization. Naturally, we
can relax the equality constraint and thus allow a small violation of the martingale property to enrich
the uncertainty set. Formally, by sending violation size to infinity in the martingale constraint, our
relaxation allows to interpolate between the conventional DRO formulation (i.e. with no martingale
constraints) and Tikhonov regularization (which involves exact martingale constraints). Therefore,
this relaxation further leads to a new class of regularizers in a principled way, which improves upon
Tikhonov regularization as we show in our experiments.
Assumption 3.1. The following assumptions hold throughout.
(i) The ground cost c(·,·)is the Mahalanobis cost with the weighting matrix MSd
++,
(ii) The domain Xis unconstrained, i.e., X=Rd.
3.1 Optimal Transport-based DRO with Martingale Constraints
To start with, we investigate the exact martingale DRO problem:
min
βLβ(b
P,ρ), where Lβ(b
P,ρ),
sup
π
Eπ[`(fβ(¯
X))]
s.t. π ∈ P(X × X)
Eπc(¯
X,X)ρ,P2π=b
P
Eπ[¯
X|X] = Xb
P-a.s.,
(3.1)
and
ρ0
is the radius of uncertainty set centered at
b
P
. Note that because
b
P
is the empirical measure,
the martingale constraint implies that the conditional expected value of the perturbation obtained
by modifying each observed data point equals the observed data point itself. The quantity
Lβ(b
P,ρ)
is referred to as the worst-case expected loss of the model parameter
β
under the martingale DRO
model. It is easy to see that
Lβ(b
P,ρ)Lβ(b
P,ρ)
, where
Lβ
is defined as in
(2.1)
. This is because
the adversary in
(3.1)
has a smaller feasible set, and thus is less powerful than the adversary in
(2.1)
.
Hence, the martingale DRO solution for problem
(3.1)
is considered to be less conservative than the
conventional DRO solution for problem (2.1).
The next result asserts that the martingale DRO problem coincides with the Tikhonov regularization
problem under similar conditions of Proposition 2.2.
Proposition 3.2
(Tikhonov equivalence)
.
Suppose that (i) the loss function
`(·)
is a convex quadratic
function, i.e., 2`(·) = γ > 0, and (ii) the feature mapping fβ(¯
X) = β>¯
Xis linear. Then we have
Lβ(b
P,ρ) = Eb
P[`(β>X)] + γρ
2kβk2
M1.
If the Mahalanobis matrix
M
is the identity matrix, the martingale DRO model
(3.1)
recovers the
Tikhonov regularization problem.
4
Proof of Proposition 3.2.By a change of the variable, let ∆ = ¯
XXand we have
sup
Eπ[kk2
M]ρ
Eπ[∆|X]=0
Eπ`(β>(X+ ∆))= sup
Eπ[kk2
M]ρ
Eπ[∆|X]=0
Eπh`(β>X) + `(β>X)β>∆ + γ
2kβ>k2i
=Eb
P`(β>X)+ sup
Eπ[kk2
M]ρ
Eπ[∆|X]=0
Eπhγ
2kβ>k2i
=Eb
P`(β>X)+γρ
2kβk2
M1.
The last equality follows from the general Hölder’s inequality. To achieve the equality, we can, for
example, take a normally distributed random variable
C
with mean 0 and variance
ρ
and which is
independent of X, and then let ∆ = CM1β.
Example 3.3
(Linear regression)
.
Let
X>,(Y,Z>)Rd
and
β>,(1, b>)Rd
, we have
β>X=Yb>Z
. For any
QSd1
++
, we take
M=diag(+,Q)
, which implies that we do not
allow transport of the response Y, then the problem (3.1)with γ= 2 becomes
min
bnEb
P(Yb>Z)2+ρkbk2
Q1o.
In Appendix C.1, we give another instructive proof based on the strong duality result in Section 2.
For general convex loss functions, we have the following certificate of robustness that provides upper
and lower bound for the worst-case DRO loss in (3.1).
Corollary 3.4
(General convex loss functions)
.
Suppose that (i) the loss function
`(·)
is
µ
-strongly
convex and
C
-smooth, that is,
`(θ)`(θ0) + `(θ0)(θ0θ) + µ
2(θ0θ)2
and
`(θ)`(θ0) +
`(θ0)(θ0θ) + C
2(θ0θ)2
hold for all
θ,θ0R
, and (ii) the feature mapping
fβ(¯
X) = β>¯
X
is
linear. Then we have for any ρ0,
Eb
P[`(fβ(X))] + µρ
2kβk2
M1Lβ(b
P,ρ)Eb
P[`(fβ(X))] + Cρ
2kβk2
M1.
Example 3.5
(Logistic regression)
.
Let
X=Y Z Rd
, where
ZRd,Y∈ {±1}
, and
`(t) =
log(1 + exp(t)), where `(·)satisfies Assumption (i) in Corollary 3.4 with C=1
4. Then we have
Lβ(b
P,ρ)Eb
P[log(1 + exp(Y β>Z))] + ρ
8kβk2
M1.
3.2 Optimal Transport-based DRO with Perturbed Martingale Constraints
Now we turn to the relaxation of the martingale constraint to improve upon Tikhonov regularization
and gain more flexibility. We consider the perturbed martingale coupling based DRO model (perturbed
martingale DRO):
min
βLβ(b
P,ρ,), where Lβ(b
P,ρ,),
sup
π
Eπ[`(fβ(¯
X))]
s.t. π ∈ P(X × X)
Eπc(¯
X,X)ρ,P2π=b
P
kEπ[¯
X|X]XkMb
P-a.s.
(3.2)
The parameter
controls the allowed violations of the martingale constraint for the adversary. It
is trivial that if we set
= 0
, then we obtain
Lβ(b
P,ρ, 0) = Lβ(b
P,ρ)
and we recover the exact
martingale DRO model
(3.1)
. If we set
= +
then the martingale constraint becomes ineffective,
thus we have
Lβ(b
P,ρ, +) = Lβ(b
P,ρ)
and model
(3.2)
collapses to the DRO formulation
(2.1)
.
We thus can think of
as an interpolating parameter connecting two extremes: the conventional DRO
model (2.1) (at = +) and the exact martingale DRO model (3.1) (at = 0).
However, the resulting optimization problem
(3.2)
constitutes an infinite-dimensional optimization
problem over probability distributions and thus appears to be computationally intractable. To
overcome this issue, we leverage Theorem 2.3 and prove that the problem
(3.2)
is actually equivalent
to a finite-dimensional problem. To begin with, we present one crucial proposition that can be applied
to more general settings.
5
摘要:

TikhonovRegularizationisOptimalTransportRobustunderMartingaleConstraintsJiajinLiSiruiLinJoséBlanchetStanfordUniversity{jiajinli,siruilin,jose.blanchet}@stanford.eduVietAnhNguyenChineseUniversityofHongKongnguyen@se.cuhk.edu.hkAbstractDistributionallyrobustoptimizationhasbeenshowntoofferaprincipledway...

展开>> 收起<<
Tikhonov Regularization is Optimal Transport Robust under Martingale Constraints Jiajin Li Sirui Lin José Blanchet.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:3.33MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注