The Stochastic Proximal Distance Algorithm Haoyu Jiang Department of Statistics University of Illinois Urbana-Champaign

2025-05-06 0 0 4.48MB 45 页 10玖币
侵权投诉
The Stochastic Proximal Distance Algorithm
Haoyu Jiang
Department of Statistics, University of Illinois Urbana-Champaign
Jason Xu
Department of Statistical Science, Duke University
Abstract
Stochastic versions of proximal methods have gained much attention in statistics
and machine learning. These algorithms tend to admit simple, scalable forms, and
enjoy numerical stability via implicit updates. In this work, we propose and analyze
a stochastic version of the recently proposed proximal distance algorithm, a class of
iterative optimization methods that recover a desired constrained estimation problem
as a penalty parameter ρ→ ∞. By uncovering connections to related stochastic
proximal methods and interpreting the penalty parameter as the learning rate, we
justify heuristics used in practical manifestations of the proximal distance method,
establishing their convergence guarantees for the first time. Moreover, we extend recent
theoretical devices to establish finite error bounds and a complete characterization of
convergence rates regimes. We validate our analysis via a thorough empirical study,
also showing that unsurprisingly, the proposed method outpaces batch versions on
popular learning tasks.
1 Introduction
Optimization of an objective function subject to constraints or regularization is ubiquitous
in statistics and machine learning. Consider the canonical problem setting
min
θF(θ) = 1
n
n
X
i=1
f(θ;zi) s.t.θC(1)
1
arXiv:2210.12277v4 [stat.ML] 6 Sep 2024
where {zi}n
i=1 is the observed data, θis the model parameter, fis typically a measure of fit,
and Cis a set imposing desired structure on the solution. Toward solving constrained and
regularized objectives, there are many variants of gradient methods popular in the literature.
We survey a number of these in Section 2, which often employing projection or proximal
steps to handle constraints and non-smooth penalties. Here we focus attention on a recent
method by Lange and Keys [2015] that is appealing in providing a broad framework to
handle many constrained problems gracefully. Called the proximal distance algorithm, their
approach transforms (1) to the unconstrained problem
min
θF(θ) + ρ
2dist(θ, C)2,(2)
where dist(θ, C) = infθCθθ2denotes the Euclidean distance between θand the
constraint set C. Using the idea of distance majorization [Chi et al., 2014], this reformulation
can be solved via iterative algorithms from the perspective of Majorization-Minimization, or
MM [Hunter and Lange, 2004, Mairal, 2015] as long as the projection operators are practical
to evaluate. This is the case for many common constraints including sparsity, rank, and
shape restrictions; a broad variety of examples are considered in Xu et al. [2017], Keys et al.
[2019], Landeros et al. [2022].
This proximal distance method can be viewed as a smooth version of the classical penalty
method due to Courant [1943], but under a squared penalty as in (2), the unconstrained
solutions recover the constrained solution only in the limit as ρ [Beltrami, 1970].
Early attempts to set ρat a large value encountered slow progress, as the reciprocal 1
plays the role of a step-size in the algorithm, discussed further below. This is supported
by recent analyses by Keys et al. [2019], Landeros et al. [2022] revealing a convergence rate
in convex cases of O(ρk1), where kis the iteration counter. In light of this observation,
researchers have suggested gradually increasing ρwithin the iterative algorithm. Keys et al.
[2019] suggests setting an initial penalty value ρ=ρ0, rate γ > 1 and frequency m, and
then updating ρk=ρ0γk
m, which increases roughly exponentially. While this is consistent
with the intuition of approaching the constrained solution, the traditional MM geometry
used to derive the algorithm—along with its corresponding guarantees— does not apply to
heuristics that entail a sequence of changing objective functions (2) indexed by the increasing
sequence ρk. These heuristics for increasing ρplay an important role in practice so that
iterates eventually respect the constraint when the penalty becomes large, but allow for
2
large enough data-driven steps in the earlier stages of the algorithm.
Fortunately, there is a large literature on related stochastic proximal methods, which
are increasingly prominent in statistical and machine learning problems involving large data
sets [Toulis et al., 2020, Lee et al., 2022]. In this paper, we draw new connections between
the proximal distance algorithm and these studies, leveraging techniques for studying their
convergence toward rigorously resolving the open questions regarding the convergence of
the proximal distance algorithm under various ρkschedules. Our point of departure is to
propose a stochastic version of the proximal distance algorithm. By evaluating the loss at
only a subsample of the data, we break from the original geometry of majorization from
which the method was derived. However, we take a different view by drawing analogies to
implicit gradient methods. Noting the relation between the penalty parameter and the step
size or learning rate in proximal/implicit algorithms, we establish connections to implicit
SGD [Toulis et al., 2014, Toulis and Airoldi, 2017, Lee et al., 2022], the proximal Robbins-
Monro algorithm [Toulis et al., 2020], and incremental proximal algorithms [Bertsekas, 2011].
We provide new theoretical analyses that reveal convergence rates in several regimes under
various polynomial learning rate schedules under weaker assumptions than previous studies.
Our contributions bridge a theoretical gap in justifying heuristics commonly used in
practical implementations of the proximal distance algorithm. While the findings are largely
theoretical in nature, the resulting stochastic algorithm naturally enables large-scale data to
be analyzed that would be intractable under the full batch version of the method. The re-
mainder of this manuscript is organized as follows: we begin with an overview and necessary
background in Section 2. Next, Section 3 proposes our stochastic version of the proximal
distance algorithm, and discusses connections and differences to related stochastic proximal
methods. In Section 4, we present our theoretical analysis of the algorithm, establishing
finite error bounds, revealing convergence rates and providing a rigorous justification for
mechanisms to increase ρ. In Section 5, we conduct empirical study to validate our analysis
and investigate behavior in practice for both convex and non-convex settings, while demon-
strating computational gains over the full version. We conclude and discuss future directions
in Section 6.
3
2 Background and Motivation
Proximal Algorithms Many powerful iterative algorithms for optimization involve the
proximal operator [Bauschke et al., 2011, Parikh et al., 2014], defined as
proxf(y) = arg min
x
f(x) + 1
2xy2.
A fundamental example is the proximal point algorithm [Rockafellar, 1976, Parikh et al.,
2014], which minimizes an objective Fwith successive proximal operations:
θk+1 = proxαF (θk) = arg min
θ
F(θ) + 1
2αθθk2.
Intuitively, this seeks to shrink toward the current iterate θkwhen minimizing F, where a
parameter αdetermines the strength of the shrinkage. Methods such as proximal gradient
descent [Parikh et al., 2014, Li and Lin, 2015] consider composite objectives, alternating
between proximal operations and gradient descent steps.
Of particular interest here is the proximal distance algorithm [Lange and Keys, 2015]. It
extends the penalty method of Courant [1943] which relaxes a constrained problem (1) to
an unconstrained problem of the form
min
θF(θ) + ρq(θ),
where q(θ)0 is a penalty that vanishes on the constraint set: q(θ) = 0 for all θC.
Solutions to this unconstrained problem θρ
are guaranteed to converge to the constrained
solution as ρ→ ∞ [Beltrami, 1970]. The proximal distance algorithm combines this idea
with distance majorization [Chi et al., 2014], taking the penalty qto be the squared distance
between θand the constraint set C. A key algorithmic component is the set projection: note
the squared distance between θand Ccan also be expressed as
q(θ) = dist(θ, C)2=θPC(θ)2,
where PC(θ) = arg minθCθθ2denotes the projection of θon C. Lange and Keys
[2015] utilize this to convert the constrained problem (1) to an unconstrained form:
min
θF(θ) + ρ
2θPC(θ)2.(3)
Though (3) is often not directly solvable, the proximal distance algorithm makes use of
majorization-minimization (MM) to efficiently solve a sequence of simpler subproblems. In
4
more detail, distance majorization [Chi et al., 2014] implies that the following surrogate
function
F(θ) + ρ
2θPC(θk1)2(4)
majorizes the expression in (3). The principle of MM suggests then minimizing the surrogate
(4):
θk= arg min
θ
F(θ) + ρ
2θPC(θk1)2= proxρ1F[PC(θk1)].(5)
The resulting MM iteration consists of alternatively defining surrogates and minimizing them
according to these updates. By ensuring a descent property, it is easy to show that the iterate
sequence θkconverges to a local minimum θρ
of (3). In convex settings, θρ
is the global
minimizer.
In practice, convergence can be slow when fixing a large value for ρ, a pitfall past ap-
proaches seek to remedy by employing an increasing sequence of penalty parameters {ρk}.
When ρis small, the objective is dominated by the loss Fwhich drives the MM updates;
as ρgrows larger, the penalty plays a stronger role, guiding {θk}toward the constraint set.
Heuristically increasing ρat an effective rate plays a key role in the practical performance
of the method, which we will investigate more closely below.
Stochastic Proximal Methods Robbins and Monro [1951] proposed a stochastic ap-
proximation method for finding the roots of functions that lays the foundation for many
stochastic optimization methods used in large scale statistical and machine learning prob-
lems. A well-known instance is stochastic gradient descent (SGD) [Nemirovski et al., 2009,
Moulines and Bach, 2011, Bietti and Mairal, 2017], a method for smooth unconstrained
optimization that combines the Robbins-Monro algorithm with gradient descent in using
subsamples toward approximate steps. It updates
θk+1 =θkαkf(θk,zξk),
where zξkdenotes a randomly drawn observation from the dataset and αkis the sequence of
learning rates. For convex cases, θkconverges in probability to the solution whenever
X
k=1
αk=,
X
k=1
α2
k<.
While SGD enables many learning tasks in the large-scale and online settings, it can be
sensitive to initial learning rate, with small learning rates effecting slow convergence and
5
摘要:

TheStochasticProximalDistanceAlgorithmHaoyuJiangDepartmentofStatistics,UniversityofIllinoisUrbana-ChampaignJasonXuDepartmentofStatisticalScience,DukeUniversityAbstractStochasticversionsofproximalmethodshavegainedmuchattentioninstatisticsandmachinelearning.Thesealgorithmstendtoadmitsimple,scalablefor...

展开>> 收起<<
The Stochastic Proximal Distance Algorithm Haoyu Jiang Department of Statistics University of Illinois Urbana-Champaign.pdf

共45页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:45 页 大小:4.48MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 45
客服
关注