The Stochastic Proximal Distance Algorithm Haoyu Jiang Department of Statistics University of Illinois Urbana-Champaign

2025-05-06 0 0 4.48MB 45 页 10玖币

侵权投诉

The Stochastic Proximal Distance Algorithm

Haoyu Jiang

Department of Statistics, University of Illinois Urbana-Champaign

Jason Xu

Department of Statistical Science, Duke University

Abstract

Stochastic versions of proximal methods have gained much attention in statistics

and machine learning. These algorithms tend to admit simple, scalable forms, and

enjoy numerical stability via implicit updates. In this work, we propose and analyze

a stochastic version of the recently proposed proximal distance algorithm, a class of

iterative optimization methods that recover a desired constrained estimation problem

as a penalty parameter ρ→ ∞. By uncovering connections to related stochastic

proximal methods and interpreting the penalty parameter as the learning rate, we

justify heuristics used in practical manifestations of the proximal distance method,

establishing their convergence guarantees for the ﬁrst time. Moreover, we extend recent

theoretical devices to establish ﬁnite error bounds and a complete characterization of

convergence rates regimes. We validate our analysis via a thorough empirical study,

also showing that unsurprisingly, the proposed method outpaces batch versions on

popular learning tasks.

1 Introduction

Optimization of an objective function subject to constraints or regularization is ubiquitous

in statistics and machine learning. Consider the canonical problem setting

min

θF(θ) = 1

i=1

f(θ;zi) s.t.θ∈C(1)

arXiv:2210.12277v4 [stat.ML] 6 Sep 2024

where {zi}n

i=1 is the observed data, θis the model parameter, fis typically a measure of ﬁt,

and Cis a set imposing desired structure on the solution. Toward solving constrained and

regularized objectives, there are many variants of gradient methods popular in the literature.

We survey a number of these in Section 2, which often employing projection or proximal

steps to handle constraints and non-smooth penalties. Here we focus attention on a recent

method by Lange and Keys [2015] that is appealing in providing a broad framework to

handle many constrained problems gracefully. Called the proximal distance algorithm, their

approach transforms (1) to the unconstrained problem

min

θF(θ) + ρ

2dist(θ, C)2,(2)

where dist(θ, C) = infθ′∈C∥θ−θ′∥2denotes the Euclidean distance between θand the

constraint set C. Using the idea of distance majorization [Chi et al., 2014], this reformulation

can be solved via iterative algorithms from the perspective of Majorization-Minimization, or

MM [Hunter and Lange, 2004, Mairal, 2015] as long as the projection operators are practical

to evaluate. This is the case for many common constraints including sparsity, rank, and

shape restrictions; a broad variety of examples are considered in Xu et al. [2017], Keys et al.

[2019], Landeros et al. [2022].

This proximal distance method can be viewed as a smooth version of the classical penalty

method due to Courant [1943], but under a squared penalty as in (2), the unconstrained

solutions recover the constrained solution only in the limit as ρ→ ∞ [Beltrami, 1970].

Early attempts to set ρat a large value encountered slow progress, as the reciprocal 1/ρ

plays the role of a step-size in the algorithm, discussed further below. This is supported

by recent analyses by Keys et al. [2019], Landeros et al. [2022] revealing a convergence rate

in convex cases of O(ρk−1), where kis the iteration counter. In light of this observation,

researchers have suggested gradually increasing ρwithin the iterative algorithm. Keys et al.

[2019] suggests setting an initial penalty value ρ=ρ0, rate γ > 1 and frequency m, and

then updating ρk=ρ0γ⌊k

m⌋, which increases roughly exponentially. While this is consistent

with the intuition of approaching the constrained solution, the traditional MM geometry

used to derive the algorithm—along with its corresponding guarantees— does not apply to

heuristics that entail a sequence of changing objective functions (2) indexed by the increasing

sequence ρk. These heuristics for increasing ρplay an important role in practice so that

iterates eventually respect the constraint when the penalty becomes large, but allow for

large enough data-driven steps in the earlier stages of the algorithm.

Fortunately, there is a large literature on related stochastic proximal methods, which

are increasingly prominent in statistical and machine learning problems involving large data

sets [Toulis et al., 2020, Lee et al., 2022]. In this paper, we draw new connections between

the proximal distance algorithm and these studies, leveraging techniques for studying their

convergence toward rigorously resolving the open questions regarding the convergence of

the proximal distance algorithm under various ρkschedules. Our point of departure is to

propose a stochastic version of the proximal distance algorithm. By evaluating the loss at

only a subsample of the data, we break from the original geometry of majorization from

which the method was derived. However, we take a diﬀerent view by drawing analogies to

implicit gradient methods. Noting the relation between the penalty parameter and the step

size or learning rate in proximal/implicit algorithms, we establish connections to implicit

SGD [Toulis et al., 2014, Toulis and Airoldi, 2017, Lee et al., 2022], the proximal Robbins-

Monro algorithm [Toulis et al., 2020], and incremental proximal algorithms [Bertsekas, 2011].

We provide new theoretical analyses that reveal convergence rates in several regimes under

various polynomial learning rate schedules under weaker assumptions than previous studies.

Our contributions bridge a theoretical gap in justifying heuristics commonly used in

practical implementations of the proximal distance algorithm. While the ﬁndings are largely

theoretical in nature, the resulting stochastic algorithm naturally enables large-scale data to

be analyzed that would be intractable under the full batch version of the method. The re-

mainder of this manuscript is organized as follows: we begin with an overview and necessary

background in Section 2. Next, Section 3 proposes our stochastic version of the proximal

distance algorithm, and discusses connections and diﬀerences to related stochastic proximal

methods. In Section 4, we present our theoretical analysis of the algorithm, establishing

ﬁnite error bounds, revealing convergence rates and providing a rigorous justiﬁcation for

mechanisms to increase ρ. In Section 5, we conduct empirical study to validate our analysis

and investigate behavior in practice for both convex and non-convex settings, while demon-

strating computational gains over the full version. We conclude and discuss future directions

in Section 6.

2 Background and Motivation

Proximal Algorithms Many powerful iterative algorithms for optimization involve the

proximal operator [Bauschke et al., 2011, Parikh et al., 2014], deﬁned as

proxf(y) = arg min

f(x) + 1

2∥x−y∥2.

A fundamental example is the proximal point algorithm [Rockafellar, 1976, Parikh et al.,

2014], which minimizes an objective Fwith successive proximal operations:

θk+1 = proxαF (θk) = arg min

F(θ) + 1

2α∥θ−θk∥2.

Intuitively, this seeks to shrink toward the current iterate θkwhen minimizing F, where a

parameter αdetermines the strength of the shrinkage. Methods such as proximal gradient

descent [Parikh et al., 2014, Li and Lin, 2015] consider composite objectives, alternating

between proximal operations and gradient descent steps.

Of particular interest here is the proximal distance algorithm [Lange and Keys, 2015]. It

extends the penalty method of Courant [1943] which relaxes a constrained problem (1) to

an unconstrained problem of the form

min

θF(θ) + ρq(θ),

where q(θ)≥0 is a penalty that vanishes on the constraint set: q(θ) = 0 for all θ∈C.

Solutions to this unconstrained problem θρ

∗are guaranteed to converge to the constrained

solution as ρ→ ∞ [Beltrami, 1970]. The proximal distance algorithm combines this idea

with distance majorization [Chi et al., 2014], taking the penalty qto be the squared distance

between θand the constraint set C. A key algorithmic component is the set projection: note

the squared distance between θand Ccan also be expressed as

q(θ) = dist(θ, C)2=∥θ−PC(θ)∥2,

where PC(θ) = arg minθ′∈C∥θ′−θ∥2denotes the projection of θon C. Lange and Keys

[2015] utilize this to convert the constrained problem (1) to an unconstrained form:

min

θF(θ) + ρ

2∥θ−PC(θ)∥2.(3)

Though (3) is often not directly solvable, the proximal distance algorithm makes use of

majorization-minimization (MM) to eﬃciently solve a sequence of simpler subproblems. In

more detail, distance majorization [Chi et al., 2014] implies that the following surrogate

function

F(θ) + ρ

2∥θ−PC(θk−1)∥2(4)

majorizes the expression in (3). The principle of MM suggests then minimizing the surrogate

(4):

θk= arg min

F(θ) + ρ

2∥θ−PC(θk−1)∥2= proxρ−1F[PC(θk−1)].(5)

The resulting MM iteration consists of alternatively deﬁning surrogates and minimizing them

according to these updates. By ensuring a descent property, it is easy to show that the iterate

sequence θkconverges to a local minimum θρ

∗of (3). In convex settings, θρ

∗is the global

minimizer.

In practice, convergence can be slow when ﬁxing a large value for ρ, a pitfall past ap-

proaches seek to remedy by employing an increasing sequence of penalty parameters {ρk}.

When ρis small, the objective is dominated by the loss Fwhich drives the MM updates;

as ρgrows larger, the penalty plays a stronger role, guiding {θk}toward the constraint set.

Heuristically increasing ρat an eﬀective rate plays a key role in the practical performance

of the method, which we will investigate more closely below.

Stochastic Proximal Methods Robbins and Monro [1951] proposed a stochastic ap-

proximation method for ﬁnding the roots of functions that lays the foundation for many

stochastic optimization methods used in large scale statistical and machine learning prob-

lems. A well-known instance is stochastic gradient descent (SGD) [Nemirovski et al., 2009,

Moulines and Bach, 2011, Bietti and Mairal, 2017], a method for smooth unconstrained

optimization that combines the Robbins-Monro algorithm with gradient descent in using

subsamples toward approximate steps. It updates

θk+1 =θk−αk∇f(θk,zξk),

where zξkdenotes a randomly drawn observation from the dataset and αkis the sequence of

learning rates. For convex cases, θkconverges in probability to the solution whenever

∞

k=1

αk=∞,

∞

k=1

α2

k<∞.

While SGD enables many learning tasks in the large-scale and online settings, it can be

sensitive to initial learning rate, with small learning rates eﬀecting slow convergence and

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheStochasticProximalDistanceAlgorithmHaoyuJiangDepartmentofStatistics,UniversityofIllinoisUrbana-ChampaignJasonXuDepartmentofStatisticalScience,DukeUniversityAbstractStochasticversionsofproximalmethodshavegainedmuchattentioninstatisticsandmachinelearning.Thesealgorithmstendtoadmitsimple,scalablefor...

展开>> 收起<<

The Stochastic Proximal Distance Algorithm Haoyu Jiang Department of Statistics University of Illinois Urbana-Champaign.pdf

共45页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Stochastic Proximal Distance Algorithm Haoyu Jiang Department of Statistics University of Illinois Urbana-Champaign

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: