Adaptive Data Fusion for Multi-task Non-smooth Optimization Henry LamKaizheng WangYuhang WuYichen Zhang October 2022

2025-04-30 0 0 702.92KB 25 页 10玖币

侵权投诉

Adaptive Data Fusion for Multi-task Non-smooth Optimization

Henry Lam∗Kaizheng Wang†Yuhang Wu‡Yichen Zhang§

October 2022

Abstract

We study the problem of multi-task non-smooth optimization that arises ubiquitously in

statistical learning, decision-making and risk management. We develop a data fusion approach

that adaptively leverages commonalities among a large number of objectives to improve sample

eﬃciency while tackling their unknown heterogeneities. We provide sharp statistical guarantees

for our approach. Numerical experiments on both synthetic and real data demonstrate signiﬁcant

advantages of our approach over benchmarks.

1 INTRODUCTION

In most machine-learning contexts, algorithm developers and theorists are concerned with solving

a single task or optimizing a single metric at a time. Nonetheless, even in the big data era, the

datasets are expensive and oftentimes collected for a large number of tasks, and models based on a

single task likely hit the performance ceiling due to the limited sample size without fully exploiting

the dataset featuring multiple tasks. For instance, in inventory management, the hype cycle of

technology is getting shortened. It is increasingly critical for retailers to recognize the consumption

patterns of customers as early as possible, so as to minimize the cost caused by backordering and

holding. Since the selling data is limited at the early stage of the operations, decision making

can generally be challenging. Nevertheless, a retailer usually sells multiple products in a store or

manages multiple stores that sell the same product. They naturally deﬁne a group of related tasks.

The retailer may eﬀectively pool the datasets together to obtain a better estimation and decision.

By sharing representations between related tasks, multi-task learning conceptually helps the model

generalize better on individual tasks (Caruana,1997).

That being said, the relatedness of the tasks is implicit and hard to quantify. An oversighted or

misspeciﬁed relationship among tasks could adversely hurt the performance of data pooling across

multiple tasks. When the tasks are highly distinct, naive multi-task learning procedures could

underperform single-task learning (STL) ones which tackle each task separately.

To utilize the commonalities of datasets related to multiple tasks, statistical models can help

us develop a family of reliable approaches with theoretical quantiﬁcation of multi-task performance

while adapting to the unknown task relatedness. To this end, we propose a data fusion approach to

tackle unknown heterogeneities among the tasks and improve the data utilization for each individual

task. Particularly, consider mtasks of empirical risk minimization where the j-th task is to minimize

Author names are sorted alphabetically.

∗Department of IEOR, Columbia University. Email: henry.lam@columbia.edu.

†Department of IEOR and Data Science Institute, Columbia University. Email: kaizheng.wang@columbia.edu.

‡Department of Mathematics, Columbia University. Email: yuhang.wu@columbia.edu.

§Krannert School of Management, Purdue University. Email: zhang@purdue.edu.

arXiv:2210.12334v1 [stat.ML] 22 Oct 2022

the empirical risk fj(θ) = 1

njPnj

i=1 `(θ,ξji)over the parameter of interest θfrom data {ξji}nj

i=1 ∼ Pj.

We propose to minimize an augmented objective

Fw,λ(Θ,β) =

j=1

wj[fj(θj) + λj||θj−β||2]

jointly over task-speciﬁc estimators Θ= (θ1,···,θm)∈Rd×mand a multi-task center β∈Rdto be

learned. Weighting hyperparameters wjare speciﬁed to reﬂect the importance of the information

regarding each individual task (e.g. wj=nj). The regularization term λj||θj−β||2drives the

estimator of each individual task θjtowards a common center β, with strength parameterized by

(λ1,··· , λm). It is straightforward to see that our method interpolates between minimizing the risk

of the individual task as λjapproaches zero, and a robust pooling of the individual minimizers as

λjincreases.

Our key contribution is the analysis of the aforementioned procedure in a wide range of problems

where the losses {fj}m

j=1 are convex but allowed to be nonsmooth. We prove that the proposed

estimator automatically adapts to the unknown similarity among the tasks. In our motivating

example, the cost function is naturally a piecewise linear nonsmooth convex function which is

closely related to quantile regression. Other examples include linear max-margin classiﬁers as well

as threshold regression models. Among these models, since the objective function is not diﬀerentiable

at many places, technical challenges arise in the uniform concentration results and convergence rates

as the subgradient is now a set-valued mapping and not continuous. Nonetheless, with statistical

modeling, a theoretical analysis under such scenarios becomes possible.

In addition to the theoretical guarantees, we experiment with the numerical procedures on both

synthetic data and a real-world dataset of the newsvendor problem in Section 5. The experiment

reveals a steady and reliable beneﬁt of the performance of the proposed method over benchmark

ones, with signiﬁcant improvement over STL where the data are scarce, and over blindly pooling

the data together. This proposed method oﬀers a reliable procedure for practitioners to leverage

the possible relatedness between tasks in inventory decision-making, ﬁnancial risk management, and

many other applications.

1.1 Related Work

Multi-task learning based on parameter augmentation, such as the introduction of the common

center βin our method, has achieved great empirical success (Evgeniou and Pontil,2004;Jalali

et al.,2013;Chen et al.,2011). Our estimator originates from the framework of Adaptive and

Robust Multi-task Learning (ARMUL) proposed by Duan and Wang (2022), while we relaxed the

smoothness and strong convexity condition on the empirical risks fj, such that we can extend

the analysis to many real-world applications from statistical learning to inventory decision-making,

to ﬁnancial risk management. The motivating inventory management example, often known as

the data-driven newsvendor problem, can be expressed as a quantile regression problem with the

quantile level determined by a ratio of per unit holding cost versus the backordering one (Levi et al.,

2007,2015;Ban and Rudin,2019). The objective function, also known as the “check function”, is

convex but not diﬀerentiable. These applications coincide with the classical quantile regression in

statistics and econometrics literature, dated back to Koenker and Bassett Jr (1978), which estimates

the conditional quantile of the response variable across values of predictive covariates. Besides the

aforementioned newsvendor problems in inventory management, quantile regression ﬁnds a wide

range of applications in survival data analysis (Koenker and Geling,2001;Wang and Wang,2014),

ﬁnancial risk management (Engle and Manganelli,1999;Rockafellar et al.,2000) and many other

ﬁelds. We refer the reader to Koenker (2005); Koenker et al. (2017) for an extensive overview of

quantile regression.

Immense applications rising in various ﬁelds call for the need of generalization to the nonsmooth

objectives in multi-task learning. Despite its practical importance, MTL for nonsmooth objectives

remains largely unexplored in statistical learning, with several exceptions including Fan et al. (2016);

Chao et al. (2021); Kan et al. (2022). In contrast to our framework, they studied quantile regression

with multivariate response variables in linear models and neural networks, respectively, and treated

each variable as a diﬀerent task sharing the same observed covariates. Since their models are based

on a shared covariate for multiple tasks, they typically imposed factor structure and augmented

the objective with a rank-based regularization on the Θmatrix. On the contrary, our framework

features diﬀerent covariates in each task. As such, we regularize the objective function with a penalty

driving towards a robust central of all tasks and utilize the information to jointly optimize over the

individual estimators and the intrinsic central. Our analysis also complements and generalizes

limited existing literature on nonsmooth quantile regression in large-scale or distributed datasets

(Volgushev et al.,2019;Chen et al.,2021) which considered only the quantile regression under

homogeneous tasks. Several other existing works considered a similar framework as ours in an

empirical Bayesian argument. Gupta and Kallus (2022) developed a data pooling procedure for

data-driven newsvendor problems that shrinks the empirical distribution of each individual task

towards a weighted global empirical distribution according to an anchor distribution. The data

distributions there have ﬁnite supports. Mukherjee et al. (2015) focused on the Gaussian setting,

studied the predictive risk instead of estimation error, and proposed a shrinkage estimator towards

a data-driven location simultaneously optimized.

1.2 Notation

The constants c1, c2, C, C1, C2,· · · may diﬀer from line to line. We use [n]as a shorthand for

{1,2,· · ·, n}and |·| to denote the absolute value of a real number or cardinality of a set. ||A|| =

sup||x||2=1 ||Ax||2denotes the spectral norm. Let Z+be the set of positive integers and R+={x∈

R:x≥0}. Deﬁne x+= max{x, 0}and x−= max{−x, 0}for x∈R. Deﬁne B(x, r) = {y∈Rd:

||y−x||2≤r}for x∈Rdand r≥0.

2 PROBLEM FORMULATION

Let m∈Z+be the number of tasks and Xbe the sample space, and let `:Rd× X → Rbe

a (non-smooth) loss function. For every j∈[m], let Pjbe a probability distribution over Xand

Dj={ξji}nj

i=1 be njindependent samples drawn from Pj. The j-th task is to estimate the population

loss minimizer

θ∗

j∈argmin

θ∈Rd

Eξ∼Pj`(θ,ξ)

from the data. Denote by Θ∗= (θ∗

1,···,θ∗

m)∈Rd×mthe parameter matrix.

Deﬁne the empirical loss function of the j-th task as

fj(θ) = 1

i=1

`(θ,ξji).

Two straightforward strategies are single-task learning (STL) and data pooling (DP). The former

corresponds to solving the individual tasks separately, i.e.,

θj= argmin

θ∈Rd

fj(θ),∀j∈[m].

The latter corresponds to merging all datasets to train a single model, i.e.,

θ1=···=b

θm= argmin

θ∈Rd

j=1

njfj(θ).

These two strategies have intrinsic shortcomings: STL does not take full advantage of the data

available, while DP has a high risk of model misspeciﬁcation. To resolve this issue, deﬁne

Fw,λ(Θ,β) =

j=1

wj[fj(θj) + λj||θj−β||2],(2.1)

where Θ= (θ1,· · ·,θm)∈Rd×m,β∈Rd,w= (w1,· · ·, wm)∈Rm

+are weight parameters (e.g.

wj=nj), and λ= (λ1,···, λm)∈Rm

+are penalty parameters. We propose to solve an augmented

program

Θ,b

β)∈argmin

Θ∈Rd×m,β∈Rd

Fw,λ(Θ,β)(2.2)

where each task receives its own estimate b

θjwhile the penalty terms shrink b

θj’s toward b

βto promote

similarity among the estimated models. This is a convex program so long as {fj}m

j=1 are all convex.

If we choose λ=0, we return to the STL setting; if we choose suﬃciently large λ, the cusp of the

`2penalty at zero enforces strict equality b

θj=b

βfor all j∈[m], eﬀectively pooling all the data.

Therefore, it is desirable to choose a suitable λsuch that we can automatically adapt to whichever

situation proves more suitable.

Note that (2.2) belongs to the framework of Adaptive and Robust Multi-task Learning (ARMUL)

proposed by Duan and Wang (2022). However, theoretical gaurantees of ARMUL require {fj}m

j=1

to be smooth and locally strongly convex near the minimizers. As such, scenarios where ARMUL

is powerful includes, for example, multi-task linear regression and multi-task logistic regression.

In contrast, our theoretical results relax the smoothness condition and extend to scenarios where

{fj}m

j=1 are non-smooth, which are ubiquitous in statistical learning and operations research.

3 EXAMPLES

Here we introduce three motivating examples in statistical learning and operations management.

3.1 Newsvendor Problem

Suppose a retailer sells a perishable good that needs to be prepared/stocked/ordered in advance. Let

Dbe a random variable representing the market demand. The retailer needs to decide a quantity q

of goods to prepare (e.g. raw materials to buy, food to defrost) ahead of time in order to minimize

the expected cost a combination of the backorder/underage and holding/overage costs as follows,

EDb(D−q)++h(D−q)−,

where band hare backorder and holding costs per unit, respectively.

In practice, the distribution of the demand Dis not known beforehand. Instead, the information

available is a set of independent random samples {Di}n

i=1 drawn from that. We can estimate q(τ)

by ˆq∈argminq∈Rf(θ), where the objective function

f(q) = 1

i=1 b(Di−q)++h(Di−q)−

is non-smooth. If we deﬁne the check loss ρτ(z) = (1 −τ)z−+τ z+, then f(q)is proportional

to 1

nPn

i=1 ρτ(Di−q)with τ=b/(b+h). The solution ˆqis the τ-th sample quantile of the data

{Di}n

i=1.

The above classical newsvendor problems assumed that the holding cost and the backordering

cost grow linearly with regard to quantity surplus and deﬁcit, respectively. We can relax this

assumption to cases where the two costs are replaced with B((D−q)+)and H((D−q)−)with

general functions B, H :R+→R+that are convex, non-decreasing, and satisfy B(0) = H(0).

Given the data {Di}n

i=1, it is natural to estimate the best linear decision rule by minimizing the

loss function

f(q) = 1

i=1 B(Di−q)++H(Di−q)−.

In modern newsvendor problems, the data for a speciﬁc product at one store can be quite limited.

Fortunately, multiple products in the same store or multiple stores in a nearby region have similar

sales patterns. A joint analysis of the datasets by multi-task learning facilitates decision making.

3.2 Quantile Regression

Denote by FY|Xthe conditional CDF of a response Y∈Rgiven covariates X∈Rd. Deﬁne the

τ-th conditional quantile of Ygiven X∈Rdas QY|X(τ) = inf y:FY|X(y)≥τ. Assume that

QY|X(τ) = X>θ∗holds for some θ∗∈Rd. Given ni.i.d. samples {(xi, yi)}n

i=1 from some joint

distribution P, we can estimate θ∗by b

θ∈argminθ∈Rdf(θ), where the objective

f(θ) = 1

i=1

ρτ(yi−x>

iθ)

is non-smooth. This is the quantile regression in statistics (Koenker and Bassett Jr,1978) which

targets the conditional quantile of the response. In contrast, least squares regression aims to estimate

the conditional mean. When we collect data from multiple populations (e.g. diﬀerent geographical

locations), multi-task learning helps utilize their commonality while tackling the heterogeneity.

3.3 Support Vector Machine

Consider a binary classiﬁcation problem where one wants to predict the label Y∈ {±1}from

covariates X∈Rd. A popular method for training linear classiﬁers of the form X7→ sgn(X>θ)

is the support vector machine (SVM) (Cortes and Vapnik,1995). Given the data {(xi, yi)}n

i=1, the

(soft-margin) SVM amounts to minimizing the empirical loss function below:

f(θ) = 1

i=1

(1 −yix>

iθ)++µkθk2

where µ≥0is a penalty parameter. Here fis non-smooth. SVM has demonstrated superior

performance in binary classiﬁcation problems. In multi-class and multi-label settings, multi-task

SVM is a popular approach where each task is to distinguish a pair of classes.

4 THEORETICAL ANALYSIS

In this section, we provide a non-asymptotic analysis of (2.2). Of particular interest to us is to

generalize the results of Duan and Wang (2022) to non-smooth empirical loss functions. While the

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AdaptiveDataFusionforMulti-taskNon-smoothOptimizationHenryLam*KaizhengWangYuhangWuYichenZhangOctober2022AbstractWestudytheproblemofmulti-tasknon-smoothoptimizationthatarisesubiquitouslyinstatisticallearning,decision-makingandriskmanagement.Wedevelopadatafusionapproachthatadaptivelyleveragescommon...

展开>> 收起<<

Adaptive Data Fusion for Multi-task Non-smooth Optimization Henry LamKaizheng WangYuhang WuYichen Zhang October 2022.pdf

共25页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Adaptive Data Fusion for Multi-task Non-smooth Optimization Henry LamKaizheng WangYuhang WuYichen Zhang October 2022

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: