Adaptive Data Fusion for Multi-task Non-smooth Optimization Henry LamKaizheng WangYuhang WuYichen Zhang October 2022

2025-04-30 0 0 702.92KB 25 页 10玖币
侵权投诉
Adaptive Data Fusion for Multi-task Non-smooth Optimization
Henry LamKaizheng WangYuhang WuYichen Zhang§
October 2022
Abstract
We study the problem of multi-task non-smooth optimization that arises ubiquitously in
statistical learning, decision-making and risk management. We develop a data fusion approach
that adaptively leverages commonalities among a large number of objectives to improve sample
efficiency while tackling their unknown heterogeneities. We provide sharp statistical guarantees
for our approach. Numerical experiments on both synthetic and real data demonstrate significant
advantages of our approach over benchmarks.
1 INTRODUCTION
In most machine-learning contexts, algorithm developers and theorists are concerned with solving
a single task or optimizing a single metric at a time. Nonetheless, even in the big data era, the
datasets are expensive and oftentimes collected for a large number of tasks, and models based on a
single task likely hit the performance ceiling due to the limited sample size without fully exploiting
the dataset featuring multiple tasks. For instance, in inventory management, the hype cycle of
technology is getting shortened. It is increasingly critical for retailers to recognize the consumption
patterns of customers as early as possible, so as to minimize the cost caused by backordering and
holding. Since the selling data is limited at the early stage of the operations, decision making
can generally be challenging. Nevertheless, a retailer usually sells multiple products in a store or
manages multiple stores that sell the same product. They naturally define a group of related tasks.
The retailer may effectively pool the datasets together to obtain a better estimation and decision.
By sharing representations between related tasks, multi-task learning conceptually helps the model
generalize better on individual tasks (Caruana,1997).
That being said, the relatedness of the tasks is implicit and hard to quantify. An oversighted or
misspecified relationship among tasks could adversely hurt the performance of data pooling across
multiple tasks. When the tasks are highly distinct, naive multi-task learning procedures could
underperform single-task learning (STL) ones which tackle each task separately.
To utilize the commonalities of datasets related to multiple tasks, statistical models can help
us develop a family of reliable approaches with theoretical quantification of multi-task performance
while adapting to the unknown task relatedness. To this end, we propose a data fusion approach to
tackle unknown heterogeneities among the tasks and improve the data utilization for each individual
task. Particularly, consider mtasks of empirical risk minimization where the j-th task is to minimize
Author names are sorted alphabetically.
Department of IEOR, Columbia University. Email: henry.lam@columbia.edu.
Department of IEOR and Data Science Institute, Columbia University. Email: kaizheng.wang@columbia.edu.
Department of Mathematics, Columbia University. Email: yuhang.wu@columbia.edu.
§Krannert School of Management, Purdue University. Email: zhang@purdue.edu.
1
arXiv:2210.12334v1 [stat.ML] 22 Oct 2022
the empirical risk fj(θ) = 1
njPnj
i=1 `(θ,ξji)over the parameter of interest θfrom data {ξji}nj
i=1 ∼ Pj.
We propose to minimize an augmented objective
Fw,λ(Θ,β) =
m
X
j=1
wj[fj(θj) + λj||θjβ||2]
jointly over task-specific estimators Θ= (θ1,···,θm)Rd×mand a multi-task center βRdto be
learned. Weighting hyperparameters wjare specified to reflect the importance of the information
regarding each individual task (e.g. wj=nj). The regularization term λj||θjβ||2drives the
estimator of each individual task θjtowards a common center β, with strength parameterized by
(λ1,··· , λm). It is straightforward to see that our method interpolates between minimizing the risk
of the individual task as λjapproaches zero, and a robust pooling of the individual minimizers as
λjincreases.
Our key contribution is the analysis of the aforementioned procedure in a wide range of problems
where the losses {fj}m
j=1 are convex but allowed to be nonsmooth. We prove that the proposed
estimator automatically adapts to the unknown similarity among the tasks. In our motivating
example, the cost function is naturally a piecewise linear nonsmooth convex function which is
closely related to quantile regression. Other examples include linear max-margin classifiers as well
as threshold regression models. Among these models, since the objective function is not differentiable
at many places, technical challenges arise in the uniform concentration results and convergence rates
as the subgradient is now a set-valued mapping and not continuous. Nonetheless, with statistical
modeling, a theoretical analysis under such scenarios becomes possible.
In addition to the theoretical guarantees, we experiment with the numerical procedures on both
synthetic data and a real-world dataset of the newsvendor problem in Section 5. The experiment
reveals a steady and reliable benefit of the performance of the proposed method over benchmark
ones, with significant improvement over STL where the data are scarce, and over blindly pooling
the data together. This proposed method offers a reliable procedure for practitioners to leverage
the possible relatedness between tasks in inventory decision-making, financial risk management, and
many other applications.
1.1 Related Work
Multi-task learning based on parameter augmentation, such as the introduction of the common
center βin our method, has achieved great empirical success (Evgeniou and Pontil,2004;Jalali
et al.,2013;Chen et al.,2011). Our estimator originates from the framework of Adaptive and
Robust Multi-task Learning (ARMUL) proposed by Duan and Wang (2022), while we relaxed the
smoothness and strong convexity condition on the empirical risks fj, such that we can extend
the analysis to many real-world applications from statistical learning to inventory decision-making,
to financial risk management. The motivating inventory management example, often known as
the data-driven newsvendor problem, can be expressed as a quantile regression problem with the
quantile level determined by a ratio of per unit holding cost versus the backordering one (Levi et al.,
2007,2015;Ban and Rudin,2019). The objective function, also known as the “check function”, is
convex but not differentiable. These applications coincide with the classical quantile regression in
statistics and econometrics literature, dated back to Koenker and Bassett Jr (1978), which estimates
the conditional quantile of the response variable across values of predictive covariates. Besides the
aforementioned newsvendor problems in inventory management, quantile regression finds a wide
range of applications in survival data analysis (Koenker and Geling,2001;Wang and Wang,2014),
financial risk management (Engle and Manganelli,1999;Rockafellar et al.,2000) and many other
2
fields. We refer the reader to Koenker (2005); Koenker et al. (2017) for an extensive overview of
quantile regression.
Immense applications rising in various fields call for the need of generalization to the nonsmooth
objectives in multi-task learning. Despite its practical importance, MTL for nonsmooth objectives
remains largely unexplored in statistical learning, with several exceptions including Fan et al. (2016);
Chao et al. (2021); Kan et al. (2022). In contrast to our framework, they studied quantile regression
with multivariate response variables in linear models and neural networks, respectively, and treated
each variable as a different task sharing the same observed covariates. Since their models are based
on a shared covariate for multiple tasks, they typically imposed factor structure and augmented
the objective with a rank-based regularization on the Θmatrix. On the contrary, our framework
features different covariates in each task. As such, we regularize the objective function with a penalty
driving towards a robust central of all tasks and utilize the information to jointly optimize over the
individual estimators and the intrinsic central. Our analysis also complements and generalizes
limited existing literature on nonsmooth quantile regression in large-scale or distributed datasets
(Volgushev et al.,2019;Chen et al.,2021) which considered only the quantile regression under
homogeneous tasks. Several other existing works considered a similar framework as ours in an
empirical Bayesian argument. Gupta and Kallus (2022) developed a data pooling procedure for
data-driven newsvendor problems that shrinks the empirical distribution of each individual task
towards a weighted global empirical distribution according to an anchor distribution. The data
distributions there have finite supports. Mukherjee et al. (2015) focused on the Gaussian setting,
studied the predictive risk instead of estimation error, and proposed a shrinkage estimator towards
a data-driven location simultaneously optimized.
1.2 Notation
The constants c1, c2, C, C1, C2,· · · may differ from line to line. We use [n]as a shorthand for
{1,2,· · ·, n}and |·| to denote the absolute value of a real number or cardinality of a set. ||A|| =
sup||x||2=1 ||Ax||2denotes the spectral norm. Let Z+be the set of positive integers and R+={x
R:x0}. Define x+= max{x, 0}and x= max{−x, 0}for xR. Define B(x, r) = {yRd:
||yx||2r}for xRdand r0.
2 PROBLEM FORMULATION
Let mZ+be the number of tasks and Xbe the sample space, and let `:Rd× X Rbe
a (non-smooth) loss function. For every j[m], let Pjbe a probability distribution over Xand
Dj={ξji}nj
i=1 be njindependent samples drawn from Pj. The j-th task is to estimate the population
loss minimizer
θ
jargmin
θRd
Eξ∼Pj`(θ,ξ)
from the data. Denote by Θ= (θ
1,···,θ
m)Rd×mthe parameter matrix.
Define the empirical loss function of the j-th task as
fj(θ) = 1
nj
nj
X
i=1
`(θ,ξji).
Two straightforward strategies are single-task learning (STL) and data pooling (DP). The former
corresponds to solving the individual tasks separately, i.e.,
b
θj= argmin
θRd
fj(θ),j[m].
3
The latter corresponds to merging all datasets to train a single model, i.e.,
b
θ1=···=b
θm= argmin
θRd
m
X
j=1
njfj(θ).
These two strategies have intrinsic shortcomings: STL does not take full advantage of the data
available, while DP has a high risk of model misspecification. To resolve this issue, define
Fw,λ(Θ,β) =
m
X
j=1
wj[fj(θj) + λj||θjβ||2],(2.1)
where Θ= (θ1,· · ·,θm)Rd×m,βRd,w= (w1,· · ·, wm)Rm
+are weight parameters (e.g.
wj=nj), and λ= (λ1,···, λm)Rm
+are penalty parameters. We propose to solve an augmented
program
(b
Θ,b
β)argmin
ΘRd×m,βRd
Fw,λ(Θ,β)(2.2)
where each task receives its own estimate b
θjwhile the penalty terms shrink b
θj’s toward b
βto promote
similarity among the estimated models. This is a convex program so long as {fj}m
j=1 are all convex.
If we choose λ=0, we return to the STL setting; if we choose sufficiently large λ, the cusp of the
`2penalty at zero enforces strict equality b
θj=b
βfor all j[m], effectively pooling all the data.
Therefore, it is desirable to choose a suitable λsuch that we can automatically adapt to whichever
situation proves more suitable.
Note that (2.2) belongs to the framework of Adaptive and Robust Multi-task Learning (ARMUL)
proposed by Duan and Wang (2022). However, theoretical gaurantees of ARMUL require {fj}m
j=1
to be smooth and locally strongly convex near the minimizers. As such, scenarios where ARMUL
is powerful includes, for example, multi-task linear regression and multi-task logistic regression.
In contrast, our theoretical results relax the smoothness condition and extend to scenarios where
{fj}m
j=1 are non-smooth, which are ubiquitous in statistical learning and operations research.
3 EXAMPLES
Here we introduce three motivating examples in statistical learning and operations management.
3.1 Newsvendor Problem
Suppose a retailer sells a perishable good that needs to be prepared/stocked/ordered in advance. Let
Dbe a random variable representing the market demand. The retailer needs to decide a quantity q
of goods to prepare (e.g. raw materials to buy, food to defrost) ahead of time in order to minimize
the expected cost a combination of the backorder/underage and holding/overage costs as follows,
EDb(Dq)++h(Dq),
where band hare backorder and holding costs per unit, respectively.
In practice, the distribution of the demand Dis not known beforehand. Instead, the information
available is a set of independent random samples {Di}n
i=1 drawn from that. We can estimate q(τ)
by ˆqargminqRf(θ), where the objective function
f(q) = 1
n
n
X
i=1 b(Diq)++h(Diq)
4
is non-smooth. If we define the check loss ρτ(z) = (1 τ)z+τ z+, then f(q)is proportional
to 1
nPn
i=1 ρτ(Diq)with τ=b/(b+h). The solution ˆqis the τ-th sample quantile of the data
{Di}n
i=1.
The above classical newsvendor problems assumed that the holding cost and the backordering
cost grow linearly with regard to quantity surplus and deficit, respectively. We can relax this
assumption to cases where the two costs are replaced with B((Dq)+)and H((Dq))with
general functions B, H :R+R+that are convex, non-decreasing, and satisfy B(0) = H(0).
Given the data {Di}n
i=1, it is natural to estimate the best linear decision rule by minimizing the
loss function
f(q) = 1
n
n
X
i=1 B(Diq)++H(Diq).
In modern newsvendor problems, the data for a specific product at one store can be quite limited.
Fortunately, multiple products in the same store or multiple stores in a nearby region have similar
sales patterns. A joint analysis of the datasets by multi-task learning facilitates decision making.
3.2 Quantile Regression
Denote by FY|Xthe conditional CDF of a response YRgiven covariates XRd. Define the
τ-th conditional quantile of Ygiven XRdas QY|X(τ) = inf y:FY|X(y)τ. Assume that
QY|X(τ) = X>θholds for some θRd. Given ni.i.d. samples {(xi, yi)}n
i=1 from some joint
distribution P, we can estimate θby b
θargminθRdf(θ), where the objective
f(θ) = 1
n
n
X
i=1
ρτ(yix>
iθ)
is non-smooth. This is the quantile regression in statistics (Koenker and Bassett Jr,1978) which
targets the conditional quantile of the response. In contrast, least squares regression aims to estimate
the conditional mean. When we collect data from multiple populations (e.g. different geographical
locations), multi-task learning helps utilize their commonality while tackling the heterogeneity.
3.3 Support Vector Machine
Consider a binary classification problem where one wants to predict the label Y∈ {±1}from
covariates XRd. A popular method for training linear classifiers of the form X7→ sgn(X>θ)
is the support vector machine (SVM) (Cortes and Vapnik,1995). Given the data {(xi, yi)}n
i=1, the
(soft-margin) SVM amounts to minimizing the empirical loss function below:
f(θ) = 1
n
n
X
i=1
(1 yix>
iθ)++µkθk2
2,
where µ0is a penalty parameter. Here fis non-smooth. SVM has demonstrated superior
performance in binary classification problems. In multi-class and multi-label settings, multi-task
SVM is a popular approach where each task is to distinguish a pair of classes.
4 THEORETICAL ANALYSIS
In this section, we provide a non-asymptotic analysis of (2.2). Of particular interest to us is to
generalize the results of Duan and Wang (2022) to non-smooth empirical loss functions. While the
5
摘要:

AdaptiveDataFusionforMulti-taskNon-smoothOptimizationHenryLam*KaizhengWang„YuhangWu…YichenZhangŸOctober2022AbstractWestudytheproblemofmulti-tasknon-smoothoptimizationthatarisesubiquitouslyinstatisticallearning,decision-makingandriskmanagement.Wedevelopadatafusionapproachthatadaptivelyleveragescommon...

展开>> 收起<<
Adaptive Data Fusion for Multi-task Non-smooth Optimization Henry LamKaizheng WangYuhang WuYichen Zhang October 2022.pdf

共25页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:25 页 大小:702.92KB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 25
客服
关注