Robust angle-based transfer learning in high dimensions Tian Gu1 Yi Han2 and Rui Duan3 1Department of Biostatistics Columbia Mailman School of Public Health New

2025-05-03 0 0 2.27MB 28 页 10玖币
侵权投诉
Robust angle-based transfer learning in high dimensions
Tian Gu1, Yi Han2, and Rui Duan3,
1Department of Biostatistics, Columbia Mailman School of Public Health, New
York, NY 10032, USA
2Department of Statistics, Columbia University, New York, NY 10027, USA
3Department of Biostatistics, Harvard T.H. Chan School of Public Health,
Boston, MA 02115, USA
Corresponding author: rduan@hsph.harvard.edu
Abstract
Transfer learning enhances the performance of a target model by leveraging data from
related source populations, a technique particularly beneficial when target data is scarce.
This study addresses the challenge of training high-dimensional regression models with
limited target data, in the context of heterogeneous source populations. We consider a
practical setting where only the parameter estimates of the pre-trained source models
are accessible, instead of the individual-level source data. Under the setting with only
one source model, we propose a novel flexible angle-based transfer learning (angleTL)
method, which leverages the concordance between the source and the target model pa-
rameters. We showed that the proposed angleTL is adaptive to the signal strength of
the target model, unifies several benchmark methods by construction, and can prevent
negative transfer when between-population heterogeneity is large. We also provide al-
gorithms to effectively incorporate multiple source models accounting for the fact that
some source models may be more helpful than others. Our high-dimensional asymptotic
analysis provides interpretations and insights on when a source model can be useful for
the target, and demonstrates the superiority of angleTL over other benchmark methods.
We perform extensive simulation studies to validate our theoretical conclusions and show
the feasibility of applying angleTL to transferring existing genetic risk prediction models
across multiple biobanks.
1
arXiv:2210.12759v4 [stat.ME] 10 Nov 2023
1 Introduction
Insufficient training data presents a critical challenge across various domains. In finance, the scarcity
of comprehensive user credit histories hampers efforts in evaluating individual financial risk and de-
tecting fraud (Teja, 2017). Similarly, in precision medicine, the limited availability of medical
records, especially for studying rare diseases or minority and disadvantaged sub-populations, com-
promises both the fairness and clinical efficacy of treatments (Jia and Shi, 2017; Kim and Milliken,
2019). These constraints highlight an urgent need for innovative approaches that can enhance model
performance in the face of data limitations (Viele et al., 2014; Li and Song, 2020; Yang and Kim,
2020).
When borrowing information from external sources, a major challenge is to account for the
potential data heterogeneity. The external data may likely be collected from a different population
with diverse characteristics (Chen et al., 2020) or be historical data where the variable definitions
and measures may change over time (Mansukhani et al., 2019; Mitchell et al., 2021). Many data
integration methods are proposed to leverage potentially shared information across populations
while addressing data heterogeneity. For example, some methods are based on re-weighting or
re-sampling the source data such that they are more similar to the target (Huang et al., 2006;
Pan et al., 2010; Long et al., 2013); some methods assume there is a unique lower dimensional
representation of features across populations, which can be transferred from the source to the
target (Tzeng et al., 2014; Ganin and Lempitsky, 2015; Sun and Saenko, 2016); some methods
propose to use the target data to calibrate the source models (Girshick et al., 2014; Li et al., 2020;
Gu et al., 2023a). In situations where no data from the target population is available, methods
have been proposed to combine source models aiming for distributional robustness by optimizing
for the worst-case performance, assuming the target population can be represented as either a
single source or a mixture of source populations (Meinshausen and B¨uhlmann, 2015; Wang et al.,
2023). The performance of many of the aforementioned methods largely depends on whether the
underlying assumptions regarding the similarity between the source and the target populations
hold, which is usually unknown in practice. Therefore, it is desired to develop methods that can be
adaptive to the underlying data heterogeneity or at least prevent the case where incorporating the
source information leads to worse model performance than not including it, known as the “negative
transfer” phenomenon (Weiss et al., 2016).
In addition, there might be data-sharing constraints such that external sources cannot share
individual-level data with the target study. Federated or distributed algorithms are proposed to
overcome such data-sharing barriers by sharing only summary-level statistics across studies, many
of which rely on sharing the gradients or higher-order derivatives of objective functions (Duan
et al., 2018, 2020, 2022; Cai et al., 2021; Li et al., 2021) and may require iteratively sharing updated
summary statistics across datasets (Li et al., 2021). However, distributed and federated learning
are less feasible without a collaborative environment or certain infrastructures that enable efficient
computing and timely information sharing. In contrast, pre-trained models from existing studies are
often more accessible. With increasing attention to reproducibility and open science, more journals
require studies to publish their results as supplementary materials or to make them shareable upon
request (Thompson et al., 2006; Roobol et al., 2012). Many platforms allow direct implementation
(Belbasis and Panagiotou, 2022) or validation of fitted models on secure collaborative platforms
2
such as PheKB (Kirby et al., 2016) and FATE (Liu et al., 2021). There is an increasing need for
data integration methods that can directly leverage fitted models to improve the model performance
of a target study.
Regression models are broadly applied in many fields, due to advantages such as simplicity,
computational efficiency, and interpretability (Hafemeister and Satija, 2019; Wynants et al., 2020;
Wu et al., 2020), and they are also the building blocks of many data analysis pipelines and tools
(Van Buuren et al., 2006; Tan, 2006). In this paper, we consider the problem of incorporating
pre-trained regression models from external sources to help train a target model with limited target
data. For the i-th subject in the target data, let YiRdenote the outcome variable of interest and
XiRpdenote a set of p-dimensional covariates. We consider
Yi=X
iβ+ϵi,for i∈ {1, . . . , n}
where ϵiRis the random noise with mean zero and variance σ2,βRpis the regression coefficient
of interest, and nis the target sample size. In addition to the target data, we observe a source model
fitted on an external source dataset, where we observe the source estimate ˆ
wRpof the underlying
regression coefficient win the source population. In the case where the underlying coefficients βand
wshow certain similarities, we hope to leverage ˆ
wto assist the estimation of the target parameter
β.
Similar problems have been considered in the literature. In a series of transfer learning work,
the source model estimate ˆ
wwas incorporated through a regularization with the form βˆ
wq,
for some positive constant q. For example, Li et al. (2020) proposed the transLASSO algorithm
that leverages the source data by adding a penalty term βˆ
w1when learning βusing the
target data. Later, transLASSO has been extended to generalized linear models (Tian and Yang,
2022), functional linear regression (Lin and Reimherr, 2022), and Q-learning (Chen et al., 2022).
Considering data sharing constraints, Li et al. (2021) proposed a federated learning approach by
sharing gradients and Hessian matrices, and Gu et al. (2023b) proposed a method to incorporate
information from fitted models through a synthetic data approach. Similar ideas are also used in
multi-tasking learning literature, where L2-norm-based penalties, i.e., βw2, are introduced to
leverage the similarities between model parameters (Tian et al., 2022; Duan and Wang, 2022). The
underlying similarity assumption of these methods is that the Lqdistance between βand wis small,
and we refer to this class of methods as the distance-based transfer learning methods.
In real-world applications, the distance-based transfer learning methods may be less effective
when βand ware highly concordant, but their distance may not be small. For example, the
source outcome may be defined differently from the target outcome (categorical versus continuous
characterization of the same outcome), or there might be standardization procedures applied to the
source data that are unknown to us. The source model might be fitted using a different but related
outcome variable correlated with the target outcome (Miglioretti, 2003; Stearns, 2010). To leverage
the concordance between the model parameters, several alternative similarity characterizations are
proposed. For example, Li et al. (2014) proposed a bivariate ridge regression assuming the j-th
entries (βj, wj) follows a bivariate normal distribution with shared correlation ρfor all j[p].
Similarly, Maier et al. (2014) proposed a multi-trait prediction method, assuming multivariate
normally distributed random effects to leverage shared genetic architectures across traits. Qiao
et al. (2019) proposed a method for multi-objective optimization problems. Liang et al. (2020)
3
proposed a calibration version of transLASSO, which allows the scale of the model parameters to
differ. However, these methods require individual-level data from the source and their robustness is
unclear when the level of data heterogeneity is high.
In this paper, we propose an angle-based Transfer Learning approach, named angleTL, which
leverages the similarity of two regression models through a novel penalization obtained by decou-
pling the angle distance between βand w. As a consequence, angleTL adapts to signal strength
and inherently guards against negative transfer, offering a simpler form both conceptually and com-
putationally. Our approach unifies distance-based transfer learning, target-only and source-only
estimators as specific instances, and thus is guaranteed to have superior performance. The high-
dimensional asymptotic analysis provides a precise characterization of the prediction risk, illustrat-
ing the bias-variance trade-off, the influence of signal strengths, the similarity between source and
target, the noise level, and the estimation error of the source to the accuracy of the final estimator.
Given multiple source models, we propose methods to effectively incorporate them toward better
predictive performance in the target population. The proposed methods only require parameter
estimates of fitted source models, which are more accessible in practice. We perform extensive sim-
ulation studies to validate our theoretical conclusions and evaluate the performance of angleTL by
training genetic risk models for low-density lipoprotein cholesterol (LDL) using data from multiple
large-scale biobanks.
2 Angle-based transfer learning
Let YRndenote the outcome variable of interest and XRn×pdenote a set of p-dimensional
covariates in the target data of size n. Without borrowing information from the source, we can
obtain a target-only estimator of βthrough a ridge regression,
˜
βλ= arg min
β
1
nYXβ2
2+λβ2
2,(1)
where λis a tuning parameter. The penalty on β2
2helps reduce the overall mean squared error
of the estimator in the high-dimensional setting (Hastie et al., 2009).
Suppose that we also observe ˆ
wRp, the parameter estimates of a source model fitted on an
external source dataset with a sample size potentially much larger than n. Due to data heterogeneity,
the two underlying regression coefficients, βand w, may not be the same but may share certain
similarities, and hence ˆ
wcan be used to guide the estimation of β. Following a series of recent
transfer learning methods (Li et al., 2022, 2021; Gu et al., 2023b), we define the L2-distance-based
transfer learning estimator (distTL) to be
ˇ
βλd= arg min
β
1
nYXβ2
2+λdβˆ
w2
2,(2)
where λdis a tuning parameter. Imposing a distance-based penalty, βˆ
w2, is equivalent to
adding a constraint βˆ
w2h, which reduces the parameter space to an L2ball centered at ˆ
w
as shown in panel A of Fig. 1. Anchoring on the source estimator ˆ
w, distTL encourages estimators
closer to ˆ
wwhile allowing for the calibration of potential differences.
As introduced earlier, in real-world applications, it is possible that βand ware concordant to
some degree, but βw2is not small. In such cases, distTL may be less effective and we may
4
Figure 1: Geometric illustration of the distance-based similarity characterization (A); the angle-
based characterization (B); and the situation where the distance-based transfer learning have the
same predictive risk as the proposed method (C).
consider a more general characterization of the similarity using the angle distance characterized
by sin Θ(w,β) = q1(wβ)2
w2
2β2
2
. As shown in panel B of Fig.1, if the target model parameter
βis restricted by a constraint that sin Θ(β,w)d, the source model parameter wcan provide
directional information of βand reduce the parameter space of βto a cone. When wand βare
small in L2distance, it also implies that the angle between the two vectors is small, while it may
not be true the other way around.
Instead of directly penalizing on sin Θ( ˆ
w,β) = q1(ˆ
wβ)2
ˆ
w2
2β2
2
which may lead to computational
complexity, we consider an alternative angle-based transfer learning estimator defined as
ˆ
βλ,η = arg min
β
1
nYXβ2
2+λβ2
22ηˆ
wβ,(3)
where λand ηare tuning parameters. To see the connection between the proposed penalty terms
and the sin Θ distance, without loss of generality, we can consider ˆ
w2
2= 1. Due to the duality
of constrained optimization and penalization, the penalty on sin Θ( ˆ
w,β) is equivalent to having a
constraint q1(ˆ
wβ)2
β2
2
< a for some a, which can be written as ˆ
wβ
β2>1a2. While this con-
straint controls the term ˆ
wβ
β2, what we propose is to decouple its numerator and denominator, and
introducing two separate constraints ˆ
wβ> b (equivalent to ˆ
wβ<b) and β2< c (equiva-
lent to β2
2< c2). While this decoupling provides computational simplicity and efficiency, the set
β:ˆ
wβ> b, β2< c may not necessarily include, or be included in, the set β:q1(ˆ
wβ)2
β2
2
< a,
when band care two freely adjustable tuning parameters. So the proposed penalization is not
equivalent to or is a relaxation of the sin Θ penalty.
The proposed penalization has several advantages. First, given λand η, Equation (3) has a
closed-form solution ˆ
βλ,η = (XX+Ip)1(XY+ˆ
w), which ensures computational efficiency.
Secondly, it is adaptive to the signal strength of the target parameter. Specifically, when β2is
large, the constraint ˆ
wβ> b can be satisfied for β’s with large sin Θ distance with ˆ
w. In this
scenario, the angle constraint is relatively minimal, and less information is borrowed from the
source. Conversely, when β2is small, there is a tighter constraint on the angle distance, which
results in borrowing more information from the source. Intuitively, with a sufficiently strong signal,
5
摘要:

Robustangle-basedtransferlearninginhighdimensionsTianGu1,YiHan2,andRuiDuan3,†1DepartmentofBiostatistics,ColumbiaMailmanSchoolofPublicHealth,NewYork,NY10032,USA2DepartmentofStatistics,ColumbiaUniversity,NewYork,NY10027,USA3DepartmentofBiostatistics,HarvardT.H.ChanSchoolofPublicHealth,Boston,MA02115,U...

展开>> 收起<<
Robust angle-based transfer learning in high dimensions Tian Gu1 Yi Han2 and Rui Duan3 1Department of Biostatistics Columbia Mailman School of Public Health New.pdf

共28页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:28 页 大小:2.27MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 28
客服
关注