such as PheKB (Kirby et al., 2016) and FATE (Liu et al., 2021). There is an increasing need for
data integration methods that can directly leverage fitted models to improve the model performance
of a target study.
Regression models are broadly applied in many fields, due to advantages such as simplicity,
computational efficiency, and interpretability (Hafemeister and Satija, 2019; Wynants et al., 2020;
Wu et al., 2020), and they are also the building blocks of many data analysis pipelines and tools
(Van Buuren et al., 2006; Tan, 2006). In this paper, we consider the problem of incorporating
pre-trained regression models from external sources to help train a target model with limited target
data. For the i-th subject in the target data, let Yi∈Rdenote the outcome variable of interest and
Xi∈Rpdenote a set of p-dimensional covariates. We consider
Yi=X⊤
iβ+ϵi,for i∈ {1, . . . , n}
where ϵi∈Ris the random noise with mean zero and variance σ2,β∈Rpis the regression coefficient
of interest, and nis the target sample size. In addition to the target data, we observe a source model
fitted on an external source dataset, where we observe the source estimate ˆ
w∈Rpof the underlying
regression coefficient win the source population. In the case where the underlying coefficients βand
wshow certain similarities, we hope to leverage ˆ
wto assist the estimation of the target parameter
β.
Similar problems have been considered in the literature. In a series of transfer learning work,
the source model estimate ˆ
wwas incorporated through a regularization with the form ∥β−ˆ
w∥q,
for some positive constant q. For example, Li et al. (2020) proposed the transLASSO algorithm
that leverages the source data by adding a penalty term ∥β−ˆ
w∥1when learning βusing the
target data. Later, transLASSO has been extended to generalized linear models (Tian and Yang,
2022), functional linear regression (Lin and Reimherr, 2022), and Q-learning (Chen et al., 2022).
Considering data sharing constraints, Li et al. (2021) proposed a federated learning approach by
sharing gradients and Hessian matrices, and Gu et al. (2023b) proposed a method to incorporate
information from fitted models through a synthetic data approach. Similar ideas are also used in
multi-tasking learning literature, where L2-norm-based penalties, i.e., ∥β−w∥2, are introduced to
leverage the similarities between model parameters (Tian et al., 2022; Duan and Wang, 2022). The
underlying similarity assumption of these methods is that the Lqdistance between βand wis small,
and we refer to this class of methods as the distance-based transfer learning methods.
In real-world applications, the distance-based transfer learning methods may be less effective
when βand ware highly concordant, but their distance may not be small. For example, the
source outcome may be defined differently from the target outcome (categorical versus continuous
characterization of the same outcome), or there might be standardization procedures applied to the
source data that are unknown to us. The source model might be fitted using a different but related
outcome variable correlated with the target outcome (Miglioretti, 2003; Stearns, 2010). To leverage
the concordance between the model parameters, several alternative similarity characterizations are
proposed. For example, Li et al. (2014) proposed a bivariate ridge regression assuming the j-th
entries (βj, wj) follows a bivariate normal distribution with shared correlation ρfor all j∈[p].
Similarly, Maier et al. (2014) proposed a multi-trait prediction method, assuming multivariate
normally distributed random effects to leverage shared genetic architectures across traits. Qiao
et al. (2019) proposed a method for multi-objective optimization problems. Liang et al. (2020)
3