A Latent Logistic Regression Model with Graph Data Haixiang Zhang Yingjun Deng Alan J.X. Guo Qing-Hu Hou and Ou Wu

2025-04-27 0 0 1.27MB 23 页 10玖币
侵权投诉
A Latent Logistic Regression Model
with Graph Data
Haixiang Zhang, Yingjun Deng, Alan J.X. Guo, Qing-Hu Hou and Ou Wu
Center for Applied Mathematics, Tianjin University, Tianjin 300072, China
Abstract
Recently, graph (network) data is an emerging research area in artificial intelligence,
machine learning and statistics. In this work, we are interested in whether node’s labels
(people’s responses) are affected by their neighbor’s features (friends’ characteristics).
We propose a novel latent logistic regression model to describe the network dependence
with binary responses. The key advantage of our proposed model is that a latent bi-
nary indicator is introduced to indicate whether a node is susceptible to the influence
of its neighbour. A score-type test is proposed to diagnose the existence of network
dependence. In addition, an EM-type algorithm is used to estimate the model param-
eters under network dependence. Extensive simulations are conducted to evaluate the
performance of our method. Two public datasets are used to illustrate the effectiveness
of the proposed latent logistic regression model.
Keywords: Artificial intelligence; EM algorithm; Graph data; Logistic regression;
ROC curves; Social network
1 Introduction
Nowadays, graphs or networks are widely used in many research fields, such as social interac-
tions, protein-protein interactions, chemical molecule bonds, transport networks, etc. Great
efforts have been focused on network data in the literature. For example, Zhu et al. (2017)
proposed a network vector autoregressive model. Yan et al. (2019) studied the maximum
likelihood estimation of a directed network model with covariates. Zhang et al. (2022) stud-
ied some topics on large scale social networks. Chandna et al. (2021) considered the local
Corresponding author: haixiang.zhang@tju.edu.cn (Haixiang Zhang)
1
arXiv:2210.05218v1 [stat.ME] 11 Oct 2022
linear estimation of the graphon function. Zhu et al. (2021) introduced a network functional
varying coefficient model. Pan et al. (2022) proposed a latent space logistic regression model
for link prediction with social networks. Zhao et al. (2022) proposed a dimension reduction
method for covariates in network data, etc.
The logistic regression is a very famous statistical tool, which plays an important role in
practical applications. e.g. finance research (Li et al., 2019), public health (Lemon et al.,
2003), medicine (Lu and Yang, 2012), education (Pyke and Sheridan, 1993), bioinformat-
ics (Wu et al., 2009). There have been some methodological developments on the logistic
regression. e.g. Landwehr et al. (1984) proposed a graphical methods for assessing logistic
regression models. Stefanski and Carroll (1985) studied the logistic regression model when
covariates are subject to measurement error. Jennings (1986) investigated the outliers and
residual distributions in logistic regression. Efron (1988) used the logistic regression tech-
niques to estimate hazard rates and survival curves from censored data. Carroll and Pederson
(1993) investigated robustness in the logistic regression model. Meier et al. (2008) extended
the group lasso to logistic regression models. Liu et al. (2009) proposed an algorithm for
solving large-scale sparse logistic regression. Singh et al. (2009) examined the problem of
efficient feature selection for logistic regression on very large data sets. Shi et al. (2010) and
Yuan et al. (2012) studied some algorithms for L1-regularized logistic regression. Conroy
and Sajda (2012) proposed a fast and exact model selection procedure for L2-regularized lo-
gistic regression. Das et al. (2013) introduced new supervised and semi-supervised learning
algorithms based on locally-weighted logistic regression. Tripathi et al. (2017) considered
estimation of the probability density function and the cumulative distribution function of
the generalized logistic distribution. Schein and Ungar (2007) and Yang and Loog (2018)
studied the active learning methods for logistic regression. Wang (2020) studied binary lo-
gistic regression for rare events data. Han et al. (2019) presented an efficient algorithm for
logistic regression on homomorphic encrypted data. Wang et al. (2018) and Zuo et al. (2021)
considered the optimal subsmapling for logistic regression in big data, among others.
However, the above-mentioned logistic regression methods did not consider the effects of
networks. For example, does a persion like to playing a game given that his/her friends like
it? Will the customers by a commodity if their friends have bought it? In this work, we
2
are interested in exploring whether certain type of network dependence exists with binary
outcomes data and to quantify this dependence structure if it exists. To deal with this issue,
we propose a logistic model with latent binary indicator, which has the ability to describe
whether a node is susceptible to the influence of its neighbor. Our method has the following
two advantages: First, the proposed logistic model with a latent binary indicator is very
flexibility in practical applications. It provides a solution to estimate the probability that a
node might be affected by neighbor’s characteristics in the network. Hence, we can detect a
subgroup of nodes who are more likely to be influenced by their neighbors. Second, we give
a score-type test for detecting the existence of the network dependence in the logistic model.
An EM algorithm is employed to estimate the model parameters, which leads to consistence
estimator with desirable performance in simulations.
The remainder of this article is organized as follows: In Section 2, we introduce a latent
logistic regression model. In Section 3, we propose a supremum score test statistic to de-
tect the existence of network dependence. In Section 4, an EM-type estimation algorithm
is proposed. Simulations and a real data application are presented in Sections 5 and 6,
respectively. In Section 7, we give some concluding remarks.
2 Model and Notation
Let G= (V, E) be an undirected graph with nodes Vand edges E. Assume that there are n
nodes belonging to two classes. Let Yi∈ {0,1}and Xi= (Xi1,··· , Xip)0Rpbe the binary
label and feature vector of the i-th node vi,i= 1,··· , n. Given the graph structure of G,
we propose a novel latent logistic regression model:
P(Yi= 1|Xi, ζi) = exp{β0+X0
iβ+δζiPn
j=1 aijX0
jβ}
1 + exp{β0+X0
iβ+δζiPn
j=1 aijX0
jβ}, i = 1,··· , n, (2.1)
where β0Ris an intercept, β= (β1,··· , βp)0is the vector of regression parameters,
A= (aij) is the adjacency matrix (aii = 0, and aij = 1 if there is an edge between the ith
node and jth node, aij = 0 otherwise); ζi∈ {0,1}is a latent indicator denoting whether the
label (response) of ith node depends on its neighbor’s features. Note that ζiis unobservable,
3
and we assume that
P(ζi= 1|Xi) = exp{γ0+X0
iγ}
1 + exp{γ0+X0
iγ}.(2.2)
The parameter δplays the role of describing the magnitude of the dependence of a node
to its neighbor. When δ= 0, there is no network dependence between labels of connected
nodes, and the parameters γ0and γare not estimable in this case. In what follows, we
will proposed a method to test the null hypothesis H0:δ= 0, and then give an EM-type
algorithm to estimate the model parameters under δ6= 0.
3 Test for H0:δ= 0
First we assume that ζ= (ζ1,··· , ζn)0is known. The log likelihood function is
L(θ;ζ) =
n
X
i=1 "Yiβ0+X0
iβ+δζi
n
X
j=1
aijX0
jβ
log 1 + exp nβ0+X0
iβ+δζi
n
X
j=1
aijX0
jβo#,(3.1)
where θ= (δ, β0,β0)0. Under H0, model (2.1) reduces to the standard logistic model. Let
˜
β0and ˜
βdenote the maximum likelihood estimator under the null. Denote ˜
θ= (0,˜
β0,˜
β0)0.
Some calculations lead to the following score function:
S(˜
θ;ζ) = L(θ;ζ)
δ |θ=˜
θ=
n
X
i=1
˜
Zi"Yiexp( ˜
β0+X0
i˜
β)
1 + exp( ˜
β0+X0
i˜
β)#,(3.2)
where ˜
Zi=ζiPn
j=1 aijX0
j˜
β. Because ζis not available in ˜
Zi’s, we propose to replace ζi
with its expectation E(ζi) = P(ζi= 1|Xi) given in (2.2). For convenience, denote ˜
Z
i=
E(ζi)Pn
j=1 aijX0
j˜
β. After replacing ˜
Ziwith ˜
Z
iin (3.2), we obtain a score-type statistic:
S(˜
θ;φ) =
n
X
i=1
˜
Z
i"Yiexp( ˜
β0+X0
i˜
β)
1 + exp( ˜
β0+X0
i˜
β)#,(3.3)
where φ= (γ0,γ0)0.
Motivated by Fan et al. (2017), we propose a supremum score test statistic:
Tn= sup
φΓ
{S(˜
θ;φ)}2
Pn
i=1{U
i(˜
θ;φ)}2,(3.4)
4
where ΓRp+1, and
U
i(˜
θ;φ) = "Yiexp( ˜
β0+X0
i˜
β)
1 + exp( ˜
β0+X0
i˜
β)#n˜
Z
i− B
n(˜
θ)I1
n(˜
θ)(1,X0
i)0o,
where In(˜
θ) = 2L(θ;ζ)
ηη0|θ=˜
θ,Bn(˜
θ) = 2L(θ;ζ)
δη0|θ=˜
θ,η= (β0,β0)0and B
n(˜
θ) is given from
Bn(˜
θ) by replacing ζiwith its expectation E(ζi) = P(ζi= 1|Xi). To be more specific, we
have the following two explicit expressions:
In(˜
θ) =
n
X
i=1
exp( ˜
X0
i˜
η)
1 + exp( ˜
X0
i˜
η)(exp( ˜
X0
i˜
η)
1 + exp( ˜
X0
i˜
η))2
˜
Xi˜
X0
i,
and
Bn(˜
θ) =
n
X
i=1 "ζiYiexp( ˜
X0
i˜
η)
1 + exp( ˜
X0
i˜
η)0,
n
X
j=1
aijX0
j
+ζi
n
X
j=1
aijX0
j˜
β exp( ˜
X0
i˜
η)
1 + exp( ˜
X0
i˜
η)2
exp( ˜
X0
i˜
η)
1 + exp( ˜
X0
i˜
η)˜
X0
i#,
where ˜
η= ( ˜
β0,˜
β0)0, and ˜
Xi= (1,X0
i)0.
Theorem 1 As n→ ∞, we have Tnconverges in distribution to sup
φΓ
G2(φ)under H0, where
{G(φ) : φΓ}is a mean zero Gaussian process with the covariance function
Σ(φ1,φ2) = lim
n→∞ Pn
i=1 U
i(θt;φ1)U
i(θt;φ2)
[Pn
i=1{U
i(θt;φ1)}2Pn
i=1{U
i(θt;φ2)}2]1/2,
for any φ1,φ2Γ.
We adopt a resampling method in order to obtain the the critical value of the asymptotic
distribution of Tnunder H0. To be more specific, we define a perturbed test statistic:
T
n= sup
φΓ
{Pn
i=1 ξiU
i(˜
θ;φ)}2
Pn
i=1{U
i(˜
θ;φ)}2,(3.5)
where ξ1,··· , ξnare independently generated from N(0,1). Note that Tnand T
nown the
same asymptotic distribution under H0. By repeatedly generating a great deal of perturbed
statistics, we can obtain the empirical upper α-quantile, Cα, of the perturbed statistics T
n’s.
The null H0is rejected if Tn> Cα.
5
摘要:

ALatentLogisticRegressionModelwithGraphDataHaixiangZhang,YingjunDeng,AlanJ.X.Guo,Qing-HuHouandOuWuCenterforAppliedMathematics,TianjinUniversity,Tianjin300072,ChinaAbstractRecently,graph(network)dataisanemergingresearchareainarti cialintelligence,machinelearningandstatistics.Inthiswork,weareinterest...

展开>> 收起<<
A Latent Logistic Regression Model with Graph Data Haixiang Zhang Yingjun Deng Alan J.X. Guo Qing-Hu Hou and Ou Wu.pdf

共23页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:23 页 大小:1.27MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 23
客服
关注