A Latent Logistic Regression Model with Graph Data Haixiang Zhang Yingjun Deng Alan J.X. Guo Qing-Hu Hou and Ou Wu

2025-04-27 0 0 1.27MB 23 页 10玖币

侵权投诉

A Latent Logistic Regression Model

with Graph Data

Haixiang Zhang∗, Yingjun Deng, Alan J.X. Guo, Qing-Hu Hou and Ou Wu

Center for Applied Mathematics, Tianjin University, Tianjin 300072, China

Abstract

Recently, graph (network) data is an emerging research area in artiﬁcial intelligence,

machine learning and statistics. In this work, we are interested in whether node’s labels

(people’s responses) are aﬀected by their neighbor’s features (friends’ characteristics).

We propose a novel latent logistic regression model to describe the network dependence

with binary responses. The key advantage of our proposed model is that a latent bi-

nary indicator is introduced to indicate whether a node is susceptible to the inﬂuence

of its neighbour. A score-type test is proposed to diagnose the existence of network

dependence. In addition, an EM-type algorithm is used to estimate the model param-

eters under network dependence. Extensive simulations are conducted to evaluate the

performance of our method. Two public datasets are used to illustrate the eﬀectiveness

of the proposed latent logistic regression model.

Keywords: Artiﬁcial intelligence; EM algorithm; Graph data; Logistic regression;

ROC curves; Social network

1 Introduction

Nowadays, graphs or networks are widely used in many research ﬁelds, such as social interac-

tions, protein-protein interactions, chemical molecule bonds, transport networks, etc. Great

eﬀorts have been focused on network data in the literature. For example, Zhu et al. (2017)

proposed a network vector autoregressive model. Yan et al. (2019) studied the maximum

likelihood estimation of a directed network model with covariates. Zhang et al. (2022) stud-

ied some topics on large scale social networks. Chandna et al. (2021) considered the local

∗Corresponding author: haixiang.zhang@tju.edu.cn (Haixiang Zhang)

arXiv:2210.05218v1 [stat.ME] 11 Oct 2022

linear estimation of the graphon function. Zhu et al. (2021) introduced a network functional

varying coeﬃcient model. Pan et al. (2022) proposed a latent space logistic regression model

for link prediction with social networks. Zhao et al. (2022) proposed a dimension reduction

method for covariates in network data, etc.

The logistic regression is a very famous statistical tool, which plays an important role in

practical applications. e.g. ﬁnance research (Li et al., 2019), public health (Lemon et al.,

2003), medicine (Lu and Yang, 2012), education (Pyke and Sheridan, 1993), bioinformat-

ics (Wu et al., 2009). There have been some methodological developments on the logistic

regression. e.g. Landwehr et al. (1984) proposed a graphical methods for assessing logistic

regression models. Stefanski and Carroll (1985) studied the logistic regression model when

covariates are subject to measurement error. Jennings (1986) investigated the outliers and

residual distributions in logistic regression. Efron (1988) used the logistic regression tech-

niques to estimate hazard rates and survival curves from censored data. Carroll and Pederson

(1993) investigated robustness in the logistic regression model. Meier et al. (2008) extended

the group lasso to logistic regression models. Liu et al. (2009) proposed an algorithm for

solving large-scale sparse logistic regression. Singh et al. (2009) examined the problem of

eﬃcient feature selection for logistic regression on very large data sets. Shi et al. (2010) and

Yuan et al. (2012) studied some algorithms for L1-regularized logistic regression. Conroy

and Sajda (2012) proposed a fast and exact model selection procedure for L2-regularized lo-

gistic regression. Das et al. (2013) introduced new supervised and semi-supervised learning

algorithms based on locally-weighted logistic regression. Tripathi et al. (2017) considered

estimation of the probability density function and the cumulative distribution function of

the generalized logistic distribution. Schein and Ungar (2007) and Yang and Loog (2018)

studied the active learning methods for logistic regression. Wang (2020) studied binary lo-

gistic regression for rare events data. Han et al. (2019) presented an eﬃcient algorithm for

logistic regression on homomorphic encrypted data. Wang et al. (2018) and Zuo et al. (2021)

considered the optimal subsmapling for logistic regression in big data, among others.

However, the above-mentioned logistic regression methods did not consider the eﬀects of

networks. For example, does a persion like to playing a game given that his/her friends like

it? Will the customers by a commodity if their friends have bought it? In this work, we

are interested in exploring whether certain type of network dependence exists with binary

outcomes data and to quantify this dependence structure if it exists. To deal with this issue,

we propose a logistic model with latent binary indicator, which has the ability to describe

whether a node is susceptible to the inﬂuence of its neighbor. Our method has the following

two advantages: First, the proposed logistic model with a latent binary indicator is very

ﬂexibility in practical applications. It provides a solution to estimate the probability that a

node might be aﬀected by neighbor’s characteristics in the network. Hence, we can detect a

subgroup of nodes who are more likely to be inﬂuenced by their neighbors. Second, we give

a score-type test for detecting the existence of the network dependence in the logistic model.

An EM algorithm is employed to estimate the model parameters, which leads to consistence

estimator with desirable performance in simulations.

The remainder of this article is organized as follows: In Section 2, we introduce a latent

logistic regression model. In Section 3, we propose a supremum score test statistic to de-

tect the existence of network dependence. In Section 4, an EM-type estimation algorithm

is proposed. Simulations and a real data application are presented in Sections 5 and 6,

respectively. In Section 7, we give some concluding remarks.

2 Model and Notation

Let G= (V, E) be an undirected graph with nodes Vand edges E. Assume that there are n

nodes belonging to two classes. Let Yi∈ {0,1}and Xi= (Xi1,··· , Xip)0∈Rpbe the binary

label and feature vector of the i-th node vi,i= 1,··· , n. Given the graph structure of G,

we propose a novel latent logistic regression model:

P(Yi= 1|Xi, ζi) = exp{β0+X0

iβ+δζiPn

j=1 aijX0

jβ}

1 + exp{β0+X0

iβ+δζiPn

j=1 aijX0

jβ}, i = 1,··· , n, (2.1)

where β0∈Ris an intercept, β= (β1,··· , βp)0is the vector of regression parameters,

A= (aij) is the adjacency matrix (aii = 0, and aij = 1 if there is an edge between the ith

node and jth node, aij = 0 otherwise); ζi∈ {0,1}is a latent indicator denoting whether the

label (response) of ith node depends on its neighbor’s features. Note that ζiis unobservable,

and we assume that

P(ζi= 1|Xi) = exp{γ0+X0

iγ}

1 + exp{γ0+X0

iγ}.(2.2)

The parameter δplays the role of describing the magnitude of the dependence of a node

to its neighbor. When δ= 0, there is no network dependence between labels of connected

nodes, and the parameters γ0and γare not estimable in this case. In what follows, we

will proposed a method to test the null hypothesis H0:δ= 0, and then give an EM-type

algorithm to estimate the model parameters under δ6= 0.

3 Test for H0:δ= 0

First we assume that ζ= (ζ1,··· , ζn)0is known. The log likelihood function is

L(θ;ζ) =

i=1 "Yiβ0+X0

iβ+δζi

j=1

aijX0

jβ

−log 1 + exp nβ0+X0

iβ+δζi

j=1

aijX0

jβo#,(3.1)

where θ= (δ, β0,β0)0. Under H0, model (2.1) reduces to the standard logistic model. Let

β0and ˜

βdenote the maximum likelihood estimator under the null. Denote ˜

θ= (0,˜

β0,˜

β0)0.

Some calculations lead to the following score function:

S(˜

θ;ζ) = ∂L(θ;ζ)

∂δ |θ=˜

θ=

i=1

Zi"Yi−exp( ˜

β0+X0

i˜

β)

1 + exp( ˜

β0+X0

i˜

β)#,(3.2)

where ˜

Zi=ζiPn

j=1 aijX0

j˜

β. Because ζis not available in ˜

Zi’s, we propose to replace ζi

with its expectation E(ζi) = P(ζi= 1|Xi) given in (2.2). For convenience, denote ˜

Z∗

E(ζi)Pn

j=1 aijX0

j˜

β. After replacing ˜

Ziwith ˜

Z∗

iin (3.2), we obtain a score-type statistic:

S∗(˜

θ;φ) =

i=1

Z∗

i"Yi−exp( ˜

β0+X0

i˜

β)

1 + exp( ˜

β0+X0

i˜

β)#,(3.3)

where φ= (γ0,γ0)0.

Motivated by Fan et al. (2017), we propose a supremum score test statistic:

Tn= sup

φ∈Γ

{S∗(˜

θ;φ)}2

i=1{U∗

i(˜

θ;φ)}2,(3.4)

where Γ∈Rp+1, and

U∗

i(˜

θ;φ) = "Yi−exp( ˜

β0+X0

i˜

β)

1 + exp( ˜

β0+X0

i˜

β)#n˜

Z∗

i− B∗

n(˜

θ)I−1

n(˜

θ)(1,X0

i)0o,

where In(˜

θ) = −∂2L(θ;ζ)

∂ηη0|θ=˜

θ,Bn(˜

θ) = −∂2L(θ;ζ)

∂δ∂η0|θ=˜

θ,η= (β0,β0)0and B∗

n(˜

θ) is given from

Bn(˜

θ) by replacing ζiwith its expectation E(ζi) = P(ζi= 1|Xi). To be more speciﬁc, we

have the following two explicit expressions:

In(˜

θ) =

i=1 

exp( ˜

i˜

η)

1 + exp( ˜

i˜

η)−(exp( ˜

i˜

η)

1 + exp( ˜

i˜

η))2

˜

Xi˜

and

Bn(˜

θ) = −

i=1 "ζiYi−exp( ˜

i˜

η)

1 + exp( ˜

i˜

η)0,

j=1

aijX0

j

+ζi

j=1

aijX0

j˜

β exp( ˜

i˜

η)

1 + exp( ˜

i˜

η)2

−exp( ˜

i˜

η)

1 + exp( ˜

i˜

η)˜

i#,

where ˜

η= ( ˜

β0,˜

β0)0, and ˜

Xi= (1,X0

i)0.

Theorem 1 As n→ ∞, we have Tnconverges in distribution to sup

φ∈Γ

G2(φ)under H0, where

{G(φ) : φ∈Γ}is a mean zero Gaussian process with the covariance function

Σ(φ1,φ2) = lim

n→∞ Pn

i=1 U∗

i(θt;φ1)U∗

i(θt;φ2)

[Pn

i=1{U∗

i(θt;φ1)}2Pn

i=1{U∗

i(θt;φ2)}2]1/2,

for any φ1,φ2∈Γ.

We adopt a resampling method in order to obtain the the critical value of the asymptotic

distribution of Tnunder H0. To be more speciﬁc, we deﬁne a perturbed test statistic:

T∗

n= sup

φ∈Γ

{Pn

i=1 ξiU∗

i(˜

θ;φ)}2

i=1{U∗

i(˜

θ;φ)}2,(3.5)

where ξ1,··· , ξnare independently generated from N(0,1). Note that Tnand T∗

nown the

same asymptotic distribution under H0. By repeatedly generating a great deal of perturbed

statistics, we can obtain the empirical upper α-quantile, Cα, of the perturbed statistics T∗

n’s.

The null H0is rejected if Tn> Cα.

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

ALatentLogisticRegressionModelwithGraphDataHaixiangZhang,YingjunDeng,AlanJ.X.Guo,Qing-HuHouandOuWuCenterforAppliedMathematics,TianjinUniversity,Tianjin300072,ChinaAbstractRecently,graph(network)dataisanemergingresearchareainarticialintelligence,machinelearningandstatistics.Inthiswork,weareinterest...

展开>> 收起<<

A Latent Logistic Regression Model with Graph Data Haixiang Zhang Yingjun Deng Alan J.X. Guo Qing-Hu Hou and Ou Wu.pdf

共23页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

A Latent Logistic Regression Model with Graph Data Haixiang Zhang Yingjun Deng Alan J.X. Guo Qing-Hu Hou and Ou Wu

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: