Linear Regression with Centrality Measures Job Market Paper Yong Cai

2025-05-03 0 0 2.93MB 97 页 10玖币
侵权投诉
Linear Regression with Centrality Measures
Job Market Paper
Yong Cai
Department of Economics
Northwestern University
yongcai2023@u.northwestern.edu
This version: October 19, 2022.
Newest version here.
Abstract
This paper studies the properties of linear regression on centrality measures when net-
work data is sparse – that is, when there are many more agents than links per agent
– and when they are measured with error. We make three contributions in this set-
ting: (1) We show that OLS estimators can become inconsistent under sparsity and
characterize the threshold at which this occurs, with and without measurement er-
ror. This threshold depends on the centrality measure used. Specifically, regression
on eigenvector is less robust to sparsity than on degree and diffusion. (2) We develop
distributional theory for OLS estimators under measurement error and sparsity, find-
ing that OLS estimators are subject to asymptotic bias even when they are consistent.
Moreover, bias can be large relative to their variances, so that bias correction is nec-
essary for inference. (3) We propose novel bias correction and inference methods for
OLS with sparse noisy networks. Simulation evidence suggests that our theory and
methods perform well, particularly in settings where the usual OLS estimators and
heteroskedasticity-consistent/robust t-tests are deficient. Finally, we demonstrate the
utility of our results in an application inspired by De Weerdt and Dercon (2006), in
which we consider consumption smoothing and social insurance in Nyakatoke, Tanza-
nia.
Keywords: networks, diffusion centrality, eigenvector centrality.
JEL Classification Codes: C18, C21, C81.
I am grateful to Ivan Canay, Eric Auerbach and Joel Horowitz for guidance on this project. I have
also benefited from the comments and suggestions of Eduardo Campillo-Betancourt, Piotr Dworczak, Grant
Goehring and Deborah Kim.
1
arXiv:2210.10024v1 [econ.EM] 18 Oct 2022
1 Introduction
A large and rapidly growing body of work documents the influence of networks in a wide
range of economic outcomes: peer effects drive academic achievement, production networks
shape shock propagation in the macroeconomy, social networks affect information- and risk-
sharing with important implications for development (see Sacerdote 2011,Carvalho and
Tahbaz-Salehi 2019 and Breza et al. 2019 for recent reviews). Many other examples abound.
One particular strand of research has fruitfully explored the relationship between an
agent’s network position and their economic outcomes. For example, Hochberg et al. (2007)
considers the network of venture capital firms and finds that better-networked firms success-
fully exit a greater proportion of their investments. Meanwhile, Cruz et al. (2017) examines
the social networks in the Philippines and shows that more central families are dispropor-
tionately represented in political offices. Similarly, Banerjee et al. (2013) studies the problem
of diffusing microfinance in India and establishes that seeding information to more central
agents led to greater participation in the program.
In these papers, researchers often estimate linear models by ordinary least squares (OLS),
using centrality measures as explanatory variables. Centrality measures are node-level statis-
tics that capture notions of importance in a network. Since nodes can be important for many
reasons, a variety of centrality measures exist, each capturing a particular aspect of network
position. For example, the degree centrality of an agent reflects the number or intensity
of their direct links, while eigenvector centrality is designed so that influence of agents is
proportional to that of their connections. The correlation between an outcome variable and
a particular centrality measure may be revealing about the types of interactions that drive
a given economic phenomenon: an outcome that is well-predicted solely by degree is likely
to be determined in an extremely local manner, whereas one that is more strongly asso-
ciated with eigenvector centrality may involve non-linear interactions between agents that
are further apart. As such, when researchers estimate these correlations and test their sta-
tistical significance, they frequently do so with the goal of drawing conclusions about the
economic significance of various centrality measures and the implied mechanisms for out-
come determination. Such an exercise is credible only if the OLS estimator is close to the
estimand, and if the chosen test statistic (typically the heteroskedasticity-consistent/robust
t-statistics) is well described by its asymptotic distribution (standard normal for t-statistics)
in finite sample.
However, network data have two features that may threaten the statistical validity of
OLS. Firstly, networks may be sparse, with many more agents than links per agent. This
could happen because interactions are observed with low frequency, or because the interac-
2
tions in question are rare. Chandrasekhar (2016) argues that many economic networks are
sparse, providing evidence from commonly used social network data (e.g. AddHealth; Kar-
nataka Villages (Banerjee et al. 2013); Harvard social network (Leider et al. 2009)). Sparsity
poses a challenge to estimation and inference: if networks are largely empty, there might not
be enough variation in centrality measures to identify the parameters of interest. Despite
its importance, sparsity has received relatively little attention in the network econometrics
literature.
Secondly, the observed network may differ from the true network of interest. Centrality
measures are often calculated on data which are obtained by survey or constructed using
some proxy for interaction between agents, though subsequent analysis would frequently treat
the true network as known. Ignoring measurement error may thus lead to estimates that
perform poorly. A growing literature works with networks that are assumed to be measured
with error. However, they generally do not consider sparse settings. This is important
since sparsity and measurement error are mutually reinforcing: sparser networks contain
weaker signals, which are in turn more difficult to pick out from noisy measurements. The
upshot is that OLS estimators computed on sparse, noisy networks may have particularly
poor properties. Asymptotic theory that ignore these features will provide similarly poor
approximations to their finite sample behavior. Consequently, estimation and inference
procedures based on these theories may lead to invalid conclusions about the economic
significance of centrality measures.
This paper studies the statistical properties of OLS on centrality measures in an asymp-
totic framework which features both measurement error and sparsity. Our analysis is centered
on degree, diffusion and eigenvector centralities, which are among the most popular mea-
sures. Our contribution is threefold: (1) We characterize the amount of sparsity at which
OLS estimators become inconsistent with and without measurement error, finding that this
threshold varies depending on the centrality measure used. Specifically, regression on eigen-
vector centrality is less robust to sparsity than that on degree and diffusion. This suggests
that researchers should be cautious about comparing regressions on different centrality mea-
sures, since they may differ in statistical properties in addition to economic significance. (2)
We develop distributional theory for OLS estimators under measurement error and sparsity.
We restrict ourselves to sparsity ranges under which OLS is consistent, but we find that
asymptotic bias can be large even in this case. Furthermore, the bias may be of larger order
than variance, in which case bias correction would be necessary for obtaining non-degenerate
asymptotic distributions. Additionally, we find that under sparsity, the estimator converges
at a slower rate than is reflected by the usual heteroskedasticity-consistent(hc)/robust stan-
dard errors, requiring a different estimator. (3) In view of the distributional theory, we
3
propose novel bias-corrected estimators and inference methods for OLS with sparse, noisy
networks. We also clarify the settings under which hc/robust t-statistics are appropriate for
testing.
Our theoretical results are derived in an asymptotic framework where networks are mod-
eled as realizations of sparse random graphs. As n→ ∞, the expected number of links per
agent grows much more slowly than n. Because our statistical model captures important
features of real world data, we expect our methods to be reliable for estimation and infer-
ence with sparse, noisy networks. We provide simulation evidence supporting this view. The
utility of our results is also evident from an application inspired by De Weerdt and Dercon
(2006), where we conduct a stylized study of consumption smoothing and social insurance
in Nyakatoke, Tanzania.
Our choice of asymptotic framework poses technical challenges. Firstly, the eigenvectors
and eigenvalues of sparse random graphs are difficult to characterize. We draw on recent
advances in random matrix theory (Alt et al. 2021a;b;Benaych-Georges et al. 2019;2020)
to overcome this challenge. Secondly, spectral norms of random matrices concentrate slowly
in sparse regimes. Instead, we develop bounds for moments of noisy adjacency matrices by
relating them to counts of particular graphs, in the spirit of Wigner (1957) (see Chapter 2
of Tao 2012 more generally). Finally, in order for bias correction to improve mean-squared
error, the bias needs to be estimated at a sufficiently fast rate. Because variance is of lower
order than bias, a naive plug-in approach does not work for estimating higher order bias
terms, although it is sufficient for the first order term. We leverage this fact to recursively
construct good estimators for higher order terms.
Related Literature
Our work is most closely related to papers that study linear regression with centrality statis-
tics. To our best knowledge, we are the first to study linear regression with diffusion cen-
trality, though there exist prior work on eigenvector centrality. Le and Li (2020) studies
linear regression on multiple eigenvectors of a network assuming the same type of measure-
ment error as this paper. They focus on denser settings than we do and provide inference
method only for the null hypothesis that the slope coefficient is 0. We are concerned only
with eigenvector centrality, which is the leading eigenvector, but our results cover the sparse
case as well as tests of non-zero null hypotheses (more details in Remark 5). Our paper is
also related to Cai et al. (2021), which proposes penalized regressions on the leading left and
right singular vectors of a network. They consider networks that are as sparse as the ones we
study, but their networks are observed with an additive, normally distributed error (more
4
details in Remark 4). Outside of the linear regression setting, Cheng et al. (2021) considers
inference on deterministic linear functionals of eigenvectors. They study symmetric matrices
with asymmetric noise, proposing novel estimators that leverage asymmetry to improve per-
formance when eigengaps are small. We focus on symmetric matrices with symmetric noise
and study the plug-in estimator in which eigenvector is estimated using the noisy adjacency
matrix in place of the true matrix.
Our paper also relates to a growing literature that considers sampling and measurement
error in networks. Chandrasekhar and Lewis (2016) examines settings in which researchers
have access to a panel of networks, but which are constructed using only a partial sample
of nodes or edges. Thirkettle (2019) studies a similar missing data problem, but in a cross-
sectional setting with only one network. It is concerned with forming bounds on centrality
statistics and does not consider subsequent linear regression. Griffith (2022) considers the
censoring in network data, which arises when agents are only allowed to list a fixed number
of relationships during the sampling process. The above papers study missing data problems
under the assumption that the observed network is without error. We assume that the
entirety of one network is observed but with error. Lewbel et al. (2021) studies measurement
error in peer effects regression, finding that 2SLS with friends-of-friends instruments is valid
as long as measurement error is small. They do not discuss centrality regressions.
This paper is also connected to the nascent literature on the statistical properties of
sparse networks. A strand of this literature is concerned with network formation models
that can give rise to sparsity in the observed data. Dong et al. (2020) and Motalebi et al.
(2021) consider modifications to the stochastic block model. A more general model takes
the form of inhomogeneous Erdos-Renyi graph, which are generated by a graphon with a
sparsity parameter that tends to zero in the limit (see for instacne Bollob´as et al. 2007
and Bickel and Chen 2009). Our paper takes this approach. Yet another model for sparse
graphs is based on graphex processes, which generalizes graphons by generating vertices
through Poisson point processes (see Borgs et al. 2018,Veitch and Roy 2019 and references
therein). Our choice of inhomogeneous Erdos-Renyi graphs is motivated by their prevalence
in econometrics (Section 3 of De Paula 2017 and Section 6 of Graham 2020a provide many
examples), as well as tractability considerations. To our best knowledge, few papers have
tackled the challenges that sparse networks pose for regression. Two notable exceptions study
network formation models, which take the form of edge-level logistic regressions (Jochmans
2018;Graham 2020b). A separate literature considers estimation of peer effects regressions
involving sparse networks using panel data (Manresa 2016;Rose 2016;De Paula et al. 2020).
Here, sparsity is an assumption used to justify regularization methods. We consider a node-
level regression in a cross-sectional setting with one large network.
5
摘要:

LinearRegressionwithCentralityMeasuresJobMarketPaperYongCai*DepartmentofEconomicsNorthwesternUniversityyongcai2023@u.northwestern.eduThisversion:October19,2022.Newestversionhere.AbstractThispaperstudiesthepropertiesoflinearregressiononcentralitymeasureswhennet-workdataissparse{thatis,whenthereareman...

展开>> 收起<<
Linear Regression with Centrality Measures Job Market Paper Yong Cai.pdf

共97页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:97 页 大小:2.93MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 97
客服
关注