Iteratively Reweighted Least Squares Method for Estimating Polyserial and Polychoric Correlation Coecients Peng Zhang1 Ben Liu1 and Jingjing Pan2

2025-05-05 0 0 1.02MB 22 页 10玖币
侵权投诉
Iteratively Reweighted Least Squares Method for Estimating
Polyserial and Polychoric Correlation Coefficients
Peng Zhang 1, Ben Liu1, and Jingjing Pan2
1School of Mathematical Sciences, Zhejiang University, Hangzhou, China, 310027.
2Zhejiang Super Soul Artificial Intelligence Research Institute
Abstract
An iteratively reweighted least squares (IRLS) method is proposed for estimating polyserial
and polychoric correlation coefficients in this paper. It iteratively calculates the slopes in a
series of weighted linear regression models fitting on conditional expected values. For polyserial
correlation coefficient, conditional expectations of the latent predictor is derived from the ob-
served ordinal categorical variable, and the regression coefficient is obtained using weighted least
squares method. In estimating polychoric correlation coefficient, conditional expectations of the
response variable and the predictor are updated in turns. Standard errors of the estimators are
obtained using the delta method based on data summaries instead of the whole data. Condi-
tional univariate normal distribution is exploited and a single integral is numerically evaluated
in the proposed algorithm, comparing to the double integral computed numerically based on
the bivariate normal distribution in the traditional maximum likelihood (ML) approaches. This
renders the new algorithm very fast in estimating both polyserial and polychoric correlation
coefficients. Thorough simulation studies are conducted to compare the performances of the
proposed method with the classical ML methods. Real data analyses illustrate the advantage
of the new method in computation speed.
keyword: Iteratively reweighted least squares, Polyserial correlation, Polychoric correlation,
Tetrachoric correlation, Maximum likelihood, Linear regression.
1 Introduction
In behavioural, educational and psychological studies, it is common that the observed variables are
measured using ordinal scales. For example, Likert scale is widely used to measure responses in
surveys, allowing individuals to express how much respondents agree or disagree with a particular
statement in a five (or seven) point scale. These categorical variables can be treated as being
discretized from an underlying continuous variable for degree of agreement on the statement. There
are also many examples of quantitative variables that are discretized explicitly in social science
studies. For instance, when asking questions about sensitive or personal quantitative attributes
(income, alcohol consumption), the non-response rate may often be reduced by simply asking the
Corresponding author. E-mail: pengz@zju.edu.cn.
1
arXiv:2210.11115v1 [stat.ME] 20 Oct 2022
respondent to select one of two very broad categories(under $30K/ over $30K, etc.). When analyzing
this kind of data, a common approach is to assign integer values to each category and proceed in the
analysis as if the data had been measured on an interval scale with desired distributional properties.
The most common choice for the distribution of the latent variables is the normal distribution
because all covariances between the latent variables can be fully captured by the covariance matrix
and each of its elements can be estimated using a bivariate normal distribution separately. The
correlation in the standard bivariate normal distribution is called tetrachoric correlation based on
2×2 contingency table was suggested by Pearson [1900]. The tetrachoric correlation was generalized
to the case where the observed variables Xand Yhave rand sordinal categories by Ritchie-Scott
[1918] and Pearson and Pearson [1922] in the early 20th century, but it took over half a century
before the computationally feasible maximum likelihood procedure was proposed by Olsson [1979].
There have been two basic approaches to implementation: the so-called two-step method which
first estimates the unknown thresholds from the marginal frequencies of the table and then finds
the maximum likelihood estimate (MLE) of ρconditional on the estimated thresholds. The second
approach is to find the joint MLE of (ρ, a, b) from the likelihood function. The author gives the
equation system to be solved and, in addition, derives expressions for the information matrix which
can be used to obtain asymptotic standard errors for the estimates.
Let Xbe an observed ordinal variable which depends on an underlying latent continuous random
variable Z1and Yrepresent another observed continuous variable. It is assumed that the joint
distribution of Z1and Yis bivariate normal. The product moment correlation between Xand
Yis called the point polyserial correlation, while the correlation between Z1and Yis called the
polyserial correlation. The MLE of the polyserial correlation has been derived by Cox [1974]. Olsson
et al. [1982] derived the relationship between the polyserial and point polyserial correlation and
compared the MLE of polyserial correlation with a two-step estimator and with a computationally
convenient ad hoc estimator.
Another method to estimate tetrachoric and polychoric correlation coefficients is a Bayesian
approach proposed by Albert [1992]. The author used a latent bivariate normal distribution to
estimate a polychoric correlation coefficient from the Bayesian point of view by using the Gibbs
sampler. One attractive feature of this method is that it can be generalized in a straightforward
manner to handle a number of nonnormal latent distributions. They generalized their method to
handle bivariate lognormal and bivariate tlatent distributions in their simulations.
Chen and Choi [2009] and Choi et al. [2011] have showed that a different form of Bayesian esti-
mation outperforms traditional maximum likelihood (ML) in a variety of settings, but their method
is restricted only to the case of the bivariate Gaussian distribution. They correctly pointed out that,
in real practice, the desirable sample sizes to obtain stable estimates for the polychoric correlation
coefficient may not be available to the researcher. They claimed that due to the properties of nu-
merical procedure of ML (i.e., iterative hill-climbing method using gradients of the target function),
the ML estimation method for polychoric correlation coefficients has several disadvantages such as,
local maxima, non-converged solution, an inaccurate estimation of the confidence interval and so
on. Two new Bayesian estimates, maximum a posteriori (MAP) and expected a posteriori (EAP)
are introduced and compared to ML method. In their simulation study, they found evidence that
the MAP would be the estimator of choice for the polychoric correlation coefficients.
Pearson correlations can be considered a less suitable method for studying the degree of associ-
ation between categorical variables for several reasons. First, from a methodological point of view
these variables would imply ordinal scales, whereas Pearson correlations assume interval measure-
2
ment scales. Furthermore, the only information provided by this kind of scale is the number of
subjects in each of the categories (cells) in a contingency table; if Pearson correlations are used
in this case the relationship between measures would be artificially restricted due to the restric-
tions imposed by categorization (Gilley and Uhlig [1993]), since all subjects situated in the interval
that limits each of the categories would be considered as being included in the same category and,
therefore, they would be assigned the same score with a resulting reduction in data variability.
Holgado-Tello et al. [2010] illustrated the advantages of using polychoric rather than Pearson
correlations in exploratory factor analysis(EFA) and confirmatory factor analysis(CFA), taking into
account that the latter require quantitative variables measured in intervals, and that the relationship
between these variables has to be monotonic. Their results showed that the solutions obtained by
using polychoric correlations provide a more accurate reproduction of the measurement model used
to generate the data.
More recently, network research has gained substantial attention in psychological sciences, which
is called psychological networks by researchers. Psychological networks has been used in various
different fields of psychology Epskamp et al. [2018]. The Gaussian graphical model(GGM) Lau-
ritzen [1996], in which edges can directly be interpreted as partial correlation coefficients. The
GGM requires an estimate of the covariance matrix as input, for which polyserial correlation and
polychoric correlations can also be used in case the data are ordinal. However, for large network
problems, it usually needs considerably longer computational time when using ML method.
In this paper we propose a simple and fast method to estimate the polyserial correlation co-
efficient and the polychoric correlation coefficient. It is motivated by the fact that the Pearson’s
correlation coefficient coincides with the slope of the regression line for paired standard normal
data. When one of the paired continuous data is discretized, an unbiased estimator of the slope
is derived from the generated categorical data. When both of the paired data are discretized, the
slope of the regression line, i.e. the correlation coefficient of the two normal random variables, will
be obtained iteratively from a series of similar estimation procedures. The detail of the algorithm
can be found in Section 2. In Section 3 and 4, we conduct simulation studies and data analyses
to compare the proposed method with the ML method. At last, we conclude with discussions and
some works can be done in the future to improve the proposed method.
2 Iteratively Reweighted Least Squares Algorithm
Assume (Z1, Z2)TN2(0
0
0,R
R
R) where 0
0
0 = (0,0)Tand R
R
R=1ρ
ρ1,1ρ1. Conditioning on
Z1,Z2|Z1N(ρZ1,1ρ2). Hence
E(Z2|Z1) = ρZ1.(2.1)
This represents a simple linear regression model fitting Z2on Z1and ρis the slope of the regression
line. Therefore, ρ, the Pearson correlation coefficient of Z1and Z2, can be estimated from such a
linear regression model.
2.1 Polyserial correlation coefficients
Consider the case where one of the paired random variables, namely Z1, is discritized into an
ordinal polychotomous variable, X, and the other is observed as a continuous variable, Y. Let
3
Xbe an observed ordinal variable with scategories, generated from the latent variable Z1with
X=iif ai1< Z1ai, i = 1, . . . , s, where ais are thresholds with a0=−∞ and as=.
If Z1were observable, it would have been given from the regression line that E(Y|Z1) = ρZ1.
Taking expectation with respect to Z1,
E{E(Y|Z1)}=ρE(Z1).
It holds for every Z1such that ai1< Z1ai, or correspondingly, X=i, for i= 1,2, . . . , s. That
is,
E{E(Y|Z1, ai1< Z1ai)}=ρE(Z1|ai1< Z1ai),
or,
E{E(Y|X=i)}=ρE(Z1|ai1< Z1ai),(2.2)
for i= 1,··· , s.
Denote E(Y|X=i) by EYiand E(Z1|ai1< Z1ai) by exi, equation (2.2) is a regression model
without an intercept, in which EYiis the response variable and exiis the explanatory variable, with
ρbeing the regression coefficient. Because EYis have unequal variances, ρcannot be estimated with
an ordinary least squares method. However, clearly EYis are independent to each other, ρcan be
estimated with a weighted least squares method with a diagonal weight matrix.
It is easy to show that the density function of EYiis
f(y) =
Φaiρy
1ρ2Φai1ρy
1ρ2
Pi
φ(y),
where Pi= Pr(X=i). The mean and variance of EYi,µiand σ2
i, are given by
µi=ρφ(ai1)φ(ai)
Pi
σ2
i= 1 + ρ2ai1φ(ai1)aiφ(ai)
Piρ2{φ(ai1)φ(ai)}2
P2
i
(2.3)
Let yi1, yi2, . . . , yinibe the observed response variables associated with X=iand ai1< Z1j
aifor j= 1, . . . , ni, where niis the size of data with X=i,EYiis estimated by
ˆ
Eyi= ¯yX=i=1
ni
ni
X
j=1
yij(2.4)
Since Z1has a truncated normal distribution with lower and upper limits ai1and airespectively,
exiis the expected value of the truncated normal distribution, given by
exi=φ(ai1)φ(ai)
Pi
,(2.5)
where φ(·) is the density function of the standard normal distribution. Let CPi= Pr(Xi) =
Φ(ai), i = 1,··· , s. Then
CPi=
i
X
j=1
Pj=
i
X
j=1{Φ(aj)Φ(aj1)},
4
then ˆai= Φ1(ˆ
CP i), and exiin (2.5) is estimated by
ˆexi=φai1)φ( ˆai)
ˆ
Pi
=φ{Φ1(ˆ
CP i1)} − φ{Φ1(ˆ
CP i)}
ˆ
Pi
(2.6)
Let ˆ
Ex= (ˆex1,ˆex2,...,ˆexs)T,ˆ
Ey= ( ˆ
Ey1,ˆ
Ey2,..., ˆ
Eys)T, and
ˆ
Σ=
ˆσ2
1/n10. . . 0
0 ˆσ2
2/n2. . . 0
.
.
..
.
.....
.
.
0 0 . . . ˆσ2
s/ns
,
the regression coefficient is given by the weighted least squares method,
ˆρ= (ˆ
ET
xˆ
Σ1ˆ
Ex)1ˆ
ET
xˆ
Σ1ˆ
Ey,(2.7)
which is reduced to
ˆρ=Ps
i=1 niˆσ2
iˆexiˆ
Eyi
Ps
i=1 niˆσ2
iˆe2
xi
.(2.8)
While σ2
iin (2.3) depends on ρ, it can be obtained iteratively using the formula in (2.8), with the
Pearson correlation coefficient as the initial value. The variance of ˆρis given by
Var(ˆρ)=(ˆ
ET
xˆ
Σ1ˆ
Ex)1= (
s
X
i=1
niˆσ2
iˆe2
xi)1,(2.9)
and the standard error of ˆρis pVar(ˆρ).
The details of the IRLS algorithm for estimating polyserial correlation coefficient are given in
the following Algorithm 1,
5
摘要:

IterativelyReweightedLeastSquaresMethodforEstimatingPolyserialandPolychoricCorrelationCoecientsPengZhang*1,BenLiu1,andJingjingPan21SchoolofMathematicalSciences,ZhejiangUniversity,Hangzhou,China,310027.2ZhejiangSuperSoulArti cialIntelligenceResearchInstituteAbstractAniterativelyreweightedleastsquare...

展开>> 收起<<
Iteratively Reweighted Least Squares Method for Estimating Polyserial and Polychoric Correlation Coecients Peng Zhang1 Ben Liu1 and Jingjing Pan2.pdf

共22页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!

相关推荐

分类:图书资源 价格:10玖币 属性:22 页 大小:1.02MB 格式:PDF 时间:2025-05-05

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 22
客服
关注