Bayesian Analysis of Mixtures of Lognormal Distribution with an Unknown Number of Components from Grouped Data

2025-04-27 0 0 2.02MB 30 页 10玖币
侵权投诉
Bayesian Analysis of Mixtures of Lognormal
Distribution with an Unknown Number of
Components from Grouped Data *
Kazuhiko Kakamu
Abstract
This study proposes a reversible jump Markov chain Monte Carlo method for estimating pa-
rameters of lognormal distribution mixtures for income. Using simulated data examples, we ex-
amined the proposed algorithm’s performance and the accuracy of posterior distributions of the
Gini coefficients. Results suggest that the parameters were estimated accurately. Therefore, the
posterior distributions are close to the true distributions even when the different data generating
process is accounted for. Moreover, promising results for Gini coefficients encouraged us to ap-
ply our method to real data from Japan. The empirical examples indicate two subgroups in Japan
(2020) and the Gini coefficients’ integrity.
JEL classification: C11; C13; D31.
Key words: Gini coefficient; grouped data; mixtures of lognormal distribution; reversible jump
MCMC.
*Previous versions of this paper were presented at CFE 2019 in London as well as the seminars at Vienna University of
Economics and Business, Keio University, Kobe University, and Kwansei Gakuin University. We would like to thank the
seminar/conference participants, especially Duangkamon Chotikapanich for her valuable comments and suggestions. Part
of this research was conducted while the author was visiting Institute for Economic Geography and GIScience, Vienna
University of Economics and Business, whose hospitality is gratefully acknowledged. This work is partially supported by
KAKENHI #20H00080, #20K01590 and #16KK0081.
School of Data Science, Nagoya City University, Yamanohata 1, Mizuho-cho, Mizuho-ku, Nagoya 467-8501, Japan.
Email: kakamu@ds.nagoya-cu.ac.jp
1
arXiv:2210.05115v3 [econ.EM] 21 Sep 2023
1 Introduction
Finding a distribution that fits the data well is one of the main challenges in the estimation of income
distributions. However, we face the trade-off between the interpretation of parameters and the fit of
the hypothetical distribution. To explore the fit of the distribution, several flexible hypothetical distri-
butions are proposed, including: the generalized beta distribution of first and second kind (McDonald,
1984); generalized beta distribution (McDonald and Xu,1995); double Pareto-lognormal distribution
(Reed and Jorgensen,2004); and κ-generalized distribution (Clementi et al.,2007). Several of these
support interpretations that are economically meaningful. Conversely, the mixture distribution mod-
els are also considered to fit the distribution to the data because the assumed underlying distributions
are easy to interpret and the distribution fits better than single component models in many cases. The
greater level of detail offered by mixture distribution models, such as a subgroups’ information, is ev-
ident from the model’s adoption in Paap and van Dijk (1998); Griffiths and Hajargasht (2012) among
other studies.
Mixture distribution models have also considered the framework of household income distribu-
tions from a Bayesian point of view using Markov chain Monte Carlo (MCMC) methods. For exam-
ple, in the case of lognormal distribution, Lubrano and Ndoye (2016) considered a finite mixtures of
lognormal (MLN) distribution model from individual data and determined the number of components
by the marginal likelihood (Chib,1995) and DIC (Spiegelhalter et al.,2002). The income inequality
was then decomposed into between-subgroup and within-subgroup components. Moreover, it is also
considered in gamma distribution cases. Wiper et al. (2001) considered the mixtures of gamma distri-
bution model with a known and unknown number of components. Chotikapanich and Griffiths (2008)
examined the Canadian income data using two components’ mixtures of gamma densities, which is
the known number of components case in Wiper et al. (2001). However, with the exception of Wiper
et al. (2001), the number of components were assumed in advance or determined after estimation in
these studies, and they used individual or household data as mentioned above.
As with Wiper et al. (2001), there are two main approaches for dealing with an unknown number
of components in a mixture model: one uses a Dirichlet process prior (Escobar and West,1995),
and the other use a reversible jump MCMC algorithm (Richardson and Green,1997), which is used
in Wiper et al. (2001). The reversible jump MCMC algorithm, which is first proposed by Green
(1995), is one of the most powerful tools in model determination. Richardson and Green (1997)
proposed the algorithm in the framework of the mixtures of normal distribution model. Subsequently,
2
some scholars have proposed extensions to the multivariate normal distribution (Kom`
arek,2009) and
the mixtures of normal distribution with the same component means (Papastamoulis and Iliopoulos,
2009). In addition, Miller and Harrison (2013) pointed out that the posterior from a Dirichlet process
prior for the number of components was not consistent—unlike with the reversible jump MCMC, the
Dirichlet process prior did not converge at the true number. Therefore, we consider the reversible
jump MCMC algorithm in this study, because we are also interested in the number of components in
the analysis of income distribution.
Although the availability of individual and household data has improved, it remains difficult to
access, especially in developing countries. Alternatively, the grouped data, which partitions the sam-
ple space of observations into several non-overlapping groups is widely available. Using this type
of data, Gau et al. (2014) considered the MCMC sampling scheme for finite mixtures of normal dis-
tribution. This study extends this approach to the mixtures of lognormal distribution model, which
can simultaneously determine the number of components from grouped data. Exploring this model
is worthwhile because the number of components provides information about population subgroups
as discussed in Lubrano and Ndoye (2016). Therefore, if it is possible to determine the number of
components from grouped data; for example, this approach can be used for detailed comparisons of
the income inequalities in developing countries.
This study aims to develop a reversible jump MCMC method for the mixtures of lognormal (MLN)
distribution model from grouped data to examine the income distributions and income inequalities in
Japan. Our proposed algorithm is discussed using simulated data examples. From these, we can
confirm that our proposed algorithm works well in terms of the accuracy of the parameters and in
fitting the distribution. The data also suggests that the posterior distributions of the Gini coefficients
are accurate. Hence, we applied it to real data in Japan in 2020 to examine the income distributions
and inequalities. From the results, we identified two subgroups in both two-or-more person house-
holds and workers’ households. We also observed that the Gini coefficient of two-or-more person
households was larger than that of workers’ households.
The rest of this paper is organized as follows. In Section 2, we summarize the MLN distribution
model using grouped data with its Gini coefficient and obtain a joint posterior distribution. Section 3
discusses the computational strategy of the MCMC method. In Section 4, our approach is illustrated
using simulated datasets. Section 5, examines the empirical examples using real datasets from Japan.
Finally, brief conclusions are offered in Section 6.
3
2 Mixtures of Lognormal Distribution Model using Grouped Data
Let x > 0, which means the annual income of households or individuals, for example, follow any
hypothetical distribution. Let xi,i= 1,2, . . . , n observations be sampled from the distribution. Then,
the grouped data partitions the sample space of observations into K > 1non-overlapping intervals of
the forms (t0, t1],(t1, t2],. . .,(tK1, tK), where t0= 0 and tK=. Moreover, only the number, nk
of observations falling in each interval (tk1, tk],k= 1,2, . . . , K, can be observed with
K
X
k=1
nk=n.
It should be mentioned that the class income mean ¯xk, which means the average of xiin the interval
(tk1, tk], is also available in many cases.
Let θbe the vector of parameters of any underlying hypothetical distribution, which we assume in
advance. Let f(x|θ)and F(x|θ)be the probability density function (PDF) and cumulative distribution
function (CDF), respectively. Given the PDF and CDF, we define the likelihood function, which is
based on the concept of selected order statistics, to estimate the parameters of the distribution. 1
To explain the likelihood function, let t= (t1, t2, . . . , tK1)be the vector of the endpoints of the
intervals and let n= (n1, n2, . . . , nK)be the vector of frequencies, which fall in the intervals. Then,
the likelihood function is defined as follows:
L(t|θ,n) = n!F(t1|θ)n11
(n11)! f(t1|θ)(K1
Y
k=2
(F(tk|θ)F(tk1|θ))nk1
(nk1)! f(tk|θ))(1 F(tK1|θ))nK
nK!.(1)
Once the parameter estimate for θis obtained from (1) using maximum likelihood and so on, the Gini
coefficient can be estimated by using
G=1 + 2
µZ
0
xF (x|θ)f(x|θ)dx, (2)
where µis the mean of the distribution. 2
1McDonald and Ransom (1979) considered the likelihood based on the multinomial distribution, whereas Nishino and
Kakamu (2011) considered the likelihood based on the selected order statistics. As is pointed out by Eckernkemper and
Gribisch (2021), the likelihood based on the multinomial distribution is applicable to the data with known fixed boundaries
and random frequencies, while the likelihood based on the selected order statistics is applicable to the data with known
random boundaries and fixed frequencies. In this study, we follow the likelihood based on Nishino and Kakamu (2011),
because we are interested in the decile data, whose features are with known random boundaries and fixed frequencies. It
should be mentioned that our approach merely treats the special case of DGP1 in Eckernkemper and Gribisch (2021).
2In the numerical integration, we use the expression
G= 1 Z
0
(1 F(x|θ))2dx
Z
0
(1 F(x|θ))dx
,
4
In the empirical analysis we need to specify the hypothetical income distribution. First, we start
with the lognormal (LN) distribution, following Nishino and Kakamu (2011), because the distribution
fits to the Japanese data, which is also used in this empirical example. Although we could consider
the other distributions, such as a gamma distribution and so on, we restrict our discussion on the LN
distribution to focus on our empirical example. Let x∼ LN(µ, σ2), which means xfollows LN
distribution, where the PDF is expressed by
f(x|µ, σ2) = 1
2πσ2xexp (ln xµ)2
2σ2,(3)
and the CDF is expressed by
F(x|µ, σ2) = Φ ln xµ
σ,(4)
where Φ(·)is the CDF of the standard normal distribution. If we substitute (3) and (4) for (1), it
becomes the likelihood function for the LN distribution model and its Gini coefficient has a closed
form, expressed by
GLN = σ
21.(5)
To extend the above results, we consider the MLN distribution model with Rcomponents. Let us
begin with the fixed number of components model. Let π= (π1, π2, . . . , πR),θr= (µr, σ2
r), and
Θ={θr}R
r=1, where
R
X
r=1
πr= 1. Then, the PDF of the MLN distribution with Rcomponents is
expressed by
f(x|π,Θ) =
R
X
r=1
πrf(x|θr) =
R
X
r=1
πr
p2πσ2
rxexp (ln xµr)2
2σ2
r,(6)
and the CDF is expressed by
F(x|π,Θ) =
R
X
r=1
πrF(x|θr) =
R
X
r=1
πrΦln xµr
σr.(7)
If we substitute (6) and (7) for (1), it becomes the likelihood function for the MLN distribution model.
However, its Gini coefficient does not have a closed form. Therefore, it is calculated from (2). In the
next section, we will consider the MLN distribution model with an unknown number of components,
where Ris also treated as one of the parameters.
because it is equivalent to (2) (see Dorfman,1979) and easier than calculating (2).
5
摘要:

BayesianAnalysisofMixturesofLognormalDistributionwithanUnknownNumberofComponentsfromGroupedData*KazuhikoKakamu†AbstractThisstudyproposesareversiblejumpMarkovchainMonteCarlomethodforestimatingpa-rametersoflognormaldistributionmixturesforincome.Usingsimulateddataexamples,weex-aminedtheproposedalgorith...

展开>> 收起<<
Bayesian Analysis of Mixtures of Lognormal Distribution with an Unknown Number of Components from Grouped Data.pdf

共30页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:30 页 大小:2.02MB 格式:PDF 时间:2025-04-27

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 30
客服
关注