Bayesian Analysis of Mixtures of Lognormal Distribution with an Unknown Number of Components from Grouped Data

2025-04-27 1 0 2.02MB 30 页 10玖币

侵权投诉

Bayesian Analysis of Mixtures of Lognormal

Distribution with an Unknown Number of

Components from Grouped Data *

Kazuhiko Kakamu †

Abstract

This study proposes a reversible jump Markov chain Monte Carlo method for estimating pa-

rameters of lognormal distribution mixtures for income. Using simulated data examples, we ex-

amined the proposed algorithm’s performance and the accuracy of posterior distributions of the

Gini coefﬁcients. Results suggest that the parameters were estimated accurately. Therefore, the

posterior distributions are close to the true distributions even when the different data generating

process is accounted for. Moreover, promising results for Gini coefﬁcients encouraged us to ap-

ply our method to real data from Japan. The empirical examples indicate two subgroups in Japan

(2020) and the Gini coefﬁcients’ integrity.

JEL classiﬁcation: C11; C13; D31.

Key words: Gini coefﬁcient; grouped data; mixtures of lognormal distribution; reversible jump

MCMC.

*Previous versions of this paper were presented at CFE 2019 in London as well as the seminars at Vienna University of

Economics and Business, Keio University, Kobe University, and Kwansei Gakuin University. We would like to thank the

seminar/conference participants, especially Duangkamon Chotikapanich for her valuable comments and suggestions. Part

of this research was conducted while the author was visiting Institute for Economic Geography and GIScience, Vienna

University of Economics and Business, whose hospitality is gratefully acknowledged. This work is partially supported by

KAKENHI #20H00080, #20K01590 and #16KK0081.

†School of Data Science, Nagoya City University, Yamanohata 1, Mizuho-cho, Mizuho-ku, Nagoya 467-8501, Japan.

Email: kakamu@ds.nagoya-cu.ac.jp

arXiv:2210.05115v3 [econ.EM] 21 Sep 2023

1 Introduction

Finding a distribution that ﬁts the data well is one of the main challenges in the estimation of income

distributions. However, we face the trade-off between the interpretation of parameters and the ﬁt of

the hypothetical distribution. To explore the ﬁt of the distribution, several ﬂexible hypothetical distri-

butions are proposed, including: the generalized beta distribution of ﬁrst and second kind (McDonald,

1984); generalized beta distribution (McDonald and Xu,1995); double Pareto-lognormal distribution

(Reed and Jorgensen,2004); and κ-generalized distribution (Clementi et al.,2007). Several of these

support interpretations that are economically meaningful. Conversely, the mixture distribution mod-

els are also considered to ﬁt the distribution to the data because the assumed underlying distributions

are easy to interpret and the distribution ﬁts better than single component models in many cases. The

greater level of detail offered by mixture distribution models, such as a subgroups’ information, is ev-

ident from the model’s adoption in Paap and van Dijk (1998); Grifﬁths and Hajargasht (2012) among

other studies.

Mixture distribution models have also considered the framework of household income distribu-

tions from a Bayesian point of view using Markov chain Monte Carlo (MCMC) methods. For exam-

ple, in the case of lognormal distribution, Lubrano and Ndoye (2016) considered a ﬁnite mixtures of

lognormal (MLN) distribution model from individual data and determined the number of components

by the marginal likelihood (Chib,1995) and DIC (Spiegelhalter et al.,2002). The income inequality

was then decomposed into between-subgroup and within-subgroup components. Moreover, it is also

considered in gamma distribution cases. Wiper et al. (2001) considered the mixtures of gamma distri-

bution model with a known and unknown number of components. Chotikapanich and Grifﬁths (2008)

examined the Canadian income data using two components’ mixtures of gamma densities, which is

the known number of components case in Wiper et al. (2001). However, with the exception of Wiper

et al. (2001), the number of components were assumed in advance or determined after estimation in

these studies, and they used individual or household data as mentioned above.

As with Wiper et al. (2001), there are two main approaches for dealing with an unknown number

of components in a mixture model: one uses a Dirichlet process prior (Escobar and West,1995),

and the other use a reversible jump MCMC algorithm (Richardson and Green,1997), which is used

in Wiper et al. (2001). The reversible jump MCMC algorithm, which is ﬁrst proposed by Green

(1995), is one of the most powerful tools in model determination. Richardson and Green (1997)

proposed the algorithm in the framework of the mixtures of normal distribution model. Subsequently,

some scholars have proposed extensions to the multivariate normal distribution (Kom`

arek,2009) and

the mixtures of normal distribution with the same component means (Papastamoulis and Iliopoulos,

2009). In addition, Miller and Harrison (2013) pointed out that the posterior from a Dirichlet process

prior for the number of components was not consistent—unlike with the reversible jump MCMC, the

Dirichlet process prior did not converge at the true number. Therefore, we consider the reversible

jump MCMC algorithm in this study, because we are also interested in the number of components in

the analysis of income distribution.

Although the availability of individual and household data has improved, it remains difﬁcult to

access, especially in developing countries. Alternatively, the grouped data, which partitions the sam-

ple space of observations into several non-overlapping groups is widely available. Using this type

of data, Gau et al. (2014) considered the MCMC sampling scheme for ﬁnite mixtures of normal dis-

tribution. This study extends this approach to the mixtures of lognormal distribution model, which

can simultaneously determine the number of components from grouped data. Exploring this model

is worthwhile because the number of components provides information about population subgroups

as discussed in Lubrano and Ndoye (2016). Therefore, if it is possible to determine the number of

components from grouped data; for example, this approach can be used for detailed comparisons of

the income inequalities in developing countries.

This study aims to develop a reversible jump MCMC method for the mixtures of lognormal (MLN)

distribution model from grouped data to examine the income distributions and income inequalities in

Japan. Our proposed algorithm is discussed using simulated data examples. From these, we can

conﬁrm that our proposed algorithm works well in terms of the accuracy of the parameters and in

ﬁtting the distribution. The data also suggests that the posterior distributions of the Gini coefﬁcients

are accurate. Hence, we applied it to real data in Japan in 2020 to examine the income distributions

and inequalities. From the results, we identiﬁed two subgroups in both two-or-more person house-

holds and workers’ households. We also observed that the Gini coefﬁcient of two-or-more person

households was larger than that of workers’ households.

The rest of this paper is organized as follows. In Section 2, we summarize the MLN distribution

model using grouped data with its Gini coefﬁcient and obtain a joint posterior distribution. Section 3

discusses the computational strategy of the MCMC method. In Section 4, our approach is illustrated

using simulated datasets. Section 5, examines the empirical examples using real datasets from Japan.

Finally, brief conclusions are offered in Section 6.

2 Mixtures of Lognormal Distribution Model using Grouped Data

Let x > 0, which means the annual income of households or individuals, for example, follow any

hypothetical distribution. Let xi,i= 1,2, . . . , n observations be sampled from the distribution. Then,

the grouped data partitions the sample space of observations into K > 1non-overlapping intervals of

the forms (t0, t1],(t1, t2],. . .,(tK−1, tK), where t0= 0 and tK=∞. Moreover, only the number, nk

of observations falling in each interval (tk−1, tk],k= 1,2, . . . , K, can be observed with

k=1

nk=n.

It should be mentioned that the class income mean ¯xk, which means the average of xiin the interval

(tk−1, tk], is also available in many cases.

Let θbe the vector of parameters of any underlying hypothetical distribution, which we assume in

advance. Let f(x|θ)and F(x|θ)be the probability density function (PDF) and cumulative distribution

function (CDF), respectively. Given the PDF and CDF, we deﬁne the likelihood function, which is

based on the concept of selected order statistics, to estimate the parameters of the distribution. 1

To explain the likelihood function, let t= (t1, t2, . . . , tK−1)′be the vector of the endpoints of the

intervals and let n= (n1, n2, . . . , nK)′be the vector of frequencies, which fall in the intervals. Then,

the likelihood function is deﬁned as follows:

L(t|θ,n) = n!F(t1|θ)n1−1

(n1−1)! f(t1|θ)(K−1

k=2

(F(tk|θ)−F(tk−1|θ))nk−1

(nk−1)! f(tk|θ))(1 −F(tK−1|θ))nK

nK!.(1)

Once the parameter estimate for θis obtained from (1) using maximum likelihood and so on, the Gini

coefﬁcient can be estimated by using

G=−1 + 2

µZ∞

xF (x|θ)f(x|θ)dx, (2)

where µis the mean of the distribution. 2

1McDonald and Ransom (1979) considered the likelihood based on the multinomial distribution, whereas Nishino and

Kakamu (2011) considered the likelihood based on the selected order statistics. As is pointed out by Eckernkemper and

Gribisch (2021), the likelihood based on the multinomial distribution is applicable to the data with known ﬁxed boundaries

and random frequencies, while the likelihood based on the selected order statistics is applicable to the data with known

random boundaries and ﬁxed frequencies. In this study, we follow the likelihood based on Nishino and Kakamu (2011),

because we are interested in the decile data, whose features are with known random boundaries and ﬁxed frequencies. It

should be mentioned that our approach merely treats the special case of DGP1 in Eckernkemper and Gribisch (2021).

2In the numerical integration, we use the expression

G= 1 −Z∞

(1 −F(x|θ))2dx

Z∞

(1 −F(x|θ))dx

In the empirical analysis we need to specify the hypothetical income distribution. First, we start

with the lognormal (LN) distribution, following Nishino and Kakamu (2011), because the distribution

ﬁts to the Japanese data, which is also used in this empirical example. Although we could consider

the other distributions, such as a gamma distribution and so on, we restrict our discussion on the LN

distribution to focus on our empirical example. Let x∼ LN(µ, σ2), which means xfollows LN

distribution, where the PDF is expressed by

f(x|µ, σ2) = 1

√2πσ2xexp −(ln x−µ)2

2σ2,(3)

and the CDF is expressed by

F(x|µ, σ2) = Φ ln x−µ

σ,(4)

where Φ(·)is the CDF of the standard normal distribution. If we substitute (3) and (4) for (1), it

becomes the likelihood function for the LN distribution model and its Gini coefﬁcient has a closed

form, expressed by

GLN = 2Φ σ

√2−1.(5)

To extend the above results, we consider the MLN distribution model with Rcomponents. Let us

begin with the ﬁxed number of components model. Let π= (π1, π2, . . . , πR)′,θr= (µr, σ2

r)′, and

Θ={θr}R

r=1, where

r=1

πr= 1. Then, the PDF of the MLN distribution with Rcomponents is

expressed by

f(x|π,Θ) =

r=1

πrf(x|θr) =

r=1

πr

p2πσ2

rxexp −(ln x−µr)2

2σ2

r,(6)

and the CDF is expressed by

F(x|π,Θ) =

r=1

πrF(x|θr) =

r=1

πrΦln x−µr

σr.(7)

If we substitute (6) and (7) for (1), it becomes the likelihood function for the MLN distribution model.

However, its Gini coefﬁcient does not have a closed form. Therefore, it is calculated from (2). In the

next section, we will consider the MLN distribution model with an unknown number of components,

where Ris also treated as one of the parameters.

because it is equivalent to (2) (see Dorfman,1979) and easier than calculating (2).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

BayesianAnalysisofMixturesofLognormalDistributionwithanUnknownNumberofComponentsfromGroupedData*KazuhikoKakamu†AbstractThisstudyproposesareversiblejumpMarkovchainMonteCarlomethodforestimatingpa-rametersoflognormaldistributionmixturesforincome.Usingsimulateddataexamples,weex-aminedtheproposedalgorith...

展开>> 收起<<

Bayesian Analysis of Mixtures of Lognormal Distribution with an Unknown Number of Components from Grouped Data.pdf

共30页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

Bayesian Analysis of Mixtures of Lognormal Distribution with an Unknown Number of Components from Grouped Data

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: