An extended generalized Pareto regression model for count data Touqeer Ahmad1 Carlo Gaetan2and Philippe

2025-04-30 0 0 1.83MB 19 页 10玖币
侵权投诉
An extended generalized Pareto regression
model for count data
Touqeer Ahmad 1, Carlo Gaetan 2and Philippe
Naveau 3
1ENSAI, CREST-UMR9194, Bruz, France.
2Dipartimento di Scienze Ambientali, Informatica e Statistica, Universit`a Ca’ Foscari-
Venezia, Italy.
3Laboratoire des Sciences du Climat et de l’Environnement, UMR 8212, CEA-CNRS-
UVSQ, IPSL & Universit´e Paris-Saclay, Gif-sur-Yvette, France.
Address for correspondence: ENSAI, 51 Rue Blaise Pascal, 35170, Bruz, France.
E-mail: touqeer.ahmad@ensai.fr.
Abstract: The statistical modeling of discrete extremes has received less attention
than their continuous counterparts in the Extreme Value Theory (EVT) literature.
One approach to the transition from continuous to discrete extremes is the modeling
of threshold exceedances of integer random variables by the discrete version of the
generalized Pareto distribution. However, the optimal choice of thresholds defining
exceedances remains a problematic issue. Moreover, in a regression framework, the
treatment of the majority of non-extreme data below the selected threshold is either
ignored or separated from the extremes. To tackle these issues, we expand on the
concept of employing a smooth transition between the bulk and the upper tail of
the distribution. In the case of zero inflation, we also develop models with an ad-
ditional parameter. To incorporate possible predictors, we relate the parameters to
additive smoothed predictors via an appropriate link, as in the generalized additive
model (GAM) framework. A penalized maximum likelihood estimation procedure is
implemented. We illustrate our modeling proposal with a real dataset of avalanche
activity in the French Alps. With the advantage of bypassing the threshold selection
step, our results indicate that the proposed models are more flexible and robust than
competing models, such as the negative binomial distribution.
Key words: Extreme value theory; discrete extended generalized Pareto distribu-
tion; zero-inflated models; generalized additive models
arXiv:2210.15253v2 [stat.ME] 17 Jun 2024
2Touqeer Ahmad et al.
1 Introduction
Modeling count data, i.e., non-negative integers, in the presence of covariates is a very
common task in many research areas. The Poisson regression model is often of limited
use because count data typically exhibit overdispersion (i.e., the variance of the counts
appears larger than the mean) and/or an excessive number of zeros. Additional
parameters can be inserted to deal with overdispersion, e.g., the quasi-Poisson model
(Wedderburn,1974), or different distributions, such as negative binomial distribution,
can be fitted. To model zero inflation, Lambert (1992) studied a two-component
mixture model, one component is a point mass at zero and the other component
is an assumed parametric count distribution. Lambert’s specification is an example
of a distributional regression model (Stasinopoulos et al.,2018;Kneib et al.,2023).
The term “distributional” emphasizes that several characteristics of the conditional
distribution of the data are modeled in terms of covariates, rather than only the mean.
So far, the main focus of the literature has been on the relationship between the
location, the scale, and the shape with the covariates (Rigby and Stasinopoulos,
2005), less attention has been paid to the tail of the count distribution.
0 20 40 60 80
0.5 1.0 1.5 2.0 2.5 3.0
Avalanches
log10(count +1)
−3 −2 −1 0 1 2 3
0 1 2 3
Q−Q Plot
Theoretical Quantiles
Sample Quantiles
(a) (b)
Figure 1: (a) Frequency table plot (log10 scale) of avalanche events in the Haute-
Maurienne massif of the French Alps, (b) Q-Q plot of randomized residuals from the
zero-inflated negative binomial model with additive covariates. The dotted lines show
the 95% point-wise confidence intervals.
As a motivating example, we consider a dataset on the avalanche activity in the
Haute-Maurienne region of the French Alps, The avalanche activity is measured by a
three-day moving sum of the avalanche events recorded from February 1982 to April
2021, see Evin et al. (2021) for details. Figure 1(a) shows a large relative frequency
An extended generalized Pareto regression model for count data 3
of zeros (meaning no avalanche has been reported) as well as heavy-tailed behavior.
We fit a zero-inflated negative binomial regression model under the framework of
generalized additive models for location, scale, and shape (GAMLSS) where the pa-
rameters are related to additive environmental covariates (see Section 4for a detailed
description) via suitable link functions (Stasinopoulos et al.,2018). The randomized
quantile residuals (Dunn and Smyth,1996) are used to check the adequacy of the fit-
ted model. Figure 1(b) clearly shows that the fitted models do not correctly estimate
the upper tail behavior of avalanche extremes. In addition, the number of zeros is
not correctly predicted in this example.
Extreme value theory, originally developed by Fisher and Tippett (1928), provides
a mathematical blueprint to model very high and very low-frequency events (e.g.,
extreme temperatures, heavy rainfall intensities, heavy floods, and extreme winds,
etc.), and monographs such as Coles (2001)orBeirlant et al. (2004) discuss the main
extreme value models. In particular, under the peak-over-threshold (POT) approach
(Pickands,1975), the distribution of exceedances of a high threshold is often ap-
proximated by the Generalized Pareto Distribution (GPD). Modifications of GPD
to discrete data exist in the literature (Krishna and Pundir,2009;Buddana and
Kozubowski,2014;Kozubowski et al.,2015), and recently Hitz et al. (2024) discussed
discrete versions of GPD to approximate the tail behavior of integer-valued random
variables. This approach still requires the definition of a threshold at a high quantile,
which is not easy due to the discrete nature of the data (Daouia et al.,2023).
It should also be noted that especially environmental time series are rarely stationary
and depend on environmental factors. A standard approach to modeling continuous
extremes of a non-stationary process focuses on maintaining a predetermined thresh-
old but treating parameters of the GPD as functions of covariates (Davison and Smith,
1990). An alternative approach (Eastoe and Tawn,2009) uses preprocessing methods
to model the non-stationarity in the body of the process to produce transformed data
and then uses standard methods to model the extremes of the transformed data. The
first approach has been adapted to the discrete case by Ranjbar et al. (2022). The
second approach seems to be difficult to adapt. The distribution of the preprocessed
data cannot be connected to a distribution of count data.
The proposed model addresses the issue of the POT approach ignoring or separating
non-extreme data below the selected threshold from the extremes. The model utilizes
a smooth transition between the bulk and upper tail of the distribution, for the full
range of the data, while bypassing a threshold selection. The discrete extended version
of GPD (DEGPD) is derived by discretizing the cumulative distribution function
(CDF) of an extended GPD (Naveau et al.,2016). The model takes into account the
possible effects of covariates in a non-parametric way. Since it is possible to have a
dataset with an excess of zeros, such as in the motivating example, we also consider
a mixture of the previous distribution with a degenerate distribution at zero. This
results in a distribution named Zero-Inflated DEGPD (ZIDEGPD).
4Touqeer Ahmad et al.
The paper is organized as follows. Section 2introduces the DEGPD. The extension
to deal with many zeros and covariate effects is given in Section 3. Section 4discusses
applications of DEGPD and ZIDEGPD to avalanche data with environmental covari-
ates. Finally, Section 5concludes with a summary of our results and a discussion of
future research directions.
2 Discrete extreme modelling
The distribution of exceedances (i.e., the amount of data that appears over a given
high threshold) is often approximated by the Generalized Pareto distribution (GPD)
defined by its CDF as
F(z;σ, ξ) = (1(1 + ξz)1
+ξ̸= 0
1exp (z)ξ= 0 ,(2.1)
where (a)+= max(a, 0). The σ > 0 and −∞ < ξ < +represent the scale and
shape parameters of the distribution, respectively.
More precisely let Xbe a random variable taking values in [0, xF) where xF
(0,)∪ {∞} Suppose that there exists a strictly positive sequence ausuch that
the distribution of a1
u(Xu)|Xuweakly converge to a non-degenerate proba-
bility distribution on [0,) as uxF, then this distribution is the GPD (Balkema
and De Haan,1974). Thus, for large u,
Pr(Xu > x|Xu) = Pr(a1
u(Xu)> a1
ux|Xu)1F(x;auσ, ξ).
The shape parameter, ξ, defines the tail behavior of the GPD. If ξ < 0, the upper
tail is bounded. If ξ= 0, we have the exponential distribution, where all moments
are finite. If ξ > 0, the upper tail is unbounded and the higher moments ultimately
become infinite. The three defined cases are labeled “short-tailed”, “light-tailed”,
and “heavy-tailed”, respectively. These categorizations enhance the flexibility of the
GPD and underscore its adaptability to various modeling scenarios.
Using the GPD to approximate the distribution tail for discrete data can be inappro-
priate, as pointed out in Hitz et al. (2024). These authors proposed to approximate
the distribution tail of a count random variable Yby discretizing the CDF defined
by (2.1) and, for large u,
Pr(Yu=k|Yu) = F(k+ 1; σ, ξ)F(k;σ, ξ), k N0,(2.2)
with σ > 0 and ξ0. The distribution is called discrete GPD (DGPD), and several
properties of discrete Pareto type distributions have been studied previously in the
literature (Krishna and Pundir,2009;Buddana and Kozubowski,2014;Kozubowski
et al.,2015).
摘要:

AnextendedgeneralizedParetoregressionmodelforcountdataTouqeerAhmad1,CarloGaetan2andPhilippeNaveau31ENSAI,CREST-UMR9194,Bruz,France.2DipartimentodiScienzeAmbientali,InformaticaeStatistica,Universit`aCa’Foscari-Venezia,Italy.3LaboratoiredesSciencesduClimatetdel’Environnement,UMR8212,CEA-CNRS-UVSQ,IPSL...

展开>> 收起<<
An extended generalized Pareto regression model for count data Touqeer Ahmad1 Carlo Gaetan2and Philippe.pdf

共19页,预览4页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:19 页 大小:1.83MB 格式:PDF 时间:2025-04-30

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 19
客服
关注