An extended generalized Pareto regression model for count data Touqeer Ahmad1 Carlo Gaetan2and Philippe

2025-04-30 0 0 1.83MB 19 页 10玖币

侵权投诉

An extended generalized Pareto regression

model for count data

Touqeer Ahmad 1, Carlo Gaetan 2and Philippe

Naveau 3

1ENSAI, CREST-UMR9194, Bruz, France.

2Dipartimento di Scienze Ambientali, Informatica e Statistica, Universit`a Ca’ Foscari-

Venezia, Italy.

3Laboratoire des Sciences du Climat et de l’Environnement, UMR 8212, CEA-CNRS-

UVSQ, IPSL & Universit´e Paris-Saclay, Gif-sur-Yvette, France.

Address for correspondence: ENSAI, 51 Rue Blaise Pascal, 35170, Bruz, France.

E-mail: touqeer.ahmad@ensai.fr.

Abstract: The statistical modeling of discrete extremes has received less attention

than their continuous counterparts in the Extreme Value Theory (EVT) literature.

One approach to the transition from continuous to discrete extremes is the modeling

of threshold exceedances of integer random variables by the discrete version of the

generalized Pareto distribution. However, the optimal choice of thresholds deﬁning

exceedances remains a problematic issue. Moreover, in a regression framework, the

treatment of the majority of non-extreme data below the selected threshold is either

ignored or separated from the extremes. To tackle these issues, we expand on the

concept of employing a smooth transition between the bulk and the upper tail of

the distribution. In the case of zero inﬂation, we also develop models with an ad-

ditional parameter. To incorporate possible predictors, we relate the parameters to

additive smoothed predictors via an appropriate link, as in the generalized additive

model (GAM) framework. A penalized maximum likelihood estimation procedure is

implemented. We illustrate our modeling proposal with a real dataset of avalanche

activity in the French Alps. With the advantage of bypassing the threshold selection

step, our results indicate that the proposed models are more ﬂexible and robust than

competing models, such as the negative binomial distribution.

Key words: Extreme value theory; discrete extended generalized Pareto distribu-

tion; zero-inﬂated models; generalized additive models

arXiv:2210.15253v2 [stat.ME] 17 Jun 2024

2Touqeer Ahmad et al.

1 Introduction

Modeling count data, i.e., non-negative integers, in the presence of covariates is a very

common task in many research areas. The Poisson regression model is often of limited

use because count data typically exhibit overdispersion (i.e., the variance of the counts

appears larger than the mean) and/or an excessive number of zeros. Additional

parameters can be inserted to deal with overdispersion, e.g., the quasi-Poisson model

(Wedderburn,1974), or diﬀerent distributions, such as negative binomial distribution,

can be ﬁtted. To model zero inﬂation, Lambert (1992) studied a two-component

mixture model, one component is a point mass at zero and the other component

is an assumed parametric count distribution. Lambert’s speciﬁcation is an example

of a distributional regression model (Stasinopoulos et al.,2018;Kneib et al.,2023).

The term “distributional” emphasizes that several characteristics of the conditional

distribution of the data are modeled in terms of covariates, rather than only the mean.

So far, the main focus of the literature has been on the relationship between the

location, the scale, and the shape with the covariates (Rigby and Stasinopoulos,

2005), less attention has been paid to the tail of the count distribution.

0 20 40 60 80

0.5 1.0 1.5 2.0 2.5 3.0

Avalanches

log10(count +1)

−3 −2 −1 0 1 2 3

0 1 2 3

Q−Q Plot

Theoretical Quantiles

Sample Quantiles

(a) (b)

Figure 1: (a) Frequency table plot (log10 scale) of avalanche events in the Haute-

Maurienne massif of the French Alps, (b) Q-Q plot of randomized residuals from the

zero-inﬂated negative binomial model with additive covariates. The dotted lines show

the 95% point-wise conﬁdence intervals.

As a motivating example, we consider a dataset on the avalanche activity in the

Haute-Maurienne region of the French Alps, The avalanche activity is measured by a

three-day moving sum of the avalanche events recorded from February 1982 to April

2021, see Evin et al. (2021) for details. Figure 1(a) shows a large relative frequency

An extended generalized Pareto regression model for count data 3

of zeros (meaning no avalanche has been reported) as well as heavy-tailed behavior.

We ﬁt a zero-inﬂated negative binomial regression model under the framework of

generalized additive models for location, scale, and shape (GAMLSS) where the pa-

rameters are related to additive environmental covariates (see Section 4for a detailed

description) via suitable link functions (Stasinopoulos et al.,2018). The randomized

quantile residuals (Dunn and Smyth,1996) are used to check the adequacy of the ﬁt-

ted model. Figure 1(b) clearly shows that the ﬁtted models do not correctly estimate

the upper tail behavior of avalanche extremes. In addition, the number of zeros is

not correctly predicted in this example.

Extreme value theory, originally developed by Fisher and Tippett (1928), provides

a mathematical blueprint to model very high and very low-frequency events (e.g.,

extreme temperatures, heavy rainfall intensities, heavy ﬂoods, and extreme winds,

etc.), and monographs such as Coles (2001)orBeirlant et al. (2004) discuss the main

extreme value models. In particular, under the peak-over-threshold (POT) approach

(Pickands,1975), the distribution of exceedances of a high threshold is often ap-

proximated by the Generalized Pareto Distribution (GPD). Modiﬁcations of GPD

to discrete data exist in the literature (Krishna and Pundir,2009;Buddana and

Kozubowski,2014;Kozubowski et al.,2015), and recently Hitz et al. (2024) discussed

discrete versions of GPD to approximate the tail behavior of integer-valued random

variables. This approach still requires the deﬁnition of a threshold at a high quantile,

which is not easy due to the discrete nature of the data (Daouia et al.,2023).

It should also be noted that especially environmental time series are rarely stationary

and depend on environmental factors. A standard approach to modeling continuous

extremes of a non-stationary process focuses on maintaining a predetermined thresh-

old but treating parameters of the GPD as functions of covariates (Davison and Smith,

1990). An alternative approach (Eastoe and Tawn,2009) uses preprocessing methods

to model the non-stationarity in the body of the process to produce transformed data

and then uses standard methods to model the extremes of the transformed data. The

ﬁrst approach has been adapted to the discrete case by Ranjbar et al. (2022). The

second approach seems to be diﬃcult to adapt. The distribution of the preprocessed

data cannot be connected to a distribution of count data.

The proposed model addresses the issue of the POT approach ignoring or separating

non-extreme data below the selected threshold from the extremes. The model utilizes

a smooth transition between the bulk and upper tail of the distribution, for the full

range of the data, while bypassing a threshold selection. The discrete extended version

of GPD (DEGPD) is derived by discretizing the cumulative distribution function

(CDF) of an extended GPD (Naveau et al.,2016). The model takes into account the

possible eﬀects of covariates in a non-parametric way. Since it is possible to have a

dataset with an excess of zeros, such as in the motivating example, we also consider

a mixture of the previous distribution with a degenerate distribution at zero. This

results in a distribution named Zero-Inﬂated DEGPD (ZIDEGPD).

4Touqeer Ahmad et al.

The paper is organized as follows. Section 2introduces the DEGPD. The extension

to deal with many zeros and covariate eﬀects is given in Section 3. Section 4discusses

applications of DEGPD and ZIDEGPD to avalanche data with environmental covari-

ates. Finally, Section 5concludes with a summary of our results and a discussion of

future research directions.

2 Discrete extreme modelling

The distribution of exceedances (i.e., the amount of data that appears over a given

high threshold) is often approximated by the Generalized Pareto distribution (GPD)

deﬁned by its CDF as

F(z;σ, ξ) = (1−(1 + ξz/σ)−1/ξ

+ξ̸= 0

1−exp (−z/σ)ξ= 0 ,(2.1)

where (a)+= max(a, 0). The σ > 0 and −∞ < ξ < +∞represent the scale and

shape parameters of the distribution, respectively.

More precisely let Xbe a random variable taking values in [0, xF) where xF∈

(0,∞)∪ {∞} Suppose that there exists a strictly positive sequence ausuch that

the distribution of a−1

u(X−u)|X≥uweakly converge to a non-degenerate proba-

bility distribution on [0,∞) as u→xF, then this distribution is the GPD (Balkema

and De Haan,1974). Thus, for large u,

Pr(X−u > x|X≥u) = Pr(a−1

u(X−u)> a−1

ux|X≥u)≈1−F(x;auσ, ξ).

The shape parameter, ξ, deﬁnes the tail behavior of the GPD. If ξ < 0, the upper

tail is bounded. If ξ= 0, we have the exponential distribution, where all moments

are ﬁnite. If ξ > 0, the upper tail is unbounded and the higher moments ultimately

become inﬁnite. The three deﬁned cases are labeled “short-tailed”, “light-tailed”,

and “heavy-tailed”, respectively. These categorizations enhance the ﬂexibility of the

GPD and underscore its adaptability to various modeling scenarios.

Using the GPD to approximate the distribution tail for discrete data can be inappro-

priate, as pointed out in Hitz et al. (2024). These authors proposed to approximate

the distribution tail of a count random variable Yby discretizing the CDF deﬁned

by (2.1) and, for large u,

Pr(Y−u=k|Y≥u) = F(k+ 1; σ, ξ)−F(k;σ, ξ), k ∈N0,(2.2)

with σ > 0 and ξ≥0. The distribution is called discrete GPD (DGPD), and several

properties of discrete Pareto type distributions have been studied previously in the

literature (Krishna and Pundir,2009;Buddana and Kozubowski,2014;Kozubowski

et al.,2015).

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

AnextendedgeneralizedParetoregressionmodelforcountdataTouqeerAhmad1,CarloGaetan2andPhilippeNaveau31ENSAI,CREST-UMR9194,Bruz,France.2DipartimentodiScienzeAmbientali,InformaticaeStatistica,Universit`aCa’Foscari-Venezia,Italy.3LaboratoiredesSciencesduClimatetdel’Environnement,UMR8212,CEA-CNRS-UVSQ,IPSL...

展开>> 收起<<

An extended generalized Pareto regression model for count data Touqeer Ahmad1 Carlo Gaetan2and Philippe.pdf

共19页,预览4页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

An extended generalized Pareto regression model for count data Touqeer Ahmad1 Carlo Gaetan2and Philippe

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: