Should data ever be thrown away Pooling interval-censored data sets with different precision

2025-05-03 0 0 1.2MB 35 页 10玖币
侵权投诉
Should data ever be thrown away?
Pooling interval-censored data sets with different precision?
Krasymyr Tretiaka,, Scott Fersona
aUniversity of Liverpool, Liverpool L69 7ZX, United Kingdom
Abstract
Data quality is an important consideration in many engineering applications and projects.
Data collection procedures do not always involve careful utilization of the most precise
instruments and strictest protocols. As a consequence, data are invariably affected by
imprecision and sometimes sharply varying levels of quality of the data. Different mathe-
matical representations of imprecision have been suggested, including a classical approach
to censored data which is considered optimal when the proposed error model is correct, and
a weaker approach called interval statistics based on partial identification that makes fewer
assumptions. Maximizing the quality of statistical results is often crucial to the success of
many engineering projects, and a natural question that arises is whether data of differing
qualities should be pooled together or we should include only precise measurements and
disregard imprecise data. Some worry that combining precise and imprecise measurements
can depreciate the overall quality of the pooled data. Some fear that excluding data of
lesser precision can increase their overall uncertainty about results because lower sample
size implies more sampling uncertainty. This paper explores these concerns and describes
simulation results that show when it is advisable to combine fairly precise data with rather
imprecise data by comparing analyses using different mathematical representations of im-
precision. Pooling data sets is preferred when the low-quality data set does not exceed
a certain level of uncertainty. However, so long as the data are random, it may be le-
gitimate to reject the low-quality data if its reduction of sampling uncertainty does not
counterbalance the effect of its imprecision on the overall uncertainty.
Keywords: imprecise data, censoring, maximum likelihood, epistemic uncertainty,
?This work was partially funded by the Engineering and Physical Science Research Council (EPSRC)
through programme grant “Digital twins for improved dynamic design”, EP/R006768/1.
Corresponding author
Email addresses: k.tretiak@liverpool.ac.uk (Krasymyr Tretiak), ferson@liverpool.ac.uk
(Scott Ferson)
Preprint submitted to International Journal of Approximate Reasoning February 21, 2023
arXiv:2210.13863v2 [stat.ME] 20 Feb 2023
Kolmogorov–Smirnov, descriptive statistics
1. Introduction
Measurements are often performed under varying conditions or protocols, or by inde-
pendent agents, or with different measuring devices. This can lead to data measurements
with different precision. Interval-censored data, or simply interval data, are measured val-
ues known only within certain bounds instead of being observed exactly. Interval data
can arise in various cases including non-detects and data censoring [1], periodic observa-
tions, plus-or-minus measurement uncertainties, interval-valued expert opinions [2], privacy
requirements, theoretical constraints, bounding studies, etc. [3, 4, 5]. Some common ex-
amples of interval-censored data come from clinical and health studies [6], survival data
analysis [7], chemical risk assessment [8, 9], mechanics [10], etc. The breadth of the un-
certainty captured by the widths of these intervals is epistemic uncertainty [11, 12, 13].
This uncertainty arises due to lack of knowledge, limited or inaccurate measurements,
approximation or poor understanding of physical phenomena.
Over the past couple of decades various statistical methods have been developed for
handling interval-censored data. The substitution method [14, 15] is still the most common
procedure in handling nondetect data [9] which are special cases of left-censored data where
a value is known to reside between zero and some quantitative detection limit. Substitution
replaces nondetects with a zero, the detection limit, half the detection limit, or some other
fraction of the detection limit. This method ignores the imprecision and can lead to
incorrect estimates [16, 17]. One might presume that using the detection limit as the
substituted value would yield the most conservative upper confidence limit for the mean,
but this is not true. The reason is that the upper confidence limit depends positively
on both the mean and variance of the data, and clustering values at a single detection
limit reduces the variance. An alternative strategy, what might be considered the zeroth
approach, neglects the imprecision altogether and omits from the analysis any data with
imprecision beyond some acceptable threshold. It is hard to assess how common this
approach is, but it seems to widespread despite its obvious statistical shortcomings.
Another basic approach is maximum likelihood estimation (MLE) which gauges the
parameters of a probability distribution by maximizing a likelihood function [18, 19, 20].
MLE requires an assumption that the data follow a specific distribution. An alternative ap-
proach is the Turnbull’s method [21, 22, 23], a generalisation of the Kaplan–Meier method
which is commonly used in survival analysis for estimating percentiles, means, or other
statistics without substitutions [24, 25]. These methods are non-parametric, and do not
require specification of an assumed distribution. The Kaplan–Meier method is used for
right censoring whereas Turnbull’s method takes into account interval-censored data.
2
Interval data can also be considered as a special case of “symbolic data” modelled
with uniform distributions over the interval ranges [26, 27, 28]. This approach corresponds
to Laplace’s principle of indifference with respect to the censored interval. All values
within each interval are assumed to be equally likely. This assumption makes it relatively
straightforward to calculate sample statistics for data sets containing such intervals. This
uniforms approach characterises a sample as a precise distribution, but estimators based
on it are not consistent because there is no guarantee the distribution approaches the
true distribution from which the data were generated as sample size increases. Bertrand
and Groupil [26] acknowledged that, even as the sample size grows to infinity, the true
distribution of the underlying population is only approximated by this approach.
Another approach represents interval uncertainty purely in the form of bounds [29, 30,
31, 3, 32, 33, 5, 34]. This approach originates from the theory of imprecise probabilities
[35, 36, 37, 38]. It models each interval as a set of possible values, and calculations result
in a class of distribution functions, corresponding to the different possible values within
the respective intervals. In contrast to the uniforms approach, some operations within this
intervals approach can be computationally expensive. However the obtained results are
arguably more reliable, and the resulting bounds on the distribution will asymptotically
enclose the true distribution as sample size grows to infinity. The intervals approach is
computationally slower, and weaker than the other approaches if their assumptions are
satisfied.
As strategies for descriptive statistics, these approaches represent different positions
along a continuum between assumption-dependent methods that are powerful but possibly
unreliable and relatively assumption-free methods that are reliable but may not be as
powerful. Fig. 1 argues that this continuum connects the poles of deterministic calculations
and purely qualitative approaches (such as narrative analysis, word clouds, etc.). Normal
Deterministic
calculation
Normal
theory
Nonparametric
methods
Qualitative
analysis
Imprecise
statistics
Figure 1: Schematic representation of assumption-dependent methods that are powerful but possibly
unreliable and relatively assumption-free methods that are reliable but may not be as powerful.
3
theory, which developed in the first half of the 20th century, is extremely powerful, but
it requires assumptions that are often hard to justify (such as normality, independence,
homoscedasticity, stationarity, etc.). Non-parametric methods that allowed analysts to
relax these assumptions flowered in the second half of that century. Imprecise statistics,
motivated by the recognition of non-negligible epistemic uncertainties such as interval
censoring, has developed over the last two decades to provide approaches that further relax
assumptions when they are untenable or in doubt. Modern statistics recognises there is not
just one tool for any given statistical problem. Various methods, representing sometimes
disparate sets of assumptions can be deployed, and it is often useful to compare the results
from such different methods.
A natural question asks about guidelines for making practical decisions whether data
of differing qualities should be pooled together. By pooling we mean combining the data
sets and analysing them as a single sample. Certainly any data sets considered for pooling
are assumed to be measuring the same quantity or distribution of quantities. A common
belief within the statistics community suggests that all available data should be included
in any analysis. Others feel that imprecise data could potentially contaminate precise data
and that pooling might need to be avoided lest any bias that hides in the imprecision be
introduced.
Sample size is a crucial consideration in each statistical study and, in two-sample tests
for instance, it should depend on the minimum expected difference, desired statistical
power, and the significance criterion [39]. Increasing the sample size improves the char-
acterization of variability, lowering the degree of sampling uncertainty. A lower level of
sampling error is advantageous because it permits statistical inferences to be made with
more certainty. But is it possible that some data are so imprecise as to warrant their
exclusion from an analysis?
Clearly, it would be inappropriate to exclude data because of its imprecision, even
if it is very low quality, if the method of analysis is sensitive to whether and how its
imprecision depends on the magnitude of the value being measured. For instance, when
asking people about their salaries, high-earners often tend to give vague answers [40].
Excluding these answers could bias the results dramatically. However, so long as the
imprecision is independent of the underlying magnitudes of the random data, it might be
legitimate to reject the low-quality data when the effect of the measurement imprecision
is greater on the overall uncertainty than the reduction in sampling uncertainty from
including the low-quality data. This can be true even when the imprecision depends on
other features of the data including group membership. Data for which measurement
imprecision has nothing to do with the data values are said to be irrelevantly imprecise.
Mathematically, this means that breadth of the imprecision of a quantity is independent
of its underlying true magnitude.
4
This paper is organized as follows. Section 2 introduces an approach to analysing
statistical data whose measurement uncertainties are intervals. Section 3 describes the
process of generating synthetic interval data sets for our simulations. Section 4 describes
using confidence intervals on the mean to assess uncertainty for precise and imprecise
data sets. Section 5 presents numerical simulations and comparisons with interval data
sets containing different levels of uncertainty using distribution-free Kolmogorov–Smirnov
confidence limits to assess uncertainty. Section 6 uses maximum likelihood for fitting a
named distribution to precise and imprecise interval data sets and contrasts the traditional
approach with the imprecise probabilities approach to maximum likelihood for which we
compute confidence intervals. Section 7 discusses the assumptions used in the preceding
analyses, when they might and might not be appropriate and alternative assumptions
that could be used to characterise imprecision. Summaries of the application of several
statistical methods for estimating the overall level of uncertainty of interval data, and
conclusions on whether it is advisable to pool data with varying quality are presented in
Section 8.
2. Interval statistics
Analysis of data that combine measurements with varying quality can be done with
interval descriptive statistics [26, 27, 29, 3, 33, 34]. Statistical analysis with measurements
modeled as intervals provides a reliable method to account for both the sampling and
measurement uncertainties. By applying interval statistics we can determine the overall
level of uncertainty and decide when pooling of the data sets is preferable or when we
should disregard some data. Siegrist [41] used quantitative examples involving combining
precise and sloppy (imprecise) data sets to illustrate cases where the resulting overall
uncertainty of the pooled data can be either lower or higher than that of the precise data
by itself. In his studies he used two data sets: precise data were sampled from the normal
distribution with known mean, variance and blurred with ±0.1 uncertainty; sloppy data
were sampled from the same distribution but blurred with ±0.1funcertainty, where fcan
be any positive number greater than one.
This paper extends the idea proposed in [41] and provides a broader view on questions
of pooling data sets with different levels of uncertainty. We assess combinations consisting
of only two data sets: precise and imprecise. We consider data sets that are measurements,
not guesses, opinions, or expert elicitations. These measurements come from the empiricist
with clear statements about their precision and measurement protocols. Because different
approaches could result in different decisions, it is essential to compare several statisti-
cal methods for experimental data processing, for instance, estimation of the parameters
of a named distribution, construction of their confidence intervals, and determination of
5
摘要:

Shoulddataeverbethrownaway?Poolinginterval-censoreddatasetswithdi erentprecision?KrasymyrTretiaka,,ScottFersonaaUniversityofLiverpool,LiverpoolL697ZX,UnitedKingdomAbstractDataqualityisanimportantconsiderationinmanyengineeringapplicationsandprojects.Datacollectionproceduresdonotalwaysinvolvecarefulu...

展开>> 收起<<
Should data ever be thrown away Pooling interval-censored data sets with different precision.pdf

共35页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:35 页 大小:1.2MB 格式:PDF 时间:2025-05-03

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 35
客服
关注