The Shannon Entropy of a Histogram S. J. Watts L. Crow Department of Physics and Astronomy The University of Manchester Oxford Road M13 9PL Manchester UK

2025-05-06 0 0 3.5MB 27 页 10玖币
侵权投诉
The Shannon Entropy of a Histogram
S. J. Watts*, L. Crow
Department of Physics and Astronomy, The University of Manchester, Oxford Road, M13 9PL, Manchester, UK
*Correspondence to: Stephen.Watts@manchester.ac.uk
Abstract: The histogram is a key method for visualizing data and estimating the underlying probability
distribution. Incorrect conclusions about the data result from over or under-binning. A new method based on
the Shannon entropy of the histogram uses a simple formula based on the differential entropy estimated
from nearest-neighbour distances. Links are made between the new method and other algorithms such as
Scott’s formula, and cost and risk function methods. A parameter is found that predicts over and under-
binning which can be estimated for any histogram. The new algorithm is shown to be robust by application to
real data. A key conclusion is that the Shannon entropy of continuous data is (1/2)log2N bits which provides a
limit to what one can learn from data.
1. Histogram Algorithm
The histogram is a key method for visualizing data and estimating the underlying probability distribution
function (pdf). Incorrect conclusions about the data result from over-binning or under-binning. Computer
software that automatically decides upon a suitable bin width can fool the unwary user. Scientists often
adjust the bin size to get a visually appealing plot. The human visual system is very good at gauging when a
histogram is under or over-binned. However, one must always be wary of a process that cannot be quantified.
Two key parameters affect the choice of binning the number of entries (N), and some variance measure of
the pdf. Table 1.1 shows the different algorithms and the scientific software which uses them, refs. [1-13]. The
final column indicates whether the algorithm uses a formula or cost function. Many of the algorithms
minimise an Integrated Mean Square Error (IMSE) function, but arrive at different formulations using various
approaches and approximations. The relationship between many of these algorithms and the one described
in this paper will be discussed in Sections 4.2 and 5.0
The variables describing a histogram are as follows. A data point (or “entry”) is assigned to a bin i where i = 1
to Nbin , which is the total number of bins. Defining ni as the number of entries in bin i then,
1
,
bin
iN i
ii
i
n
N np
N
=
=
= =
(1)
N is the total number of points or entries and pi is the probability of counting events in bin i.
The Shannon entropy of the binned histogram, HB , is
1
log( )
bin
iN
B ii
i
H pp
=
=
= −
(2)
The entropy is given in bits or nats when the log is base 2 or e respectively. This paper will use bits unless
otherwise indicated. The entropy means that a minimal of 2H bins are required to histogram the data. For
example, a fair coin gives two discrete values head or tail each with a probability of 1/2 . HB = 1 bits for this
system which thus requires
1
22=
bins. A useful quantity is the efficiency, e which is defined as the ratio of
the number of bins expected from the entropy, to the actual number of bins required to histogram all the
data.
() ( )
2
H bits H nats
H
bin bin
e
NN
e
≡=
(3)
2
For an unfair coin,
12
0.25 , 0.75pp= =
,
0.811
B
H=
bits and
0.877
H
e
=
. As is well known, H is maximised
when all the pi are the same, in which case,
. This is the uniform distribution.
Shannon entropy applies to discrete or categorical data. For continuous data, one must first assign the
histogram variable, x, to the bin i. To do this, one needs to define the starting point of the histogram, xs ,
which is set to be a sensible value close to the minimum value of x in the data. The bin index i for the data
points xi with i = 1 to N is then,
1
is
xx
i

= +


(4)
Where the floor function,


, gives the greatest integer less than or equal to the real number between the
brackets. is the fixed bin width for all x. The choice of this bin width is crucial and is the main discussion of
this paper. The differential entropy, h, is defined as,
( )
log ( )
S
h p x p x dx= −
(5)
where
( )
px
is the probability density function and
x
is the continuous random variable. S is the support of
p(x). This is well-defined for a particular distribution. For example, the normal distribution with σ = 1 has h =
2.047 bits, Table 1.2 . The discrete entropy, H, and differential entropy, h, are related by [14],
( )
log
Hh= −
(6)
At first sight, for continuous data, H appears poorly defined because it is dependent on which is not known.
Incorrectly, one might think H to be infinite because should tend to zero for a continuous distribution.
must be chosen so that the histogram properly represents the underlying distribution without bias. There is
an extensive literature on the choice for which is concisely summarised in reference [15]. A widely used
formula for is due to Scott, [2], and obtains the optimal bin width by minimizing the IMSE between the
histogram estimate of the underlying distribution and the actual distribution. Scott’s Formula is,
( )
1
31
3
2
6
S
N
p x dx


∆=


(7)
where,
( ) ( )
dp x
px dx
=
. Note that this is infinite for a uniform distribution as
( )
0.px
=
Eq. (7) cannot be
used directly because one needs to know the pdf to apply it. This formula is often used on the assumption
that the underlying distribution is normal, estimating the variance,
2
σ
, from the data. The resulting formula is
called Scott’s Rule.
11 1
36 3
23 N
πσ
∆= ×
(8)
Using Eqs. (5), (6), and (7) one can construct Table 1.2, which gives, h, ∆ and H for various probability density
functions. Table 1.2 shows that the entropy depends only on the logarithm of the number of events in the
histogram. This is not a complete surprise. If all events were in separate bins, then the entropy would be
maximal at
( )
log . N
However, for such a case, there would be no knowledge of the underlying distribution.
The values for entropy in the table also imply a small but non-zero value for
1N=
which may be traceable to
the fact that Scott’s Rule is an asymptotic result. Table 1.2 suggests that a histogram should have a well-
defined entropy determined only by the number of entries, and will be of the form,
3
( )
1log
M
HN
M
=
(9)
with
13M≤≤
. This assignment allows one to fix the bin width using Eq. (6) as,
( ) ( )
11
2
h nats h bits
MM
eN N
−−
∆= =
(10)
This is can be applied to real data because one can estimate
h
using the binless entropy estimator of
Kozachenko and Leonenko, [16], for continuous distributions in Euclidean space. This is well described by
Victor in [17]. For a one dimensional distribution,
()( )
()
22
1
1
log 2 1 log
log 2
N
i
i
e
h bits N N
gl
=
= −+ +
(11)
where,
i
l
is the nearest neighbour distance for the
th
i
point in the histogram with value
i
x
. Eqs. (10) and
(11) define the new algorithm for the bin width of the histogram.
To illustrate this algorithm, the Moyal distribution, see Appendix A1, was simulated for 10,000 entries with M
= 1, 2, 2.6, 4 and 6. This is shown in Fig. 1 together with a fit for all values of M except 1. The Moyal
distribution was chosen to illustrate the algorithm because it is a mix of a normal distribution with an
exponential tail, and has no free parameters. The χ2/DF is 0.94, 1.31, 1.83 and 13.54 for M = 2, 2.6, 4 and 6
respectively. This example shows that M = 1 is over-binned as expected, that 2 < M < 3 produce a well binned
histogram, and that once M > 3, the histogram is under-binned. These conclusions are for the data in Fig. 1
but apply in general as Section 4 will show.
2. Entropy calculation using entries per bin
A range of different distributions with a wide range of skewness and kurtosis were simulated to demonstrate
the work in this paper; uniform distribution between zero and one, standard normal distribution, exponential
with mean one, standard log-normal and an approximation to the Landau distribution due to Moyal. Appendix
A1 provides more details. For brevity, these distributions will be referred to as uniform, normal, exponential,
log-normal and Moyal in the rest of the paper.
2.1 Entries per bin analysis
An alternative way to calculate the entropy of the histogram is to count how many bins have 0, 1, 2, …j entries
per bin. One can then plot this distribution which will be called the “Entries per bin“ or EntBin plot. This could
itself be binned, but in the EntBin plots shown in the figures, each point will correspond to one value of j,
which is why the phrase plot is used. Thus one can calculate the entropy in an alternative way.
00
log
Max Max
jN jN
jj j
jj
jj
H NH NNN
= =
= =

= ×= × 

∑∑
(12)
Where,
j
N
is the number of bins with j entries, and
j
H
is the entropy associated with this entry, which is
given above and uses the fact that the probability associated with j entries per bin is j/N. The maximum
number of entries/bin is NMax. The j = 0 bin has zero entropy, but this is included so that empty bins are not
forgotten. The final unknown is the distribution for
j
N
. This can be written as,
(; )
jS
N Npj
µ
=
(13)
Where
µ
is the average number of entries per bin and
(; )pj
µ
is the probability for j entries per bin for the
given
µ
. The overall scale is determined by NS which will be discussed below. There are two possible values
of
µ
, either the expected mean given the entropy or the actual mean of the histogram, stated in Eq. (14) and
Eq. (15) respectively.
4
1
1
1M
HH
M
NN
N
eN
µ
= = =
(14)
BH
bin
N
N
µ eµ
= =
(15)
If the underlying pdf is a uniform distribution, then
1
M
S
NN=
,
H
µµ
=
and
1
( ; ) exp( )
!
j
pj j
µµ µ
= −
. This is
a Poisson distribution. For large
µ
the distribution becomes normal with mean
H
µ
and standard deviation
H
µ
. The EntBin plot for simulated uniform, normal, and exponential distributions, for N = 10,000 are shown
in Fig. 2a and Fig. 2b for M = 1.5 and M = 2.0 respectively. Some key observations from this figure.
In Fig. 2a and M = 1.5, the uniform distribution gives an EntBin plot that is normal with expected mean
and standard deviation, 10,0002/3 = 21.5 and sqrt(21.5) = 4.64, respectively. For M = 2, the uniform
distribution gives an EntBin plot which is normal with expected mean and standard deviation, 10,0001/2 =
100 and sqrt(100) = 10, respectively.
In Fig. 2a, for entries/bin > 9, the normal and exponential distributions give an EntBin plot with a normal
distribution with mean
H
µ
but much larger standard deviation than expected. The underlying pdf’s are
smearing out the EntBin normal distribution, but the mean stays at H
µ
.
In Fig. 2b and M = 2 , for entries/bin > 9, the normal and exponential distributions give an EntBin plot
which is smeared out, and is essentially uniform.
Referring to the entries/bin variable as capital X. At low entries/bin (X +1 < 10), there is an excess of
counts and this region of the plot can be described by a power law,
( 1)X
α
+
, with α in the range 1.4 to
2.0. A power law suggests scaling behaviour [18]. Since the number of entries/bin will scale inversely with
bin size one would expect α to depend on the underlying pdf but not be too sensitive to M. Since the
entropy associated with low values of X is small compared to the total, the contribution of this part of the
plot to the total entropy will be discounted at this stage.
The above observations lead to a simplified mathematical model for the EntBin distribution. For a uniform pdf
for large N, the EntBin plot will be described by a normal distribution. For all other pdf’s, the EntBin plot will
be uniform. Table 2.1 gives details of the fit to the data in Fig. 2, based on the observations above. One can
approximate the summation in Eq. (12) with the integral,
1
0
log ( ; )
Max
N
M
XX
H N p X dX
NN
µ



nats (16)
Note that p(X;µ) is now a pdf not a probability. There are two cases to consider.
Case 1 - For an underlying uniform pdf
p(X;µ) will be a normal distribution and the log(X) term will be almost constant around the mean and can be
taken outside the integral. One can approximate Eq. (16) with Eq. (17).
0
(log( ) log( )) (;)
H
H
N
H X p X dX
µµ
µ
(17)
From which one immediately obtains,
1
(log( ) log( )) log
H
H NN
M
µ
≈ −=
(18)
5
This supports the assumption made in Eq. (9).
Case 2 - All non-uniform pdfs or pdfs with a “shape”
The previous model points to the importance that the mean entries/bin should be
H
µ
if one is to ensure that
the entropy of the histogram will be as in Eq. (9). An EntBin plot with a uniform distribution with mean
H
µ
and stretched around this mean by a factor
H
s
µ
×
is required. This also has a range, RX, of
H
s
µ
×
. Thus Eq.
(16) becomes,
( )
0
11
log( ) log( )
X
R
HX
H X X N dX
R
µ
≈−
(19)
After some simple algebra and discounting small terms in
2
1
X
R
, one obtains
11
log( ) log( )
22
s
H Ns
M

≈ −+


(20)
For consistency, Eq. (20) must lead to Eq. (9), so s = 2. The extra term is just -0.19 nats. The entropy in the low
X entries due to the power-law contribution will balance out this term. This gives a key result that the
maximum number of entries per bin will be approximately,
1
1
22
M
Max H
NN
µ
≈=
(21)
Fig. 3a shows that this relation works reasonably well for many different distributions. This is the only value
possible to ensure that the average entries/bin will be H
µ
for those entries in the EntBin plot that contribute
significantly to the overall entropy. These correspond to X > 10 in the EntBin plot. A key result can be obtained
by re-arranging Eq. (21). Note the use of base 2 in the logarithm.
2
( ) log 1
X
Max
N
H bits H
N

≈ +≡


(22)
For a uniform distribution the extra one bit is removed. Thus, the fact that a pdf has “shape” means one gains
an extra bit. Eq. (22) defines HX. One can estimate the M of any histogram in two ways; by calculating HB for
the histogram using Eq. (2) and using this to estimate M, or calculate HX and then estimate M. These two
estimates will be called MB and MX respectively. Specifically,
1
log( )
bin
iN
B ii
i
H pp
=
=
= −
and
log
B
B
N
MH
=
(23)
2
log
X
X
N
MH
=
(24)
Fig. 3b shows MX versus the input M for a normal, exponential, log-normal and Moyal distributions. These are
very different distributions. For N = 10,000, there is good agreement between MX and the input M thus
confirming the analysis in this section. For lower N, the MX value for the Moyal distribution at N = 100 and
1000, saturates at MX ~ 3 and 4 respectively. This is because there are not enough entries to achieve high
values of NMax . However, MX is reliable in the range of M that matters. For the data in Fig. 1, MX is 1.2, 2.0, 2.6,
2.9 and 5.5 for histograms in which M was set to 1, 2, 2.6, 3 and 6 respectively. The MX estimate is a quick way
to estimate M from published histograms and will find application in Section 6.
Another way to check the assumptions underlying Case 2, is to estimate the mean number of entries per bin
weighted by the entropy. This is defined as,
摘要:

TheShannonEntropyofaHistogramS.J.Watts*,L.CrowDepartmentofPhysicsandAstronomy,TheUniversityofManchester,OxfordRoad,M139PL,Manchester,UK*Correspondenceto:Stephen.Watts@manchester.ac.ukAbstract:Thehistogramisakeymethodforvisualizingdataandestimatingtheunderlyingprobabilitydistribution.Incorrectconclus...

展开>> 收起<<
The Shannon Entropy of a Histogram S. J. Watts L. Crow Department of Physics and Astronomy The University of Manchester Oxford Road M13 9PL Manchester UK.pdf

共27页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:27 页 大小:3.5MB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 27
客服
关注