The Shannon Entropy of a Histogram S. J. Watts L. Crow Department of Physics and Astronomy The University of Manchester Oxford Road M13 9PL Manchester UK

2025-05-06 0 0 3.5MB 27 页 10玖币

侵权投诉

The Shannon Entropy of a Histogram

S. J. Watts*, L. Crow

Department of Physics and Astronomy, The University of Manchester, Oxford Road, M13 9PL, Manchester, UK

*Correspondence to: Stephen.Watts@manchester.ac.uk

Abstract: The histogram is a key method for visualizing data and estimating the underlying probability

distribution. Incorrect conclusions about the data result from over or under-binning. A new method based on

the Shannon entropy of the histogram uses a simple formula based on the differential entropy estimated

from nearest-neighbour distances. Links are made between the new method and other algorithms such as

Scott’s formula, and cost and risk function methods. A parameter is found that predicts over and under-

binning which can be estimated for any histogram. The new algorithm is shown to be robust by application to

real data. A key conclusion is that the Shannon entropy of continuous data is (1/2)log2N bits which provides a

limit to what one can learn from data.

1. Histogram Algorithm

The histogram is a key method for visualizing data and estimating the underlying probability distribution

function (pdf). Incorrect conclusions about the data result from over-binning or under-binning. Computer

software that automatically decides upon a suitable bin width can fool the unwary user. Scientists often

adjust the bin size to get a visually appealing plot. The human visual system is very good at gauging when a

histogram is under or over-binned. However, one must always be wary of a process that cannot be quantified.

Two key parameters affect the choice of binning – the number of entries (N), and some variance measure of

the pdf. Table 1.1 shows the different algorithms and the scientific software which uses them, refs. [1-13]. The

final column indicates whether the algorithm uses a formula or cost function. Many of the algorithms

minimise an Integrated Mean Square Error (IMSE) function, but arrive at different formulations using various

approaches and approximations. The relationship between many of these algorithms and the one described

in this paper will be discussed in Sections 4.2 and 5.0

The variables describing a histogram are as follows. A data point (or “entry”) is assigned to a bin i where i = 1

to Nbin , which is the total number of bins. Defining ni as the number of entries in bin i then,

bin

iN i

N np

= =

∑

(1)

N is the total number of points or entries and pi is the probability of counting events in bin i.

The Shannon entropy of the binned histogram, HB , is

log( )

bin

B ii

H pp

= −

∑

(2)

The entropy is given in bits or nats when the log is base 2 or e respectively. This paper will use bits unless

otherwise indicated. The entropy means that a minimal of 2H bins are required to histogram the data. For

example, a fair coin gives two discrete values – head or tail – each with a probability of 1/2 . HB = 1 bits for this

system which thus requires

22=

bins. A useful quantity is the efficiency, e which is defined as the ratio of

the number of bins expected from the entropy, to the actual number of bins required to histogram all the

data.

() ( )

H bits H nats

bin bin

≡=

(3)

For an unfair coin,

0.25 , 0.75pp= =

0.811

bits and

0.877

. As is well known, H is maximised

when all the pi are the same, in which case,

. This is the uniform distribution.

Shannon entropy applies to discrete or categorical data. For continuous data, one must first assign the

histogram variable, x, to the bin i. To do this, one needs to define the starting point of the histogram, xs ,

which is set to be a sensible value close to the minimum value of x in the data. The bin index i for the data

points xi with i = 1 to N is then,

i−



= +



∆



(4)

Where the floor function,





, gives the greatest integer less than or equal to the real number between the

brackets. ∆ is the fixed bin width for all x. The choice of this bin width is crucial and is the main discussion of

this paper. The differential entropy, h, is defined as,

( )

log ( )

h p x p x dx= −

∫

(5)

where

( )

is the probability density function and

is the continuous random variable. S is the support of

p(x). This is well-defined for a particular distribution. For example, the normal distribution with σ = 1 has h =

2.047 bits, Table 1.2 . The discrete entropy, H, and differential entropy, h, are related by [14],

( )

log

Hh= − ∆

(6)

At first sight, for continuous data, H appears poorly defined because it is dependent on ∆ which is not known.

Incorrectly, one might think H to be infinite because ∆ should tend to zero for a continuous distribution.

∆ must be chosen so that the histogram properly represents the underlying distribution without bias. There is

an extensive literature on the choice for ∆ which is concisely summarised in reference [15]. A widely used

formula for ∆ is due to Scott, [2], and obtains the optimal bin width by minimizing the IMSE between the

histogram estimate of the underlying distribution and the actual distribution. Scott’s Formula is,

( )

p x dx

−





∆=

′





∫

(7)

where,

( ) ( )

dp x

px dx

′

. Note that this is infinite for a uniform distribution as

( )

0.px

′=

Eq. (7) cannot be

used directly because one needs to know the pdf to apply it. This formula is often used on the assumption

that the underlying distribution is normal, estimating the variance,

, from the data. The resulting formula is

called Scott’s Rule.

11 1

36 3

23 N

πσ

−

∆= ×

(8)

Using Eqs. (5), (6), and (7) one can construct Table 1.2, which gives, h, ∆ and H for various probability density

functions. Table 1.2 shows that the entropy depends only on the logarithm of the number of events in the

histogram. This is not a complete surprise. If all events were in separate bins, then the entropy would be

maximal at

( )

log . N

However, for such a case, there would be no knowledge of the underlying distribution.

The values for entropy in the table also imply a small but non-zero value for

1N=

which may be traceable to

the fact that Scott’s Rule is an asymptotic result. Table 1.2 suggests that a histogram should have a well-

defined entropy determined only by the number of entries, and will be of the form,

( )

1log

(9)

with

13M≤≤

. This assignment allows one to fix the bin width using Eq. (6) as,

( ) ( )

h nats h bits

eN N

−−

∆= =

(10)

This is can be applied to real data because one can estimate

using the binless entropy estimator of

Kozachenko and Leonenko, [16], for continuous distributions in Euclidean space. This is well described by

Victor in [17]. For a one dimensional distribution,

()( )

()

log 2 1 log

log 2

h bits N N

= −+ +

∑

(11)

where,

is the nearest neighbour distance for the

point in the histogram with value

. Eqs. (10) and

(11) define the new algorithm for the bin width of the histogram.

To illustrate this algorithm, the Moyal distribution, see Appendix A1, was simulated for 10,000 entries with M

= 1, 2, 2.6, 4 and 6. This is shown in Fig. 1 together with a fit for all values of M except 1. The Moyal

distribution was chosen to illustrate the algorithm because it is a mix of a normal distribution with an

exponential tail, and has no free parameters. The χ2/DF is 0.94, 1.31, 1.83 and 13.54 for M = 2, 2.6, 4 and 6

respectively. This example shows that M = 1 is over-binned as expected, that 2 < M < 3 produce a well binned

histogram, and that once M > 3, the histogram is under-binned. These conclusions are for the data in Fig. 1

but apply in general as Section 4 will show.

2. Entropy calculation using entries per bin

A range of different distributions with a wide range of skewness and kurtosis were simulated to demonstrate

the work in this paper; uniform distribution between zero and one, standard normal distribution, exponential

with mean one, standard log-normal and an approximation to the Landau distribution due to Moyal. Appendix

A1 provides more details. For brevity, these distributions will be referred to as uniform, normal, exponential,

log-normal and Moyal in the rest of the paper.

2.1 Entries per bin analysis

An alternative way to calculate the entropy of the histogram is to count how many bins have 0, 1, 2, …j entries

per bin. One can then plot this distribution which will be called the “Entries per bin“ or EntBin plot. This could

itself be binned, but in the EntBin plots shown in the figures, each point will correspond to one value of j,

which is why the phrase plot is used. Thus one can calculate the entropy in an alternative way.

log

Max Max

jN jN

jj j

H NH NNN

= =



= ×=− × 



∑∑

(12)

Where,

is the number of bins with j entries, and

is the entropy associated with this entry, which is

given above and uses the fact that the probability associated with j entries per bin is j/N. The maximum

number of entries/bin is NMax. The j = 0 bin has zero entropy, but this is included so that empty bins are not

forgotten. The final unknown is the distribution for

. This can be written as,

(; )

N Npj

(13)

Where

is the average number of entries per bin and

(; )pj

is the probability for j entries per bin for the

given

. The overall scale is determined by NS which will be discussed below. There are two possible values

, either the expected mean given the entropy or the actual mean of the histogram, stated in Eq. (14) and

Eq. (15) respectively.

−

= = =

(14)

bin

µ eµ

= =

(15)

If the underlying pdf is a uniform distribution, then

NN=

µµ

and

( ; ) exp( )

pj j

µµ µ

= −

. This is

a Poisson distribution. For large

the distribution becomes normal with mean

and standard deviation

. The EntBin plot for simulated uniform, normal, and exponential distributions, for N = 10,000 are shown

in Fig. 2a and Fig. 2b for M = 1.5 and M = 2.0 respectively. Some key observations from this figure.

• In Fig. 2a and M = 1.5, the uniform distribution gives an EntBin plot that is normal with expected mean

and standard deviation, 10,0002/3 = 21.5 and sqrt(21.5) = 4.64, respectively. For M = 2, the uniform

distribution gives an EntBin plot which is normal with expected mean and standard deviation, 10,0001/2 =

100 and sqrt(100) = 10, respectively.

• In Fig. 2a, for entries/bin > 9, the normal and exponential distributions give an EntBin plot with a normal

distribution with mean

but much larger standard deviation than expected. The underlying pdf’s are

smearing out the EntBin normal distribution, but the mean stays at H

• In Fig. 2b and M = 2 , for entries/bin > 9, the normal and exponential distributions give an EntBin plot

which is smeared out, and is essentially uniform.

• Referring to the entries/bin variable as capital X. At low entries/bin (X +1 < 10), there is an excess of

counts and this region of the plot can be described by a power law,

( 1)X

−

, with α in the range 1.4 to

2.0. A power law suggests scaling behaviour [18]. Since the number of entries/bin will scale inversely with

bin size one would expect α to depend on the underlying pdf but not be too sensitive to M. Since the

entropy associated with low values of X is small compared to the total, the contribution of this part of the

plot to the total entropy will be discounted at this stage.

The above observations lead to a simplified mathematical model for the EntBin distribution. For a uniform pdf

for large N, the EntBin plot will be described by a normal distribution. For all other pdf’s, the EntBin plot will

be uniform. Table 2.1 gives details of the fit to the data in Fig. 2, based on the observations above. One can

approximate the summation in Eq. (12) with the integral,

log ( ; )

Max

H N p X dX



≈



∫

nats (16)

Note that p(X;µ) is now a pdf not a probability. There are two cases to consider.

Case 1 - For an underlying uniform pdf

p(X;µ) will be a normal distribution and the log(X) term will be almost constant around the mean and can be

taken outside the integral. One can approximate Eq. (16) with Eq. (17).

(log( ) log( )) (;)

H X p X dX

µµ

∞

−

≈∫

(17)

From which one immediately obtains,

(log( ) log( )) log

H NN

≈ −=

(18)

This supports the assumption made in Eq. (9).

Case 2 - All non-uniform pdfs or pdfs with a “shape”

The previous model points to the importance that the mean entries/bin should be

if one is to ensure that

the entropy of the histogram will be as in Eq. (9). An EntBin plot with a uniform distribution with mean

and stretched around this mean by a factor

is required. This also has a range, RX, of

. Thus Eq.

(16) becomes,

( )

log( ) log( )

H X X N dX

≈−

∫

(19)

After some simple algebra and discounting small terms in

, one obtains

log( ) log( )

H Ns



≈ −+





(20)

For consistency, Eq. (20) must lead to Eq. (9), so s = 2. The extra term is just -0.19 nats. The entropy in the low

X entries due to the power-law contribution will balance out this term. This gives a key result that the

maximum number of entries per bin will be approximately,

Max H

−

≈=

(21)

Fig. 3a shows that this relation works reasonably well for many different distributions. This is the only value

possible to ensure that the average entries/bin will be H

for those entries in the EntBin plot that contribute

significantly to the overall entropy. These correspond to X > 10 in the EntBin plot. A key result can be obtained

by re-arranging Eq. (21). Note the use of base 2 in the logarithm.

( ) log 1

Max

H bits H



≈ +≡





(22)

For a uniform distribution the extra one bit is removed. Thus, the fact that a pdf has “shape” means one gains

an extra bit. Eq. (22) defines HX. One can estimate the M of any histogram in two ways; by calculating HB for

the histogram using Eq. (2) and using this to estimate M, or calculate HX and then estimate M. These two

estimates will be called MB and MX respectively. Specifically,

log( )

bin

B ii

H pp

= − ∑

and

log

(23)

log

(24)

Fig. 3b shows MX versus the input M for a normal, exponential, log-normal and Moyal distributions. These are

very different distributions. For N = 10,000, there is good agreement between MX and the input M thus

confirming the analysis in this section. For lower N, the MX value for the Moyal distribution at N = 100 and

1000, saturates at MX ~ 3 and 4 respectively. This is because there are not enough entries to achieve high

values of NMax . However, MX is reliable in the range of M that matters. For the data in Fig. 1, MX is 1.2, 2.0, 2.6,

2.9 and 5.5 for histograms in which M was set to 1, 2, 2.6, 3 and 6 respectively. The MX estimate is a quick way

to estimate M from published histograms and will find application in Section 6.

Another way to check the assumptions underlying Case 2, is to estimate the mean number of entries per bin

weighted by the entropy. This is defined as,

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

10 玖币 0人已下载

立即下载

摘要：

TheShannonEntropyofaHistogramS.J.Watts*,L.CrowDepartmentofPhysicsandAstronomy,TheUniversityofManchester,OxfordRoad,M139PL,Manchester,UK*Correspondenceto:Stephen.Watts@manchester.ac.ukAbstract:Thehistogramisakeymethodforvisualizingdataandestimatingtheunderlyingprobabilitydistribution.Incorrectconclus...

展开>> 收起<<

The Shannon Entropy of a Histogram S. J. Watts L. Crow Department of Physics and Astronomy The University of Manchester Oxford Road M13 9PL Manchester UK.pdf

共27页,预览5页

还剩页未读，继续阅读

声明：本站为文档C2C交易模式，即用户上传的文档直接被用户下载，本站只是中间服务平台，本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间，仅对用户上传内容的表现方式做保护处理，对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私，请立即通知玖贝云文库，我们立即给予删除！

The Shannon Entropy of a Histogram S. J. Watts L. Crow Department of Physics and Astronomy The University of Manchester Oxford Road M13 9PL Manchester UK

相关推荐

开通VIP享超值会员特权

作者详情

相关内容

热门标签

举报选择: