The Entropy Method in Large Deviation Theory

2025-05-06 0 0 326.96KB 32 页 10玖币
侵权投诉
arXiv:2210.13121v2 [math.PR] 3 Nov 2022
The Entropy Method in Large Deviation Theory
Lei Yu
November 4, 2022
Abstract
This paper illustrates the power of the entropy method in address-
ing problems from large deviation theory. We provide and review en-
tropy proofs for most fundamental results in large deviation theory,
including Cramer’s theorem, the Gärtner–Ellis theorem, and Sanov’s
theorem. Moreover, by the entropy method, we also strengthen Sanov’s
theorem to the strong version.
1 Introduction
Information theory was fundamentally established by the works of Harry
Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s,
which successfully addressed the theoretic limit of information communica-
tion. Since then, information theory gradually influenced other branches of
mathematics. For example, the notion of (information) entropy introduced
by Shannon after he consulted von Neumann, the father of the computer, was
widely employed in probability theory, functional analysis, ergodic theory,
graph theory and so on.
Information theory concerns realizing reliable and efficient communica-
tion of a source over a noisy channel. Entropy and mutual information were
introduced by Shannon to determine the theoretical limit of such kind of
communication. Entropy and mutual information are special cases of a more
general concept, known as the relative entropy. For a nonnegative σ-finite
measure µand a probability measure Qdefined on the same space such that
L. Yu is with the School of Statistics and Data Science, LPMC, KLMDASR, and
LEBPS, Nankai University, Tianjin 300071, China (e-mail: leiyu@nankai.edu.cn). This
work was supported by the NSFC grant 62101286 and the Fundamental Research Funds
for the Central Universities of China (Nankai University).
1
Qµ, the relative entropy of Qwith respect to (w.r.t.) µis
D(Qkµ) = Zlog dQ
dµdQ
which is the expectation of the information density ıQkµ= log dQ
dµ. When
µis the counting measure on an countable set, H(X) = D(Qkµ)is the
Shannon entropy of XQ. When µis the Lebesgue measure on an Eu-
clidean space, h(X) = D(Qkµ)is the differential entropy of XQ. The
relative entropy was also widely used in the case that µis a probability
measure. In this case, for two probability measures QP,
D(QkP) = Zlog dQ
dPdQ. (1)
If Qis not absolutely continuous w.r.t. P, then D(QkP) := +. The
mutual information between Xand Ywith (X, Y )PXY is I(X;Y) =
D(PXY kPXPY). So, the relative entropy is a general concept incorporating
the entropy and mutual information. The relative varentropy of Qwith
respect to (w.r.t.) µis
V(Qkµ) = VarQıQkµ(X)= VarQlog dQ
dµ(X).
1.1 Organization
In Section 2, we introduce the concept of information projection (or shortly, I-
projection), which will be then used to characterize the asymptotic exponent
in the large deviation theory. In Section 3, we prove several versions of
Cramer’s theorem by the entropy method. In particular, the entropy proof
for the simple version of Cramer’s theorem in Section 3.1 is considered as
a paradigm to illustrate the power of the entropy method. Still using the
entropy method, we next generalize the Cramer’s theorem to the non-i.i.d.
setting, and obtain the Gärtner–Ellis theorem in Section 4. We then focus on
another fundamental theorem in large deviation theory—Sanov’s theorem in
Section 5. The entropy proof for this theorem is given in Section 5.1, and the
entropy proof for the strong version is given in Section 5.2. As an application,
the entropy proof of Gibbs conditioning principle is introduced in Section 6.
1.2 Notations
Let Xbe a Hausdorff topological space, and BXis the Borel σ-algebra on
X. Let PX(or shortly P) be a probability measure on X. We also use
2
QX, RX(or shortly Q, R) to denote another two probability measures on X.
The probability measures PX, QX, RXcan be thought as the push-forward
measures (or the distributions) induced jointly by the same measurable func-
tion X(random variable) from an underlying measurable space to Xand
by different probability measures P,Q,Rdefined on the underlying measur-
able space. Without loss of generality, we assume that Xis the identity
map, and P,Q,Rare the same as PX, QX, RX. So, PX, QX, RXcould be
independently specified to arbitrary probability measures. We say that all
probability measures induced by the underlying measure P, together with the
corresponding measurable spaces, constitute the P-system. So, PXis in fact
the distribution of the random variable Xin the P-system, where the letter
P in the notation PXrefers to the P-system and the subscript “X” refers
to the random variable. When emphasizing the random variables, we write
XPXto indicate that Xfollows the distribution PXin the P-system.
We use Pn
Xto denote the n-fold product of PX. For a probability
measure PXand a regular conditional distribution (transition probability or
Markov kernel) PY|Xfrom Xto Y, we denote PXPY|Xas the joint proba-
bility measure induced by PXand PY|X. For a distribution PXon Xand
a measurable subset A⊆ X,PX(·|A)denotes the conditional probability
measure given A. For brevity, we write PX(x) := PX({x}), x ∈ X. In par-
ticular, if XPXis discrete, the restriction of PXto the set of singletons
corresponds to the probability mass function of Xin the P-system. We
use QXPXto denote that the distribution QXis absolutely continuous
w.r.t. PX. We use Xnto denote a random vector (X1, X2, ..., Xn)taking
values on (Xn,Bn
X), and use xn:= (x1, x2, ..., xn)to denote its realization.
For an n-length vector xn, we use xito denote the subvector consisting
of the first icomponents of xn, and xn
i+1 to denote the subvector consist-
ing of the last nicomponents. For a probability measure PXnon Xn,
we use PXk|Xk1to denote the regular conditional distribution of Xkgiven
Xk1induced by PXn. For a measurable function f:X R, sometimes
we adopt the notation PX(f) := EPX[X] := RXfdPX. For a conditional
probability measure PX|Y, define the conditional expectation operator in-
duced by PX|Yas PX|Y(f)(y) := RfdPX|Y=yfor any measurable function
f: (X,BX)(R,BR)if the integral is well-defined for every y. When the
random variable Xis clear from the texture, we briefly denote PXas P. For
example, we denote the expectation EPX[X]as EP[X].
The relative entropy is defined in (1). The conditional relative entropy
is defined as
D(QX|WkPX|W|QW) = D(QX|WQWkPX|WQW).
3
Given n1, the empirical measure for a sequence xn∈ Xnis
Lxn:= 1
n
n
X
i=1
δxi
where δxis Dirac mass at the point x∈ X.
Denote dP(P, Q) = inf{δ > 0 : P(A)Q(Aδ) + δ, closed A⊆ Z}
with Aδ:= SzA{z∈ Z :d(z, z)< δ}, where Zis Xor Y. Here, dPis
known as the Lévy–Prokhorov metric on P(Z)which is compatible with the
weak topology on P(Z). We use Bδ(R) := {Q∈ P(Z) : dP(R, Q)< δ}
and Bδ](R) := {Q∈ P(Z) : dP(R, Q)δ}to respectively denote an open
ball and a closed ball under the Lévy–Prokhorov metric. We use A,Ao,
and Ac:= Z\Ato respectively denote the closure, interior, and complement
of the set A⊆ Z. Denote the sublevel set of the relative entropy (or the
divergence “ball”) as Dǫ](PX) := {QX:D(QXkPX)ǫ}for ǫ0. The
Lévy–Prokhorov metric, the TV distance, and the relative entropy admit
the following relation: For any QX, PX,
p2D(QXkPX)≥ kQXPXkTV dP(QX, PX),(2)
which implies for ǫ0,
D2ǫ](PX)Bǫ](PX).
The first inequality in (2) is known as Pinsker’s inequality, and the second
inequality follows by definition [7].
We denote inf := +,sup := −∞. Given two positive sequences
{an}
n=1 and {bn}
n=1, we say that an.bnif lim supn→∞ an/bn1, and
anbnif an.bnand bn.an.
2 Preliminaries: Information Projection
For a set Γ⊆ P(X), define
DkP) := inf
RΓD(RkP).
Any probability measure Rattaining the infimum above is called the infor-
mation projection or I-projection of Pon Γ.
Lemma 1. If Γis closed in the weak topology, then the I-projection exists.
4
Proof. This lemma follows by the lower semicontinuity of R7→ D(RkP)
and the compactness of the sublevel sets of R7→ D(RkP)under the weak
topology.
If Γis convex and such Rexists, the convexity of Γguarantees its unique-
ness since D(RkP)is strictly convex in R. Another important property of
the I-projection is the following equivalence.
Theorem 1. [3, Theorem 2.2] A probability measure QΓD(P)is the
I-projection of Pon the convex set Γof probability measures iff every RΓ
satisfies
D(RkP)D(RkQ) + D(QkP).(3)
The proof of this theorem is based on differentiating D(PλkP)w.r.t. λat
λ= 0, where Pλ=λP + (1 λ)Qwith λ[0,1]; see details in [3, Theorem
2.2]. The inequality (3) can be written as
Zlog dQ
dPdRD(QkP),
or Zlog dQ
dPd(RQ)0.
So, for the I-projection Q, the set
H:= R:Zlog dQ
dPd(RQ) = 0
constitutes a “supporting hyperplane” of Γ. Moreover, the function 1 +
log dQ
dPcan be seen as the “gradient” of Q7→ D(QkP), and His the “tangent
hyperplane” of Q7→ D(QkP). Analogously to Euclidean spaces, considering
D(QkP)as a “distance”, it holds that Qis the projection of Pto H, and
D(RkP) = D(RkQ) + D(QkP),R∈ H.
As a consequence of Theorem 1, when Γis specified by linear constraints,
the I-projection can be written out explicitly. To illustrate this point, we let
f:X Rbe a measurable function. Consider the following I-projection
problem:
γ(α) := inf
Q:Q(f)=αD(QkP).
To give the I-projection for γ(α), we first introduce the following lemma,
which can be easily verified by definition.
5
摘要:

arXiv:2210.13121v2[math.PR]3Nov2022TheEntropyMethodinLargeDeviationTheoryLeiYu∗November4,2022AbstractThispaperillustratesthepoweroftheentropymethodinaddress-ingproblemsfromlargedeviationtheory.Weprovideandreviewen-tropyproofsformostfundamentalresultsinlargedeviationtheory,includingCramer’stheorem,th...

展开>> 收起<<
The Entropy Method in Large Deviation Theory.pdf

共32页,预览5页

还剩页未读, 继续阅读

声明:本站为文档C2C交易模式,即用户上传的文档直接被用户下载,本站只是中间服务平台,本站所有文档下载所得的收益归上传人(含作者)所有。玖贝云文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。若文档所含内容侵犯了您的版权或隐私,请立即通知玖贝云文库,我们立即给予删除!
分类:图书资源 价格:10玖币 属性:32 页 大小:326.96KB 格式:PDF 时间:2025-05-06

开通VIP享超值会员特权

  • 多端同步记录
  • 高速下载文档
  • 免费文档工具
  • 分享文档赚钱
  • 每日登录抽奖
  • 优质衍生服务
/ 32
客服
关注