Idea Transcript
Chapter 8 Differential Entropy Peng-Hua Wang Graduate Inst. of Comm. Engineering National Taipei University
Chapter Outline Chap. 8 Differential Entropy 8.1 Definitions 8.2 AEP for Continuous Random Variables 8.3 Relation of Differential Entropy to Discrete Entropy 8.4 Joint and Conditional Differential Entropy 8.5 Relative Entropy and Mutual Information 8.6 Properties of Differential Entropy and Related Amounts
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 2/24
8.1 Definitions
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 3/24
Definitions Definition 1 (Differential entropy) The differential entropy h(X) of a continuous random variable X with pdf f (X) is defined as
h(X) = −
Z
f (x) log f (x)dx, S
where S is the support region of the random variable. Example
X ∼ U (0, a), h(X) = −
Peng-Hua Wang, May 14, 2012
Z
a 0
1 1 log dx = log a. a a
Information Theory, Chap. 8 - p. 4/24
Differential Entropy of Gaussian −
x2 2σ 2
1 ∼ N (0, σ 2 ) with pdf φ(x) = √2πσ e , then 2 Z ha (x) = − φ(x) loga φ(x)dx Z 2 1 x = − φ(x) loga √ − 2 loga e dx 2πσ 2 2σ loga e 1 1 2 2 2 E [X ] = log (2πeσ ) = loga (2πσ ) + φ a 2 2 2σ 2
Example. If X
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 5/24
Differential Entropy of Gaussian Remark. If a random variable with pdf f (x) has zero mean and variance σ 2 , then
−
Z
f (x) loga φ(x)dx Z 2 1 x = − f (x) loga √ − 2 loga e dx 2πσ 2 2σ loga e 1 1 2 2 2 E [X ] = log (2πeσ ) = loga (2πσ ) + f a 2 2 2σ 2
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 6/24
Gaussian has Maximal Differential Entropy Suppose that a random variable X with pdf f (x) has zero mean and variance σ 2 , what is its maximal differential entropy? Let φ(x) be the pdf of N (0, σ 2 ). Z
Z
φ(x) h(X) + f (x) log φ(x)dx = f (x) log dx f (x) Z φ(x) dx (convexity of logarithm) ≤ log f (x) f (x) Z = log φ(x)dx = 0
That is,
h(X) ≤ −
Z
1 f (x) log φ(x)dx = log(2πeσ 2 ) 2
and equality holds if f (x)
Peng-Hua Wang, May 14, 2012
= φ(x). Information Theory, Chap. 8 - p. 7/24
8.2 AEP for Continuous Random Variables
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 8/24
AEP Theorem 1 (AEP) Let X1 , X2 , . . . , Xn be a sequence of i.i.d. random variables with common pdf f (x). Then,
1 − log f (X1 , X2 , . . . , Xn ) → E[− log f (X)] = h(X) n in probability. Definition 2 (Typical Set) For ǫ
> 0 the typical set
(n) Aǫ
with respect
to f (x) is defined as
n A(n) = (x , x , . . . , x ) ∈ S : 1 2 n ǫ 1 − log f (x1 , x2 , . . . , xn ) − h(X) ≤ ǫ n Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 9/24
AEP Definition 3 (Volume) The volume Vol(A) of a set A as
Vol(A) =
Z
⊂ Rn is defined
dx1 dx2 . . . dxn A
Theorem 2 (Properties of typical set) 1. sufficiently large.
(n)
Pr(Aǫ ) > 1 − ǫ for n
(n)
2.
Vol(Aǫ ) ≤ 2n(h(X)+ǫ) for all n.
3.
Vol(Aǫ ) ≥ (1 − ǫ)2n(h(X)−ǫ) for n sufficiently large.
(n)
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 10/24
8.4 Joint and Conditional Differential Entropy
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 11/24
Definitions Definition 4 (Differential entropy) The differential entropy of jointly distributed random variables X1 , X2 , . . . , Xn is defined as
h(X1 , X2 , . . . , Xn ) = − where f (xn )
Z
f (xn ) log f (xn )dxn
= f (x1 , x2 , . . . , xn ) is the joint pdf.
Definition 5 (Conditional differential entropy) The conditional differential entropy of jointly distributed random variables X, Y with joint pdf f (x, y) is defined as, if it exists,
h(X|Y ) = −
Peng-Hua Wang, May 14, 2012
Z
f (x, y) log f (x|y)dxdy = h(X, Y ) − h(Y )
Information Theory, Chap. 8 - p. 12/24
Multivariate Normal Distribution Theorem 3 (Entropy of a multivariate normal) Let X1 , X2 , . . . , Xn have a multivariate normal distribution with mean vector µ and covariance matrix K. Then
1 h(X1 , X2 , . . . , Xn ) = log(2πe)n |K| 2 Proof. The joint pdf of a multivariate normal distribution is
1 − 21 (x−µ)t K−1 (x−µ) e φ(x) = n/2 1/2 (2π) |K|
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 13/24
Multivariate Normal Distribution Therefore, Z
h(X1 , X2 , . . . , Xn ) = − φ(x) loga φ(x)dx Z 1 1 = φ(x) loga (2π)n |K| + (x − µ)t K−1 (x − µ) loga e dx 2 2 1 1 = loga (2π)n |K| + (loga e) E (x − µ)t K−1 (x − µ) 2 2 | {z } =n
1 1 = loga (2π)n |K| + n loga e 2 2 1 = loga (2πe)n |K| 2
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 14/24
Multivariate Normal Distribution = (Y1 , Y2 , . . . , Yn )t be a random vector. If K = E[YY t ], then E[Y t K−1 Y] = n.
Let Y
Proof. Denote
|
K = E[YY ] = k1 | t
and
K−1
We have ki Peng-Hua Wang, May 14, 2012
=
| k2
| ...
| at1 at2 . . .
atn
= E[Yi Y] and atj ki = δij .
kn |
Information Theory, Chap. 8 - p. 15/24
Multivariate Normal Distribution Now,
t −1 t Y K Y=Y
at1 at2 . . .
atn
at1 Y
at Y 2 Y = (Y1 , Y2 , . . . , Yn ) .. . atn Y
= Y1 at1 Y + Y2 at2 Y + · · · + Yn atn Y
and
E[Y t K−1 Y] = at1 E[Y1 Y] + at2 E[Y2 Y] + · · · + atn E[Yn Y] = at1 k1 + at2 k2 + · · · + atn kn = n
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 16/24
8.5 Relative Entropy and Mutual Information
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 17/24
Definitions Definition 6 (Relative entropy) The relative entropy (or KullbackVLeibler distance) D(f ||g) between two densities f (x) and
g(x) is defined as
D(f ||g) =
Z
f (x) f (x) log dx g(x)
Definition 7 (Mutual information) The mutual information I(X; Y ) between two random variables with joint density f (x, y) is defined as
I(X; Y ) =
Peng-Hua Wang, May 14, 2012
Z
f (x, y) f (x, y) log dxdy f (x)f (y)
Information Theory, Chap. 8 - p. 18/24
Example Let (X, Y )
∼ N (0, K) where " K=
Then h(X)
σ
2
ρσ 2
ρσ
2
σ2
#
.
= h(Y ) = 12 log(2πe)σ 2 and
1 1 2 h(X, Y ) = log(2πe) |K| = log(2πe)2 σ 4 (1 − ρ2 ). 2 2 Therefore,
1 I(X; Y ) = h(X) + h(Y ) − h(X; Y ) = − log(1 − ρ2 ). 2
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 19/24
8.6 Properties of Differential Entropy and Related Amounts
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 20/24
Properties Theorem 4 (Relative entropy)
D(f ||g) ≥ 0 with equality iff f
= g almost everywhere. Corollary 1 1. I(X; Y ) ≥ 0 with equality iff X and Y are independent. 1.
h(X|Y ) ≤ h(X) with equality iff X and Y are independent.
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 21/24
Properties Theorem 5 (Chain rule for differential entropy)
h(X1 , X2 , . . . , Xn ) =
n X i=1
h(Xi |X1 , X2 , . . . , Xi−1 )
Corollary 2
h(X1 , X2 , . . . , Xn ) ≤
n X
h(Xi )
i=1
Corollary 3 (Hadamard’s inequality) If K is the covariance matrix of a multivariate normal distribution, then
|K| ≤
Peng-Hua Wang, May 14, 2012
n Y
Kii .
i=1
Information Theory, Chap. 8 - p. 22/24
Properties Theorem 6 1.
h(X + c) = h(X)
2.
h(aX) = h(X) + log |a|.
3.
h(AX) = h(X) + log | det(A)|
Peng-Hua Wang, May 14, 2012
Information Theory, Chap. 8 - p. 23/24
Gaussian has Maximal Entropy ∈ Rn have zero mean and covariance K = E[XXt ]. Then h(X) ≤ 21 log(2πe)n |K|. with equality X ∼ N (0, K) R Proof. Let g(x) be any density satisfying xi xj g(x)dx = Kij . Let φ(x) be the density of N (0, K). Then, Z Z 0 ≤ D(g||φ) = g log(g/φ) = −h(g) − g log φ Z = −h(g) − φ log φ = −h(g) + h(φ) Theorem 7 Let the random vector X
That is, h(g)
Peng-Hua Wang, May 14, 2012
≤ h(φ). Equality holds if g = φ.
Information Theory, Chap. 8 - p. 24/24