SLIDE 1
Introduction to information theory and empirical statistical theory - - PowerPoint PPT Presentation
Introduction to information theory and empirical statistical theory - - PowerPoint PPT Presentation
Introduction to information theory and empirical statistical theory Nan Chen and Andrew J. Majda Center for Atmosphere Ocean Science Courant Institute of Mathematical Sciences New York University October 13, 2016 6.1 Introduction Goal of
SLIDE 2
SLIDE 3
An motivation example.
−8 −6 −4 −2 2 4 6 8 10 20 30 40 50 60 70 80 90 100 −8 −6 −4 −2 2 4 6 8 10 20 30 40 50 60 70 80 90 100 −8 −6 −4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 −8 −6 −4 −2 2 4 6 8 0.05 0.1 0.15 0.2 0.25 0.3 −5 5 1 2
|µ1−µ2| = 2
Case I Case II
|µ1−µ2| = 1
Question: If the blue ensembles are the truth and the red ensembles are the prediction, which case has a better prediction? Regarding the mean state, the answer is case II. But intuitively ... ? Information theory is needed to quantify the uncertainty and access the least biased prediction given some statistical constraints.
2 / 24
SLIDE 4
Claude Elwood Shannon
◮ Claude Elwood Shannon (1916 – 2001) was an American mathematician, electronic engineer, and cryptographer known as “the father of information theory”. ◮ Pioneered by Shannon in a 1948 article, “A mathematical theory of communication”, Bell System Technical Journal. ◮ The article was later published in 1949 as a book by Shannon and Warren Weaver, “The Mathematical Theory
- f Communication”, Univ of Illinois Press.
◮ Shannon’s article in 1951 Prediction and Entropy of Printed English was the contribution of information theory to natural linguistics.
3 / 24
SLIDE 5
Shannon’s other work, hobbies and inventions, ◮ Shannon’s mouse ◮ Computer chess program ◮ Juggling ◮ Using information theory to calculate odds while gambling ... ◮ Ultimate machine ◮ ...
4 / 24
SLIDE 6
6.2 Information theory and Shannon’s entropy
Definition 6.1 (The Shannon entropy) Let p be a finite, discrete probability measure on the sample space A = {a1, . . . , an} p =
n
- i=1
piδai , pi ≥ 0,
n
- i=1
pi = 1. (6.2) The Shannon entropy Sp of the probability p is defined as S(p) = S(p1, . . . , pn) = −
n
- i=1
pi ln pi. (6.3)
5 / 24
SLIDE 7
Shannon’s intuition from the theory of communication Representing a “word” in a message as a sequence of binary digits with length n. ◮ The set A2n of all words of length n has 2n = N elements. ◮ The amount of information needed to characterize one element is n = log2 N. Example: n = 7, N = 128. 1 1 1 1 1 1 1 1 . . . . . . . . . . . . . . . Following this type of reasoning, the amount of information needed to characterize an element of any set, AN, is n = log2 N for general N.
6 / 24
SLIDE 8
◮ Now consider a situation of a set A = AN1 ∪ · · · ∪ ANk , where the set ANi are pairwise disjoint from each other with ANi having Ni total elements. ◮ Set pi to be given by pi = Ni/N, where N = Ni. Example: A = AN1 ∪ AN2 N = 24, N1 = 8, N2 = 16. 1
- N1=23=8
p1=8/24
1 1 1
- N2=24=16
p2=16/24
◮ If we know that an element of A belongs to some ANi , then we need log2 Ni additional information to determine it completely. ◮ The average amount of information we need to determine an element, provided that we already now the ANi to which it belongs, is given by
- i
Ni N log2 Ni =
- i
Ni N log2 Ni N · N
- =
- pi log2 pi + log2 N.
◮ Recall that log2 N is the information that we need to determine an element given the set A if we do not know to which ANi a given element belongs. ◮ Thus, the corresponding average lack of information is −
- pi log2 pi.
7 / 24
SLIDE 9
Proposition 6.1 (Uniqueness of the Shannon entropy) Let Hn be a function defined on the space of discrete probability measure PMn(A) and satisfying the following properties:
- 1. Hn(p1, . . . , pn) is a continuous function.
- 2. A(n) = Hn(1/n, . . . , 1/n) is monotonic increasing in n, i.e., Hn increases with
increasing uncertainty, Hn({1/n, . . . , 1/n}) > Hn′({1/n′, . . . , 1/n′}), n > n′
- 3. Composite rule:
Hn(p1, . . . , pk, pk+1 . . . , pn) = H2(w1, w2) + w1Hk(p1/w1, . . . , pk/w1) + w2Hn−k(pk+1/w2, . . . , pn/w2), where w1 = p1 + . . . + pk, w2 = pk+1 + . . . + pn and pi/wj is the conditional probability. Then Hn is a positive multiple of the Shannon entropy Hn(p1, . . . , pn) = KS(p1, . . . , pn) = −K
n
- i=1
pi ln pi, with K > 0. Key of proof: Using composite rule = ⇒ A(nv) = A(n) + A(v). The only function satisfying this condition is A(n) = K ln n. See [Majda & Wang, 2006] page 186-187 for the details of proof.
8 / 24
SLIDE 10
Definition 6.2 (Empirical maximum entropy principle) With a given probability measure p ∈ PM(A), the expected value, or statistical measurement, of f with respect to p is given by fp =
n
- i=1
f(ai)pi. A practical issue is to look for the least biased probability distribution consistent with certain statistical constraints, CL =
- p ∈ PM(A)|fjp = Fj, 1 ≤ j ≤ L
- .
The least biased probability distribution pL given the constraints CL is the maximum entropy distribution, max
p∈CL
S(p) = S(pL), pL ∈ CL. Remark: Uncertainty decreases with more information being included1. S(pL) ≤ S(pL′), for L′ < L.
1In thermodynamics, entropy is commonly associated with the amount of order,
disorder, or chaos in a thermodynamic system.
9 / 24
SLIDE 11
Example 1. Find the least biased probability distribution p on A = {a1, . . . , an} with no additional constraints. Maximize S(p1, . . . , pn) = −
n
- i=1
pi ln pi, Subject to
n
- i=1
pi = 1 and all pi ≥ 0.
10 / 24
SLIDE 12
Example 1. Find the least biased probability distribution p on A = {a1, . . . , an} with no additional constraints. Maximize S(p1, . . . , pn) = −
n
- i=1
pi ln pi, Subject to
n
- i=1
pi = 1 and all pi ≥ 0. Construct the Lagrange function L = −
n
- i=1
pi ln pi + λ
n
- i=1
pi. The minimum p∗ satisfies ∂L ∂pi = 0, i = 1, . . . , n, which results in ln p∗
i + 1 + λ = 0,
i = 1, . . . , n. This implies all the probabilities p∗
i are equal and therefore
p∗
i = 1/n.
Uniform distribution!
10 / 24
SLIDE 13
Example 2. Find the least biased probability distribution p on A = {a1, . . . , an} subject to the r + 1 constraints (r + 1 ≤ n), Fj = fjp =
n
- i=1
fj(ai)pi, j = 1, . . . , r,
n
- i=1
pi = 1.
11 / 24
SLIDE 14
Example 2. Find the least biased probability distribution p on A = {a1, . . . , an} subject to the r + 1 constraints (r + 1 ≤ n), Fj = fjp =
n
- i=1
fj(ai)pi, j = 1, . . . , r,
n
- i=1
pi = 1. Construct the Lagrange function L = −
n
- i=1
pi ln pi + λ0
n
- i=1
pi +
r
- j=1
λjfjp. Componentwise it yields a system of n equations for the unknowns p∗
i
ln p∗
i = − r
- j=1
λjfj(ai) − (λ0 + 1), i = 1, . . . , n, and solving for p∗
i we obtain
p∗
i = exp
−
r
- j=1
λjfj(ai) − (λ0 + 1) . Exponential family! To eliminate the multiplier λ0, we utilize the constraint that the sum of all the probabilities p∗
i is 1. This simplifies the formula for p∗ i ,
p∗
i =
exp
- − r
j=1 λjfj(ai)
- n
i=1 exp
- − r
j=1 λjfj(ai)
.
The other Lagrange multipliers λi must be determined through the constraint equations (generally a non-trivial task).
11 / 24
SLIDE 15
6.3: Most probable states with prior distribution
Example 3: Let A = A1 ∪ A2, where A1, A2 represent disjoint sample spaces which are completely unrelated A1 = {a1, . . . , al}, A2 = {al+1, . . . , an}. By applying the maximum entropy principle to each set, we get the least biased measure for each set, i.e., A1 : p(1) =
l
- j=1
1 l δaj , A2 : p(2) =
n−l
- j=1
1 n − l δal+j . If we know that statistical measurements on each set are equally important, then the least biased probability measure for the whole set should be p0 = 1 2
l
- j=1
1 l δaj + 1 2
n−l
- j=1
1 n − l δal+j . For example, if A1 = {a1, a2, a3} and A2 = {a4, a5}, then p(a1) = p(a2) = p(a3) = 1/6, p(a4) = p(a5) = 1/4.
12 / 24
SLIDE 16
The maximum entropy principle needs to be extended to accommodate the existence
- f additional external bias. Suppose we already have the external bias, with the
weighted importance (or probability) of each sample point given by p0 =
n
- i=1
p0
i δai .
Also assume we know r additional constraints involving other measurements, i.e., Fj = fjp, j = 1, . . . , r. The goal is to select the least biased probability distribution p∗ consistent with the given measurements while retaining the external bias expressed by p0. For this purpose we define the relative Shannon entropy, S(p, p0) = −
n
- i=1
pi ln
- pi
p0
i
- .
13 / 24
SLIDE 17
Definition 6.3 (maximum relative entropy principle) The least biased probability measure p∗, given all the constraints C and external bias, is the one that satisfies max
p∈C S(p, p0) = S(p∗, p0),
where S(p, p0) = −
n
- i=1
pi ln
- pi
p0
i
- .
14 / 24
SLIDE 18
Definition 6.3 (maximum relative entropy principle) The least biased probability measure p∗, given all the constraints C and external bias, is the one that satisfies max
p∈C S(p, p0) = S(p∗, p0),
where S(p, p0) = −
n
- i=1
pi ln
- pi
p0
i
- .
Proposition 1 If there is no real external bias, then the maximum entropy principle should yield the same probability distribution p∗ as the maximum entropy principle. In fact, without external bias, p0 =
n
- i=1
1 n δai . Then the relative entropy S(p, p0) differs from the entropy S(p) by a constant, S(p, p0) = S(p) − ln n.
14 / 24
SLIDE 19
Definition 6.3 maximum relative entropy principle The least biased probability measure p∗, given all the constraints C and external bias, is the one that satisfies max
p∈C S(p, p0) = S(p∗, p0),
where S(p, p0) = −
n
- i=1
pi ln
- pi
p0
i
- .
Proposition 2 If the only constraint on the system is given by the external bias p0, then the probability distribution predicted by the maximum relative entropy principle must be the external bias itself, p∗ = p0. In this case the only restricted condition is n
i=1 = 1. Utilizing the Lagrange multiplier,
−∇pS|p=p∗ + λ∇p n
- i=1
pi
- |p=p∗ = 0,
and componentwise it yields ln
- p∗
i
p0
i
- + (1 + λ) = 0,
i = 1, . . . , n. This in turn implies that the ratio pi/p0
i is a constant independent of i and thus p∗ = p0. 15 / 24
SLIDE 20
6.4 Entropy for continuous measures on the line
A continuous probability density, ρ(λ), on the line satisfies the two requirements ρ(λ) ≥ 0,
- R1 ρ(λ)dλ = 1.
Let · denote the expected value of a random variable over the probability space, where the random variable, q, is defined. Then, F(q) =
- R1 F(λ)ρ(λ)dλ.
The mean, ¯ q, and variance, σ2, of a random variable are defined respectively by ¯ q ≡ q =
- R1 λρ(λ)dλ,
σ2 ≡ (q − ¯ q)2 =
- R2(λ − ¯
q)2ρ(λ)dλ Definition 6.4 (Gaussian distribution) A Gaussian distribution with given mean ¯ λ and variance σ2 has the probability density function ρ¯
λ,σ ≡
1 √ 2πσ e
− (λ− ¯
λ)2 2σ2
, so that ¯ λ =
- λρ¯
λ,σ(λ)dλ,
σ2 =
- (λ − ¯
λ)2ρ¯
λ,σ(λ)dλ. 16 / 24
SLIDE 21
Given some set of constraints, C, on the space of probability densities and given the probability density ρ0 measuring the external bias, analogous to the discrete case, the following principles hold for the continuous measures: Maximum entropy principle S(ρ∗) = max
ρ∈C S(ρ)
Maximum relative entropy principle S(ρ∗, ρ0) = max
ρ∈C S(ρ, ρ0) 17 / 24
SLIDE 22
Example: Find the least biased distribution given the constraints C defined by the first and second moments, ρ(λ) ≥ 0,
- R1 ρ(λ)dλ = 1,
¯ λ =
- R
λρ(λ)dλ, σ2 =
- R1(λ − ¯
λ)2ρ(λ)dλ.
18 / 24
SLIDE 23
Example: Find the least biased distribution given the constraints C defined by the first and second moments, ρ(λ) ≥ 0,
- R1 ρ(λ)dλ = 1,
¯ λ =
- R
λρ(λ)dλ, σ2 =
- R1(λ − ¯
λ)2ρ(λ)dλ. The variational derivative of the entropy is given by δS δρ = −(1 + ln ρ). The first and second moment constraints are linear functionals of the density ρ so that δ¯ λ δρ = λ, δσ2 δρ = (λ − ¯ λ)2. From the Lagrange multiplier principle we have, at the entropy maximum − δS δρ |ρ=ρ∗ = −µ0 − µ1 δ¯ λ δρ |ρ=ρ∗ − µ2 δσ2 δρ |ρ=ρ∗, where µ0, µ1, µ2 are the Lagrangian multiplier for the constraints. Therefore ln ρ∗ = (−µ0 + 1) − µ1λ − µ2(λ − ¯ λ)2, which defines ρ∗ as a Gaussian probability density with mean ¯ λ and variance σ2 so that ρ∗ = ρ¯
λ,σ(λ). 18 / 24
SLIDE 24
Example: Find the least biased distribution given the constraints C defined by the first and second moments, ρ(λ) ≥ 0,
- R1 ρ(λ)dλ = 1,
¯ λ =
- R
λρ(λ)dλ, σ2 =
- R1(λ − ¯
λ)2ρ(λ)dλ. The variational derivative of the entropy is given by δS δρ = −(1 + ln ρ). The first and second moment constraints are linear functionals of the density ρ so that δ¯ λ δρ = λ, δσ2 δρ = (λ − ¯ λ)2. From the Lagrange multiplier principle we have, at the entropy maximum − δS δρ |ρ=ρ∗ = −µ0 − µ1 δ¯ λ δρ |ρ=ρ∗ − µ2 δσ2 δρ |ρ=ρ∗, where µ0, µ1, µ2 are the Lagrangian multiplier for the constraints. Therefore ln ρ∗ = (−µ0 + 1) − µ1λ − µ2(λ − ¯ λ)2, which defines ρ∗ as a Gaussian probability density with mean ¯ λ and variance σ2 so that ρ∗ = ρ¯
λ,σ(λ).
This fact supports the idea well known from the central limit theorem that Gaussian densities are the most universal distributions with given first and second moments.
18 / 24
SLIDE 25
Definition (Relative entropy “distance” function) P(p, π0) = −S(p, π0). The name “distance” is partially justified by the following claim P(p, π0) ≥ 0 with equality only for p = π0. Now use the relative entropy distance function to supply a simple direct proof. For such a probability distribution ρ satisfying the constraints of the given first two moments but ρ = ρ¯
λ,σ guarantes
0 ≤ P(ρ, ρ¯
λ,σ(λ))
ρ¯
λ,σ ≡
1 √ 2πσ e
− (λ− ¯
λ)2 2σ2
=
- ρ ln ρ −
- ρ ln ρ¯
λ,σ(λ)
=
- ρ ln ρ −
- ρ ln
- σ
√ 2π
- −
- ρ (λ − ¯
λ)2 σ2 =
- ρ ln ρ −
- ρ¯
λ,σ ln
- σ
√ 2π
- −
- ρ¯
λ,σ
(λ − ¯ λ)2 σ2 = S(ρ¯
λ,σ) − S(ρ)
so that S(ρ¯
λ,σ) > S(ρ) as required. Here ρ and ρ¯ λ,σ have the same first two moments
and therefore
- ρ(λ − ¯
λ)2 =
- ρ¯
λ,σ(λ − ¯
λ)2.
19 / 24
SLIDE 26
Finite-moment problem for probability measures
We quantify the information contained in the moments of a probability density p(λ) ¯ λ = M1 =
- λp(λ)dλ;
Mj =
- (λ − ¯
λ)jp(λ)dλ, 2 ≤ j ≤ 2L, in addition to
- p(λ)dλ = 1. The probability distribution p∗
2L(λ) with the least bias given
these moment constraints, i.e., the one that maximizes the entropy, is: S(p∗
2L) =
max
p∈PM2L
S(p). This gives p∗
2L = exp
2L
- m=0
αm(λ − ¯ λ)m
- ,
where λm are the appropriate Lagrange multipliers. Note that we need even powers and α2L < 0 to guarantee a finite probability measure. Since adding more moments increases information, it follows that S(p∗
2L2) ≤ S(p∗ 2L1),
for L1 ≤ L2.
− δS δρ |ρ=ρ∗ = −µ0 − µ1 δM1 δρ |ρ=ρ∗ − µ2 δM2 δρ |ρ=ρ∗ − . . . − µ2L δM2L δρ |ρ=ρ∗, ln ρ∗ = (−µ0 + 1) − µ1λ − µ2(λ − ¯ λ)2 − . . . − µ2L(λ − ¯ λ)2L.
20 / 24
SLIDE 27
We quantify the information loss in utilizing only the 2L moments of p, 0 ≤ (p, p∗
2L) =
- p ln p −
- p ln p∗
2L
=
- p ln p −
- p∗
2L ln p∗ 2L
= S(p∗
2L) − S(p).
It follows that p∗
2L is a sum of the 2L moment constraints so that automatically
- p ln p∗
2L =
- p∗
2L ln p∗ 2L.
We immediately have, for L1 < L2 P(p, p∗
2L1) = P(p, p∗ 2L2) + P(p∗ 2L2, p∗ 2L1).
(6.73) We can view measuring fewer moments of p as a coarse-grained measurement of p and (6.73) precisely measures the information loss in the coarse-graining procedure through the term P(p∗
2L2, p∗ 2L1) = S(p∗ 2L1) − S(p∗ 2L2). 21 / 24
SLIDE 28
Gaussian v.s. non-Gaussian When is a probability distribution significantly non-Gaussian? How much additional information is contained in the third and forth moments of a probability distribution beyond the Gaussian estimate, p∗
G?
Here, we compare P(p∗
4 , p∗ G) for vary-
ing values of the third and forth mo- ments with fixed mean and variance. The third and forth moments of p∗
4 are
characterized by the skewness and flat- ness (kurtosis) Skew =
- (λ − ¯
λ)3p∗
4
(
- (λ − ¯
λ)2p∗
4 )3/2 ,
Flat =
- (λ − ¯
λ)4p∗
4
(
- (λ − ¯
λ)2p∗
4 )2 . 22 / 24
SLIDE 29
Multivariate Gaussian distribution px(x1, . . . , xk) = 1
- (2π)k|Σ|
exp
- − 1
2 (x − µ)T Σ−1(x − µ)
- ,
where µ is the k-dimensional mean and k × k matrix Σ is the covariance, which is always positive definite. The marginal distribution of Gaussian over a subset of components is obtained by simply dropping the remaining components, e.g. given a Gaussian vector (x1, x2, x3)T with µ µ µ = (µ1, µ2, µ3)T and covariance µ µ µ = µ1 µ2 µ3 , Σ = Σ11 Σ12 Σ13 Σ21 Σ22 Σ23 Σ31 Σ32 Σ33 , the marginal density for (x1, x3) is given by a Gaussian with ˜ µ µ µ =
- µ1
µ3
- ,
˜ Σ =
- Σ11
Σ13 Σ31 Σ33
- .
23 / 24
SLIDE 30
Summary
◮ Shannon entropy ◮ Maximum entropy principle: finding the least biased distribution with statistical constraints ◮ Maximum relative entropy principle: finding the least biased distribution with both statistical constraints and prior distribution ◮ Entropy and relative entropy for the continuous measures on the line ◮ Finite-moment problem and non-Gaussian distributions
24 / 24
SLIDE 31
Appendix I: More materials of the relative entropy
The relative entropy of q compared with p is given by P(p, q) =
- p ln p
q . ◮ P(p, q) is positive unless p = q. ◮ P(p, q) is invariant under general nonlinear change of variables. However, P(p, q) is not symmetric under p ↔ q and does not obey the triangle inequality. Relative entropy quantifies ◮ The model error in the imperfect model q with respect to the perfect one p. ◮ The gain of information in the posterior distribution p compared with the prior q.
1 / 6
SLIDE 32
Proof: Non-negativity of relative entropy ◮ Jensen’s inequality. If f is a convex function and a is a random variable labeled by reals then f(a) ≥ f(a).
〈f(a)〉
tf(a1)+(1−t)f(a2)
f(a)
f (ta1+(1−t)a2)
f(〈a〉) a1 ta1+(1−t)a2 a2
Jensen’s inequality generalizes the statement that the secant line of a convex function lies above the graph of the function, “f(a)” = tf(a1) + (1 − t)f(a2) ≥ f(ta1 + (1 − t)a2) = “f(a)”
◮ Jensen’s inequality leads to the fundamental result that P(p, q) ≥ 0: P(p, q) =
- p(a) ln p(a)
q(a) da =
- − ln q
p
- ≥ − lnq/p
(Jensen; − ln is convex) = − ln
- p(a) q(a)
p(a) da = 0.
2 / 6
SLIDE 33
Proof: Invariance under general nonlinear change of variable Assume a change of variable from x to y. Then according to p(x)dx = p(y)dy, P(p(x), q(x)) =
- p(x) ln p(x)
q(x) dx =
- p(y) ln p(y)dy/dx
q(y)dy/dx dy =
- p(y) ln p(y)
q(y) dy = P(p(y), q(y)).
3 / 6
SLIDE 34
Measuring the lack of information in the imperfect model q compared with nature p ◮ Assume the truth is p(a). The ignorance (lack of information) about a is I(p(a)) = − ln(p(a)). The average ignorance, i.e., the Shannon entropy, is I(p(a)) = −
- p(a) ln(p(a))da.
◮ Assume the imperfect model is pM(a). The ignorance in the model about a is I(pM(a)) = − ln(pM(a)). The model’s expected ignorance is I(pM(a)) = −
- p(a) ln(pM(a))da.
Note here the outcomes are actually generated by p. ◮ Consider now, I(pM(a)) − I(p(a)) =
- p(a) log(p(a))da −
- p(a) ln(pM(a))da
=
- p(a) ln p(a)
pM(a) da = P(p, pM). Interpretation: P(p, pM) is an objective metric for model error that measures the expected increase in ignorance, or lack of information, about the system incurred by using pM, when the outcomes are actually generated by p.
4 / 6
SLIDE 35
Appendix II: Shannon entropy and the relative entropy in Gaussian framework
◮ Recall the Shannon entropy S(ρ) = −
- ρ ln(ρ).
If ρ ∼ N( µ, R) satisfies a n-dimensional Gaussian distribution, then the Shannon entropy has explicit form S(ρ) = n 2 (1 + ln 2π) + 1 2 ln det(R). Uncertainty is reflected in the covariance.
Derivation of 1-D case. Let ρ(x) = 1 √ 2πR exp
- − (x − µ)2
2R
- .
Then the Shannon entropy becomes −
- ρ(x) ln ρ(x)dx = −
- ρ(x)
- − 1
2 ln (2πR) − (x − µ)2 2R
- dx
= 1 2 ln(2πR) + 1 2R
- (x − µ)2ρ(x)dx
= 1 2 ln(2π) + 1 2 ln R + 1 2 .
5 / 6
SLIDE 36
Recall that the relative entropy of q compared with p is given by P(p, q) =
- p ln p
q . When both p ∼ N( mp, Rp) and q ∼ N( mq, Rq) are Gaussian, the relative entropy has an explicit formula, P(p, q) =
- 1
2 (
mp − mq)T R−1
q
( mp − mq)
- Signal
+ 1
2
- tr(RpR−1
q
) − |K| − ln det(RpR−1
q
)
- Dispersion
, where |K| is the dimension of both the distribution. ◮ Signal measures the lack of information in the mean weighted by model covariance. ◮ Dispersion involves the covariance ratio. This also implies that even in the Gaussian framework, the uncertainty reduction is not merely reflected in the variance! [Compare with the Shannon entropy.]
p(a) q(a)