KL divergence or relative entropy Two pmfs p (x) and q (x) : p (x) - - PDF document

▶

Aug 30, 2022 120 likes •210 views

KL divergence or relative entropy Two pmfs p (x) and q (x) : p (x) log p (x) (5) D( p q ) = q (x) x X q = 0 , otherwise p log p Say 0 log 0 0 = . log p (X) D( p q ) = E p (6) q (X) I(X ; Y) = D( p (x, y) p (x) p

SLIDE 1

KL divergence or relative entropy

Two pmfs p(x) and q(x): D(p q) =

x∈X

p(x) log p(x) q(x) (5) Say 0 log 0

q = 0, otherwise p log p 0 = ∞.

D(p q) = Ep

log p(X)

q(X)

I(X; Y) = D(p(x, y) p(x) p(y)) (7)

SLIDE 2

Measure of how different two proba-

bility distributions are

The average number of bits that are

wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q.

D(p q) ≥ 0; D(p q) = 0 iff p = q
Not a metric: not commutative, doesn’t

satisfy triangle equality

SLIDE 3

[Slide on D(pq) vs D(qp)]

SLIDE 4

Cross entropy

Entropy = uncertainty
Lower entropy = determining efficient codes

= knowing the structure of the language = good measure of model quality

Entropy = measure of surprise
How surprised we are when w follows h is

pointwise entropy: H(w|h) = − log2 p(w|h)

p(w|h) = 1? p(w|h) = 0

Total surprise:

Htotal = −

n

log2 m(wj|w1, w2, . . . , wj−1) = − log2 m(w1, w2, . . . , wn)

SLIDE 5

Formalizing through cross-entropy

Our model of language is q(x). How

good a model is it?

Idea: use D(p q), where p is the

correct model.

Problem: we don’t know p.
But we know roughly what it is like

from a corpus

Cross entropy:

H(X, q) = H(X) + D(p q) (8) = −

p(x) log q(x) = Ep(log 1 q(x)) (9)

SLIDE 6

Cross entropy of a language L = (Xi) ∼

p(x) according to a model m: H(L, m) = − lim

n→∞

1 n

p(x1n) log m(x1n)

If the language is ‘nice’:

H(L, m) = − lim

n→∞

1 n log m(x1n) (10) I.e., it’s just our average surprise for large n: H(L, m) ≈ −1 n log m(x1n) (11)

Since H(L) is fixed if unknown, minimiz-

ing cross-entropy is equivalent to minimiz- ing D(p m)

Providing: independent test data; assume

L = (Xi) is stationary [does’t change over time], ergodic [doesn’t get stuck]

SLIDE 7

Entropy of English text

27 letter alphabet Model Cross entropy (bits) zeroth order 4.76 (log 27) first order 4.03 second order 2.8 Shannon’s experiment 1.3 (1.34)

SLIDE 8

KL divergence or relative entropy Two pmfs p (x) and q (x) : p (x) - - PDF document

KL divergence or relative entropy

Two pmfs p(x) and q(x): D(p q) =

p(x) log p(x) q(x) (5) Say 0 log 0

q = 0, otherwise p log p 0 = ∞.

D(p q) = Ep

q(X)

I(X; Y) = D(p(x, y) p(x) p(y)) (7)

bility distributions are

wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q.

satisfy triangle equality

[Slide on D(pq) vs D(qp)]

Cross entropy

= knowing the structure of the language = good measure of model quality

pointwise entropy: H(w|h) = − log2 p(w|h)

Htotal = −

n

log2 m(wj|w1, w2, . . . , wj−1) = − log2 m(w1, w2, . . . , wn)

Formalizing through cross-entropy

good a model is it?

correct model.

from a corpus

H(X, q) = H(X) + D(p q) (8) = −

p(x) log q(x) = Ep(log 1 q(x)) (9)

p(x) according to a model m: H(L, m) = − lim

n→∞

1 n

p(x1n) log m(x1n)

H(L, m) = − lim

n→∞

1 n log m(x1n) (10) I.e., it’s just our average surprise for large n: H(L, m) ≈ −1 n log m(x1n) (11)

ing cross-entropy is equivalent to minimiz- ing D(p m)

L = (Xi) is stationary [does’t change over time], ergodic [doesn’t get stuck]

Entropy of English text

27 letter alphabet Model Cross entropy (bits) zeroth order 4.76 (log 27) first order 4.03 second order 2.8 Shannon’s experiment 1.3 (1.34)

Perplexity

perplexity(x1n, m) = 2H(x1n,m) = m(x1n)−1

n