Mathematical Foundations Foundations of Statistical Natural Language - - PowerPoint PPT Presentation

mathematical foundations
SMART_READER_LITE
LIVE PREVIEW

Mathematical Foundations Foundations of Statistical Natural Language - - PowerPoint PPT Presentation

Mathematical Foundations Foundations of Statistical Natural Language Processing, chapter2 Presented by Jen-Wei Kuo CSIE, NTNU rogerkuo@csie.ntnu.edu.tw Reference A First Course in Probability -Sheldon Ross Probability


slide-1
SLIDE 1

Mathematical Foundations

Foundations of Statistical Natural Language Processing, chapter2

Presented by Jen-Wei Kuo(郭人瑋) CSIE, NTNU rogerkuo@csie.ntnu.edu.tw

slide-2
SLIDE 2

2

Reference

  • A First Course in Probability
  • Sheldon Ross
  • Probability and Random Processes for Electrical

Engineering

  • Algerto Leon-Garcia
slide-3
SLIDE 3

3

Outline

  • Elementary Probability Theory

– Probability spaces – Conditional probability and independence – Bayes’ theorem – Random variables – Expectation and variance – Joint and conditional distributions – Gaussian distributions

  • Essential Information Theory

– Entropy – Joint entropy and conditional entropy – Mutual information – Relative entropy or Kullback-Leibler divergence

slide-4
SLIDE 4

4

Essential Information Theory

Entropy

  • Entropy measures the amount of information in

a random variable. It is normally measured in bits.

− =

X x

x p x p X H ) ( log ) ( ) (

2

  • We define

log

2

=

slide-5
SLIDE 5

5

Essential Information Theory

Entropy

  • Example:

Suppose you are reporting the result of rolling an 8-sided die. Then the entropy is:

bits i p i p X H

i i

3 8 log 8 1 log 8 1 log 8 1 ) ( log ) ( ) (

8 1 8 1

= = − = − = − =

∑ ∑

= =

slide-6
SLIDE 6

6

Essential Information Theory

Entropy

  • Entropy代表要傳遞這件事的平均資訊量,當

我們建立系統時,希望Entropy愈低愈好。

  • 傳遞機率時,由於機率不會超過1,故我們只

需傳遞分母的值即可。

slide-7
SLIDE 7

7

Essential Information Theory

Entropy

  • Properties of Entropy:

        = = − =

∑ ∑

∈ ∈

) ( 1 log ) ( 1 log ) ( ) ( log ) ( ) (

2 2

x p E x p x p x p x p X H

X x X x

slide-8
SLIDE 8

8

Essential Information Theory

Joint Entropy and Conditional Entropy

  • Joint Entropy:
  • Conditional Entropy:

∑ ∑

∈ ∈

− =

Y y X x

y x p y x p Y X H ) , ( log ) , ( ) , (

∑ ∑

∈ ∈

− =

Y y X x

x y p x y p X Y H ) | ( log ) , ( ) | (

slide-9
SLIDE 9

9

Essential Information Theory

Joint Entropy and Conditional Entropy

  • Proof of Conditional Entropy:

∑ ∑ ∑ ∑ ∑

∈ ∈ ∈ ∈ ∈

− =      − = = =

Y y X x X x Y y X x

x y p x y p x y p x y p x p x X Y H x p X Y H ) | ( log ) , ( ) | ( log ) | ( ) ( ) | ( ) ( ) | (

slide-10
SLIDE 10

10

Essential Information Theory

Joint Entropy and Conditional Entropy

  • Chain rule for Entropy:
  • Proof:

) | ( ) ( ) , ( X Y H X H Y X H + =

( ) ( )

) ( ) | ( ) ( log ) , ( ) | ( log ) , ( ) ( log ) | ( log ) , ( ) ( ) | ( log ) , ( ) , ( log ) , ( ) , ( X H X Y H x p y x p x y p y x p x p x y p y x p x p x y p y x p y x p y x p Y X H

Y y X x Y y X x Y y X x Y y X x Y y X x

+ = − − = + − = − = − =

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

∈ ∈ ∈ ∈ ∈ ∈ ∈ ∈ ∈ ∈

slide-11
SLIDE 11

Essential Information Theory

Mutual Information

) , ( Y X H

) ; ( Y X I ) | ( Y X H

) | ( X Y H

) (X H

) (Y H

) | ( ) ( ) | ( ) ( ) ; ( X Y H Y H Y X H X H Y X I − = − =

slide-12
SLIDE 12

12

Essential Information Theory

Mutual Information

  • This difference is called the mutual information

between X and Y.

  • The amount of information one random variable

contains about another.

  • It is 0 only when two variables are independent.

也就是說,兩個獨立事件的mutual Information為0。

slide-13
SLIDE 13

13

Essential Information Theory

Mutual Information

  • How to simply calculate Mutual Information?

∑ ∑ ∑ ∑ ∑ ∑ ∑ ∑

=       − + = + + = + + = − + = − =

y x y x y x y x y x y y x x

y p x p y x p y x p y x p y p x p y x p y x p y x p y p y x p x p y x p y x p y x p y p y p x p x p Y X H Y H X H Y X H X H Y X I

, , , , , ,

) ( ) ( ) , ( log ) , ( ) , ( 1 log ) ( 1 log ) ( 1 log ) , ( ) , ( log ) , ( ) ( 1 log ) , ( ) ( 1 log ) , ( ) , ( log ) , ( ) ( 1 log ) ( ) ( 1 log ) ( ) , ( ) ( ) ( ) | ( ) ( ) ; (

slide-14
SLIDE 14

14

Essential Information Theory

Mutual Information

  • Define the pointwise mutual information

between two particular points. This has sometimes been used as a measure of association between elements.

) ( ) ( ) , ( log ) , ( y p x p y x p y x I =

slide-15
SLIDE 15

15

Essential Information Theory

Relative Entropy or Kullback-Leibler divergence

  • For two probability mass functions, p(x) , q(x)

their relative entropy is given by: define

=

X x

x q x p x p q p D ) ( ) ( log ) ( ) || (

∞ = = log log p p and q

slide-16
SLIDE 16

16

Essential Information Theory

Relative Entropy or Kullback-Leibler divergence

  • 意義:It is the average number of bits that are

wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q.

  • Some authors use the name “KL distance”, but note

that relative entropy isn’t a metric (it doesn’t satisfy the triangle inequality)

slide-17
SLIDE 17

17

Essential Information Theory

Relative Entropy or Kullback-Leibler divergence

)) ( ) ( || ) , ( ( ) ( ) ( ) , ( log ) , ( ) ; (

,

y p x p y x p D y p x p y x p y x p Y X I

y x

= = ∑

Properties of KL-divergence:

∑ ∑

=

y x

x y q x y p x y p x p x y q x y p D ) | ( ) | ( log ) | ( ) ( )) | ( || ) | ( (

Define the Conditional Relative Entropy:

slide-18
SLIDE 18

18

Essential Information Theory

Relative Entropy or Kullback-Leibler divergence

)) ( ) ( || ) , ( ( ) ( ) ( ) , ( log ) , ( ) ; (

,

y p x p y x p D y p x p y x p y x p Y X I

y x

= = ∑

Properties of KL-divergence:

∑ ∑

=

y x

x y q x y p x y p x p x y q x y p D ) | ( ) | ( log ) | ( ) ( )) | ( || ) | ( (

Define the Conditional Relative Entropy:

slide-19
SLIDE 19

19

Essential Information Theory

The noisy channel model

Encoder Channel p ( y | x ) Decoder

Message from A finite alphabet Input to channel Output from channel Attempt to reconstruct message based on output

The noisy channel model W X Y W/

slide-20
SLIDE 20

20

Essential Information Theory

The noisy channel model

A binary symmetric channel 1 1

slide-21
SLIDE 21

21

Essential Information Theory

The noisy channel model

1 ) ( 1 ) ( ) ( ) | ( ) ( max ) ; ( max

) ( ) (

≤ < − = − = − = = C p H p H Y H X Y H Y H Y X I C

X p X p

Capacity:

The channel capacity describes the rate at which one can transmit information through the channel with an arbitrarily low probability of being unable to recover the input from the output.

2 1 1 1 = ⇒ = = ⇒ = = C p if C p

  • r

p if

slide-22
SLIDE 22

22

Essential Information Theory

The noisy channel model

Application: (In speech recognition)

Input: word sequences Output:

  • bserved speech signal

P(input): probability of word sequences P(output|input): acoustic model ( channel prob.) Bayes’ theorem ) | ( ) ( max arg ) ( ) | ( ) ( max arg ) | ( max arg ˆ i

  • p

i p

  • p

i

  • p

i p

  • i

p I

i i i

= = =

slide-23
SLIDE 23

23

Essential Information Theory

Cross entropy

Cross entropy:

The cross entropy between a random variable X with true probability distribution p(X) and another pmf q (normally a model of p) is given by:

∑ ∑ ∑ ∑ ∑

∈ ∈ ∈ ∈ ∈

− =       =       + = + = + =

X x X x X x X x X x

x q x p x q x p x q x p x p x p x q x p x p x p x p q p D X H q X H ) ( log ) ( ) ( 1 log ) ( ) ( ) ( log ) ( 1 log ) ( ) ( ) ( log ) ( ) ( 1 log ) ( ) || ( ) ( ) , (

slide-24
SLIDE 24

24

Essential Information Theory

Cross entropy

Cross entropy of a language :

suppose Language L = (Xi) ~ p(x) according to a model m by We cannot calculate this quantity without knowing p. But if we make certain assumptions that the language is ‘nice,’ then the cross entropy for the language can be calculated as:

∞ →

− =

n

x n n n

x m x p n m L H

1

) ( log ) ( 1 lim ) , (

1 1

) ( log 1 lim ) , (

1n n

x m n m L H

∞ →

− =

slide-25
SLIDE 25

25

Essential Information Theory

Cross entropy

Cross entropy of a language :

We do not actually attempt to calculate the limit, but approximate it by calculating for a sufficiently large n: This measure is just the figure for our average surprise. Our goal will be to try to minimize this number. Because H(X) is fixed, this is equivalent to minimizing the relative entropy, which is a measure of how much our probability distribution departs from actual language use.

) ( log 1 ) , (

1n

x m n m L H − ≈

slide-26
SLIDE 26

26

Essential Information Theory

Perplexity

In the speech recognition community, people tend to refer to perplexity rather than cross entropy. The relationship between the two is simple:

Why we use perplexity not cross entropy?

Because it is much easier to impress funding bodies by saying that “we’ve managed to reduce perplexity from 950 to only 540” than by saying that “we’ve reduced cross entropy from 9.9 to 9.1 bits.”

n n x m n m x H n

x m m x Perplexity

n n

1 1 ) ( log 1 ) ( 1

) ( 2 2 ) , (

1 , 1

− −

= = =