Mathematical Foundations
Foundations of Statistical Natural Language Processing, chapter2
Presented by Jen-Wei Kuo(郭人瑋) CSIE, NTNU rogerkuo@csie.ntnu.edu.tw
Mathematical Foundations Foundations of Statistical Natural Language - - PowerPoint PPT Presentation
Mathematical Foundations Foundations of Statistical Natural Language Processing, chapter2 Presented by Jen-Wei Kuo CSIE, NTNU rogerkuo@csie.ntnu.edu.tw Reference A First Course in Probability -Sheldon Ross Probability
Foundations of Statistical Natural Language Processing, chapter2
Presented by Jen-Wei Kuo(郭人瑋) CSIE, NTNU rogerkuo@csie.ntnu.edu.tw
2
3
– Probability spaces – Conditional probability and independence – Bayes’ theorem – Random variables – Expectation and variance – Joint and conditional distributions – Gaussian distributions
– Entropy – Joint entropy and conditional entropy – Mutual information – Relative entropy or Kullback-Leibler divergence
4
∈
X x
2
2
5
i i
8 1 8 1
= =
6
7
∈ ∈
2 2
X x X x
8
∈ ∈
Y y X x
∈ ∈
Y y X x
9
∈ ∈ ∈ ∈ ∈
Y y X x X x Y y X x
10
) ( ) | ( ) ( log ) , ( ) | ( log ) , ( ) ( log ) | ( log ) , ( ) ( ) | ( log ) , ( ) , ( log ) , ( ) , ( X H X Y H x p y x p x y p y x p x p x y p y x p x p x y p y x p y x p y x p Y X H
Y y X x Y y X x Y y X x Y y X x Y y X x
+ = − − = + − = − = − =
∈ ∈ ∈ ∈ ∈ ∈ ∈ ∈ ∈ ∈
) , ( Y X H
) | ( X Y H
12
between X and Y.
contains about another.
也就是說,兩個獨立事件的mutual Information為0。
13
= − + = + + = + + = − + = − =
y x y x y x y x y x y y x x
y p x p y x p y x p y x p y p x p y x p y x p y x p y p y x p x p y x p y x p y x p y p y p x p x p Y X H Y H X H Y X H X H Y X I
, , , , , ,
) ( ) ( ) , ( log ) , ( ) , ( 1 log ) ( 1 log ) ( 1 log ) , ( ) , ( log ) , ( ) ( 1 log ) , ( ) ( 1 log ) , ( ) , ( log ) , ( ) ( 1 log ) ( ) ( 1 log ) ( ) , ( ) ( ) ( ) | ( ) ( ) ; (
14
15
∈
X x
16
17
)) ( ) ( || ) , ( ( ) ( ) ( ) , ( log ) , ( ) ; (
,
y p x p y x p D y p x p y x p y x p Y X I
y x
= = ∑
Properties of KL-divergence:
=
y x
x y q x y p x y p x p x y q x y p D ) | ( ) | ( log ) | ( ) ( )) | ( || ) | ( (
Define the Conditional Relative Entropy:
18
)) ( ) ( || ) , ( ( ) ( ) ( ) , ( log ) , ( ) ; (
,
y p x p y x p D y p x p y x p y x p Y X I
y x
= = ∑
Properties of KL-divergence:
=
y x
x y q x y p x y p x p x y q x y p D ) | ( ) | ( log ) | ( ) ( )) | ( || ) | ( (
Define the Conditional Relative Entropy:
19
Encoder Channel p ( y | x ) Decoder
Message from A finite alphabet Input to channel Output from channel Attempt to reconstruct message based on output
The noisy channel model W X Y W/
20
A binary symmetric channel 1 1
21
1 ) ( 1 ) ( ) ( ) | ( ) ( max ) ; ( max
) ( ) (
≤ < − = − = − = = C p H p H Y H X Y H Y H Y X I C
X p X p
Capacity:
The channel capacity describes the rate at which one can transmit information through the channel with an arbitrarily low probability of being unable to recover the input from the output.
2 1 1 1 = ⇒ = = ⇒ = = C p if C p
p if
22
Application: (In speech recognition)
Input: word sequences Output:
P(input): probability of word sequences P(output|input): acoustic model ( channel prob.) Bayes’ theorem ) | ( ) ( max arg ) ( ) | ( ) ( max arg ) | ( max arg ˆ i
i p
i
i p
p I
i i i
= = =
23
Cross entropy:
The cross entropy between a random variable X with true probability distribution p(X) and another pmf q (normally a model of p) is given by:
∈ ∈ ∈ ∈ ∈
− = = + = + = + =
X x X x X x X x X x
x q x p x q x p x q x p x p x p x q x p x p x p x p q p D X H q X H ) ( log ) ( ) ( 1 log ) ( ) ( ) ( log ) ( 1 log ) ( ) ( ) ( log ) ( ) ( 1 log ) ( ) || ( ) ( ) , (
24
Cross entropy of a language :
suppose Language L = (Xi) ~ p(x) according to a model m by We cannot calculate this quantity without knowing p. But if we make certain assumptions that the language is ‘nice,’ then the cross entropy for the language can be calculated as:
∞ →
− =
n
x n n n
x m x p n m L H
1
) ( log ) ( 1 lim ) , (
1 1
) ( log 1 lim ) , (
1n n
x m n m L H
∞ →
− =
25
Cross entropy of a language :
We do not actually attempt to calculate the limit, but approximate it by calculating for a sufficiently large n: This measure is just the figure for our average surprise. Our goal will be to try to minimize this number. Because H(X) is fixed, this is equivalent to minimizing the relative entropy, which is a measure of how much our probability distribution departs from actual language use.
) ( log 1 ) , (
1n
x m n m L H − ≈
26
In the speech recognition community, people tend to refer to perplexity rather than cross entropy. The relationship between the two is simple:
Why we use perplexity not cross entropy?
Because it is much easier to impress funding bodies by saying that “we’ve managed to reduce perplexity from 950 to only 540” than by saying that “we’ve reduced cross entropy from 9.9 to 9.1 bits.”
n n x m n m x H n
n n
1 1 ) ( log 1 ) ( 1
1 , 1
− −