Mathematical Foundations Foundations of Statistical Natural Language - PowerPoint PPT Presentation

Mathematical Foundations Foundations of Statistical Natural Language Processing, chapter2 Presented by Jen-Wei Kuo （郭人瑋） CSIE, NTNU rogerkuo@csie.ntnu.edu.tw

Reference • A First Course in Probability -Sheldon Ross • Probability and Random Processes for Electrical Engineering -Algerto Leon-Garcia 2

Outline • Elementary Probability Theory – Probability spaces – Conditional probability and independence – Bayes’ theorem – Random variables – Expectation and variance – Joint and conditional distributions – Gaussian distributions • Essential Information Theory – Entropy – Joint entropy and conditional entropy – Mutual information 3 – Relative entropy or Kullback-Leibler divergence

Essential Information Theory Entropy • Entropy measures the amount of information in a random variable. It is normally measured in bits. ∑ = − H ( X ) p ( x ) log p ( x ) 2 ∈ x X • We define = 0 log 0 0 2 4

Essential Information Theory Entropy • Example: Suppose you are reporting the result of rolling an 8-sided die. Then the entropy is: 8 8 1 1 ∑ ∑ = − = − H ( X ) p ( i ) log p ( i ) log 8 8 = = i 1 i 1 1 = − = = log log 8 3 bits 8 5

Essential Information Theory Entropy • Entropy 代表要傳遞這件事的平均資訊量，當我們建立系統時，希望 Entropy 愈低愈好。 • 傳遞機率時，由於機率不會超過 1 ，故我們只需傳遞分母的值即可。 6

Essential Information Theory Entropy • Properties of Entropy: ∑ = − H ( X ) p ( x ) log p ( x ) 2 ∈ x X 1 ∑ = p ( x ) log 2 p ( x ) ∈ x X   1 =   E log   p ( x )   7

Essential Information Theory Joint Entropy and Conditional Entropy • Joint Entropy: ∑ ∑ = − H ( X , Y ) p ( x , y ) log p ( x , y ) ∈ ∈ x X y Y • Conditional Entropy: ∑ ∑ = − H ( Y | X ) p ( y , x ) log p ( y | x ) ∈ ∈ x X y Y 8

Essential Information Theory Joint Entropy and Conditional Entropy • Proof of Conditional Entropy: ∑ = = H ( Y | X ) p ( x ) H ( Y | X x ) ∈ x X −  ∑ ∑ = p ( x ) p ( y | x ) log p ( y | x )     ∈ ∈ x X y Y ∑ ∑ = − p ( y , x ) log p ( y | x ) ∈ ∈ x X y Y 9

Essential Information Theory Joint Entropy and Conditional Entropy • Chain rule for Entropy: = + H ( X , Y ) H ( X ) H ( Y | X ) • Proof: ∑ ∑ = − H ( X , Y ) p ( x , y ) log p ( x , y ) ∈ ∈ x X y Y ( ) ∑ ∑ = − p ( x , y ) log p ( y | x ) p ( x ) ∈ ∈ x X y Y ( ) ∑ ∑ = − + p ( x , y ) log p ( y | x ) log p ( x ) ∈ ∈ x X y Y ∑ ∑ ∑ ∑ = − − p ( x , y ) log p ( y | x ) p ( x , y ) log p ( x ) ∈ ∈ ∈ ∈ x X y Y x X y Y 10 = + H ( Y | X ) H ( X )

Essential Information Theory Mutual Information H ( X , Y ) H ( X | Y ) H ( Y | X ) I ( X ; Y ) H ( X ) H ( Y ) = − = − I ( X ; Y ) H ( X ) H ( X | Y ) H ( Y ) H ( Y | X )

Essential Information Theory Mutual Information • This difference is called the mutual information between X and Y. • The amount of information one random variable contains about another. • It is 0 only when two variables are independent. 也就是說，兩個獨立事件的 mutual Information 為 0 。 12

Essential Information Theory Mutual Information • How to simply calculate Mutual Information ？ = − I ( X ; Y ) H ( X ) H ( X | Y ) = + − H ( X ) H ( Y ) H ( X , Y ) 1 1 ∑ ∑ ∑ = + + p ( x ) log p ( y ) log p ( x , y ) log p ( x , y ) p ( x ) p ( y ) x y x , y 1 1 ∑ ∑ ∑ = + + p ( x , y ) log p ( x , y ) log p ( x , y ) log p ( x , y ) p ( x ) p ( y ) x , y x , y x , y   1 1 1 ∑ = + − p ( x , y ) log log log   p ( x ) p ( y ) p ( x , y )   x , y p ( x , y ) ∑ = p ( x , y ) log 13 p ( x ) p ( y ) x , y

Essential Information Theory Mutual Information • Define the pointwise mutual information between two particular points. p ( x , y ) = I ( x , y ) log p ( x ) p ( y ) This has sometimes been used as a measure of association between elements. 14

Essential Information Theory Relative Entropy or Kullback-Leibler divergence • For two probability mass functions, p(x) , q(x) their relative entropy is given by: p ( x ) ∑ = D ( p || q ) p ( x ) log q ( x ) ∈ x X define 0 p = = ∞ 0 log 0 and p log q 0 15

Essential Information Theory Relative Entropy or Kullback-Leibler divergence • 意義： It is the average number of bits that are wasted by encoding events from a distribution p with a code based on a not-quite-right distribution q. • Some authors use the name “KL distance”, but note that relative entropy isn’t a metric (it doesn’t satisfy the triangle inequality) 16

Essential Information Theory Relative Entropy or Kullback-Leibler divergence Properties of KL-divergence: = ∑ p ( x , y ) I ( X ; Y ) p ( x , y ) log p ( x ) p ( y ) x , y = D ( p ( x , y ) || p ( x ) p ( y )) Define the Conditional Relative Entropy: p ( y | x ) ∑ ∑ = D ( p ( y | x ) || q ( y | x )) p ( x ) p ( y | x ) log q ( y | x ) x y 17

Essential Information Theory Relative Entropy or Kullback-Leibler divergence Properties of KL-divergence: = ∑ p ( x , y ) I ( X ; Y ) p ( x , y ) log p ( x ) p ( y ) x , y = D ( p ( x , y ) || p ( x ) p ( y )) Define the Conditional Relative Entropy: p ( y | x ) ∑ ∑ = D ( p ( y | x ) || q ( y | x )) p ( x ) p ( y | x ) log q ( y | x ) x y 18

Essential Information Theory The noisy channel model W X Encoder Message from Input to channel A finite alphabet Channel p ( y | x ) W / Y Decoder Attempt to Output from channel reconstruct message based on output The noisy channel model 19

Essential Information Theory The noisy channel model 0 0 1 1 A binary symmetric channel 20

Essential Information Theory The noisy channel model Capacity: The channel capacity describes the rate at which one can transmit information through the channel with an arbitrarily low probability of being unable to recover the input from the output. = = − = − = − C max I ( X ; Y ) max H ( Y ) H ( Y | X ) H ( Y ) H ( p ) 1 H ( p ) p ( X ) p ( X ) < ≤ 0 C 1 = = ⇒ = if p 0 or p 1 C 1 1 = ⇒ = if p C 0 2 21

Essential Information Theory The noisy channel model Application: (In speech recognition) Input: word sequences Output: observed speech signal P(input): probability of word sequences P(output|input): acoustic model ( channel prob.) Bayes’ theorem p ( i ) p ( o | i ) ˆ = = = I arg max p ( i | o ) arg max arg max p ( i ) p ( o | i ) p ( o ) i i i 22

Essential Information Theory Cross entropy Cross entropy: The cross entropy between a random variable X with true probability distribution p(X) and another pmf q (normally a model of p) is given by: = + H ( X , q ) H ( X ) D ( p || q ) 1 p ( x ) ∑ ∑ = + p ( x ) log p ( x ) log p ( x ) q ( x ) ∈ ∈ x X x X   1 p ( x ) ∑ = + p ( x ) log log   p ( x ) q ( x )   ∈ x X   1 ∑ = p ( x ) log   q ( x )   ∈ x X ∑ = − p ( x ) log q ( x ) 23 ∈ x X

Essential Information Theory Cross entropy Cross entropy of a language : suppose Language L = (X i ) ~ p(x) according to a model m by 1 ∑ = − H ( L , m ) lim p ( x ) log m ( x ) 1 n 1 n n → ∞ n x 1 n We cannot calculate this quantity without knowing p. But if we make certain assumptions that the language is ‘nice,’ then the cross entropy for the language can be calculated as: 1 = − H ( L , m ) lim log m ( x ) 1 n n → ∞ n 24

Essential Information Theory Cross entropy Cross entropy of a language : We do not actually attempt to calculate the limit, but approximate it by calculating for a sufficiently large n: 1 ≈ − H ( L , m ) log m ( x ) 1 n n This measure is just the figure for our average surprise. Our goal will be to try to minimize this number. Because H(X) is fixed, this is equivalent to minimizing the relative entropy, which is a measure of how much our probability distribution departs from actual language use. 25

Mathematical Foundations Foundations of Statistical Natural Language - PowerPoint PPT Presentation

Mathematical Foundations Foundations of Statistical Natural Language Processing, chapter2 Presented by Jen-Wei Kuo CSIE, NTNU rogerkuo@csie.ntnu.edu.tw Reference A First Course in Probability -Sheldon Ross Probability

recap to this point foundations foundations foundations foundations genetics =

Boosting: Foundations and Algorithms Boosting: Foundations and Algorithms Boosting: Foundations

CS 210 Foundations of Computer Science Debdeep Mukhopadhyay Mathematical Reasoning Foundations

BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD CLASS BUILDING THE

For personal use only BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD

For personal use only BUILDING THE FOUNDATIONS OF A WORLD BUILDING THE FOUNDATIONS OF A WORLD

Outline Foundations of Data and Knowledge Systems EPCL Basic Training Camp 2012 3. Foundations

MFCS 2014 in Budapest , Hungary in 2014 39th International Symposium on Mathematical Foundations

Cognitive Foundations Lecture 2: Experimental Methods (2) Foundations of Language Science and

Foundations Track 1: Believer T o Disciple Lesson 13: Financial Stewardship Foundations

Mathematical Induction COMPSCI 230 Discrete Math March 26, 2015 COMPSCI 230 Discrete

Mathematical String Notation 7 January 2019 OSU CSE 1 String Theory A mathematical model

Slide 1 Page: 1 Mathematical Tasks.ppt Effective Mathematics Instruction: The Role of

Mathematical Set Notation 8 February 2019 OSU CSE 1 Set Theory A mathematical model that

Semantic Foundations for Probabilistic Programming Chris Heunen Ohad Kammar, Sam Staton, Frank

Some Comments on the Some Comments on the Foundations of Network Analysis Foundations of Network

Fast adaptive estimation of log-additive exponential models in Kullback-Leibler divergence

Application of Information Theory, Introduction Iftach Haitner Tel Aviv University. October 28,

KL divergence or relative entropy Two pmfs p (x) and q (x) : p (x) log p (x) (5) D( p q )

Toward software engineering in practice Michael Hilton School of Computer Science 17-214 1

Bottom-up Cell Suppression that Preserves the Missing-at-random Condition Yoshitaka Kameya and

Shashidhar Reddy Puchakayala (Shashi) Apr 15, 2010 What is registration? Why registration

Statistical Natural Language Processing A refresher on information theory ar ltekin

An Information-Theoretic Approach to Detecting Changes in Multi-Dimensional Data Streams Auth