introduction to natural language processing i statistick
play

Introduction to Natural Language Processing I [Statistick metody - PowerPoint PPT Presentation

Introduction to Natural Language Processing I [Statistick metody zpracovn pirozench jazyk I] (NPFL067) http://ufal.mff.cuni.cz/courses/npfl067 prof. RNDr. Jan Haji, Dr. / doc. RNDr. Pavel Pecina, Ph.D. FAL MFF UK


  1. Joint and Conditional Probability • p(A,B) = p(A  B) • p(A|B) = p(A,B) / p(B) – Estimating form counts: • p(A|B) = p(A,B) / p(B) = (c(A  B) / T) / (c(B) / T) = = c(A  B) / c(B)  A B A  B 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 21

  2. Bayes Rule • p(A,B) = p(B,A) since p(A  p(B  – therefore: p(A|B) p(B) = p(B|A) p(A), and therefore p(A|B) = p(B|A) p(A) / p(B) !  A B A  B 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 22

  3. Independence • Can we compute p(A,B) from p(A) and p(B)? • Recall from previous foil: p(A|B) = p(B|A) p(A) / p(B) p(A|B) p(B) = p(B|A) p(A) p(A,B) = p(B|A) p(A) ... we’re almost there: how p(B|A) relates to p(B)? – p(B|A) = P(B) iff A and B are independent • Example: two coin tosses, weather today and weather on March 4th 1789; • Any two events for which p(B|A) = P(B)! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 23

  4. Chain Rule p(A 1 , A 2 , A 3 , A 4 , ..., A n ) = ! p(A 1 |A 2 ,A 3 ,A 4 ,...,A n )  p(A 2 |A 3 ,A 4 ,...,A n )   p(A 3 |A 4 ,...,A n )  ... p(A n-1 |A n )  p(A n ) • this is a direct consequence of the Bayes rule. 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 24

  5. The Golden Rule (of Classic Statistical NLP) • Interested in an event A given B (when it is not easy or practical or desirable to estimate p(A|B)): • take Bayes rule, max over all As: • argmax A p(A|B) = argmax A p(B|A) . p(A) / p(B) = argmax A p(B|A) p(A) ! • ... as p(B) is constant when changing As 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 25

  6. Random Variable • is a function X:   Q – in general: Q = R n , typically R – easier to handle real numbers than real-world events • random variable is discrete if Q is countable (i.e. also if finite) • Example: die : natural “numbering” [1,6], coin : {0,1} • Probability distribution: – p X (x) = p(X=x) = df p(A x ) where A x = {a  : X(a) = x} – often just p(x) if it is clear from context what X is 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 26

  7. Expectation Joint and Conditional Distributions • is a mean of a random variable (weighted average) – E(X) =  x  X(  x . p X (x) • Example: one six-sided die: 3.5, two dice (sum) 7 • Joint and Conditional distribution rules: – analogous to probability of events • Bayes: p X|Y (x,y) = notation p XY (x|y) = even simpler notation p(x|y) = p(y|x) . p(x) / p(y) • Chain rule: p(w,x,y,z) = p(z).p(y|z).p(x|y,z).p(w|x,y,z) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 27

  8. Standard distributions • Binomial (discrete) – outcome: 0 or 1 (thus: bi nomial) – make n trials – interested in the (probability of) number of successes r • Must be careful: it’s not uniform! n • p b (r|n) = ( ) / 2 n (for equally likely outcome) r n • ( ) counts how many possibilities there are for r choosing r objects out of n; = n! / ((n-r)! r!) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 28

  9. Continuous Distributions • The normal distribution (“Gaussian”) • p norm (x|  ) = e -(x-  ) 2 / (2  2 ) /  • where: –  is the mean (x-coordinate of the peak) (0) –  is the standard deviation (1)  x • other: hyperbolic, t 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 29

  10. Essential Information Theory

  11. The Notion of Entropy • Entropy ~ “chaos”, fuzziness, opposite of order, ... – you know it: • it is much easier to create “mess” than to tidy things up... • Comes from physics: – Entropy does not go down unless energy is applied • Measure of uncertainty: – if low... low uncertainty; the higher the entropy, the higher uncertainty, but the higher “surprise” (information) we can get out of an experiment 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 31

  12. The Formula • Let p X (x) be a distribution of random variable X • Basic outcomes (alphabet)  H(X) = -  x  p(x) log 2 p(x) ! • Unit: bits (log 10 : nats) • Notation: H(X) = H p (X) = H(p) = H X (p) = H(p X ) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 32

  13. Using the Formula: Example • Toss a fair coin:  = {head,tail} – p(head) = .5, p(tail) = .5 – H(p) = - 0.5 log 2 (0.5) + (- 0.5 log 2 (0.5)) = 2  ( (-0.5)  (-1) ) = 2  0.5 = 1 • Take fair, 32-sided die: p(x) = 1 / 32 for every side x – H(p) = -  i = 1..32 p(x i ) log 2 p(x i ) = - 32 (p(x 1 ) log 2 p(x 1 ) (since for all i p(x i ) = p(x 1 ) = 1/32) = -32  ((1/32)  (-5)) = 5 (now you see why it’s called bits ?) • Unfair coin: – p(head) = .2 ... H(p) = .722 ; p(head) = .01 ... H(p) = .081 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 33

  14. Example: Book Availability Entropy H(p) 1 bad bookstore good bookstore 0 0 0.5 1  p(Book Available) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 34

  15. The Limits • When H(p) = 0? – if a result of an experiment is known ahead of time: – necessarily:  x  ; p(x) = 1 &  y  ; y  x  p(y) = 0 • Upper bound? – none in general – for |  | = n: H(p)  log 2 n • nothing can be more uncertain than the uniform distribution 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 35

  16. Entropy and Expectation • Recall: – E(X) =  x  X  p X (x)  x • Then: E(log 2 (1/p X (x))) =  x  X  p X (x) log 2 (1/p X (x)) = = -  x  X  p X (x) log 2 p X (x) = = H(p X ) = notation H(p) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 36

  17. Perplexity: motivation • Recall: – 2 equiprobable outcomes: H(p) = 1 bit – 32 equiprobable outcomes: H(p) = 5 bits – 4.3 billion equiprobable outcomes: H(p) ~= 32 bits • What if the outcomes are not equiprobable? – 32 outcomes, 2 equiprobable at .5, rest impossible: • H(p) = 1 bit – Any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes ? 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 37

  18. Perplexity • Perplexity: – G(p) = 2 H(p) • ... so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc. • it is easier to imagine: – NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict • the “wilder” (biased) distribution, the better: – lower entropy, lower perplexity 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 38

  19. Joint Entropy and Conditional Entropy • Two random variables: X (space  ),Y (  ) • Joint entropy: – no big deal: ((X,Y) considered a single event): H(X,Y) = -  x   y  p(x,y) log 2 p(x,y) • Conditional entropy: H(Y|X) = -  x   y  p(x,y) log 2 p(y|x) recall that H(X) = E (log 2 (1/p X (x))) (weighted “average”, and weights are not conditional) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 39

  20. Conditional Entropy (Using the Calculus) • other definition: H(Y|X) =  x  p(x) H(Y|X=x) = for H(Y|X=x), we can use the single-variable definition (x ~ constant) =  x  p(x) ( -  y  p(y|x) log 2 p(y|x) ) = = -  x   y  p(y|x) p(x) log 2 p(y|x) = = -  x   y  p(x,y) log 2 p(y|x) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 40

  21. Properties of Entropy I • Entropy is non-negative: – H(X)  – proof: (recall: H(X) = -  x  p(x) log 2 p(x)) • log(p(x)) is negative or zero for x  1, • p(x) is non-negative; their product p(x)log(p(x) is thus negative; • sum of negative numbers is negative; • and - f is positive for negative f • Chain rule: – H(X,Y) = H(Y|X) + H(X), as well as – H(X,Y) = H(X|Y) + H(Y) (since H(Y,X) = H(X,Y)) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 41

  22. Properties of Entropy II • Conditional Entropy is better (than unconditional): – H(Y|X)  H(Y) (proof on Monday) • H(X,Y)  H(X) + H(Y) (follows from the previous (in)equalities) • equality iff X,Y independent • [recall: X,Y independent iff p(X,Y) = p(X)p(Y)] • H(p) is concave (remember the book availability graph?) – concave function f over an interval (a,b):  x,y  (a,b),   [0,1]: f f(  x + (1-  )y)  f(x) + (1-  )f(y) • function f is convex if -f is concave • [for proofs and generalizations, see Cover/Thomas] x y 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 42

  23. “Coding” Interpretation of Entropy • The least (average) number of bits needed to encode a message (string, sequence, series,...) (each element having being a result of a random process with some distribution p): = H(p) • Remember various compressing algorithms? – they do well on data with repeating (= easily predictable = low entropy) patterns – their results though have high entropy  compressing compressed data does nothing 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 43

  24. Coding: Example • How many bits do we need for ISO Latin 1? –  the trivial answer: 8 • Experience: some chars are more common, some (very) rare: • ...so what if we use more bits for the rare, and less bits for the frequent? [be careful: want to decode (easily)!] • suppose: p(‘a’) = 0.3, p(‘b’) = 0.3, p(‘c’) = 0.3, the rest: p(x)  .0004 • code: ‘a’ ~ 00, ‘b’ ~ 01, ‘c’ ~ 10, rest: 11b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 • code acbbécbaac: 0010010111000011111001000010 a c b b é c b a a c • number of bits used: 28 (vs. 80 using “naive” coding) • code length ~ 1 / probability; conditional prob OK! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 44

  25. Entropy of a Language • Imagine that we produce the next letter using p(l n+1 |l 1 ,...,l n ), where l 1 ,...,l n is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let’s call l 1 ,...,l n the history h (h n+1 ), and all histories H: • Then compute its entropy: – -  h   l  p(l,h) log 2 p(l|h) • Not very practical, isn’t it? 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 45

  26. Kullback-Leibler Distance (Relative Entropy) • Remember: – long series of experiments... c i /T i oscillates around some number... we can only estimate it... to get a distribution q. • So we get a distribution q; (sample space  , r.v. X) the true distribution is, however, p. (same  , X)  how big error are we making? • D(p||q) (the Kullback-Leibler distance): D(p||q) =  x  p(x) log 2 (p(x)/q(x)) = E p log 2 (p(x)/q(x)) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 46

  27. Comments on Relative Entropy • Conventions: – 0 log 0 = 0 – p log (p/0) =  (for p > 0) • Distance? (less “misleading”: Divergence) – not quite: • not symmetric: D(p||q)  D(q||p) • does not satisfy the triangle inequality – but useful to look at it that way • H(p) + D(p||q): bits needed for encoding p if q is used 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 47

  28. Mutual Information (MI) in terms of relative entropy • Random variables X, Y; p X  Y (x,y), p X (x), p Y (y) • Mutual information (between two random variables X,Y): I(X,Y) = D(p(x,y) || p(x)p(y)) • I(X,Y) measures how much (our knowledge of) Y contributes (on average) to easing the prediction of X • or, how p(x,y) deviates from (independent) p(x)p(y) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 48

  29. Mutual Information: the Formula • Rewrite the definition: [ recall: D(r||s) =  v  r(v) log 2 (r(v)/s(v)); substitute r(v) = p(x,y), s(v) = p(x)p(y); <v> ~ <x,y> ] I(X,Y) = D(p(x,y) || p(x)p(y)) = ! =  x   y  p(x,y) log 2 (p(x,y)/p(x)p(y)) • Measured in bits (what else? :-) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 49

  30. From Mutual Information to Entropy • by how many bits the knowledge of Y lowers the entropy H(X): I(X,Y) =  x   y  p(x,y) log 2 (p(x,y)/p(y)p(x)) = ...use p(x,y)/p(y) = p(x|y) =  x   y  p(x,y) log 2 (p(x|y)/p(x)) = ...use log(a/b) = log a - log b (a ~ p(x|y), b ~ p(x)), distribute sums =  x   y  p(x,y)log 2 p(x|y) -  x   y  p(x,y)log 2 p(x) = ...use def. of H(X|Y) (left term) , and  y  p(x,y) = p(x) (right term) = - H(X|Y) + (-  x  p(x)log 2 p(x) ) = ...use def. of H(X) ( right term ), swap terms = H(X) - H(X|Y) ...by symmetry, = H(Y) - H(Y|X) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 50

  31. Properties of MI vs. Entropy • I(X,Y) = H(X) - H(X|Y) = number of bits the knowledge of Y lowers the entropy of X = H(Y) - H(Y|X) (prev. foil, symmetry) Recall: H(X,Y) = H(X|Y) + H(Y)  -H(X|Y) = H(Y) - H(X,Y)  • I(X,Y) = H(X) + H(Y) - H(X,Y) • I(X,X) = H(X) (since H(X|X) = 0) • I(X,Y) = I(Y,X) (just for completeness) • I(X,Y)  0 ... let’s prove that now (as promised). 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 51

  32. Jensen’s Inequality f • Recall: f is convex on interval (a,b) iff  x,y  (a,b),   [0,1]: f(  x + (1-  )y)  f(x) + (1-  )f(y) x y J.I.: for distribution p(x), r.v. X on  , and convex f, • f (  x  p(x) x )   x  p(x) f(x) • Proof (idea): by induction on the number of basic outcomes; start with |  | = 2 by: • • p(x 1 )f(x 1 ) + p(x 2 )f(x 2 )  f(p(x 1 )x 1 + p(x 2 )x 2 ) (  def. of convexity) • for the induction step (|  | = k  k+1), just use the induction hypothesis and def. of convexity (again). 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 52

  33. Information Inequality D(p||q)  0 ! • Proof: 0 = - log 1 = - log  x  q(x) = - log  x  (q(x)/p(x))p(x)  ...apply Jensen’s inequality here ( - log is convex)...   x  p(x) (-log(q(x)/p(x))) =  x  p(x) log(p(x)/q(x)) = = D(p||q) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 53

  34. Other (In)Equalities and Facts • Log sum inequality: for r i , s i   i=1..n (r i log(r i /s i ))  (  i=1..n r i ) log(  i=1..n r i /  i=1..n s i )) • D(p||q) is convex [in p,q] (  log sum inequality) • H(p X )  log 2 |  |, where  is the sample space of p X Proof: uniform u(x), same sample space  :  p(x) log u(x) = -log 2 |  |; log 2 |  | - H(X) = -  p(x) log u(x) +  p(x) log p(x) = D(p||u)  0 • H(p) is concave [in p]: Proof: from H(X) = log 2 |  | - D(p||u), D(p||u) convex  H(x) concave 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 54

  35. Cross-Entropy • Typical case: we’ve got series of observations T = {t 1 , t 2 , t 3 , t 4 , ..., t n }(numbers, words, ...; t i   ); estimate (simple):  y  (y) = c(y) / |T|, def. c(y) = |{t  ; t = y}| p • ...but the true p is unknown; every sample is too small! • Natural question: how well do we do using [instead of p] ? p • Idea: simulate actual p by using a different T’ (or rather: by using different observation we simulate the insufficiency of T vs. some other data (“random” difference)) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 55

  36. Cross Entropy: The Formula • H p’ ( ) = H(p’) + D(p’|| ) p p ( x ) ! H p’ ( ) = -  x  p’(x) log 2 p p • p’ is certainly not the true p, but we can consider it the “real world” distribution against which we test p • note on notation (confusing...): p/p’  , also H T’ (p) p • (Cross)Perplexity: G p’ (p) = G T’ (p)= 2 H p ’( ) p 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 56

  37. Conditional Cross Entropy • So far: “unconditional” distribution(s) p(x), p’(x)... • In practice: virtually always conditioning on context • Interested in: sample space  , r.v. Y, y  ; context: sample space  , r.v. X, x  : “our” distribution p(y|x), test against p’(y,x), which is taken from some independent data: H p’ (p) = -  y  x  p’(y,x) log 2 p(y|x) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 57

  38. Sample Space vs. Data • In practice, it is often inconvenient to sum over the sample space(s)  ,  (especially for cross entropy!) • Use the following formula: H p’ (p) = -  y  x  p’(y,x) log 2 p(y|x) = ! - 1/|T’|  i = 1..|T’| log 2 p(y i |x i ) • This is in fact the normalized log probability of the “test” data: H p’ (p) = - 1/|T’| log 2  i = 1..|T’| p(y i |x i ) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 58

  39. Computation Example  = {a,b,..,z}, prob. distribution (assumed/estimated from data) : • p(a) = .25, p(b) = .5, p(  ) = 1/64 for  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z • Data (test): barb p’(a) = p’(r) = .25, p’(b) = .5 • Sum over  :  a b c d e f g ... p q r s t ... z -p’(  )log 2 p(  ) .5+.5+0+0+0+0+0+0+0+0+0+1.5+0+0+0+0+0 = 2.5 • Sum over data: i / s i 1/b 2/a 3/r 4/b 1/|T’| = 10 (1/4)  10 = 2.5 -log 2 p(s i ) 1 + 2 + 6 + 1 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 59

  40. Cross Entropy: Some Observations • H(p) ??  > ?? H p’ (p): ALL! • Previous example: [p(a) = .25, p(b) = .5, p(  ) = 1/64 for  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z] H(p) = 2.5 bits = H(p’) (barb) • Other data: probable: (1/8)(6+6+6+1+2+1+6+6)= 4.25 H(p) < 4.25 bits = H(p’) (probable) • And finally: abba: (1/4)(2+1+1+2)= 1.5 H(p) > 1.5 bits = H(p’) (abba) • But what about: baby -p’(‘y’)log 2 p(‘y’) = -.25 log 2 0 =  (??) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 60

  41. Cross Entropy: Usage • Comparing data?? – NO! (we believe that we test on real data!) • Rather: comparing distributions ( vs. real data) • Have (got) 2 distributions: p and q (on some  , X) – which is better? – better: has lower cross-entropy (perplexity) on real data S • “Real” data: S • H S (p) = - 1/|S|  i = 1..|S| log 2 p(y i |x i ) ?? H S (q) = - 1/|S|  i = 1..|S| log 2 q(y i |x i ) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 61

  42. Comparing Distributions Test data S: probable • p(.) from prev. example: H S (p) = 4.25 p(a) = .25, p(b) = .5, p(  ) = 1/64 for  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z • q(.|.) (conditional; defined by a table): q(.|.)  a b e l o p r other  a 0 .5 0 0 0 .125 0 0 ex.: q(o|r) = 1 b 1 0 0 0 1 .125 0 0 e 0 0 0 1 0 .125 0 0 q(r|p) = .125 l 0 .5 0 0 0 .125 0 0 o 0 0 0 0 0 .125 1 0 p 0 0 0 0 0 .125 0 1 r 0 0 0 0 0 .125 0 0 other 0 0 1 0 0 .125 0 0 (1/8) (log(p|oth.)+log(r|p)+log(o|r)+log(b|o)+log(a|b)+log(b|a)+log(l|b)+log(e|l)) (1/8) ( 0 + 3 + 0 + 0 + 1 + 0 + 1 + 0 ) H S (q) = .625 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 62

  43. Language Modeling (and the Noisy Channel)

  44. The Noisy Channel • Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,... • Model: probability of error (noise): • Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6 • The Task: known: the noisy output; want to know: the input ( decoding ) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 64

  45. Noisy Channel Applications • OCR – straightforward: text  print (adds noise), scan  image • Handwriting recognition – text  neurons, muscles (“noise”), scan/digitize  image • Speech recognition (dictation, commands, etc.) – text  conversion to acoustic signal (“noise”)  acoustic waves • Machine Translation – text in target language  translation (“noise”)  source language • Also: Part of Speech Tagging – sequence of tags  selection of word forms  text 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 65

  46. Noisy Channel: The Golden Rule of ... OCR, ASR, HR, MT, ... • Recall: p(A|B) = p(B|A) p(A) / p(B) (Bayes formula) A best = argmax A p(B|A) p(A) (The Golden Rule) • p(B|A): the acoustic/image/translation/lexical model – application-specific name – will explore later • p(A): the language model 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 66

  47. The Perfect Language Model • Sequence of word forms [forget about tagging for the moment] • Notation: A ~ W = (w 1 ,w 2 ,w 3 ,...,w d ) • The big (modeling) question: p(W) = ? • Well, we know (Bayes/chain rule  ): p(W) = p(w 1 ,w 2 ,w 3 ,...,w d ) = = p(w 1 )  p(w 2 |w 1 )  p(w 3 |w 1 ,w 2 )  p(w d |w 1 ,w 2 ,...,w d-1 ) • Not practical (even short W  too many parameters) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 67

  48. Markov Chain • Unlimited memory (cf. previous foil): – for w i , we know all its predecessors w 1 ,w 2 ,w 3 ,...,w i-1 • Limited memory: – we disregard “too old” predecessors – remember only k previous words: w i-k ,w i-k+1 ,...,w i-1 – called “k th order Markov approximation” • + stationary character (no change over time): p(W)   i=1..d p(w i |w i-k ,w i-k+1 ,...,w i-1 ), d = |W| 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 68

  49. n-gram Language Models • (n-1) th order Markov approximation  n-gram LM: prediction history p(W)  df  i=1..d p(w i |w i-n+1 ,w i-n+2 ,...,w i-1 ) ! • In particular (assume vocabulary |V| = 60k): • 0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter 6  10 4 parameters • 1-gram LM: unigram model, p(w), p(w i |w i-1 ) 3.6  10 9 parameters • 2-gram LM: bigram model, p(w i |w i-2 ,w i-1 ) 2.16  10 14 parameters • 3-gram LM: trigram model, 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 69

  50. LM: Observations • How large n ? – nothing is enough (theoretically) – but anyway: as much as possible (  close to “perfect” model) – empirically: 3 • parameter estimation? (reliability, data availability, storage space, ...) • 4 is too much: |V|=60k  1.296  10 19 parameters • but: 6-7 would be (almost) ideal (having enough data): in fact, one can recover the original text ssequence from 7-grams! • Reliability ~ (1 / Detail) (  need compromise) • For now, keep word forms (no “linguistic” processing) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 70

  51. The Length Issue  n;  w  n p(w) = 1  n=1..∞  w  n p(w) >> 1 (  ∞) • • We want to model all sequences of words – for “fixed” length tasks: no problem - n fixed, sum is 1 • tagging, OCR/handwriting (if words identified ahead of time) – for “variable” length tasks: have to account for • discount shorter sentences • General model: for each sequence of words of length n, define p’(w) =  n p(w) such that  n=1..∞  n = 1   n=1.. ∞  w  n p’(w)=1 e.g., estimate  n from data; or use normal or other distribution 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 71

  52. Parameter Estimation • Parameter: numerical value needed to compute p(w|h) • From data (how else?) • Data preparation: • get rid of formatting etc. (“text cleaning”) • define words (separate but include punctuation, call it “word”) • define sentence boundaries (insert “words” <s> and </s>) • letter case: keep, discard, or be smart: – name recognition – number type identification [these are huge problems per se!] • numbers: keep, replace by <num>, or be smart (form ~ pronunciation) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 72

  53. Maximum Likelihood Estimate • MLE: Relative Frequency... – ...best predicts the data at hand (the “training data”) • Trigrams from Training Data T: – count sequences of three words in T: c 3 (w i-2 ,w i-1 ,w i ) [NB: notation: just saying that the three words follow each other] – count sequences of two words in T: c 2 (w i-1 ,w i ): • either use c 2 (y,z) =  w c 3 (y,z,w) • or count differently at the beginning (& end) of data! p(w i |w i-2 ,w i-1 ) = est. c 3 (w i-2 ,w i-1 ,w i ) / c 2 (w i-2 ,w i-1 ) ! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 73

  54. Character Language Model • Use individual characters instead of words: p(W)  df  i=1..d p(c i |c i-n+1 ,c i-n+2 ,...,c i-1 ) • Same formulas etc. • Might consider 4-grams, 5-grams or even more • Good only for language comparison • Transform cross-entropy between letter- and word-based models: H S (p c ) = H S (p w ) / avg. # of characters/word in S 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 74

  55. LM: an Example • Training data: <s> <s> He can buy the can of soda. – Unigram: p 1 (He) = p 1 (buy) = p 1 (the) = p 1 (of) = p 1 (soda) = p 1 (.) = .125 p 1 ( can ) = .25 – Bigram: p 2 ( He|<s> ) = 1 , p 2 ( can|He ) = 1 , p 2 ( buy|can ) = .5 , p 2 ( of|can ) = .5 , p 2 ( the|buy ) = 1 ,... – Trigram: p 3 ( He|<s>,<s> ) = 1 , p 3 ( can|<s>,He ) = 1 , p 3 ( buy|He,can ) = 1 , p 3 ( of|the,can ) = 1 , ..., p 3 ( .|of,soda ) = 1 . – Entropy: H(p 1 ) = 2.75, H(p 2 ) = .25, H(p 3 ) = 0  Great?! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 75

  56. LM: an Example (The Problem) • Cross-entropy: • S = <s> <s> It was the greatest buy of all. • Even H S (p 1 ) fails (= H S (p 2 ) = H S (p 3 ) =  ), because: – all unigrams but p 1 (the), p 1 (buy), p 1 (of) and p 1 (.) are 0. – all bigram probabilities are 0. – all trigram probabilities are 0. • We want: to make all (theoretically possible * ) probabilities non-zero. * in fact, all: remember our graph from day 1? 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 76

  57. LM Smoothing (And the EM Algorithm)

  58. The Zero Problem • “Raw” n-gram language model estimate: – necessarily, some zeros • !many: trigram model  2.16  10 14 parameters, data ~ 10 9 words – which are true 0? • optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it’s probability vs. other trigrams • optimal situation cannot happen, unfortunately (open question: how many data would we need?) –  we don’t know – we must eliminate the zeros • Two kinds of zeros: p(w|h) = 0, or even p(h) = 0! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 78

  59. Why do we need Nonzero Probs? • To avoid infinite Cross Entropy: – happens when an event is found in test data which has not been seen in training data H(p) =  prevents comparing data with  0 “errors” • To make the system more robust – low count estimates: • they typically happen for “detailed” but relatively rare appearances – high count estimates: reliable but less “detailed” 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 79

  60. Eliminating the Zero Probabilities: Smoothing • Get new p’(w) (same  ): almost p(w) but no zeros • Discount w for (some) p(w) > 0: new p’(w) < p(w)  w  discounted (p(w) - p’(w)) = D • Distribute D to all w; p(w) = 0: new p’(w) > p(w) – possibly also to other w with low p(w) • For some w (possibly): p’(w) = p(w) • Make sure  w  p’(w) = 1 • There are many ways of smoothing 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 80

  61. Smoothing by Adding 1 • Simplest but not really usable: – Predicting words w from a vocabulary V, training data T: p’(w|h) = (c(h,w) + 1) / (c(h) + |V|) • for non-conditional distributions: p’(w) = (c(w) + 1) / (|T| + |V|) – Problem if |V| > c(h) (as is often the case; even >> c(h)!) • Example: Training data: <s> what is it what is small ? |T| = 8 • V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12 • p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .25 2  .125 2  .001 p(it is flying.) = .125  .25  0 2 = 0 • p’(it) =.1, p’(what) =.15, p’(.)=.05 p’(what is it?) = .15 2  .1 2  .0002 p’(it is flying.) = .1  .15  .05 2  .00004 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 81

  62. Adding less than 1 • Equally simple: – Predicting words w from a vocabulary V, training data T: p’(w|h) = (c(h,w) +  ) / (c(h) +  |V|),  • for non-conditional distributions: p’(w) = (c(w) +  ) / (|T| +  |V|) • Example: Training data: <s> what is it what is small ? |T| = 8 • V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12 • p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .25 2  .125 2  .001 p(it is flying.) = .125  .25  0 2 = 0 • Use  = .1: • p’(it)  .12, p’(what)  .23, p’(.)  .01 p’(what is it?) = .23 2  .12 2  .0007 p’(it is flying.) = .12  .23  .01 2  .000003 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 82

  63. Good - Turing • Suitable for estimation from large data – similar idea: discount/boost the relative frequency estimate: p r (w) = (c(w) + 1)  N(c(w) + 1) / (|T|  N(c(w))) , where N(c) is the count of words with count c (count-of- counts) specifically, for c(w) = 0 (unseen words), p r (w) = N(1) / (|T|  N(0)) – good for small counts (< 5-10, where N(c) is high) – variants ( see MS ) – normalization! (so that we have  w p’(w) = 1) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 83

  64. Good-Turing: An Example • Example: remember: p r (w) = (c(w) + 1)  N(c(w) + 1) / (|T|  N(c(w))) Training data: <s> what is it what is small ? |T| = 8 • V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12 p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .25 2  .125 2  .001 p(it is flying.) = .125  .25  0 2 = 0 • Raw reestimation (N(0) = 6, N(1) = 4, N(2) = 2, N(i) = 0 for i > 2): p r (it) = (1+1)  N(1+1)/(8  N(1)) = 2  2/(8  4) = .125 p r (what) = (2+1)  N(2+1)/(8  N(2)) = 3  0/(8  2) = 0: keep orig. p(what) p r (.) = (0+1)  N(0+1)/(8  N(0)) = 1  4/(8  6)  .083 • Normalize (divide by 1.5 =  w  |V| p r (w)) and compute: p’(it)  .08, p’(what)  .17, p’(.)  .06 p’(what is it?) = .17 2  .08 2  .0002 p’(it is flying.) = .08  .17  .06 2  .00004 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 84

  65. Smoothing by Combination: Linear Interpolation • Combine what? • distributions of various level of detail vs. reliability • n-gram models: • use (n-1)gram, (n-2)gram, ..., uniform reliability detail • Simplest possible combination: – sum of probabilities, normalize: • p(0|0) = .8, p(1|0) = .2, p(0|1) = 1, p(1|1) = 0, p(0) = .4, p(1) = .6: • p’(0|0) = .6, p’(1|0) = .4, p’(0|1) = .7, p’(1|1) = .3 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 85

  66. Typical n-gram LM Smoothing • Weight in less detailed distributions using  =(  0 ,   ,   ,   ): p’  (w i | w i-2 ,w i-1 ) =   p 3 (w i | w i-2 ,w i-1 ) +   p 2 (w i | w i-1 ) +   p 1 (w i ) +  0 /|V| • Normalize:  i > 0,  i=0..n  i = 1 is sufficient (  0 = 1 -  i=1..n  i ) (n=3) • Estimation using MLE: – fix the p 3 , p 2 , p 1 and |V| parameters as estimated from the training data – then find such {  i } which minimizes the cross entropy (maximizes probability of data): -(1/|D|)  i=1..|D| log 2 (p’  (w i |h i )) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 86

  67. Held-out Data • What data to use? – try the training data T: but we will always get   = 1 • why? (let p iT be an i-gram distribution estimated using r.f. from T) • minimizing H T (p’  ) over a vector  , p’  =   p 3T +   p 2T +   p 1T +   /|V| – remember: H T (p’  ) = H(p 3T )+D(p 3T ||p’  ); • (p 3T fixed  H(p 3T ) fixed, best) – which p’  minimizes H T (p’  )? ... a p’  for which D(p 3T || p’  )=0 – ...and that’s p 3T (because D(p||p) = 0, as we know). – ...and certainly p’  = p 3T if   = 1 (maybe in some other cases, too). (p’  = 1  p 3T + 0  p 2T + 0  p 1T + 0/|V|) – – thus: do not use the training data for estimation of  • must hold out part of the training data ( heldout data, H): • ...call the remaining data the (true/raw) training data, T • the test data S (e.g., for comparison purposes): still different data! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 87

  68. The Formulas • Repeat: minimizing -(1/|H|)  i=1..|H| log 2 (p’  (w i |h i )) over  p’  (w i | h i ) = p’  (w i | w i-2 ,w i-1 ) =   p 3 (w i | w i-2 ,w i-1 ) + !   p 2 (w i | w i-1 ) +   p 1 (w i ) +  0 /|V| • “Expected Counts (of lambdas)”: j = 0..3 ! c(  j ) =  i=1..|H| (  j p j (w i |h i ) / p’  (w i |h i )) • “Next  ”: j = 0..3 !  j,next = c(  j ) /  k=0..3 (c(  k )) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 88

  69. The (Smoothing) EM Algorithm 1. Start with some  , such that  j > 0 for all j  0..3. 2. Compute “Expected Counts” for each  j . 3. Compute new set of  j , using the “Next  ” formula. 4. Start over at step 2, unless a termination condition is met. • Termination condition: convergence of  . – Simply set an  , and finish if |  j -  j,next | <  for each j (step 3). • Guaranteed to converge: follows from Jensen’s inequality, plus a technical proof. 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 89

  70. Remark on Linear Interpolation Smoothing • “Bucketed” smoothing: – use several vectors of  instead of one, based on (the frequency of) history:  (h) • e.g. for h = (micrograms,per) we will have  ( h ) = (.999,.0009,.00009,.00001) (because “cubic” is the only word to follow...) – actually: not a separate set for each history, but rather a set for “similar” histories (“bucket”):  (b(h)), where b: V 2  N (in the case of trigrams) b classifies histories according to their reliability (~ frequency) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 90

  71. Bucketed Smoothing: The Algorithm • First, determine the bucketing function b (use heldout!): – decide in advance you want e.g. 1000 buckets – compute the total frequency of histories in 1 bucket (f max (b)) – gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed f max (b) (you might end up with slightly more than 1000 buckets) • Divide your heldout data according to buckets • Apply the previous algorithm to each bucket and its data 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 91

  72. Simple Example • Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p(  ) = 1/64 for  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z • Heldout data: baby; use one set of  (  1 : unigram,  0 : uniform) • Start with  1 = .5; p’  (b) = .5 x .5 + .5 / 26 = .27 p’  (a) = .5 x .25 + .5 / 26 = .14 p’  (y) = .5 x 0 + .5 / 26 = .02 c(  1 ) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x0/.02 = 2.72 c(  0 ) = .5x.04/.27 + .5x.04/.14 + .5x.04/.27 + .5x.04/.02 = 1.28 Normalize:  1,next = .68,  0,next = .32. Repeat from step 2 (recompute p’  first for efficient computation, then c(  i ), ...) Finish when new lambdas almost equal to the old ones (say, < 0.01 difference). 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 92

  73. Some More Technical Hints • Set V = {all words from training data}. • You may also consider V = T  H, but it does not make the coding in any way simpler (in fact, harder). • But: you must never use the test data for you vocabulary! • Prepend two “words” in front of all data: • avoids beginning-of-data problems • call these index -1 and 0: then the formulas hold exactly • When c n (w,h) = 0: • Assign 0 probability to p n (w|h) where c n-1 (h) > 0, but a uniform probability (1/|V|) to those p n (w|h) where c n-1 (h) = 0 [this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!] 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 93

  74. Words and the Company They Keep

  75. Motivation • Environment: – mostly “not a full analysis (sentence/text parsing)” • Tasks where “words & company” are important: – word sense disambiguation (MT, IR, TD, IE) – lexical entries: subdivision & definitions (lexicography) – language modeling (generalization, [kind of] smoothing) – word/phrase/term translation (MT, Multilingual IR) – NL generation (“natural” phrases) (Generation, MT) – parsing (lexically-based selectional preferences) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 95

  76. Collocations • Collocation – Firth: “word is characterized by the company it keeps”; collocations of a given word are statements of the habitual or customary places of that word. – non-compositionality of meaning • cannot be derived directly from its parts (heavy rain) – non-substitutability in context • for parts (red light) – non-modifiability (& non-transformability) • kick the yellow bucket; take exceptions to 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 96

  77. Association and Co-occurence; Terms • Does not fall under “collocation”, but: • Interesting just because it does often [rarely] appear together or in the same (or similar) context: • (doctors, nurses) • (hardware,software) • (gas, fuel) • (hammer, nail) • (communism, free speech) • Terms: – need not be > 1 word (notebook, washer) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 97

  78. Collocations of Special Interest • Idioms: really fixed phrases • kick the bucket, birds-of-a-feather, run for office • Proper names: difficult to recognize even with lists • Tuesday (person’s name), May, Winston Churchill, IBM, Inc. • Numerical expressions – containing “ordinary” words • Monday Oct 04 1999, two thousand seven hundred fifty • Phrasal verbs – Separable parts: • look up, take off 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 98

  79. Further Notions • Synonymy: different form/word, same meaning: • notebook / laptop • Antonymy: opposite meaning: • new/old, black/white, start/stop • Homonymy: same form/word, different meaning: • “true” (random, unrelated): can (aux. verb / can of Coke) • related: polysemy; notebook, shift, grade, ... • Other: • Hyperonymy/Hyponymy: general vs. special: vehicle/car • Meronymy/Holonymy: whole vs. part: body/leg 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 99

  80. How to Find Collocations? • Frequency – plain – filtered • Hypothesis testing – t test –   test • Pointwise (“poor man’s”) Mutual Information • (Average) Mutual Information 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 100

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend