introduction to natural language processing i statistick

Introduction to Natural Language Processing I [Statistick metody - PowerPoint PPT Presentation

Introduction to Natural Language Processing I [Statistick metody zpracovn pirozench jazyk I] (NPFL067) http://ufal.mff.cuni.cz/courses/npfl067 prof. RNDr. Jan Haji, Dr. / doc. RNDr. Pavel Pecina, Ph.D. FAL MFF UK


  1. Joint and Conditional Probability • p(A,B) = p(A  B) • p(A|B) = p(A,B) / p(B) – Estimating form counts: • p(A|B) = p(A,B) / p(B) = (c(A  B) / T) / (c(B) / T) = = c(A  B) / c(B)  A B A  B 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 21

  2. Bayes Rule • p(A,B) = p(B,A) since p(A  p(B  – therefore: p(A|B) p(B) = p(B|A) p(A), and therefore p(A|B) = p(B|A) p(A) / p(B) !  A B A  B 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 22

  3. Independence • Can we compute p(A,B) from p(A) and p(B)? • Recall from previous foil: p(A|B) = p(B|A) p(A) / p(B) p(A|B) p(B) = p(B|A) p(A) p(A,B) = p(B|A) p(A) ... we’re almost there: how p(B|A) relates to p(B)? – p(B|A) = P(B) iff A and B are independent • Example: two coin tosses, weather today and weather on March 4th 1789; • Any two events for which p(B|A) = P(B)! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 23

  4. Chain Rule p(A 1 , A 2 , A 3 , A 4 , ..., A n ) = ! p(A 1 |A 2 ,A 3 ,A 4 ,...,A n )  p(A 2 |A 3 ,A 4 ,...,A n )   p(A 3 |A 4 ,...,A n )  ... p(A n-1 |A n )  p(A n ) • this is a direct consequence of the Bayes rule. 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 24

  5. The Golden Rule (of Classic Statistical NLP) • Interested in an event A given B (when it is not easy or practical or desirable to estimate p(A|B)): • take Bayes rule, max over all As: • argmax A p(A|B) = argmax A p(B|A) . p(A) / p(B) = argmax A p(B|A) p(A) ! • ... as p(B) is constant when changing As 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 25

  6. Random Variable • is a function X:   Q – in general: Q = R n , typically R – easier to handle real numbers than real-world events • random variable is discrete if Q is countable (i.e. also if finite) • Example: die : natural “numbering” [1,6], coin : {0,1} • Probability distribution: – p X (x) = p(X=x) = df p(A x ) where A x = {a  : X(a) = x} – often just p(x) if it is clear from context what X is 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 26

  7. Expectation Joint and Conditional Distributions • is a mean of a random variable (weighted average) – E(X) =  x  X(  x . p X (x) • Example: one six-sided die: 3.5, two dice (sum) 7 • Joint and Conditional distribution rules: – analogous to probability of events • Bayes: p X|Y (x,y) = notation p XY (x|y) = even simpler notation p(x|y) = p(y|x) . p(x) / p(y) • Chain rule: p(w,x,y,z) = p(z).p(y|z).p(x|y,z).p(w|x,y,z) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 27

  8. Standard distributions • Binomial (discrete) – outcome: 0 or 1 (thus: bi nomial) – make n trials – interested in the (probability of) number of successes r • Must be careful: it’s not uniform! n • p b (r|n) = ( ) / 2 n (for equally likely outcome) r n • ( ) counts how many possibilities there are for r choosing r objects out of n; = n! / ((n-r)! r!) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 28

  9. Continuous Distributions • The normal distribution (“Gaussian”) • p norm (x|  ) = e -(x-  ) 2 / (2  2 ) /  • where: –  is the mean (x-coordinate of the peak) (0) –  is the standard deviation (1)  x • other: hyperbolic, t 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 29

  10. Essential Information Theory

  11. The Notion of Entropy • Entropy ~ “chaos”, fuzziness, opposite of order, ... – you know it: • it is much easier to create “mess” than to tidy things up... • Comes from physics: – Entropy does not go down unless energy is applied • Measure of uncertainty: – if low... low uncertainty; the higher the entropy, the higher uncertainty, but the higher “surprise” (information) we can get out of an experiment 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 31

  12. The Formula • Let p X (x) be a distribution of random variable X • Basic outcomes (alphabet)  H(X) = -  x  p(x) log 2 p(x) ! • Unit: bits (log 10 : nats) • Notation: H(X) = H p (X) = H(p) = H X (p) = H(p X ) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 32

  13. Using the Formula: Example • Toss a fair coin:  = {head,tail} – p(head) = .5, p(tail) = .5 – H(p) = - 0.5 log 2 (0.5) + (- 0.5 log 2 (0.5)) = 2  ( (-0.5)  (-1) ) = 2  0.5 = 1 • Take fair, 32-sided die: p(x) = 1 / 32 for every side x – H(p) = -  i = 1..32 p(x i ) log 2 p(x i ) = - 32 (p(x 1 ) log 2 p(x 1 ) (since for all i p(x i ) = p(x 1 ) = 1/32) = -32  ((1/32)  (-5)) = 5 (now you see why it’s called bits ?) • Unfair coin: – p(head) = .2 ... H(p) = .722 ; p(head) = .01 ... H(p) = .081 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 33

  14. Example: Book Availability Entropy H(p) 1 bad bookstore good bookstore 0 0 0.5 1  p(Book Available) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 34

  15. The Limits • When H(p) = 0? – if a result of an experiment is known ahead of time: – necessarily:  x  ; p(x) = 1 &  y  ; y  x  p(y) = 0 • Upper bound? – none in general – for |  | = n: H(p)  log 2 n • nothing can be more uncertain than the uniform distribution 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 35

  16. Entropy and Expectation • Recall: – E(X) =  x  X  p X (x)  x • Then: E(log 2 (1/p X (x))) =  x  X  p X (x) log 2 (1/p X (x)) = = -  x  X  p X (x) log 2 p X (x) = = H(p X ) = notation H(p) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 36

  17. Perplexity: motivation • Recall: – 2 equiprobable outcomes: H(p) = 1 bit – 32 equiprobable outcomes: H(p) = 5 bits – 4.3 billion equiprobable outcomes: H(p) ~= 32 bits • What if the outcomes are not equiprobable? – 32 outcomes, 2 equiprobable at .5, rest impossible: • H(p) = 1 bit – Any measure for comparing the entropy (i.e. uncertainty/difficulty of prediction) (also) for random variables with different number of outcomes ? 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 37

  18. Perplexity • Perplexity: – G(p) = 2 H(p) • ... so we are back at 32 (for 32 eqp. outcomes), 2 for fair coins, etc. • it is easier to imagine: – NLP example: vocabulary size of a vocabulary with uniform distribution, which is equally hard to predict • the “wilder” (biased) distribution, the better: – lower entropy, lower perplexity 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 38

  19. Joint Entropy and Conditional Entropy • Two random variables: X (space  ),Y (  ) • Joint entropy: – no big deal: ((X,Y) considered a single event): H(X,Y) = -  x   y  p(x,y) log 2 p(x,y) • Conditional entropy: H(Y|X) = -  x   y  p(x,y) log 2 p(y|x) recall that H(X) = E (log 2 (1/p X (x))) (weighted “average”, and weights are not conditional) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 39

  20. Conditional Entropy (Using the Calculus) • other definition: H(Y|X) =  x  p(x) H(Y|X=x) = for H(Y|X=x), we can use the single-variable definition (x ~ constant) =  x  p(x) ( -  y  p(y|x) log 2 p(y|x) ) = = -  x   y  p(y|x) p(x) log 2 p(y|x) = = -  x   y  p(x,y) log 2 p(y|x) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 40

  21. Properties of Entropy I • Entropy is non-negative: – H(X)  – proof: (recall: H(X) = -  x  p(x) log 2 p(x)) • log(p(x)) is negative or zero for x  1, • p(x) is non-negative; their product p(x)log(p(x) is thus negative; • sum of negative numbers is negative; • and - f is positive for negative f • Chain rule: – H(X,Y) = H(Y|X) + H(X), as well as – H(X,Y) = H(X|Y) + H(Y) (since H(Y,X) = H(X,Y)) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 41

  22. Properties of Entropy II • Conditional Entropy is better (than unconditional): – H(Y|X)  H(Y) (proof on Monday) • H(X,Y)  H(X) + H(Y) (follows from the previous (in)equalities) • equality iff X,Y independent • [recall: X,Y independent iff p(X,Y) = p(X)p(Y)] • H(p) is concave (remember the book availability graph?) – concave function f over an interval (a,b):  x,y  (a,b),   [0,1]: f f(  x + (1-  )y)  f(x) + (1-  )f(y) • function f is convex if -f is concave • [for proofs and generalizations, see Cover/Thomas] x y 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 42

  23. “Coding” Interpretation of Entropy • The least (average) number of bits needed to encode a message (string, sequence, series,...) (each element having being a result of a random process with some distribution p): = H(p) • Remember various compressing algorithms? – they do well on data with repeating (= easily predictable = low entropy) patterns – their results though have high entropy  compressing compressed data does nothing 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 43

  24. Coding: Example • How many bits do we need for ISO Latin 1? –  the trivial answer: 8 • Experience: some chars are more common, some (very) rare: • ...so what if we use more bits for the rare, and less bits for the frequent? [be careful: want to decode (easily)!] • suppose: p(‘a’) = 0.3, p(‘b’) = 0.3, p(‘c’) = 0.3, the rest: p(x)  .0004 • code: ‘a’ ~ 00, ‘b’ ~ 01, ‘c’ ~ 10, rest: 11b 1 b 2 b 3 b 4 b 5 b 6 b 7 b 8 • code acbbécbaac: 0010010111000011111001000010 a c b b é c b a a c • number of bits used: 28 (vs. 80 using “naive” coding) • code length ~ 1 / probability; conditional prob OK! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 44

  25. Entropy of a Language • Imagine that we produce the next letter using p(l n+1 |l 1 ,...,l n ), where l 1 ,...,l n is the sequence of all the letters which had been uttered so far (i.e. n is really big!); let’s call l 1 ,...,l n the history h (h n+1 ), and all histories H: • Then compute its entropy: – -  h   l  p(l,h) log 2 p(l|h) • Not very practical, isn’t it? 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 45

  26. Kullback-Leibler Distance (Relative Entropy) • Remember: – long series of experiments... c i /T i oscillates around some number... we can only estimate it... to get a distribution q. • So we get a distribution q; (sample space  , r.v. X) the true distribution is, however, p. (same  , X)  how big error are we making? • D(p||q) (the Kullback-Leibler distance): D(p||q) =  x  p(x) log 2 (p(x)/q(x)) = E p log 2 (p(x)/q(x)) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 46

  27. Comments on Relative Entropy • Conventions: – 0 log 0 = 0 – p log (p/0) =  (for p > 0) • Distance? (less “misleading”: Divergence) – not quite: • not symmetric: D(p||q)  D(q||p) • does not satisfy the triangle inequality – but useful to look at it that way • H(p) + D(p||q): bits needed for encoding p if q is used 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 47

  28. Mutual Information (MI) in terms of relative entropy • Random variables X, Y; p X  Y (x,y), p X (x), p Y (y) • Mutual information (between two random variables X,Y): I(X,Y) = D(p(x,y) || p(x)p(y)) • I(X,Y) measures how much (our knowledge of) Y contributes (on average) to easing the prediction of X • or, how p(x,y) deviates from (independent) p(x)p(y) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 48

  29. Mutual Information: the Formula • Rewrite the definition: [ recall: D(r||s) =  v  r(v) log 2 (r(v)/s(v)); substitute r(v) = p(x,y), s(v) = p(x)p(y); <v> ~ <x,y> ] I(X,Y) = D(p(x,y) || p(x)p(y)) = ! =  x   y  p(x,y) log 2 (p(x,y)/p(x)p(y)) • Measured in bits (what else? :-) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 49

  30. From Mutual Information to Entropy • by how many bits the knowledge of Y lowers the entropy H(X): I(X,Y) =  x   y  p(x,y) log 2 (p(x,y)/p(y)p(x)) = ...use p(x,y)/p(y) = p(x|y) =  x   y  p(x,y) log 2 (p(x|y)/p(x)) = ...use log(a/b) = log a - log b (a ~ p(x|y), b ~ p(x)), distribute sums =  x   y  p(x,y)log 2 p(x|y) -  x   y  p(x,y)log 2 p(x) = ...use def. of H(X|Y) (left term) , and  y  p(x,y) = p(x) (right term) = - H(X|Y) + (-  x  p(x)log 2 p(x) ) = ...use def. of H(X) ( right term ), swap terms = H(X) - H(X|Y) ...by symmetry, = H(Y) - H(Y|X) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 50

  31. Properties of MI vs. Entropy • I(X,Y) = H(X) - H(X|Y) = number of bits the knowledge of Y lowers the entropy of X = H(Y) - H(Y|X) (prev. foil, symmetry) Recall: H(X,Y) = H(X|Y) + H(Y)  -H(X|Y) = H(Y) - H(X,Y)  • I(X,Y) = H(X) + H(Y) - H(X,Y) • I(X,X) = H(X) (since H(X|X) = 0) • I(X,Y) = I(Y,X) (just for completeness) • I(X,Y)  0 ... let’s prove that now (as promised). 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 51

  32. Jensen’s Inequality f • Recall: f is convex on interval (a,b) iff  x,y  (a,b),   [0,1]: f(  x + (1-  )y)  f(x) + (1-  )f(y) x y J.I.: for distribution p(x), r.v. X on  , and convex f, • f (  x  p(x) x )   x  p(x) f(x) • Proof (idea): by induction on the number of basic outcomes; start with |  | = 2 by: • • p(x 1 )f(x 1 ) + p(x 2 )f(x 2 )  f(p(x 1 )x 1 + p(x 2 )x 2 ) (  def. of convexity) • for the induction step (|  | = k  k+1), just use the induction hypothesis and def. of convexity (again). 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 52

  33. Information Inequality D(p||q)  0 ! • Proof: 0 = - log 1 = - log  x  q(x) = - log  x  (q(x)/p(x))p(x)  ...apply Jensen’s inequality here ( - log is convex)...   x  p(x) (-log(q(x)/p(x))) =  x  p(x) log(p(x)/q(x)) = = D(p||q) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 53

  34. Other (In)Equalities and Facts • Log sum inequality: for r i , s i   i=1..n (r i log(r i /s i ))  (  i=1..n r i ) log(  i=1..n r i /  i=1..n s i )) • D(p||q) is convex [in p,q] (  log sum inequality) • H(p X )  log 2 |  |, where  is the sample space of p X Proof: uniform u(x), same sample space  :  p(x) log u(x) = -log 2 |  |; log 2 |  | - H(X) = -  p(x) log u(x) +  p(x) log p(x) = D(p||u)  0 • H(p) is concave [in p]: Proof: from H(X) = log 2 |  | - D(p||u), D(p||u) convex  H(x) concave 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 54

  35. Cross-Entropy • Typical case: we’ve got series of observations T = {t 1 , t 2 , t 3 , t 4 , ..., t n }(numbers, words, ...; t i   ); estimate (simple):  y  (y) = c(y) / |T|, def. c(y) = |{t  ; t = y}| p • ...but the true p is unknown; every sample is too small! • Natural question: how well do we do using [instead of p] ? p • Idea: simulate actual p by using a different T’ (or rather: by using different observation we simulate the insufficiency of T vs. some other data (“random” difference)) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 55

  36. Cross Entropy: The Formula • H p’ ( ) = H(p’) + D(p’|| ) p p ( x ) ! H p’ ( ) = -  x  p’(x) log 2 p p • p’ is certainly not the true p, but we can consider it the “real world” distribution against which we test p • note on notation (confusing...): p/p’  , also H T’ (p) p • (Cross)Perplexity: G p’ (p) = G T’ (p)= 2 H p ’( ) p 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 56

  37. Conditional Cross Entropy • So far: “unconditional” distribution(s) p(x), p’(x)... • In practice: virtually always conditioning on context • Interested in: sample space  , r.v. Y, y  ; context: sample space  , r.v. X, x  : “our” distribution p(y|x), test against p’(y,x), which is taken from some independent data: H p’ (p) = -  y  x  p’(y,x) log 2 p(y|x) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 57

  38. Sample Space vs. Data • In practice, it is often inconvenient to sum over the sample space(s)  ,  (especially for cross entropy!) • Use the following formula: H p’ (p) = -  y  x  p’(y,x) log 2 p(y|x) = ! - 1/|T’|  i = 1..|T’| log 2 p(y i |x i ) • This is in fact the normalized log probability of the “test” data: H p’ (p) = - 1/|T’| log 2  i = 1..|T’| p(y i |x i ) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 58

  39. Computation Example  = {a,b,..,z}, prob. distribution (assumed/estimated from data) : • p(a) = .25, p(b) = .5, p(  ) = 1/64 for  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z • Data (test): barb p’(a) = p’(r) = .25, p’(b) = .5 • Sum over  :  a b c d e f g ... p q r s t ... z -p’(  )log 2 p(  ) .5+.5+0+0+0+0+0+0+0+0+0+1.5+0+0+0+0+0 = 2.5 • Sum over data: i / s i 1/b 2/a 3/r 4/b 1/|T’| = 10 (1/4)  10 = 2.5 -log 2 p(s i ) 1 + 2 + 6 + 1 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 59

  40. Cross Entropy: Some Observations • H(p) ??  > ?? H p’ (p): ALL! • Previous example: [p(a) = .25, p(b) = .5, p(  ) = 1/64 for  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z] H(p) = 2.5 bits = H(p’) (barb) • Other data: probable: (1/8)(6+6+6+1+2+1+6+6)= 4.25 H(p) < 4.25 bits = H(p’) (probable) • And finally: abba: (1/4)(2+1+1+2)= 1.5 H(p) > 1.5 bits = H(p’) (abba) • But what about: baby -p’(‘y’)log 2 p(‘y’) = -.25 log 2 0 =  (??) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 60

  41. Cross Entropy: Usage • Comparing data?? – NO! (we believe that we test on real data!) • Rather: comparing distributions ( vs. real data) • Have (got) 2 distributions: p and q (on some  , X) – which is better? – better: has lower cross-entropy (perplexity) on real data S • “Real” data: S • H S (p) = - 1/|S|  i = 1..|S| log 2 p(y i |x i ) ?? H S (q) = - 1/|S|  i = 1..|S| log 2 q(y i |x i ) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 61

  42. Comparing Distributions Test data S: probable • p(.) from prev. example: H S (p) = 4.25 p(a) = .25, p(b) = .5, p(  ) = 1/64 for  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z • q(.|.) (conditional; defined by a table): q(.|.)  a b e l o p r other  a 0 .5 0 0 0 .125 0 0 ex.: q(o|r) = 1 b 1 0 0 0 1 .125 0 0 e 0 0 0 1 0 .125 0 0 q(r|p) = .125 l 0 .5 0 0 0 .125 0 0 o 0 0 0 0 0 .125 1 0 p 0 0 0 0 0 .125 0 1 r 0 0 0 0 0 .125 0 0 other 0 0 1 0 0 .125 0 0 (1/8) (log(p|oth.)+log(r|p)+log(o|r)+log(b|o)+log(a|b)+log(b|a)+log(l|b)+log(e|l)) (1/8) ( 0 + 3 + 0 + 0 + 1 + 0 + 1 + 0 ) H S (q) = .625 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 62

  43. Language Modeling (and the Noisy Channel)

  44. The Noisy Channel • Prototypical case: Input Output (noisy) The channel 0,1,1,1,0,1,0,1,... (adds noise) 0,1,1,0,0,1,1,0,... • Model: probability of error (noise): • Example: p(0|1) = .3 p(1|1) = .7 p(1|0) = .4 p(0|0) = .6 • The Task: known: the noisy output; want to know: the input ( decoding ) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 64

  45. Noisy Channel Applications • OCR – straightforward: text  print (adds noise), scan  image • Handwriting recognition – text  neurons, muscles (“noise”), scan/digitize  image • Speech recognition (dictation, commands, etc.) – text  conversion to acoustic signal (“noise”)  acoustic waves • Machine Translation – text in target language  translation (“noise”)  source language • Also: Part of Speech Tagging – sequence of tags  selection of word forms  text 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 65

  46. Noisy Channel: The Golden Rule of ... OCR, ASR, HR, MT, ... • Recall: p(A|B) = p(B|A) p(A) / p(B) (Bayes formula) A best = argmax A p(B|A) p(A) (The Golden Rule) • p(B|A): the acoustic/image/translation/lexical model – application-specific name – will explore later • p(A): the language model 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 66

  47. The Perfect Language Model • Sequence of word forms [forget about tagging for the moment] • Notation: A ~ W = (w 1 ,w 2 ,w 3 ,...,w d ) • The big (modeling) question: p(W) = ? • Well, we know (Bayes/chain rule  ): p(W) = p(w 1 ,w 2 ,w 3 ,...,w d ) = = p(w 1 )  p(w 2 |w 1 )  p(w 3 |w 1 ,w 2 )  p(w d |w 1 ,w 2 ,...,w d-1 ) • Not practical (even short W  too many parameters) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 67

  48. Markov Chain • Unlimited memory (cf. previous foil): – for w i , we know all its predecessors w 1 ,w 2 ,w 3 ,...,w i-1 • Limited memory: – we disregard “too old” predecessors – remember only k previous words: w i-k ,w i-k+1 ,...,w i-1 – called “k th order Markov approximation” • + stationary character (no change over time): p(W)   i=1..d p(w i |w i-k ,w i-k+1 ,...,w i-1 ), d = |W| 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 68

  49. n-gram Language Models • (n-1) th order Markov approximation  n-gram LM: prediction history p(W)  df  i=1..d p(w i |w i-n+1 ,w i-n+2 ,...,w i-1 ) ! • In particular (assume vocabulary |V| = 60k): • 0-gram LM: uniform model, p(w) = 1/|V|, 1 parameter 6  10 4 parameters • 1-gram LM: unigram model, p(w), p(w i |w i-1 ) 3.6  10 9 parameters • 2-gram LM: bigram model, p(w i |w i-2 ,w i-1 ) 2.16  10 14 parameters • 3-gram LM: trigram model, 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 69

  50. LM: Observations • How large n ? – nothing is enough (theoretically) – but anyway: as much as possible (  close to “perfect” model) – empirically: 3 • parameter estimation? (reliability, data availability, storage space, ...) • 4 is too much: |V|=60k  1.296  10 19 parameters • but: 6-7 would be (almost) ideal (having enough data): in fact, one can recover the original text ssequence from 7-grams! • Reliability ~ (1 / Detail) (  need compromise) • For now, keep word forms (no “linguistic” processing) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 70

  51. The Length Issue  n;  w  n p(w) = 1  n=1..∞  w  n p(w) >> 1 (  ∞) • • We want to model all sequences of words – for “fixed” length tasks: no problem - n fixed, sum is 1 • tagging, OCR/handwriting (if words identified ahead of time) – for “variable” length tasks: have to account for • discount shorter sentences • General model: for each sequence of words of length n, define p’(w) =  n p(w) such that  n=1..∞  n = 1   n=1.. ∞  w  n p’(w)=1 e.g., estimate  n from data; or use normal or other distribution 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 71

  52. Parameter Estimation • Parameter: numerical value needed to compute p(w|h) • From data (how else?) • Data preparation: • get rid of formatting etc. (“text cleaning”) • define words (separate but include punctuation, call it “word”) • define sentence boundaries (insert “words” <s> and </s>) • letter case: keep, discard, or be smart: – name recognition – number type identification [these are huge problems per se!] • numbers: keep, replace by <num>, or be smart (form ~ pronunciation) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 72

  53. Maximum Likelihood Estimate • MLE: Relative Frequency... – ...best predicts the data at hand (the “training data”) • Trigrams from Training Data T: – count sequences of three words in T: c 3 (w i-2 ,w i-1 ,w i ) [NB: notation: just saying that the three words follow each other] – count sequences of two words in T: c 2 (w i-1 ,w i ): • either use c 2 (y,z) =  w c 3 (y,z,w) • or count differently at the beginning (& end) of data! p(w i |w i-2 ,w i-1 ) = est. c 3 (w i-2 ,w i-1 ,w i ) / c 2 (w i-2 ,w i-1 ) ! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 73

  54. Character Language Model • Use individual characters instead of words: p(W)  df  i=1..d p(c i |c i-n+1 ,c i-n+2 ,...,c i-1 ) • Same formulas etc. • Might consider 4-grams, 5-grams or even more • Good only for language comparison • Transform cross-entropy between letter- and word-based models: H S (p c ) = H S (p w ) / avg. # of characters/word in S 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 74

  55. LM: an Example • Training data: <s> <s> He can buy the can of soda. – Unigram: p 1 (He) = p 1 (buy) = p 1 (the) = p 1 (of) = p 1 (soda) = p 1 (.) = .125 p 1 ( can ) = .25 – Bigram: p 2 ( He|<s> ) = 1 , p 2 ( can|He ) = 1 , p 2 ( buy|can ) = .5 , p 2 ( of|can ) = .5 , p 2 ( the|buy ) = 1 ,... – Trigram: p 3 ( He|<s>,<s> ) = 1 , p 3 ( can|<s>,He ) = 1 , p 3 ( buy|He,can ) = 1 , p 3 ( of|the,can ) = 1 , ..., p 3 ( .|of,soda ) = 1 . – Entropy: H(p 1 ) = 2.75, H(p 2 ) = .25, H(p 3 ) = 0  Great?! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 75

  56. LM: an Example (The Problem) • Cross-entropy: • S = <s> <s> It was the greatest buy of all. • Even H S (p 1 ) fails (= H S (p 2 ) = H S (p 3 ) =  ), because: – all unigrams but p 1 (the), p 1 (buy), p 1 (of) and p 1 (.) are 0. – all bigram probabilities are 0. – all trigram probabilities are 0. • We want: to make all (theoretically possible * ) probabilities non-zero. * in fact, all: remember our graph from day 1? 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 76

  57. LM Smoothing (And the EM Algorithm)

  58. The Zero Problem • “Raw” n-gram language model estimate: – necessarily, some zeros • !many: trigram model  2.16  10 14 parameters, data ~ 10 9 words – which are true 0? • optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it’s probability vs. other trigrams • optimal situation cannot happen, unfortunately (open question: how many data would we need?) –  we don’t know – we must eliminate the zeros • Two kinds of zeros: p(w|h) = 0, or even p(h) = 0! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 78

  59. Why do we need Nonzero Probs? • To avoid infinite Cross Entropy: – happens when an event is found in test data which has not been seen in training data H(p) =  prevents comparing data with  0 “errors” • To make the system more robust – low count estimates: • they typically happen for “detailed” but relatively rare appearances – high count estimates: reliable but less “detailed” 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 79

  60. Eliminating the Zero Probabilities: Smoothing • Get new p’(w) (same  ): almost p(w) but no zeros • Discount w for (some) p(w) > 0: new p’(w) < p(w)  w  discounted (p(w) - p’(w)) = D • Distribute D to all w; p(w) = 0: new p’(w) > p(w) – possibly also to other w with low p(w) • For some w (possibly): p’(w) = p(w) • Make sure  w  p’(w) = 1 • There are many ways of smoothing 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 80

  61. Smoothing by Adding 1 • Simplest but not really usable: – Predicting words w from a vocabulary V, training data T: p’(w|h) = (c(h,w) + 1) / (c(h) + |V|) • for non-conditional distributions: p’(w) = (c(w) + 1) / (|T| + |V|) – Problem if |V| > c(h) (as is often the case; even >> c(h)!) • Example: Training data: <s> what is it what is small ? |T| = 8 • V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12 • p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .25 2  .125 2  .001 p(it is flying.) = .125  .25  0 2 = 0 • p’(it) =.1, p’(what) =.15, p’(.)=.05 p’(what is it?) = .15 2  .1 2  .0002 p’(it is flying.) = .1  .15  .05 2  .00004 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 81

  62. Adding less than 1 • Equally simple: – Predicting words w from a vocabulary V, training data T: p’(w|h) = (c(h,w) +  ) / (c(h) +  |V|),  • for non-conditional distributions: p’(w) = (c(w) +  ) / (|T| +  |V|) • Example: Training data: <s> what is it what is small ? |T| = 8 • V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12 • p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .25 2  .125 2  .001 p(it is flying.) = .125  .25  0 2 = 0 • Use  = .1: • p’(it)  .12, p’(what)  .23, p’(.)  .01 p’(what is it?) = .23 2  .12 2  .0007 p’(it is flying.) = .12  .23  .01 2  .000003 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 82

  63. Good - Turing • Suitable for estimation from large data – similar idea: discount/boost the relative frequency estimate: p r (w) = (c(w) + 1)  N(c(w) + 1) / (|T|  N(c(w))) , where N(c) is the count of words with count c (count-of- counts) specifically, for c(w) = 0 (unseen words), p r (w) = N(1) / (|T|  N(0)) – good for small counts (< 5-10, where N(c) is high) – variants ( see MS ) – normalization! (so that we have  w p’(w) = 1) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 83

  64. Good-Turing: An Example • Example: remember: p r (w) = (c(w) + 1)  N(c(w) + 1) / (|T|  N(c(w))) Training data: <s> what is it what is small ? |T| = 8 • V = { what, is, it, small, ?, <s>, flying, birds, are, a, bird, . }, |V| = 12 p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) = .25 2  .125 2  .001 p(it is flying.) = .125  .25  0 2 = 0 • Raw reestimation (N(0) = 6, N(1) = 4, N(2) = 2, N(i) = 0 for i > 2): p r (it) = (1+1)  N(1+1)/(8  N(1)) = 2  2/(8  4) = .125 p r (what) = (2+1)  N(2+1)/(8  N(2)) = 3  0/(8  2) = 0: keep orig. p(what) p r (.) = (0+1)  N(0+1)/(8  N(0)) = 1  4/(8  6)  .083 • Normalize (divide by 1.5 =  w  |V| p r (w)) and compute: p’(it)  .08, p’(what)  .17, p’(.)  .06 p’(what is it?) = .17 2  .08 2  .0002 p’(it is flying.) = .08  .17  .06 2  .00004 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 84

  65. Smoothing by Combination: Linear Interpolation • Combine what? • distributions of various level of detail vs. reliability • n-gram models: • use (n-1)gram, (n-2)gram, ..., uniform reliability detail • Simplest possible combination: – sum of probabilities, normalize: • p(0|0) = .8, p(1|0) = .2, p(0|1) = 1, p(1|1) = 0, p(0) = .4, p(1) = .6: • p’(0|0) = .6, p’(1|0) = .4, p’(0|1) = .7, p’(1|1) = .3 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 85

  66. Typical n-gram LM Smoothing • Weight in less detailed distributions using  =(  0 ,   ,   ,   ): p’  (w i | w i-2 ,w i-1 ) =   p 3 (w i | w i-2 ,w i-1 ) +   p 2 (w i | w i-1 ) +   p 1 (w i ) +  0 /|V| • Normalize:  i > 0,  i=0..n  i = 1 is sufficient (  0 = 1 -  i=1..n  i ) (n=3) • Estimation using MLE: – fix the p 3 , p 2 , p 1 and |V| parameters as estimated from the training data – then find such {  i } which minimizes the cross entropy (maximizes probability of data): -(1/|D|)  i=1..|D| log 2 (p’  (w i |h i )) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 86

  67. Held-out Data • What data to use? – try the training data T: but we will always get   = 1 • why? (let p iT be an i-gram distribution estimated using r.f. from T) • minimizing H T (p’  ) over a vector  , p’  =   p 3T +   p 2T +   p 1T +   /|V| – remember: H T (p’  ) = H(p 3T )+D(p 3T ||p’  ); • (p 3T fixed  H(p 3T ) fixed, best) – which p’  minimizes H T (p’  )? ... a p’  for which D(p 3T || p’  )=0 – ...and that’s p 3T (because D(p||p) = 0, as we know). – ...and certainly p’  = p 3T if   = 1 (maybe in some other cases, too). (p’  = 1  p 3T + 0  p 2T + 0  p 1T + 0/|V|) – – thus: do not use the training data for estimation of  • must hold out part of the training data ( heldout data, H): • ...call the remaining data the (true/raw) training data, T • the test data S (e.g., for comparison purposes): still different data! 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 87

  68. The Formulas • Repeat: minimizing -(1/|H|)  i=1..|H| log 2 (p’  (w i |h i )) over  p’  (w i | h i ) = p’  (w i | w i-2 ,w i-1 ) =   p 3 (w i | w i-2 ,w i-1 ) + !   p 2 (w i | w i-1 ) +   p 1 (w i ) +  0 /|V| • “Expected Counts (of lambdas)”: j = 0..3 ! c(  j ) =  i=1..|H| (  j p j (w i |h i ) / p’  (w i |h i )) • “Next  ”: j = 0..3 !  j,next = c(  j ) /  k=0..3 (c(  k )) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 88

  69. The (Smoothing) EM Algorithm 1. Start with some  , such that  j > 0 for all j  0..3. 2. Compute “Expected Counts” for each  j . 3. Compute new set of  j , using the “Next  ” formula. 4. Start over at step 2, unless a termination condition is met. • Termination condition: convergence of  . – Simply set an  , and finish if |  j -  j,next | <  for each j (step 3). • Guaranteed to converge: follows from Jensen’s inequality, plus a technical proof. 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 89

  70. Remark on Linear Interpolation Smoothing • “Bucketed” smoothing: – use several vectors of  instead of one, based on (the frequency of) history:  (h) • e.g. for h = (micrograms,per) we will have  ( h ) = (.999,.0009,.00009,.00001) (because “cubic” is the only word to follow...) – actually: not a separate set for each history, but rather a set for “similar” histories (“bucket”):  (b(h)), where b: V 2  N (in the case of trigrams) b classifies histories according to their reliability (~ frequency) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 90

  71. Bucketed Smoothing: The Algorithm • First, determine the bucketing function b (use heldout!): – decide in advance you want e.g. 1000 buckets – compute the total frequency of histories in 1 bucket (f max (b)) – gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed f max (b) (you might end up with slightly more than 1000 buckets) • Divide your heldout data according to buckets • Apply the previous algorithm to each bucket and its data 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 91

  72. Simple Example • Raw distribution (unigram only; smooth with uniform): p(a) = .25, p(b) = .5, p(  ) = 1/64 for  {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z • Heldout data: baby; use one set of  (  1 : unigram,  0 : uniform) • Start with  1 = .5; p’  (b) = .5 x .5 + .5 / 26 = .27 p’  (a) = .5 x .25 + .5 / 26 = .14 p’  (y) = .5 x 0 + .5 / 26 = .02 c(  1 ) = .5x.5/.27 + .5x.25/.14 + .5x.5/.27 + .5x0/.02 = 2.72 c(  0 ) = .5x.04/.27 + .5x.04/.14 + .5x.04/.27 + .5x.04/.02 = 1.28 Normalize:  1,next = .68,  0,next = .32. Repeat from step 2 (recompute p’  first for efficient computation, then c(  i ), ...) Finish when new lambdas almost equal to the old ones (say, < 0.01 difference). 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 92

  73. Some More Technical Hints • Set V = {all words from training data}. • You may also consider V = T  H, but it does not make the coding in any way simpler (in fact, harder). • But: you must never use the test data for you vocabulary! • Prepend two “words” in front of all data: • avoids beginning-of-data problems • call these index -1 and 0: then the formulas hold exactly • When c n (w,h) = 0: • Assign 0 probability to p n (w|h) where c n-1 (h) > 0, but a uniform probability (1/|V|) to those p n (w|h) where c n-1 (h) = 0 [this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!] 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 93

  74. Words and the Company They Keep

  75. Motivation • Environment: – mostly “not a full analysis (sentence/text parsing)” • Tasks where “words & company” are important: – word sense disambiguation (MT, IR, TD, IE) – lexical entries: subdivision & definitions (lexicography) – language modeling (generalization, [kind of] smoothing) – word/phrase/term translation (MT, Multilingual IR) – NL generation (“natural” phrases) (Generation, MT) – parsing (lexically-based selectional preferences) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 95

  76. Collocations • Collocation – Firth: “word is characterized by the company it keeps”; collocations of a given word are statements of the habitual or customary places of that word. – non-compositionality of meaning • cannot be derived directly from its parts (heavy rain) – non-substitutability in context • for parts (red light) – non-modifiability (& non-transformability) • kick the yellow bucket; take exceptions to 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 96

  77. Association and Co-occurence; Terms • Does not fall under “collocation”, but: • Interesting just because it does often [rarely] appear together or in the same (or similar) context: • (doctors, nurses) • (hardware,software) • (gas, fuel) • (hammer, nail) • (communism, free speech) • Terms: – need not be > 1 word (notebook, washer) 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 97

  78. Collocations of Special Interest • Idioms: really fixed phrases • kick the bucket, birds-of-a-feather, run for office • Proper names: difficult to recognize even with lists • Tuesday (person’s name), May, Winston Churchill, IBM, Inc. • Numerical expressions – containing “ordinary” words • Monday Oct 04 1999, two thousand seven hundred fifty • Phrasal verbs – Separable parts: • look up, take off 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 98

  79. Further Notions • Synonymy: different form/word, same meaning: • notebook / laptop • Antonymy: opposite meaning: • new/old, black/white, start/stop • Homonymy: same form/word, different meaning: • “true” (random, unrelated): can (aux. verb / can of Coke) • related: polysemy; notebook, shift, grade, ... • Other: • Hyperonymy/Hyponymy: general vs. special: vehicle/car • Meronymy/Holonymy: whole vs. part: body/leg 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 99

  80. How to Find Collocations? • Frequency – plain – filtered • Hypothesis testing – t test –   test • Pointwise (“poor man’s”) Mutual Information • (Average) Mutual Information 2018/9 UFAL MFF UK NPFL067/Intro to Statistical NLP I/Jan Hajic - Pavel Pecina 100

Recommend


More recommend