natural language processing
play

Natural Language Processing Spring 2017 Unit 1: Sequence Models - PowerPoint PPT Presentation

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required optional Liang Huang Probabilities experiment (e.g., toss a coin 3


  1. Natural Language Processing Spring 2017 Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required optional Liang Huang

  2. Probabilities • experiment (e.g., “toss a coin 3 times”) • basic outcomes Ω (e.g., Ω ={ HHH, HHT, HTH, ..., TTT } ) • event: some subset A of Ω (e.g., A = “heads twice”) • probability distribution • a function p from Ω to [0, 1] • ∑ e ∈ Ω p (e) = 1 • probability of events (marginals) • p (A) = ∑ e ∈ A p (e) CS 562 - Lec 5-6: Probs & WFSTs 2

  3. Joint and Conditional Probs CS 562 - Lec 5-6: Probs & WFSTs 3

  4. Multiplication Rule CS 562 - Lec 5-6: Probs & WFSTs 4

  5. Independence • P(A, B) = P(A) P(B) or P(A) = P(A|B) • disjoint events are always dependent! P(A,B) = 0 • unless one of them is “impossible”: P(A)=0 • conditional independence: P(A, B|C) = P(A|C) P(B|C) P(A|C) = P (A|B, C) CS 562 - Lec 5-6: Probs & WFSTs 5

  6. Marginalization • compute marginal probs from joint/conditional probs CS 562 - Lec 5-6: Probs & WFSTs 6

  7. Bayes Rules alternative bayes rule by partition CS 562 - Lec 5-6: Probs & WFSTs 7

  8. Most Likely Event CS 562 - Lec 5-6: Probs & WFSTs 8

  9. Most Likely Given ... CS 562 - Lec 5-6: Probs & WFSTs 9

  10. Estimating Probabilities • how to get probabilities for basic outcomes? • do experiments • count stuff • e.g. how often do people start a sentence with “the”? • P (A) = (# of sentences like “the ...” in the sample) / (# of all sentences in the sample) • P (A | B) = (count of A, B) / (count of B) • we will show that this is Maximum Likelihood Estimation CS 562 - Lec 5-6: Probs & WFSTs 10

  11. Model • what is a MODEL? • a general theory of how the data is generated , • along with a set of parameter estimates • e.g., given this statistics • we can “guess” it’s generated by a 12-sided die • along with 11 free parameters p(1), p(2), ..., p(11) • alternatively, by two tosses of a single 6-sided die • along with 5 free parameters p(1), p(2), ..., p(5) • which is better given the data? which better explains the data? argmax m p(m|d) = argmax m p(m) p(d|m) CS 562 - Lec 5-6: Probs & WFSTs 11

  12. Maximum Likelihood Estimation • always maximize posterior: what’s the best m given d? • when do we use maximum likelihood estimation? • with uniform prior, same as likelihood (explains data) • argmax m p(m|d) = argmax m p(m) p(d|m) bayes, and p(d)=1 • = argmax m p(d|m) when p(m) uniform CS 562 - Lec 5-6: Probs & WFSTs 12

  13. How do we rigorously derive this? • assuming any p m (H) = θ is possible, what’s the best θ ? • e.g.: data is still H, H, T, H. • argmax θ p(d|m; θ ) = argmax θ θ 3 (1- θ ) • take derivatives, make it zero: θ = 3/4. • works in the general case: θ = n / (n+m) (n heads, m tails) • this is why MLE is just count & divide in the discrete case CS 562 - Lec 5-6: Probs & WFSTs 13

  14. What if we have some prior? • what if we have arbitrary prior • like p( θ ) = θ (1- θ ) • maximum a posteriori estimation (MAP) • MAP approaches MLE with infinite • MAP = MLE + smoothing • this prior is just “extra two tosses, unbiased” • you can inject other priors, like “extra 4 tosses, 3 Hs ” CS 562 - Lec 5-6: Probs & WFSTs 14

  15. Probabilistic Finite-State Machines • adding probabilities into finite-state acceptors (FSAs) • FSA: a set of strings; WFSA: a distribution of strings CS 562 - Lec 5-6: Probs & WFSTs 15

  16. WFSA • normalization: transitions leaving each state sum up to 1 • defines a distribution over strings? • or a distribution over paths? • => also induces a distribution over strings CS 562 - Lec 5-6: Probs & WFSTs 16

  17. WFSTs • FST: a relation over strings (a set of string pairs) • WFST: a probabilistic relation over strings (a set of <s, t, p>: strings pair <s, t> with probability p) • what is p representing? CS 562 - Lec 5-6: Probs & WFSTs 17

  18. Edit Distance as WFST • this is simplified edit distance • real edit distance as an example of WFST, but not PFST WFST: real edit distance a:a/0 ... a:b/1 costs: 0 replacement : 1 a:*e*/2 b:a/1 insertion : 2 deletion : 2 b:b/0 *e*:a/2 CS 562 - Lec 5-6: Probs & WFSTs 18

  19. Normalization • if transitions leaving each state and each input symbol sum up to 1, then... • WFST defines conditional prob p(y|x) for x => y • what if we want to define a joint prob p(x, y) for x=>y? • what if we want p(x | y)? CS 562 - Lec 5-6: Probs & WFSTs 19

  20. Questions of WFSTs • given x, y, what is p(y|x) ? • for a given x, what’s the y that maximizes p(y|x) ? • for a given y, what’s the x that maximizes p(y|x) ? • for a given x, supply all output y w/ respective p(y|x) • for a given y, supply all input x w/ respective p(y|x) CS 562 - Lec 5-6: Probs & WFSTs 20

  21. Answer: Composition • p (z | x) = p (y | x) p (z | y) ??? • = sum y p (y | x) p (z | y) have to sum up y • given y, z & x are independent in this cascade - Why? • how to build a composed WFST C out of WFSTs A, B? • again, like intersection • sum up the products • (+, x) semiring CS 562 - Lec 5-6: Probs & WFSTs 21

  22. Example CS 562 - Lec 5-6: Probs & WFSTs 22

  23. Example from M. Mohri and J. Eisner they use (min, +), we use (+, *) CS 562 - Lec 5-6: Probs & WFSTs 23

  24. Example they use (min, +), we use (+, *) CS 562 - Lec 5-6: Probs & WFSTs 24

  25. Example they use (min, +), we use (+, *) CS 562 - Lec 5-6: Probs & WFSTs 25

  26. Given x, supply all output y no longer normalized! CS 562 - Lec 5-6: Probs & WFSTs 26

  27. Given x, y, what’s p(y|x) CS 562 - Lec 5-6: Probs & WFSTs 27

  28. Given x, what’s max p(y|x) CS 562 - Lec 5-6: Probs & WFSTs 28

  29. Part-of-Speech Tagging Again CS 562 - Lec 5-6: Probs & WFSTs 29

  30. Part-of-Speech Tagging Again CS 562 - Lec 5-6: Probs & WFSTs 30

  31. Adding a Tag Bigram Model (again) FST C: POS bigram LM p(w...w) p(t...t|w...w) p(t...t) p(???) wait, is that right (mathematically)? CS 562 - Lec 5-6: Probs & WFSTs 31

  32. Noisy-Channel Model CS 562 - Lec 5-6: Probs & WFSTs 32

  33. Noisy-Channel Model p(t...t) CS 562 - Lec 5-6: Probs & WFSTs 33

  34. Applications of Noisy-Channel CS 562 - Lec 5-6: Probs & WFSTs 34

  35. Example: Edit Distance from J. Eisner O(k) deletion arcs b: ε" ε" a: ε" ε" a:b b:a ε :a O(k) insertion arcs a:a ε :b b:b O(k) identity arcs CS 562 - Lec 5-6: Probs & WFSTs 35

  36. Example: Edit Distance Best path (by Dijkstra’s algorithm) clara c: ε" l: ε" a: ε" r: ε" a: ε" ε" ε" ε" ε" ε" .o. a:c l:c c:c a:c r:c ε :c ε :c ε :c ε :c " " ε :c ε :c ε ε : b ε ε " " : c: ε" l: ε" a: ε" r: ε" a: ε" a ε" ε" ε" ε" ε" b a : c:a l:a a:a r:a a:a = ε :a ε :a ε :a ε :a ε :a ε :a : a b : ε ε a c: ε" l: ε" a: ε" r: ε" a: ε" ε" ε" ε" ε" ε" l:c c:c a:c r:c a:c a a : ε :c ε :c ε :c ε :c ε :c ε :c : ε ε b c: ε" l: ε" a: ε" r: ε" a: ε" ε" ε" ε" ε" ε" : b b l:a c:a a:a r:a a:a .o. ε :a ε :a ε :a ε :a ε :a ε :a c: ε" l: ε" a: ε" r: ε" a: ε" caca ε" ε" ε" ε" ε" CS 562 - Lec 5-6: Probs & WFSTs 36

  37. Max / Sum Probs • in a WFSA, which string x has the greatest p ( x )? • graph search (shortest path) problem • Dijkstra; Edsger Dijkstra • or Viterbi if the FSA is acyclic (1930-2002) “ GOTO considered harmful” • does it work for NFA? • best path much easier than best string • you can determinize it (with exponential cost!) • popular work-around: n -best list crunching (b. 1932) Viterbi Alg. (1967) CMDA, Qualcomm CS 562 - Lec 5-6: Probs & WFSTs 37

  38. Dijkstra 1959 vs. Viterbi 1967 Edsger Dijkstra (1930-2002) “ GOTO considered harmful” that’s min. spanning tree! Jarnik (1930) - Prim (1957) - Dijkstra (1959) CS 562 - Lec 5-6: Probs & WFSTs 38

  39. Dijkstra 1959 vs. Viterbi 1967 that’s shortest-path Moore (1957) - Dijkstra (1959) CS 562 - Lec 5-6: Probs & WFSTs 39

  40. Dijkstra 1959 vs. Viterbi 1967 special case of dynamic programming (Bellman, 1957) CS 562 - Lec 5-6: Probs & WFSTs 40

  41. Sum Probs • what is p ( x ) for some particular x ? • for DFA, just follow x • for NFA, • get a subgraph (by composition), then sum ?? • acyclic => Viterbi • cyclic => compute strongly connected components • SCC-DAG cluster graph (cyclic locally, acyclic globally) • do infinite sum (matrix inversion) locally, Viterbi globally • refer to extra readings on course website CS 562 - Lec 5-6: Probs & WFSTs 41

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend