Natural Language Processing Spring 2017 Unit 1: Sequence Models - - PowerPoint PPT Presentation

natural language processing
SMART_READER_LITE
LIVE PREVIEW

Natural Language Processing Spring 2017 Unit 1: Sequence Models - - PowerPoint PPT Presentation

Natural Language Processing Spring 2017 Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required optional Liang Huang Probabilities experiment (e.g., toss a coin 3


slide-1
SLIDE 1

Natural Language Processing

Spring 2017

Liang Huang

Unit 1: Sequence Models

Lecture 4a: Probabilities and Estimations

Lecture 4b: Weighted Finite-State Machines

required

  • ptional
slide-2
SLIDE 2

CS 562 - Lec 5-6: Probs & WFSTs

Probabilities

  • experiment (e.g., “toss a coin 3 times”)
  • basic outcomes Ω (e.g., Ω={HHH, HHT, HTH, ..., TTT} )
  • event: some subset A of Ω (e.g., A = “heads twice”)
  • probability distribution
  • a function p from Ω to [0, 1]
  • ∑ e ∈ Ω p(e) = 1
  • probability of events (marginals)
  • p (A) = ∑ e ∈ A p(e)

2

slide-3
SLIDE 3

CS 562 - Lec 5-6: Probs & WFSTs

Joint and Conditional Probs

3

slide-4
SLIDE 4

CS 562 - Lec 5-6: Probs & WFSTs

Multiplication Rule

4

slide-5
SLIDE 5

CS 562 - Lec 5-6: Probs & WFSTs

Independence

  • P(A, B) = P(A) P(B) or P(A) = P(A|B)
  • disjoint events are always dependent! P(A,B) = 0
  • unless one of them is “impossible”: P(A)=0
  • conditional independence: P(A, B|C) = P(A|C) P(B|C)

P(A|C) = P (A|B, C)

5

slide-6
SLIDE 6

CS 562 - Lec 5-6: Probs & WFSTs

Marginalization

  • compute marginal probs from joint/conditional probs

6

slide-7
SLIDE 7

CS 562 - Lec 5-6: Probs & WFSTs

Bayes Rules

7

alternative bayes rule by partition

slide-8
SLIDE 8

CS 562 - Lec 5-6: Probs & WFSTs

Most Likely Event

8

slide-9
SLIDE 9

CS 562 - Lec 5-6: Probs & WFSTs

Most Likely Given ...

9

slide-10
SLIDE 10

CS 562 - Lec 5-6: Probs & WFSTs

Estimating Probabilities

  • how to get probabilities for basic outcomes?
  • do experiments
  • count stuff
  • e.g. how often do people start a sentence with “the”?
  • P (A) = (# of sentences like “the ...” in the sample) /

(# of all sentences in the sample)

  • P (A | B) = (count of A, B) / (count of B)
  • we will show that this is Maximum Likelihood Estimation

10

slide-11
SLIDE 11

CS 562 - Lec 5-6: Probs & WFSTs

Model

  • what is a MODEL?
  • a general theory of how the data is generated,
  • along with a set of parameter estimates
  • e.g., given this statistics
  • we can “guess” it’s generated by a 12-sided die
  • along with 11 free parameters p(1), p(2), ..., p(11)
  • alternatively, by two tosses of a single 6-sided die
  • along with 5 free parameters p(1), p(2), ..., p(5)
  • which is better given the data? which better explains the data?

argmaxm p(m|d) = argmaxm p(m) p(d|m)

11

slide-12
SLIDE 12

CS 562 - Lec 5-6: Probs & WFSTs

Maximum Likelihood Estimation

  • always maximize posterior: what’s the best m given d?
  • when do we use maximum likelihood estimation?
  • with uniform prior, same as likelihood (explains data)
  • argmaxm p(m|d) = argmaxm p(m) p(d|m) bayes, and p(d)=1
  • = argmaxm p(d|m) when p(m) uniform

12

slide-13
SLIDE 13

CS 562 - Lec 5-6: Probs & WFSTs

How do we rigorously derive this?

  • assuming any pm(H) = θ is possible, what’s the best θ?
  • e.g.: data is still H, H, T, H.
  • argmaxθ p(d|m; θ) = argmaxθ θ3 (1-θ)
  • take derivatives, make it zero: θ = 3/4.
  • works in the general case: θ = n / (n+m) (n heads, m tails)
  • this is why MLE is just count & divide in the discrete case

13

slide-14
SLIDE 14

CS 562 - Lec 5-6: Probs & WFSTs

What if we have some prior?

  • what if we have arbitrary prior
  • like p(θ) = θ (1-θ)
  • maximum a posteriori estimation (MAP)
  • MAP approaches MLE with infinite
  • MAP = MLE + smoothing
  • this prior is just “extra two tosses, unbiased”
  • you can inject other priors, like “extra 4 tosses, 3 Hs”

14

slide-15
SLIDE 15

CS 562 - Lec 5-6: Probs & WFSTs

Probabilistic Finite-State Machines

  • adding probabilities into finite-state acceptors (FSAs)
  • FSA: a set of strings; WFSA: a distribution of strings

15

slide-16
SLIDE 16

CS 562 - Lec 5-6: Probs & WFSTs

WFSA

  • normalization: transitions leaving each state sum up to 1
  • defines a distribution over strings?
  • or a distribution over paths?
  • => also induces a distribution over strings

16

slide-17
SLIDE 17

CS 562 - Lec 5-6: Probs & WFSTs

WFSTs

  • FST: a relation over strings (a set of string pairs)
  • WFST: a probabilistic relation over strings (a set of

<s, t, p>: strings pair <s, t> with probability p)

  • what is p representing?

17

slide-18
SLIDE 18

CS 562 - Lec 5-6: Probs & WFSTs

Edit Distance as WFST

  • this is simplified edit distance
  • real edit distance as an example of WFST, but not PFST

18

costs: replacement: 1 insertion: 2 deletion: 2

a:a/0 a:b/1 a:*e*/2 b:b/0 *e*:a/2 b:a/1

...

WFST: real edit distance

slide-19
SLIDE 19

CS 562 - Lec 5-6: Probs & WFSTs

Normalization

  • if transitions leaving each state and each input symbol

sum up to 1, then...

  • WFST defines conditional prob p(y|x) for x => y
  • what if we want to define a joint prob p(x, y) for x=>y?
  • what if we want p(x | y)?

19

slide-20
SLIDE 20

CS 562 - Lec 5-6: Probs & WFSTs

Questions of WFSTs

  • given x, y, what is p(y|x) ?
  • for a given x, what’s the y that maximizes p(y|x) ?
  • for a given y, what’s the x that maximizes p(y|x) ?
  • for a given x, supply all output y w/ respective p(y|x)
  • for a given y, supply all input x w/ respective p(y|x)

20

slide-21
SLIDE 21

CS 562 - Lec 5-6: Probs & WFSTs

Answer: Composition

  • p (z | x) = p (y | x) p (z | y) ???
  • = sumy p (y | x) p (z | y) have to sum up y
  • given y, z & x are independent in this cascade - Why?
  • how to build a composed WFST C out of WFSTs A, B?
  • again, like intersection
  • sum up the products
  • (+, x) semiring

21

slide-22
SLIDE 22

CS 562 - Lec 5-6: Probs & WFSTs

Example

22

slide-23
SLIDE 23

CS 562 - Lec 5-6: Probs & WFSTs

Example

23

they use (min, +), we use (+, *)

from M. Mohri and J. Eisner

slide-24
SLIDE 24

CS 562 - Lec 5-6: Probs & WFSTs

Example

24

they use (min, +), we use (+, *)

slide-25
SLIDE 25

CS 562 - Lec 5-6: Probs & WFSTs

Example

25

they use (min, +), we use (+, *)

slide-26
SLIDE 26

CS 562 - Lec 5-6: Probs & WFSTs

Given x, supply all output y

26

no longer normalized!

slide-27
SLIDE 27

CS 562 - Lec 5-6: Probs & WFSTs

Given x, y, what’s p(y|x)

27

slide-28
SLIDE 28

CS 562 - Lec 5-6: Probs & WFSTs

Given x, what’s max p(y|x)

28

slide-29
SLIDE 29

CS 562 - Lec 5-6: Probs & WFSTs

Part-of-Speech Tagging Again

29

slide-30
SLIDE 30

CS 562 - Lec 5-6: Probs & WFSTs

Part-of-Speech Tagging Again

30

slide-31
SLIDE 31

CS 562 - Lec 5-6: Probs & WFSTs

Adding a Tag Bigram Model (again)

31

FST C: POS bigram LM

p(t...t|w...w) p(t...t)

wait, is that right (mathematically)?

p(???) p(w...w)

slide-32
SLIDE 32

CS 562 - Lec 5-6: Probs & WFSTs

Noisy-Channel Model

32

slide-33
SLIDE 33

CS 562 - Lec 5-6: Probs & WFSTs

Noisy-Channel Model

33

p(t...t)

slide-34
SLIDE 34

CS 562 - Lec 5-6: Probs & WFSTs

Applications of Noisy-Channel

34

slide-35
SLIDE 35

CS 562 - Lec 5-6: Probs & WFSTs

Example: Edit Distance

35

a:ε" ε" ε:a b:ε" ε" ε:b a:b b:a a:a b:b O(k) deletion arcs O(k) insertion arcs O(k) identity arcs

from J. Eisner

slide-36
SLIDE 36

CS 562 - Lec 5-6: Probs & WFSTs

Example: Edit Distance

36

clara

a : ε " ε " ε ε : a b : ε " ε " ε ε : b a : b b : a a : a b : b

.o. =

caca

.o.

c:ε" ε" l:ε" ε" a:ε" ε" r:ε" ε" a:ε" ε" ε:c c:c ε:c l:c ε:c a:c ε:c r:c ε:c a:c ε:c c:ε" ε" l:ε" ε" a:ε" ε" r:ε" ε" a:ε" ε" ε:a c:a ε:a l:a ε:a a:a ε:a r:a ε:a a:a ε:a c:ε" ε" l:ε" ε" a:ε" ε" r:ε" ε" a:ε" ε" ε:c c:c ε:c l:c ε:c a:c ε:c r:c ε:c a:c ε:c c:ε" ε" l:ε" ε" a:ε" ε" r:ε" ε" a:ε" ε" ε:a c:a ε:a l:a ε:a a:a ε:a r:a ε:a a:a ε:a c:ε" ε" l:ε" ε" a:ε" ε" r:ε" ε" a:ε" ε"

Best path (by Dijkstra’s algorithm)

slide-37
SLIDE 37

CS 562 - Lec 5-6: Probs & WFSTs

Max / Sum Probs

  • in a WFSA, which string x has the greatest p(x)?
  • graph search (shortest path) problem
  • Dijkstra;
  • or

Viterbi if the FSA is acyclic

  • does it work for NFA?
  • best path much easier than best string
  • you can determinize it (with exponential cost!)
  • popular work-around: n-best list crunching

37

(b. 1932) Viterbi Alg. (1967) CMDA, Qualcomm Edsger Dijkstra (1930-2002) “GOTO considered harmful”

slide-38
SLIDE 38

CS 562 - Lec 5-6: Probs & WFSTs

Dijkstra 1959 vs. Viterbi 1967

38

that’s min. spanning tree!

Jarnik (1930) - Prim (1957) - Dijkstra (1959)

Edsger Dijkstra (1930-2002) “GOTO considered harmful”

slide-39
SLIDE 39

CS 562 - Lec 5-6: Probs & WFSTs

Dijkstra 1959 vs. Viterbi 1967

39

that’s shortest-path

Moore (1957) - Dijkstra (1959)

slide-40
SLIDE 40

CS 562 - Lec 5-6: Probs & WFSTs

Dijkstra 1959 vs. Viterbi 1967

40

special case of dynamic programming (Bellman, 1957)

slide-41
SLIDE 41

CS 562 - Lec 5-6: Probs & WFSTs

Sum Probs

  • what is p(x) for some particular x?
  • for DFA, just follow x
  • for NFA,
  • get a subgraph (by composition), then sum ??
  • acyclic =>

Viterbi

  • cyclic => compute strongly connected components
  • SCC-DAG cluster graph (cyclic locally, acyclic globally)
  • do infinite sum (matrix inversion) locally,

Viterbi globally

  • refer to extra readings on course website

41