introduction to natural language processing
play

Introduction to Natural Language Processing a course taught as - PowerPoint PPT Presentation

Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by members of the Institute of Formal and Applied Linguistics Today: Week 3, lecture Todays topic: Markov Models Todays teacher: Jan Haji c


  1. Introduction to Natural Language Processing a course taught as B4M36NLP at Open Informatics by members of the Institute of Formal and Applied Linguistics Today: Week 3, lecture Today’s topic: Markov Models Today’s teacher: Jan Hajiˇ c E-mail: hajic@ufal.mff.cuni.cz WWW: http://ufal.mff.cuni.cz/jan-hajic c (´ Jan Hajiˇ UFAL MFF UK) Markov Models Week 3, lecture 1 / 1

  2. Review: Markov Process • Bayes formula (chain rule): P(W) = P(w 1 ,w 2 ,...,w T ) =  i=1..T p(w i |w 1 ,w 2 ,..,w i-n+1 ,..,w i-1 ) approximation • n-gram language models: – Markov process (chain) of the order n-1: P(W) = P(w 1 ,w 2 ,...,w T ) =  i=1..T p(w i |w i-n+1 ,w i-n+2 ,..,w i-1 ) Using just one distribution (Ex.: trigram model: p(w i |w i-2 ,w i-1 )): Positions: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Words: My car broke down , and within hours Bob ’s car broke down , too . p( ,|broke down ) = p(w 5 |w 3 ,w 4 )) = p(w 14 |w 12 ,w 13 ) 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 2

  3. Markov Properties • Generalize to any process (not just words/LM): – Sequence of random variables: X = (X 1 ,X 2 ,...,X T ) – Sample space S ( states ), size N: S = {s 0 ,s 1 ,s 2 ,...,s N } 1. Limited History (Context, Horizon):  i  1..T; P(X i |X 1 ,...,X i-1 ) = P(X i |X i-1 ) 1 7 3 7 9 0 6 7 3 4 5... 1 7 3 7 9 0 6 7 3 4 5... 1 7 3 7 9 0 6 7 7 2. Time invariance (M.C. is stationary, homogeneous)  i  1..T,  y,x  S; P(X i =y|X i-1 =x) = p(y|x) 1 7 3 7 9 0 6 7 3 4 5... ? ok...same distribution 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 3

  4. Long History Possible • What if we want trigrams: 1 7 3 7 9 0 6 7 3 4 5... • Formally, use transformation: Define new variables Q i , such that X i = {Q i-1 ,Q i }: Then P(X i |X i-1 ) = P(Q i-1 ,Q i |Q i-2 ,Q i-1 ) = P(Q i |Q i-2 ,Q i-1 ) 9 0 Predicting (X i ): 1 7 3 7 9 0 6 7 3 4 5... 0  1 7 3 .... 0 6 7 3 4  9  History (X i-1 = {Q i-2 ,Q i-1 }):  1 7 .... 9 0 6 7 3 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 4

  5. Graph Representation: State Diagram • S = {s 0 ,s 1 ,s 2 ,...,s N } : states • Distribution P(X i |X i-1 ): • transitions (as arcs) with probabilities attached to them: Bigram 1 case:  e t 0.6 0.12 enter here sum of outgoing probs = 1 0.4 0.3 0.88 1 0.4 o a p(o|a) = 0.1 p(toe) = .6  .88  1 = .528 0.2 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 5

  6. The Trigram Case • S = {s 0 ,s 1 ,s 2 ,...,s N }: states: pairs s i = (x,y) • Distribution P(X i |X i-1 ): (r.v. X: generates pairs s i ) 1 e,n 1   t t,e 1 1 0.6 0.12 enter here n,e o,e n e o l b i t s s a o p l m l 0.88 o i w 0.07 e d 1 0.4  o t,o o,n 0.93 1 p(toe) = .6  .88  .07  .037 p(one) = ? 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 6

  7. Finite State Automaton • States ~ symbols of the [input/output] alphabet – pairs (or more): last element of the n-tuple • Arcs ~ transitions (sequence of states) • [Classical FSA: alphabet symbols on arcs: – transformation: arcs  nodes] • Possible thanks to the “limited history” M’ov Property • So far: Visible Markov Models (VMM) 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 7

  8. Hidden Markov Models • The simplest HMM: states generate [observable] output (using the “data” alphabet) but remain “invisible”: t e 1  2 1 0.6 0.12 enter here 0.4 0.3 0.88 1 0.4 4 3 p(4|3) = 0.1 p(toe) = .6  .88  1 = .528 0.2 a o 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 8

  9. Added Flexibility • So far, no change; but different states may generate the same output (why not?): t e 1  1 2 0.6 0.12 enter here 0.4 0.3 0.88 1 0.4 4 3 p(toe) = .6  .88  1 + p(4|3) = 0.1 .4  .1  1 = .568 0.2 t o 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 9

  10. Output from Arcs... • Added flexibility: Generate output from arcs, not states: t t e 1  1 2 0.6 0.12 enter here o 0.4 0.3 0.88 1 0.4 p(toe) = .6  .88  1 + 4 3 e 0.1 .4  .1  1 + e t 0.2 .4  .2  .3 + o e .4  .2  .4 = .624 o 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 10

  11. ... and Finally, Add Output Probabilities • Maximum flexibility: [Unigram] distribution (sample space: output alphabet) at each output arc: p(t)=0 p(t)=.8 p(o)=0 p(o)=.1 !simplified! p(e)=1 p(e)=.1 p(t)=.1  1 2 p(o)=.7 0.6 0.12 enter here p(e)=.2 1 0.4 0.88 p(toe) = .6      .88      1    + 1 0.88 4 .4     .1     .88    3 p(t)=0 + p(t)=0 p(t)=.5 p(o)=.4 .4    1     .12   p(o)=1 p(o)=.2 p(e)=.6  .237 p(e)=0 p(e)=.3 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 11

  12. Slightly Different View • Allow for multiple arcs from s i  s j , mark them by output symbols, get rid of output distributions: e,.12 o,.06 e,.06  1 2 t,.48 e,.176 enter here e,.12 o,.08 t,.088 o,.4 o,1 p(toe) = .48  .616  .6+ t,.2 o,.616 4 3 .2  1  .176 + e,.6 .2  1  .12  .237 In the future, we will use the view more convenient for the problem at hand. 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 12

  13. Formalization • HMM (the most general case): – five-tuple (S, s 0 , Y, P S , P Y ), where: • S = {s 0 ,s 1 ,s 2 ,...,s T } is the set of states, s 0 is the initial state, • Y = {y 1 ,y 2 ,...,y V } is the output alphabet, • P S (s j |s i ) is the set of prob. distributions of transitions, – size of P S : |S| 2 . • P Y (y k |s i ,s j ) is the set of output (emission) probability distributions. – size of P Y : |S| 2 x |Y| • Example: – S = {x, 1, 2, 3, 4}, s 0 = x – Y = { t, o, e } 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 13

  14. Formalization - Example • Example: – S = {x, 1, 2, 3, 4}, s 0 = x – Y = { e, o, t } e x 1 2 3 4  = 1 o x 1 2 3 4 x – P S : P Y : x 1 2 3 4 t x 1 2 3 4 x 1 .2 x 0 .6 0 .4 0 x .8 .5 1 .7 2 1 0 0 .12 0 .88 1 .1 0 2 3 2 0 0 0 0 1 2 0 3 4 3 0 1 0 0 0 3 0 4 0 4 0 0 1 0 0 4  = 1 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 14

  15. Using the HMM • The generation algorithm (of limited value :-)): 1. Start in s = s 0 . 2. Move from s to s’ with probability P S (s’|s). 3. Output (emit) symbol y k with probability P S (y k |s,s’). 4. Repeat from step 2 (until somebody says enough). • More interesting usage: – Given an output sequence Y = {y 1 ,y 2 ,...,y k }, compute its probability. – Given an output sequence Y = {y 1 ,y 2 ,...,y k }, compute the most likely sequence of states which has generated it. – ...plus variations: e.g., n best state sequences 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 15

  16. HMM Algorithms: Trellis and Viterbi

  17. HMM: The Two Tasks • HMM (the general case): – five-tuple (S, S 0 , Y, P S , P Y ), where: • S = {s 1 ,s 2 ,...,s T } is the set of states, S 0 is the initial state, • Y = {y 1 ,y 2 ,...,y V } is the output alphabet, • P S (s j |s i ) is the set of prob. distributions of transitions, • P Y (y k |s i ,s j ) is the set of output (emission) probability distributions. • Given an HMM & an output sequence Y = {y 1 ,y 2 ,...,y k }: (Task 1) compute the probability of Y; (Task 2) compute the most likely sequence of states which has generated Y. 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 17

  18. Trellis - Deterministic Output HMM: Trellis: time/position t 0 1 2 3 4...  ,0  ,1  ,2  ,3 .6 t e 1 A,0 A,1 A,2 A,3  A B “rollout” e 0.12 r e h r B,0 B,1 B,2 B,3 e t 0.3 0.4 .4 n 0.88 .88 e 1 C D p(4|3) = 0.1 C,0 C,1 C,2 C,3 0.2 t .1 1 o p(toe) = .6  .88  1 + D,0 D,1 D,2 D,3 .4  .1  1 = .568 + Y: t o e - trellis state: (HMM state, position)  (  ,0) = 1  (A,1) = .6  (D,2) = .568  (B,3) = .568 - each state: holds one number (prob):   (C,1) = .4 - probability or Y:  in the last state 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 18

  19. Creating the Trellis: The Start position/stage • Start in the start state (  ), 0 1 – set its  (  , 0 ) to 1.  ,0 .6  = 1 • Create the first stage: A,1  = .6 – get the first “output” symbol y 1 .4 – create the first stage (column) C,1 – but only those trellis states which generate y 1 y 1 : t – set their  ( state , 1 ) to the P S ( state |  )  (  , 0 ) } • ...and forget about the 0 -th stage 1 2016/7 UFAL MFF UK @ FEL/Intro to Statistical NLP I/Jan Hajic 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend