unit 2 natural language learning
play

Unit 2: Natural Language Learning Unsupervised Learning (EM, - PowerPoint PPT Presentation

Natural Language Processing Spring 2017 Unit 2: Natural Language Learning Unsupervised Learning (EM, forward-backward, inside-outside) Liang Huang liang.huang.sh@gmAYl.com Review of Noisy-Channel Model CS 562 - EM 2 Example 1:


  1. Natural Language Processing Spring 2017 Unit 2: Natural Language Learning Unsupervised Learning (EM, forward-backward, inside-outside) Liang Huang liang.huang.sh@gmAYl.com

  2. Review of Noisy-Channel Model CS 562 - EM 2

  3. Example 1: Part-of-Speech Tagging • use tag bigram as a language model • channel model is context-indep. CS 562 - EM 3

  4. Ideal vs. AvAYlable Data ideal avAYlable CS 562 - EM 4

  5. Ideal vs. AvAYlable Data HW2: ideal HW4: realistic EY B AH L EY B AH L A B E R U A B E R U 1 2 3 4 4 AH B AW T AH B AW T A B A U T O A B A U T O 1 2 3 3 4 4 AH L ER T AH L ER T A R A A T O A R A A T O 1 2 3 3 4 4 EY S EY S E E S U E E S U 1 1 2 2 CS 562 - EM 5

  6. Incomplete Data / Model CS 562 - EM 6

  7. EM: Expectation-Maximization CS 562 - EM 7

  8. How to Change m ? 1) Hard CS 562 - EM 8

  9. How to Change m ? 1) Hard CS 562 - EM 9

  10. How to Change m ? 2) Soft CS 562 - EM 10

  11. Fractional Counts • distribution over all possible hallucinated hidden variables • W AY N W AY N W AY N W AY N | | / \ | |\ \ |\ \ \ W A I N W A I N W A I N W A I N hard-EM counts 1 0 0 fractional counts 0.333 0.333 0.333 AY|-> A: 0.333 A I: 0.333 I: 0.333 W|-> W: 0.667 W A: 0.333 N|-> N: 0.667 I N: 0.333 regenerate: 2/3*1/3*1/3 2/3*1/3*2/3 1/3*1/3*2/3 fractional counts 0.25 0.5 0.25 AY|-> A I: 0.500 A: 0.250 I: 0.250 W|-> W: 0.750 W A: 0.250 N|-> N: 0.750 I N: 0.250 eventually ... 0 ... 1 ... 0 CS 562 - EM 11

  12. Is EM magic? well, sort of... • how about W EH T W E T O B IY B IY | |\ |\ \ B I I B I I • so EM can possibly: (1) learn something correct (2) learn something wrong (3) doesn’t learn anything • but with lots of data => likely to learn something good CS 562 - EM 12

  13. EM: slow version (non-DP) • initialize the conditional prob. table to uniform • repeat until converged: W AY N W AY N W AY N | | /\ | |\ \ |\ \ \ W A I N W A I N W A I N • E-step: z z’ z ’’ ( z 1 z 2 z 3 ) • for each training example x (here: (e...e, j...j) pAYr): • for each hidden z: compute p ( x, z ) from the current model • p ( x ) = sum z p ( x, z ); [debug: corpus prob p (data) *= p ( x )] • for each hidden z = ( z 1 z 2 ... z n ) : for each i : • #( z i ) += p ( x, z ) / p ( x ); #(LHS (z i )) += p ( x, z ) / p ( x ) • M-step: count-n-divide on fraccounts => new model p (A I|AY)=#(AY->A I)/#(AY) • p (RHS( z i ) | LHS( z i )) = #( z i ) / #(LHS( z i )) CS 562 - EM 13

  14. EM: slow version (non-DP) • distribution over all possible hallucinated hidden variables • W AY N W AY N W AY N W AY N | | / \ | |\ \ |\ \ \ W A I N W A I N W A I N W A I N fractional counts 1/3 1/3 1/3 AY|-> A: 0.333 A I: 0.333 I: 0.333 W|-> W: 0.667 W A: 0.333 N|-> N: 0.667 I N: 0.333 regenerate p( x , z ) : 2/3*1/3*1/3 2/3*1/3*2/3 1/3*1/3*2/3 renormalize by p (x) = 2/27 + 4/27 + 2/27 = 8/27 fractional counts 1/4 1/2 1/4 AY|-> A I: 0.500 A: 0.250 I: 0.250 ++ W|-> W: 0.750 W A: 0.250 N|-> N: 0.750 I N: 0.250 regenerate p( x , z ) : 3/4*1/4*1/4 3/4*1/2*3/4 1/4*1/4*3/4 renormalize by p (x) = 3/64 + 18/64 + 3/64 = 3/8 fractional counts 1/8 3/4 1/8 CS 562 - EM 14

  15. EM: fast version (DP) • initialize the conditional prob. table to uniform • repeat until converged: back [ v ] v forw [ u ] s t u • E-step: forw [ t ] = back [ s ] = p ( x ) = sum z p ( x, z ) • for each training example x (here: (e...e, j...j) pAYr): • forward from s to t ; note: forw [ t ] = p ( x ) = sum z p ( x, z ) • backward from t to s ; note: back [ t ]=1; back [ s ] = forw [ t ] • for each edge ( u, v) in the DP graph with label ( u, v ) = z i • fraccount( z i ) += forw [ u ] * back [ v ] * prob ( u , v ) / p ( x ) • M-step: count-n-divide on fraccounts => new model sum z : ( u, v ) in z p ( x, z ) CS 562 - EM 15

  16. How to avoid enumeration? • dynamic programming: the forward-backward algorithm • forward is just like Viterbi, replacing max by sum • backward is like reverse Viterbi (also with sum) POS tagging, alignment, crypto, ... edit-distance, ... inside-outside: PCFG, SCFG, ... CS 562 - EM 16

  17. Example Forward Code • for HW5. this example shows forward only. n, m = len(eprons), len(jprons) forward[0][0] = 1 for i in xrange(0, n): epron = eprons[i] for j in forward[i]: for k in range(1, min(m-j, 3)+1): jseg = tuple(jprons[j:j+k]) score = forward[i][j] * table[epron][jseg] forward[i+1][j+k] += score 2 3 4 0 1 totalprob *= forward[n][m] W A I N 0 W 1 AY 2 N CS 562 - EM 17 3

  18. Example Forward Code • for HW5. this example shows forward only. n, m = len(eprons), len(jprons) forward[0][0] = 1 back [ v ] v forw [ u ] s t u for i in xrange(0, n): epron = eprons[i] forw [ s ] = back [ t ] = 1.0 for j in forward[i]: for k in range(1, min(m-j, 3)+1): forw [ t ] = back [ s ] = p ( x ) jseg = tuple(jprons[j:j+k]) score = forward[i][j] * table[epron][jseg] forward[i+1][j+k] += score m j j+k 0 totalprob *= forward[n][m] ... ... A I ... ... s 0 forw [ i ][ forw [ i ][ forw forw rw [ i ][ j ] rw [ i ][ j ] j ] j ] i u AY i+ 1 v back [ i +1][ back [ back [ i +1][ back [ [ i +1][ j + k ] [ i +1][ j + k ] CS 562 - EM 18 n t

  19. EM: fast version (DP) • initialize the conditional prob. table to uniform • repeat until converged: back [ v ] v forw [ u ] s t u • E-step: forw [ t ] = back [ s ] = p ( x ) = sum z p ( x, z ) • for each trAYning example x (here: (e...e, j...j) pAYr): • forward from s to t ; note: forw [ t ] = p ( x ) = sum z p ( x, z ) • backward from t to s ; note: back [ t ]=1; back [ s ] = forw [ t ] • for each edge ( u, v) in the DP graph with label ( u, v ) = z i • fraccount( z i ) += forw [ u ] * back [ v ] * prob ( u , v ) / p ( x ) • M-step: count-n-divide on fraccounts => new model sum z : ( u, v ) in z p ( x, z ) CS 562 - EM 19

  20. EM CS 562 - EM 20

  21. Why EM increases p (data) iteratively? CS 562 - EM 21

  22. Why EM increases p (data) iteratively? converge to local maxima KL-divergence convex auxiliary function CS 562 - EM 22

  23. How to maximize the auxiliary? W AY N W AY N W AY N | | /\ | |\ \ |\ \ \ W A I N W A I N W A I N p(z’|x)=0.3 p(z’’ | x )= 0.2 p(z|x)=0.5 just count-n-divide on the fractional data! (as if MLE on complete data) W AY N W AY N W AY N | | /\ |\ \ \ | |\ \ W A I N W A I N W A I N 2x 3x 5x CS 562 - EM 23

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend