sparse feature learning
play

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn - PowerPoint PPT Presentation

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015 Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine


  1. Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  2. Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  3. Component Weights 2 Translation Model .05 .26 Language Model .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  4. Even More Numbers Inside 3 Translation Model p(a | to) = 0.18 .05 p(casa | house) = 0.35 Language p(azur | blue) = 0.77 Model p(la | the) = 0.32 .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  5. Grand Vision 4 • There are millions of parameters – each phrase translation score – each language model n-gram – etc. • Can we train them all discriminatively? • This implies optimization over the entire training corpus Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  6. 5 search space iterative n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  7. 6 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  8. 7 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  9. 8 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  10. 9 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MaxViolation [Yu et al., 2014] MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best Leave One Out [Wuebker et al., 2012] n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  11. Strategy and Core Problems 10 • Process each sentence pair in the training corpus • Optimize parameters towards producing the reference translation • Reference translation may not be producible by model – optimize towards most similar translation – or, only process sentence pair partially • Avoid overfitting • Large corpora require efficient learning methods Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  12. Sentence Level vs. Corpus Level Error Metric 11 • Optimizing BLEU requires optimizing over the entire training corpus � BLEU ( { e best h j ( e i , f i ) λ i } , { e ref = argmax e i i } ) i j • Life would be easier, if we could sum over sentence level scores � � h j ( e i , f i ) λ i , e ref BLEU’( argmax e i ) i i j • For instance, BLEU+1 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  13. 12 features Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  14. Core Rule Properties 13 • Frequency of phrase (binned) • Length of phrase – number of source words – number of target words – number of source and target words • Unaligned / added (content) words in phrase pair • Reordering within phrase pair Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  15. Lexical Translation Features 14 • lex ( e ) fires when an output word e is generated • lex ( f, e ) fires when an output word e is generated aligned to a input word f • lex ( NULL , e ) fires when an output word e is generated unaligned • lex ( f, NULL ) fires when an input word e is dropped • Could also be defined on part of speech tags or word classes Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  16. Lexicalized Reordering Features 15 • Replacement of lexicalized reordering model • Features differ by – lexicalized by first or last word of phrase (source or target) – word representation replaced by word class – orientation type Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  17. Domain Features 16 • Indicator feature that the rule occurs in one specific domain • Probability that the rule belongs to one specific domain • Domain-specific lexical translation probabilities Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  18. Syntax Features 17 • If we have syntactic parse trees, many more features – number of nodes of a particular kind – matching of source and target constituents – reordering within syntactic constituents • Parse trees are a by-product of syntax-based models • More on that in future lectures Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  19. Every Number in Model 18 • Phrase pair indicator feature • Target n-gram feature • Phrase pair orientation feature Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  20. 19 perceptron algorithm Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  21. Optimizing Linear Model 20 • We consider each sentence pair ( e i , f i ) and its alignment a i • To simplify notation, we define derivation d i = ( e i , f i , a i ) • Model score is weighted linear combination of feature values h j and weights λ j � score ( λ, d i ) = λ j h j ( d i ) j • Such models are also known as single layer perceptrons Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  22. Reference and Model Best 21 • Besides the reference derivation d ref for sentence pair i and its score i score ( λ, d ref � λ j h j ( d ref i ) = i ) j • We also have the model best translation � d best = argmax d score ( λ i , d i ) = argmax d λ j h j ( d i ) i j • ... and its model score score ( λ, d best � λ j h j ( d best ) = ) i i j • We can view the error in our model as a function of its parameters λ error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  23. Stochastic Gradient Descent 22 error( λ ) t n e i d a r g λ optimal λ current λ • We cannot analytically find the optimum of the curve error ( λ ) • We can compute the gradient d dλ error ( λ ) at any point • We want to follow the gradient towards the optimal λ value Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  24. Stochastic Gradient Descent 23 • We want to minimize the error error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i • In stochastic gradient descent, we follow direction of gradient d d λ error ( λ, d best , d ref i ) i • For each λ j , we compute the gradient pointwise d d error ( λ j , d best , d ref score ( λ, d best ) − score ( λ, d ref i ) = i ) i i d λ j d λ j Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  25. Stochastic Gradient Descent 24 • Gradient with respect to λ j d d error ( λ j , d best , d ref � λ j ′ h j ′ ( d best � λ j ′ h j ′ ( d ref i ) = ) − i ) i i d λ j d λ j j ′ j ′ • For λ ′ j � = λ j , the terms λ j ′ h j ′ ( d i ) are constant, so they disappear d d error ( λ j , d best , d ref λ j h j ( d best ) − λ j h j ( d ref i ) = i ) i i d λ j d λ j • The derivative of a linear function is its factor d error ( λ j , d best , d ref i ) = h j ( d best ) − h j ( d ref i ) i i d λ j = λ j − ( h j ( d best ) − h j ( d ref ⇒ Our model update is λ new i )) i j Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  26. Intuition 25 • Feature values in model best translation • Feature values in reference translation • Intuition: – promote features whose value is bigger in reference – demote features whose value is bigger in model best Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  27. Algorithm 26 Input: set of sentence pairs ( e , f ), set of features Output: set of weights λ for each feature 1: λ i = 0 for all i 2: while not converged do for all foreign sentences f do 3: d best = best derivation according to model 4: d ref = reference derivation 5: if d best � = d ref then 6: for all features h i do 7: λ i += h i ( d ref ) − h i ( d best ) 8: end for 9: end if 10: end for 11: 12: end while Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  28. 27 generating the reference Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

  29. Failure to Generate Reference 28 • Reference translation may be anywhere in this box all English sentences produceable by model covered by search • If produceable by model → we can compute feature scores • If not → we can not Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend