fast and adaptive online training of feature rich
play

Fast and Adaptive Online Training of Feature-Rich Translation Models - PowerPoint PPT Presentation

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013 Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang


  1. Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013

  2. Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang 2006 Arun and Koehn 2007 Ittycheriah and Roukos 2007 Watanabe et al. 2007 n -best/lattice MERT Chiang et al. 2008; Chiang et al. 2009 Haddow et al. 2011 MIRA (ISI) Hopkins and May 2011 Xiang and Ittycheriah 2011 Cherry and Foster 2012 Chiang 2012 Gimpel 2012 Simianer et al. 2012 Watanabe 2012

  3. Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang 2006 Arun and Koehn 2007 Ittycheriah and Roukos 2007 Watanabe et al. 2007 n -best/lattice MERT Chiang et al. 2008; Chiang et al. 2009 Haddow et al. 2011 MIRA (ISI) Hopkins and May 2011 Xiang and Ittycheriah 2011 Cherry and Foster 2012 Chiang 2012 Gimpel 2012 Simianer et al. 2012 Watanabe 2012

  4. Feature-rich Shared Task Submissions # Feature-rich 2012 WMT 0 IWSLT 1 2013 WMT 2 ? IWSLT TBD 4

  5. Speculation: Entrenchment Of MERT Feature-rich on small tuning sets? Implementation complexity Open source availability 5

  6. Speculation: Entrenchment Of MERT Feature-rich on small tuning sets? Implementation complexity Open source availability Top-selling phone of 2003 5

  7. Motivation: Why Feature-Rich MT? Make MT more like other machine learning settings Features for specific errors Domain adaptation 6

  8. Motivation: Why Online MT Tuning? Search: decode more often Better solutions See: [ Liang and Klein 2009 ] Computer-aided translation: incremental updating 7

  9. Benefits Of Our Method Fast and scalable Adapts to dense/sparse feature mix Not complicated 8

  10. Online Algorithm Overview Updating with an adaptive learning rate Automatic feature selection via L 1 regularization Loss function: Pairwise ranking 9

  11. Notation t time/update step 10

  12. Notation t time/update step weight vector in R n  t 10

  13. Notation t time/update step weight vector in R n  t η learning rate 10

  14. Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example 10

  15. Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example z t − 1 ∈ ∂ℓ t (  t − 1 ) subgradient set ( subdifferential ) 10

  16. Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example z t − 1 ∈ ∂ℓ t (  t − 1 ) subgradient set ( subdifferential ) z t − 1 = ∇ ℓ t (  t − 1 ) for differentiable loss functions 10

  17. Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example z t − 1 ∈ ∂ℓ t (  t − 1 ) subgradient set ( subdifferential ) z t − 1 = ∇ ℓ t (  t − 1 ) for differentiable loss functions r (  ) regularization function 10

  18. Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 11

  19. Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? 11

  20. Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? � η / t ? 11

  21. Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? � η / t ? η / ( 1 + γt ) ? Yuck. 11

  22. Warm-up: Stochastic Gradient Descent SGD update:  t =  t − 1 − ηz t − 1 Issue #2: same step size for every coordinate 12

  23. Warm-up: Stochastic Gradient Descent SGD update:  t =  t − 1 − ηz t − 1 Issue #2: same step size for every coordinate Intuitively, we might want: Frequent feature: small steps e.g. η / t � Rare feature: large steps e.g. η / t 12

  24. SGD: Learning Rate Adaptation SGD update:  t =  t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n :  t =  t − 1 − ηA − 1 z t − 1 13

  25. SGD: Learning Rate Adaptation SGD update:  t =  t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n :  t =  t − 1 − ηA − 1 z t − 1 Choices: A − 1 =  (SGD) 13

  26. SGD: Learning Rate Adaptation SGD update:  t =  t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n :  t =  t − 1 − ηA − 1 z t − 1 Choices: A − 1 =  (SGD) A − 1 = H − 1 (Batch: Newton step) 13

  27. AdaGrad Duchi et al. 2011 Update:  t =  t − 1 − ηA − 1 z t − 1 Set A − 1 = G − 1 / 2 : t G t = G t − 1 + z t − 1 · z ⊤ t − 1 14

  28. AdaGrad: Approximations and Intuition For high-dimensional  t , use diagonal G t  t =  t − 1 − ηG − 1 / 2 z t − 1 t 15

  29. AdaGrad: Approximations and Intuition For high-dimensional  t , use diagonal G t  t =  t − 1 − ηG − 1 / 2 z t − 1 t Intuition: � 1 / t schedule on constant gradient Small steps for frequent features Big steps for rare features [ Duchi et al. 2011 ] 15

  30. AdaGrad vs. SGD: 2D Illustration 10 SGD 8 AdaGrad 6 4 2 0 −2 −4 −6 −8 −10 −10 −5 0 5 10 16

  31. Feature Selection Traditional approach: frequency cutoffs Unattractive for large tuning sets (e.g. bitext) 17

  32. Feature Selection Traditional approach: frequency cutoffs Unattractive for large tuning sets (e.g. bitext) More principled: L 1 regularization � r (  ) = |   |  17

  33. Feature Selection: FOBOS T wo-step update:  t − 1 2 =  t − 1 − ηz t − 1 (1)     1 � � 2   � �  t = rg min �  −  t − 1 + λ · r (  )   �   2  2 � �� �   � �� � regularization proximal term (2) [ Duchi and Singer 2009 ] 18

  34. Feature Selection: FOBOS T wo-step update:  t − 1 2 =  t − 1 − ηz t − 1 (1)     1 � � 2   � �  t = rg min �  −  t − 1 + λ · r (  )   �   2  2 � �� �   � �� � regularization proximal term (2) [ Duchi and Singer 2009 ] Extension: AdaGrad update in step (1) 18

  35. Feature Selection: FOBOS For L 1 , FOBOS becomes soft thresholding: � � � � � �  t = sign (  t − 1 2 ) �  t − 1 � − λ + 2 19

  36. Feature Selection: FOBOS For L 1 , FOBOS becomes soft thresholding: � � � � � �  t = sign (  t − 1 2 ) �  t − 1 � − λ + 2 Squared- L 2 also has a simple form 19

  37. Feature Selection: Lazy Regularization Lazy updating: only update active coordinates Big speedup in MT setting 20

  38. Feature Selection: Lazy Regularization Lazy updating: only update active coordinates Big speedup in MT setting Easy with FOBOS: t ′ j : last update of dimension j Use λ ( t − t ′ j ) 20

  39. AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 21

  40. AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 21

  41. AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 3. Closed-form regularization:  t 21

  42. AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 3. Closed-form regularization:  t 21

  43. AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 3. Closed-form regularization:  t Not complicated Very fast 21

  44. Recap: Pairwise Ranking For derivation d , feature map ϕ ( d ) , references e 1: k Metric: B ( d, e 1: k ) (e.g. BLEU + 1) Model score: M ( d ) =  · ϕ ( d ) 22

  45. Recap: Pairwise Ranking For derivation d , feature map ϕ ( d ) , references e 1: k Metric: B ( d, e 1: k ) (e.g. BLEU + 1) Model score: M ( d ) =  · ϕ ( d ) Pairwise consistency: � d + , e 1: k � � d − , e 1: k � M ( d + ) > M ( d − ) ⇐⇒ B > B [ Hopkins and May 2011 ] 22

  46. Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒  · ( ϕ ( d + ) − ϕ ( d − )) > 0 23

  47. Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒  · ( ϕ ( d + ) − ϕ ( d − )) > 0 Loss formulation: Difference vector:  = ϕ ( d + ) − ϕ ( d − ) Find  so that  ·  > 0 Binary classification problem between  and −  23

  48. Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒  · ( ϕ ( d + ) − ϕ ( d − )) > 0 Loss formulation: Difference vector:  = ϕ ( d + ) − ϕ ( d − ) Find  so that  ·  > 0 Binary classification problem between  and −  Logistic loss: convex, differentiable [ Hopkins and May 2011 ] 23

  49. Parallelization Online algorithms are inherently sequential 24

  50. Parallelization Online algorithms are inherently sequential Out-of-order updating:  7 =  6 − ηz 4  8 =  7 − ηz 6  9 =  8 − ηz 5 24

  51. Parallelization Online algorithms are inherently sequential Out-of-order updating:  7 =  6 − ηz 4  8 =  7 − ηz 6  9 =  8 − ηz 5 � Low-latency regret bound: O ( T ) [ Langford et al. 2009 ] 24

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend