Fast and Adaptive Online Training of Feature-Rich Translation Models - PowerPoint PPT Presentation

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013

Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang 2006 Arun and Koehn 2007 Ittycheriah and Roukos 2007 Watanabe et al. 2007 n -best/lattice MERT Chiang et al. 2008; Chiang et al. 2009 Haddow et al. 2011 MIRA (ISI) Hopkins and May 2011 Xiang and Ittycheriah 2011 Cherry and Foster 2012 Chiang 2012 Gimpel 2012 Simianer et al. 2012 Watanabe 2012

Feature-rich Shared Task Submissions # Feature-rich 2012 WMT 0 IWSLT 1 2013 WMT 2 ? IWSLT TBD 4

Speculation: Entrenchment Of MERT Feature-rich on small tuning sets? Implementation complexity Open source availability 5

Speculation: Entrenchment Of MERT Feature-rich on small tuning sets? Implementation complexity Open source availability Top-selling phone of 2003 5

Motivation: Why Feature-Rich MT? Make MT more like other machine learning settings Features for specific errors Domain adaptation 6

Motivation: Why Online MT Tuning? Search: decode more often Better solutions See: [ Liang and Klein 2009 ] Computer-aided translation: incremental updating 7

Benefits Of Our Method Fast and scalable Adapts to dense/sparse feature mix Not complicated 8

Online Algorithm Overview Updating with an adaptive learning rate Automatic feature selection via L 1 regularization Loss function: Pairwise ranking 9

Notation t time/update step 10

Notation t time/update step weight vector in R n  t 10

Notation t time/update step weight vector in R n  t η learning rate 10

Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example 10

Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example z t − 1 ∈ ∂ℓ t (  t − 1 ) subgradient set ( subdifferential ) 10

Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example z t − 1 ∈ ∂ℓ t (  t − 1 ) subgradient set ( subdifferential ) z t − 1 = ∇ ℓ t (  t − 1 ) for differentiable loss functions 10

Notation t time/update step weight vector in R n  t η learning rate ℓ t (  ) loss of t ’th example z t − 1 ∈ ∂ℓ t (  t − 1 ) subgradient set ( subdifferential ) z t − 1 = ∇ ℓ t (  t − 1 ) for differentiable loss functions r (  ) regularization function 10

Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 11

Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? 11

Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? � η / t ? 11

Warm-up: Stochastic Gradient Descent Per-instance update:  t =  t − 1 − ηz t − 1 Issue #1: learning rate schedule η / t ? � η / t ? η / ( 1 + γt ) ? Yuck. 11

Warm-up: Stochastic Gradient Descent SGD update:  t =  t − 1 − ηz t − 1 Issue #2: same step size for every coordinate 12

Warm-up: Stochastic Gradient Descent SGD update:  t =  t − 1 − ηz t − 1 Issue #2: same step size for every coordinate Intuitively, we might want: Frequent feature: small steps e.g. η / t � Rare feature: large steps e.g. η / t 12

SGD: Learning Rate Adaptation SGD update:  t =  t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n :  t =  t − 1 − ηA − 1 z t − 1 13

SGD: Learning Rate Adaptation SGD update:  t =  t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n :  t =  t − 1 − ηA − 1 z t − 1 Choices: A − 1 =  (SGD) 13

SGD: Learning Rate Adaptation SGD update:  t =  t − 1 − ηz t − 1 Scale learning rate with A − 1 ∈ R n × n :  t =  t − 1 − ηA − 1 z t − 1 Choices: A − 1 =  (SGD) A − 1 = H − 1 (Batch: Newton step) 13

AdaGrad Duchi et al. 2011 Update:  t =  t − 1 − ηA − 1 z t − 1 Set A − 1 = G − 1 / 2 : t G t = G t − 1 + z t − 1 · z ⊤ t − 1 14

AdaGrad: Approximations and Intuition For high-dimensional  t , use diagonal G t  t =  t − 1 − ηG − 1 / 2 z t − 1 t 15

AdaGrad: Approximations and Intuition For high-dimensional  t , use diagonal G t  t =  t − 1 − ηG − 1 / 2 z t − 1 t Intuition: � 1 / t schedule on constant gradient Small steps for frequent features Big steps for rare features [ Duchi et al. 2011 ] 15

AdaGrad vs. SGD: 2D Illustration 10 SGD 8 AdaGrad 6 4 2 0 −2 −4 −6 −8 −10 −10 −5 0 5 10 16

Feature Selection Traditional approach: frequency cutoffs Unattractive for large tuning sets (e.g. bitext) 17

Feature Selection Traditional approach: frequency cutoffs Unattractive for large tuning sets (e.g. bitext) More principled: L 1 regularization � r (  ) = |   |  17

Feature Selection: FOBOS T wo-step update:  t − 1 2 =  t − 1 − ηz t − 1 (1)     1 � � 2   � �  t = rg min �  −  t − 1 + λ · r (  )   �   2  2 � ��   � �� regularization proximal term (2) [ Duchi and Singer 2009 ] 18

Feature Selection: FOBOS T wo-step update:  t − 1 2 =  t − 1 − ηz t − 1 (1)     1 � � 2   � �  t = rg min �  −  t − 1 + λ · r (  )   �   2  2 � ��   � �� regularization proximal term (2) [ Duchi and Singer 2009 ] Extension: AdaGrad update in step (1) 18

Feature Selection: FOBOS For L 1 , FOBOS becomes soft thresholding: � � � � � �  t = sign (  t − 1 2 ) �  t − 1 � − λ + 2 19

Feature Selection: FOBOS For L 1 , FOBOS becomes soft thresholding: � � � � � �  t = sign (  t − 1 2 ) �  t − 1 � − λ + 2 Squared- L 2 also has a simple form 19

Feature Selection: Lazy Regularization Lazy updating: only update active coordinates Big speedup in MT setting 20

Feature Selection: Lazy Regularization Lazy updating: only update active coordinates Big speedup in MT setting Easy with FOBOS: t ′ j : last update of dimension j Use λ ( t − t ′ j ) 20

AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 21

AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 21

AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 3. Closed-form regularization:  t 21

AdaGrad + FOBOS: Full Algorithm 1. Additive update: G t 2. Additive update:  t − 1 2 3. Closed-form regularization:  t Not complicated Very fast 21

Recap: Pairwise Ranking For derivation d , feature map ϕ ( d ) , references e 1: k Metric: B ( d, e 1: k ) (e.g. BLEU + 1) Model score: M ( d ) =  · ϕ ( d ) 22

Recap: Pairwise Ranking For derivation d , feature map ϕ ( d ) , references e 1: k Metric: B ( d, e 1: k ) (e.g. BLEU + 1) Model score: M ( d ) =  · ϕ ( d ) Pairwise consistency: � d + , e 1: k � � d − , e 1: k � M ( d + ) > M ( d − ) ⇐⇒ B > B [ Hopkins and May 2011 ] 22

Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒  · ( ϕ ( d + ) − ϕ ( d − )) > 0 23

Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒  · ( ϕ ( d + ) − ϕ ( d − )) > 0 Loss formulation: Difference vector:  = ϕ ( d + ) − ϕ ( d − ) Find  so that  ·  > 0 Binary classification problem between  and −  23

Loss Function: Pairwise Ranking M ( d + ) > M ( d − ) ⇐⇒  · ( ϕ ( d + ) − ϕ ( d − )) > 0 Loss formulation: Difference vector:  = ϕ ( d + ) − ϕ ( d − ) Find  so that  ·  > 0 Binary classification problem between  and −  Logistic loss: convex, differentiable [ Hopkins and May 2011 ] 23

Parallelization Online algorithms are inherently sequential 24

Parallelization Online algorithms are inherently sequential Out-of-order updating:  7 =  6 − ηz 4  8 =  7 − ηz 6  9 =  8 − ηz 5 24

Parallelization Online algorithms are inherently sequential Out-of-order updating:  7 =  6 − ηz 4  8 =  7 − ηz 6  9 =  8 − ηz 5 � Low-latency regret bound: O ( T ) [ Langford et al. 2009 ] 24

Fast and Adaptive Online Training of Feature-Rich Translation Models - PowerPoint PPT Presentation

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013 Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Fast-timing measurements in neutron-rich odd-mass Fast-timing measurements in neutron-rich

Fast Bayesian automatic Fast Bayesian automatic adaptive quadrature adaptive quadrature Gh.

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Food Handler Training Food Handler Training Food Handler Training Food Handler Training Online

Adaptive Control Chapter 13: Multimodel adaptive control with switching Chapter 13: Multimodel

Adaptive Control Chapter 14: Adaptive regulation Rejection of unknown disturbances 1

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation Ye Jia, Melvin

Natural Language Processing Marco Chiarandini Department of Mathematics & Computer Science

x86 Memory Protection User System Calls Kernel and Translation RCU File System Networking

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

for Bilingual Resources at Museums Welcome! The webinar will begin at 2:00 p.m. C.T. THC Museum

Pseudo-Boolean Solving by Incremental Translation to SAT Pete Manolios Vasilis Papavasileiou

Distortion-transmission trade-o ff in real-time transmission of Gauss-Markov sources Jhelum

Chapter 2: Amplitude Modulation Transmission EET-223: RF Communication Circuits Walter Lara

Fast and Adaptive Online Training of Feature-Rich Translation Models - PowerPoint PPT Presentation

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang Daniel Cer Christopher D. Manning Stanford University ACL 2013 Feature-Rich Research Industry/Evaluations Liang et al. 2006 Tillmann and Zhang

Fast and Adaptive Online Training of Feature-Rich Translation Models Spence Green Sida Wang

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Neural Nets for Adaptive Filter and Adaptive Neural Nets as Adaptive Filters Pattern Recognition

Adaptive Control Chapter 1: Introduction to Adaptive Control Adaptive Control Landau, Lozano,

Adaptive Control Chapter 11: Direct Adaptive Control 1 Adaptive Control Landau, Lozano,

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Adaptive Control Chapter 12: Indirect Adaptive Control 1 Adaptive Control Landau, Lozano,

Fast-timing measurements in neutron-rich odd-mass Fast-timing measurements in neutron-rich

Fast Bayesian automatic Fast Bayesian automatic adaptive quadrature adaptive quadrature Gh.

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Food Handler Training Food Handler Training Food Handler Training Food Handler Training Online

Adaptive Control Chapter 13: Multimodel adaptive control with switching Chapter 13: Multimodel

Adaptive Control Chapter 14: Adaptive regulation Rejection of unknown disturbances 1

Leveraging Weakly Supervised Data to Improve End-to-End Speech-to-Text Translation Ye Jia, Melvin

Natural Language Processing Marco Chiarandini Department of Mathematics &amp; Computer Science

x86 Memory Protection User System Calls Kernel and Translation RCU File System Networking

Attention is All You Need (Vaswani et. al. 2017) Slides and figures when not cited are from:

for Bilingual Resources at Museums Welcome! The webinar will begin at 2:00 p.m. C.T. THC Museum

Pseudo-Boolean Solving by Incremental Translation to SAT Pete Manolios Vasilis Papavasileiou

Distortion-transmission trade-o ff in real-time transmission of Gauss-Markov sources Jhelum

Chapter 2: Amplitude Modulation Transmission EET-223: RF Communication Circuits Walter Lara

Natural Language Processing Marco Chiarandini Department of Mathematics & Computer Science