Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn - PowerPoint PPT Presentation

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Component Weights 2 Translation Model .05 .26 Language Model .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Even More Numbers Inside 3 Translation Model p(a | to) = 0.18 .05 p(casa | house) = 0.35 Language p(azur | blue) = 0.77 Model p(la | the) = 0.32 .19 .06 Reordering Model .21 .1 .04 .1 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Grand Vision 4 • There are millions of parameters – each phrase translation score – each language model n-gram – etc. • Can we train them all discriminatively? • This implies optimization over the entire training corpus Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

5 search space iterative n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

6 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

7 search space MERT iterative [Och&al. 2003] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

8 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best our work n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

9 MIRA [Chiang 2007] search SampleRank [Haddow&al. 2011] space MaxViolation [Yu et al., 2014] MERT PRO iterative [Och&al. 2003] [Hopkins/May 2011] n-best Leave One Out [Wuebker et al., 2012] n-best rule scores aligned corpus a handful thousands millions Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Strategy and Core Problems 10 • Process each sentence pair in the training corpus • Optimize parameters towards producing the reference translation • Reference translation may not be producible by model – optimize towards most similar translation – or, only process sentence pair partially • Avoid overfitting • Large corpora require efficient learning methods Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Sentence Level vs. Corpus Level Error Metric 11 • Optimizing BLEU requires optimizing over the entire training corpus � BLEU ( { e best h j ( e i , f i ) λ i } , { e ref = argmax e i i } ) i j • Life would be easier, if we could sum over sentence level scores � � h j ( e i , f i ) λ i , e ref BLEU’( argmax e i ) i i j • For instance, BLEU+1 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

12 features Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Core Rule Properties 13 • Frequency of phrase (binned) • Length of phrase – number of source words – number of target words – number of source and target words • Unaligned / added (content) words in phrase pair • Reordering within phrase pair Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Lexical Translation Features 14 • lex ( e ) fires when an output word e is generated • lex ( f, e ) fires when an output word e is generated aligned to a input word f • lex ( NULL , e ) fires when an output word e is generated unaligned • lex ( f, NULL ) fires when an input word e is dropped • Could also be defined on part of speech tags or word classes Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Lexicalized Reordering Features 15 • Replacement of lexicalized reordering model • Features differ by – lexicalized by first or last word of phrase (source or target) – word representation replaced by word class – orientation type Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Domain Features 16 • Indicator feature that the rule occurs in one specific domain • Probability that the rule belongs to one specific domain • Domain-specific lexical translation probabilities Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Syntax Features 17 • If we have syntactic parse trees, many more features – number of nodes of a particular kind – matching of source and target constituents – reordering within syntactic constituents • Parse trees are a by-product of syntax-based models • More on that in future lectures Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Every Number in Model 18 • Phrase pair indicator feature • Target n-gram feature • Phrase pair orientation feature Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

19 perceptron algorithm Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Optimizing Linear Model 20 • We consider each sentence pair ( e i , f i ) and its alignment a i • To simplify notation, we define derivation d i = ( e i , f i , a i ) • Model score is weighted linear combination of feature values h j and weights λ j � score ( λ, d i ) = λ j h j ( d i ) j • Such models are also known as single layer perceptrons Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Reference and Model Best 21 • Besides the reference derivation d ref for sentence pair i and its score i score ( λ, d ref � λ j h j ( d ref i ) = i ) j • We also have the model best translation � d best = argmax d score ( λ i , d i ) = argmax d λ j h j ( d i ) i j • ... and its model score score ( λ, d best � λ j h j ( d best ) = ) i i j • We can view the error in our model as a function of its parameters λ error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Stochastic Gradient Descent 22 error( λ ) t n e i d a r g λ optimal λ current λ • We cannot analytically find the optimum of the curve error ( λ ) • We can compute the gradient d dλ error ( λ ) at any point • We want to follow the gradient towards the optimal λ value Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Stochastic Gradient Descent 23 • We want to minimize the error error ( λ, d best , d ref i ) = score ( λ, d best ) − score ( λ, d ref i ) i i • In stochastic gradient descent, we follow direction of gradient d d λ error ( λ, d best , d ref i ) i • For each λ j , we compute the gradient pointwise d d error ( λ j , d best , d ref score ( λ, d best ) − score ( λ, d ref i ) = i ) i i d λ j d λ j Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Stochastic Gradient Descent 24 • Gradient with respect to λ j d d error ( λ j , d best , d ref � λ j ′ h j ′ ( d best � λ j ′ h j ′ ( d ref i ) = ) − i ) i i d λ j d λ j j ′ j ′ • For λ ′ j � = λ j , the terms λ j ′ h j ′ ( d i ) are constant, so they disappear d d error ( λ j , d best , d ref λ j h j ( d best ) − λ j h j ( d ref i ) = i ) i i d λ j d λ j • The derivative of a linear function is its factor d error ( λ j , d best , d ref i ) = h j ( d best ) − h j ( d ref i ) i i d λ j = λ j − ( h j ( d best ) − h j ( d ref ⇒ Our model update is λ new i )) i j Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Intuition 25 • Feature values in model best translation • Feature values in reference translation • Intuition: – promote features whose value is bigger in reference – demote features whose value is bigger in model best Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Algorithm 26 Input: set of sentence pairs ( e , f ), set of features Output: set of weights λ for each feature 1: λ i = 0 for all i 2: while not converged do for all foreign sentences f do 3: d best = best derivation according to model 4: d ref = reference derivation 5: if d best � = d ref then 6: for all features h i do 7: λ i += h i ( d ref ) − h i ( d best ) 8: end for 9: end if 10: end for 11: 12: end while Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

27 generating the reference Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Failure to Generate Reference 28 • Reference translation may be anywhere in this box all English sentences produceable by model covered by search • If produceable by model → we can compute feature scores • If not → we can not Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn - PowerPoint PPT Presentation

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015 Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn Machine Translation: Sparse

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Understanding Sparse JL for Feature Hashing Meena Jagadeesan Harvard University (Class of 2020)

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Extremal results for sparse pseudorandom graphs Yufei Zhao Massachusetts Institute of Technology

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Pretzelocity Pretzelocity & & Quark Angular Momentum Quark Angular Momentum Bo-Qiang Ma

Using Python and XMPP to build a decentralized social network J er ome Poisson (Goffi)

Joseph John Thomson 1856-1940 Discovery of electron and e/m measurement First mass spectrometer

Physics 2D Lecture Slides Lecture 14: Feb 1 st 2005 Vivek Sharma UCSD Physics Compton Effect:

AUTOMATED SOFTWARE PROTECTION FOR THE MASSES AGAINST SIDE-CHANNEL ATTACKS SIDE CHANNEL ATTACKS

WELCOME! BOARD MEETING | F EBRUARY 23, 2018 ADVISORY & VOTING MEMBERS | CALL TO ORDER BOARD

GASP: a Generic Approach to Secure network Protocols Olivier Levillain May 13th 2020 O.

Security protocols: formal models and verification Sergiu Bursuc School of Computer Science,

Sambuz

Useful Links

Newsletter

Mail Us

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn - PowerPoint PPT Presentation

Sparse Feature Learning Philipp Koehn 3 March 2015 Philipp Koehn Machine Translation: Sparse Feature Learning 3 March 2015 Multiple Component Models 1 Translation Model Language Model Reordering Model Philipp Koehn Machine

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

Sparse Feature Learning Philipp Koehn 1 March 2016 Philipp Koehn Machine Translation: Sparse

Sparse Matrices Example Of Sparse Matrices diagonal tridiagonal sparse many elements are

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Sparse Matrices sparse many elements are zero dense few elements are zero Example Of

Earth: The Feature Presentation - feature, landscape, topography Earth: The Feature Presentation

Reducing Dimensionality Steven J Zeil Old Dominion Univ. Fall 2010 1 Feature Selection

Understanding Sparse JL for Feature Hashing Meena Jagadeesan Harvard University (Class of 2020)

Sparse tensors are a natural way of representing real-world data 1 Sparse tensors are a natural

MLSS 06 - Canberra Elements Hierarchical Basis Sparse Grids Sparse Grids Combination

CNBC Matlab Mini-Course Sparse Matrices Sparse matrices provide an efficient means to store

Tutorial: TF-Ranking for sparse features Tutorial: TF-Ranking for sparse features This tutorial

Parallel Numerical Algorithms Chapter 4 Sparse Linear Systems Section 4.1 Direct Methods

Extremal results for sparse pseudorandom graphs Yufei Zhao Massachusetts Institute of Technology

Feature Extraction 7-1 Ronald Peikert SciVis 2007 - Feature Extraction What are features?

Pretzelocity Pretzelocity &amp; &amp; Quark Angular Momentum Quark Angular Momentum Bo-Qiang Ma

Using Python and XMPP to build a decentralized social network J er ome Poisson (Goffi)

Joseph John Thomson 1856-1940 Discovery of electron and e/m measurement First mass spectrometer

Physics 2D Lecture Slides Lecture 14: Feb 1 st 2005 Vivek Sharma UCSD Physics Compton Effect:

AUTOMATED SOFTWARE PROTECTION FOR THE MASSES AGAINST SIDE-CHANNEL ATTACKS SIDE CHANNEL ATTACKS

WELCOME! BOARD MEETING | F EBRUARY 23, 2018 ADVISORY &amp; VOTING MEMBERS | CALL TO ORDER BOARD

GASP: a Generic Approach to Secure network Protocols Olivier Levillain May 13th 2020 O.

Security protocols: formal models and verification Sergiu Bursuc School of Computer Science,

Sambuz

Useful Links

Newsletter

Mail Us

Pretzelocity Pretzelocity & & Quark Angular Momentum Quark Angular Momentum Bo-Qiang Ma

WELCOME! BOARD MEETING | F EBRUARY 23, 2018 ADVISORY & VOTING MEMBERS | CALL TO ORDER BOARD