contents
play

Contents 1 Introduction 0 2 Decision Theory 2 2.1 Loss - PowerPoint PPT Presentation

T ECHNICAL U NIVERSITY OF V ALENCIA (UPV) D EPARTMENT OF C OMPUTER S YSTEMS AND C OMPUTATION (DSIC) C OMBINING TRANSLATION MODELS IN STATISTICAL MACHINE TRANSLATION J ESS A NDRS -F ERRER I SMAEL G ARCA -V AREA F RANCISCO C ASACUBERTA


  1. T ECHNICAL U NIVERSITY OF V ALENCIA (UPV) D EPARTMENT OF C OMPUTER S YSTEMS AND C OMPUTATION (DSIC) C OMBINING TRANSLATION MODELS IN STATISTICAL MACHINE TRANSLATION J ESÚS A NDRÉS -F ERRER I SMAEL G ARCÍA -V AREA F RANCISCO C ASACUBERTA jandres@dsic.upv.es ivarea@info-ab.uclm.es fcn@dsic.upv.es

  2. Contents 1 Introduction 0 2 Decision Theory 2 2.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Statistical Machine Translation 5 3.1 Quadratic Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Linear Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Log-Lineal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Experimental Results 9 5 Conclusions 14 Jesús Andrés Ferrer 1 DSIC ITI UPV TMI 2007

  3. 1 Introduction ! Translate a source sentence f ∈ F ∗ into a target sentence e ∈ E ∗ ! Brown et al. (1993) approached the problem of MT from a purely statistical point of view ! Pattern recognition problem with a set of classes E ∗ ! Optimal Bayes’ classi fi cation rule: (1) ˆ e = arg max e ∈ E ∗ { p ( e | f ) } ! Applying Bayes’ theorem = ⇒ inverse translation rule (ITR): ˆ e = arg max e ∈ E ∗ { p ( e ) · p ( f | e ) } (2) ! The model problem ! The search problem: NP-hard (Knight, 1999; Udupa and Maji, 2006) ! Several search algorithms have been proposed to solve this problem ef fi ciently (Brown and others, Wang and Waibel, 1997; Yaser and others, 1999; German and others, 2001; Jelinek, 1969; García-Varea and Casacuberta, 2001; Tillmann and Ney, 2003). Jesús Andrés Ferrer 0 DSIC ITI UPV TMI 2007

  4. Introduction ! Many SMT systems (Och et al., 1999; Och and Ney, 2004; Koehn et al., 2003; Zens et al., 2002) have proposed the use of the direct translation rule (DTR): ˆ e = arg max e ∈ E ∗ { p ( e ) · p ( e | f ) } (3) " Heuristic version of the ITR " Easier search algorithm for some of the translation models " Its statistical theoretical foundation has not been clear for long time " (Andrés-Ferrer et al., 2007) have provided an explanation of its use within decision theory Jesús Andrés Ferrer 1 DSIC ITI UPV TMI 2007

  5. 2 Decision Theory ! A classi fi cation problem is a decision problem: " A set of objects: X " A set of classes or actions: Ω = { ω 1 , . . . , ω C } for each object x " A loss function: l( ω k | x , ω j ) ! A classi fi cation system is Classification function : c : X → Ω ! The conditional risk given x : � R ( ω k | x ) = l( ω k | x , ω j ) p( ω j | x ) (4) ω j ∈ Ω ! Global risk for a classi fi cation function: � R (c)=E x [ R (c( x ) | x )]= R (c( x ) | x ) p( x ) d x (5) X ! Best system? " Minimise the global risk " Minimise the conditional risk for each x = ⇒ minimise the global risk " Bayes’ classification rule : ˆ c( x ) = arg min R ( ω | x ) (6) ω ∈ Ω " For each loss function there is one optimal classi fi cation rule Jesús Andrés Ferrer 2 DSIC ITI UPV TMI 2007

  6. 2.1 Loss Function ! Quadratic loss functions: � 0 ω k = ω j l( ω k | x , ω j ) = (7) � ( x , ω k , ω j ) otherwise " Optimal classi fi cation rule: � ˆ c( x ) = arg min � ( x , ω k , ω j ) p( ω j | x ) (8) ω k ∈ Ω ω j � = ω k " Search space: O ( | Ω | 2 ) " Can be prohibitive for some problems # Rough approximations of the sum: � ω j � = ω k # N -best lists Jesús Andrés Ferrer 3 DSIC ITI UPV TMI 2007

  7. Loss Function ! Linear loss functions: � 0 ω k = ω j l( ω k | x , ω j ) = (9) � ( x , ω j ) otherwise " � ( · ) : # Depends on the object x # Depends on the correct class ω j # Does NOT depend on the class proposed by the system ω k " Optimal classi fi cation Rule (Andrés-Ferrer et al., 2007): (10) ˆ c( x ) = arg max { p( ω | x ) � ( x , ω ) } ω ∈ Ω " Search space: O ( | Ω | ) ! The 0 - 1 loss function is usually assumed: � 0 ω k = ω j l( ω k | x , ω j ) = (11) 1 otherwise " Optimal classi fi cation rule: ˆ c( x ) = arg max { p( ω | x ) } (12) ω ∈ Ω " Different kind of errors are not distinguished " Not specially appropriate in some cases: # In fi nite class problems Jesús Andrés Ferrer 4 DSIC ITI UPV TMI 2007

  8. 3 Statistical Machine Translation ! SMT is a decision problem where: " Objects: X = F ∗ " Classes: Ω = E ∗ " Loss function: l( e k | f , e j ) ! A 0 - 1 loss function is often assumed ! Classi fi cation rule for the 0 - 1 loss function: e = ˆ ˆ c( f ) = arg max { p( e k | f ) } e k ∈ Ω ! Classi fi cation rule for the 0 - 1 loss function + Bayes’ Theorem ˆ e = ˆ c( f ) = arg max { p( f | e k ) p( e k ) } e k ∈ Ω ! This loss function is not specially appropriate for SMT ! The set of classes is in fi nite enumerable Jesús Andrés Ferrer 5 DSIC ITI UPV TMI 2007

  9. 3.1 Quadratic Loss Function ! Quadratic loss function in STM: � 0 e k = e j l( e k | f , e j ) = � ( f , e k , e j ) otherwise ! Classi fi cation rule: � e = arg min ˆ � ( f , e k , e j ) p( e j | f ) (13) e k ∈ E � e j � = e k ! Allow to introduce the evaluation error metric: " l( e k | f , e j ) = BLEU( e k , e j ) " l( e k | f , e j ) = WER( e k , e j ) ! Metric loss functions (R. Schlüter and Ney, 2005) ! Quadratic search space ! Approximation: N -best lists (Kumar and Byrne, 2004) ! Introduce a kernel (Cortes et al., 2005) as the loss function Jesús Andrés Ferrer 6 DSIC ITI UPV TMI 2007

  10. 3.2 Linear Loss Functions ! Linear loss function: � 0 e k = e j l( e k | f , e j ) = � ( f , e j ) otherwise ! Classi fi cation rule: e = arg max ˆ { p( e | f ) � ( f , e ) } e ∈ E � ! Inverse translation rule (ITR): " Using � ( f , e j ) = 1 and Bayes’ theorem: = ⇒ ˆ e = arg max e j ∈ E � { p( f | e j ) p( e j ) } ! Direct translation rule (DTR): " Using � ( f , e j ) = p ( e j ) = ⇒ ˆ e = arg max e j ∈ E � { p( e j | f ) p( e j ) } ! Inverse form of DTR (IFDTR) p( e j ) 2 p( f | e j ) " Applying Bayes’ theorem to DTR = � � ⇒ ˆ e = arg max e j ∈ E ∗ " DTR and IFDTR a measure of model asymmetries ! Direct and inverse translation rule (I&DTR): " Using � ( f , e j ) = p ( f , e j ) = ⇒ ˆ e = arg max e j ∈ E � { p( e j | f ) p( f | e j ) p( e j ) } Jesús Andrés Ferrer 7 DSIC ITI UPV TMI 2007

  11. 3.3 Log-Lineal Models ! Most of the current SMT systems use log-lineal models (Och and Ney, 2004; Marino et al., 2006): �� M � exp m =1 λ m h m ( f , e ) p ( e | f ) ≈ �� M � � m =1 λ m h m ( f , e � ) e � exp e = arg max e ∈ E ∗ � M ! Use the ITR with previous model to obtain the classi fi cation rule: ˆ m =1 λ m h m ( f , e ) ! Where h m is usually the logarithmic of a statistical model that approximates a probability distribution ( h m ( f , e ) = log p m ( f | e ) , h m ( f , e ) = log p m ( e | f ) , h m ( f , e ) = log p m ( e ) , ...) ! Decision Theory also explains these models: " It can be understood as a linear loss function with: M � � ( f , e ) = p ( e | f ) − 1 f m ( f , e ) λ i m =1 " With f m ( f , e ) = exp[ h m ( f , e )] . " De fi ne a family of functions depending on a hyperparameter( λ M 1 ): � � M � � p ( e | f ) − 1 f m ( f , e ) λ i � ∀ λ i : i ∈ [1 , M ] � m =1 " Experimentally (with a validation set) solve the optimisation problem " Use these hyperparameters to reduce the evaluation error metric (Och, 2003) Jesús Andrés Ferrer 8 DSIC ITI UPV TMI 2007

  12. 4 Experimental Results ! Aim: Test theory in a small dataset and simple translation models ! State-of-art models in (Andrés-Ferrer et al., 2007) ! Results with IBM Model 2 (Brown and other, 1993) trained with GIZA ++ (Och, 2000) ! Decoding algorithm for each of the following rules (García-Varea and Casacuberta, 2001): " ITR: ˆ e = arg max e j ∈ E � { p( f | e j ) p( e j ) } " DTR: ˆ e = arg max e j ∈ E � { p( e j | f ) p( e j ) } " IFDTR: � p ( e ) 2 p ( f | e ) � ˆ e = arg max e ∈ E ∗ " Two version of I&DTR (I&DTR-D and I&DTR-I): ˆ e = arg max e j ∈ E � { p( e j | f ) p( f | e j ) p( e j ) } ! The Spanish-English T OURIST task (Amengual et al., 1996) " Human-to-human communication situations at the front-desk of a hotel " Semi-automatically produced using a small seed corpus from travel guides booklets " Test: 1 K sentences randomly selected " Training sets of exponentially increasing sizes from 1 K to 128 K and 170 K Test Set Train Set Spa Eng Spa Eng sentences 1 K 170 K avg. length 12 . 7 12 . 6 12 . 9 13 . 0 vocabulary 518 393 688 514 singletons 107 90 12 7 perplexity 3 . 62 2 . 95 3 . 50 2 . 89 Jesús Andrés Ferrer 9 DSIC ITI UPV TMI 2007

  13. Asymmetry of Model 2 40 35 30 DTR DTR-N 25 IFDTR WER 20 15 10 5 1000 2000 4000 8000 16000 32000 64000 128000 Training Size Jesús Andrés Ferrer 10 DSIC ITI UPV TMI 2007

  14. WER 18 17 IFDTR ITR I&DTR-D 16 I&DTR-I 15 14 WER 13 12 11 10 1000 2000 4000 8000 16000 32000 64000 128000 Training Size Jesús Andrés Ferrer 11 DSIC ITI UPV TMI 2007

  15. SER DTR-N ITR 90 IFDTR I&DTR-D I&DTR-I 80 70 SER 60 50 40 1000 2000 4000 8000 16000 32000 64000 128000 Training Size Jesús Andrés Ferrer 12 DSIC ITI UPV TMI 2007

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend