 
              T ECHNICAL U NIVERSITY OF V ALENCIA (UPV) D EPARTMENT OF C OMPUTER S YSTEMS AND C OMPUTATION (DSIC) C OMBINING TRANSLATION MODELS IN STATISTICAL MACHINE TRANSLATION J ESÚS A NDRÉS -F ERRER I SMAEL G ARCÍA -V AREA F RANCISCO C ASACUBERTA jandres@dsic.upv.es ivarea@info-ab.uclm.es fcn@dsic.upv.es
Contents 1 Introduction 0 2 Decision Theory 2 2.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Statistical Machine Translation 5 3.1 Quadratic Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Linear Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Log-Lineal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Experimental Results 9 5 Conclusions 14 Jesús Andrés Ferrer 1 DSIC ITI UPV TMI 2007
1 Introduction ! Translate a source sentence f ∈ F ∗ into a target sentence e ∈ E ∗ ! Brown et al. (1993) approached the problem of MT from a purely statistical point of view ! Pattern recognition problem with a set of classes E ∗ ! Optimal Bayes’ classi fi cation rule: (1) ˆ e = arg max e ∈ E ∗ { p ( e | f ) } ! Applying Bayes’ theorem = ⇒ inverse translation rule (ITR): ˆ e = arg max e ∈ E ∗ { p ( e ) · p ( f | e ) } (2) ! The model problem ! The search problem: NP-hard (Knight, 1999; Udupa and Maji, 2006) ! Several search algorithms have been proposed to solve this problem ef fi ciently (Brown and others, Wang and Waibel, 1997; Yaser and others, 1999; German and others, 2001; Jelinek, 1969; García-Varea and Casacuberta, 2001; Tillmann and Ney, 2003). Jesús Andrés Ferrer 0 DSIC ITI UPV TMI 2007
Introduction ! Many SMT systems (Och et al., 1999; Och and Ney, 2004; Koehn et al., 2003; Zens et al., 2002) have proposed the use of the direct translation rule (DTR): ˆ e = arg max e ∈ E ∗ { p ( e ) · p ( e | f ) } (3) " Heuristic version of the ITR " Easier search algorithm for some of the translation models " Its statistical theoretical foundation has not been clear for long time " (Andrés-Ferrer et al., 2007) have provided an explanation of its use within decision theory Jesús Andrés Ferrer 1 DSIC ITI UPV TMI 2007
2 Decision Theory ! A classi fi cation problem is a decision problem: " A set of objects: X " A set of classes or actions: Ω = { ω 1 , . . . , ω C } for each object x " A loss function: l( ω k | x , ω j ) ! A classi fi cation system is Classification function : c : X → Ω ! The conditional risk given x : � R ( ω k | x ) = l( ω k | x , ω j ) p( ω j | x ) (4) ω j ∈ Ω ! Global risk for a classi fi cation function: � R (c)=E x [ R (c( x ) | x )]= R (c( x ) | x ) p( x ) d x (5) X ! Best system? " Minimise the global risk " Minimise the conditional risk for each x = ⇒ minimise the global risk " Bayes’ classification rule : ˆ c( x ) = arg min R ( ω | x ) (6) ω ∈ Ω " For each loss function there is one optimal classi fi cation rule Jesús Andrés Ferrer 2 DSIC ITI UPV TMI 2007
2.1 Loss Function ! Quadratic loss functions: � 0 ω k = ω j l( ω k | x , ω j ) = (7) � ( x , ω k , ω j ) otherwise " Optimal classi fi cation rule: � ˆ c( x ) = arg min � ( x , ω k , ω j ) p( ω j | x ) (8) ω k ∈ Ω ω j � = ω k " Search space: O ( | Ω | 2 ) " Can be prohibitive for some problems # Rough approximations of the sum: � ω j � = ω k # N -best lists Jesús Andrés Ferrer 3 DSIC ITI UPV TMI 2007
Loss Function ! Linear loss functions: � 0 ω k = ω j l( ω k | x , ω j ) = (9) � ( x , ω j ) otherwise " � ( · ) : # Depends on the object x # Depends on the correct class ω j # Does NOT depend on the class proposed by the system ω k " Optimal classi fi cation Rule (Andrés-Ferrer et al., 2007): (10) ˆ c( x ) = arg max { p( ω | x ) � ( x , ω ) } ω ∈ Ω " Search space: O ( | Ω | ) ! The 0 - 1 loss function is usually assumed: � 0 ω k = ω j l( ω k | x , ω j ) = (11) 1 otherwise " Optimal classi fi cation rule: ˆ c( x ) = arg max { p( ω | x ) } (12) ω ∈ Ω " Different kind of errors are not distinguished " Not specially appropriate in some cases: # In fi nite class problems Jesús Andrés Ferrer 4 DSIC ITI UPV TMI 2007
3 Statistical Machine Translation ! SMT is a decision problem where: " Objects: X = F ∗ " Classes: Ω = E ∗ " Loss function: l( e k | f , e j ) ! A 0 - 1 loss function is often assumed ! Classi fi cation rule for the 0 - 1 loss function: e = ˆ ˆ c( f ) = arg max { p( e k | f ) } e k ∈ Ω ! Classi fi cation rule for the 0 - 1 loss function + Bayes’ Theorem ˆ e = ˆ c( f ) = arg max { p( f | e k ) p( e k ) } e k ∈ Ω ! This loss function is not specially appropriate for SMT ! The set of classes is in fi nite enumerable Jesús Andrés Ferrer 5 DSIC ITI UPV TMI 2007
3.1 Quadratic Loss Function ! Quadratic loss function in STM: � 0 e k = e j l( e k | f , e j ) = � ( f , e k , e j ) otherwise ! Classi fi cation rule: � e = arg min ˆ � ( f , e k , e j ) p( e j | f ) (13) e k ∈ E � e j � = e k ! Allow to introduce the evaluation error metric: " l( e k | f , e j ) = BLEU( e k , e j ) " l( e k | f , e j ) = WER( e k , e j ) ! Metric loss functions (R. Schlüter and Ney, 2005) ! Quadratic search space ! Approximation: N -best lists (Kumar and Byrne, 2004) ! Introduce a kernel (Cortes et al., 2005) as the loss function Jesús Andrés Ferrer 6 DSIC ITI UPV TMI 2007
3.2 Linear Loss Functions ! Linear loss function: � 0 e k = e j l( e k | f , e j ) = � ( f , e j ) otherwise ! Classi fi cation rule: e = arg max ˆ { p( e | f ) � ( f , e ) } e ∈ E � ! Inverse translation rule (ITR): " Using � ( f , e j ) = 1 and Bayes’ theorem: = ⇒ ˆ e = arg max e j ∈ E � { p( f | e j ) p( e j ) } ! Direct translation rule (DTR): " Using � ( f , e j ) = p ( e j ) = ⇒ ˆ e = arg max e j ∈ E � { p( e j | f ) p( e j ) } ! Inverse form of DTR (IFDTR) p( e j ) 2 p( f | e j ) " Applying Bayes’ theorem to DTR = � � ⇒ ˆ e = arg max e j ∈ E ∗ " DTR and IFDTR a measure of model asymmetries ! Direct and inverse translation rule (I&DTR): " Using � ( f , e j ) = p ( f , e j ) = ⇒ ˆ e = arg max e j ∈ E � { p( e j | f ) p( f | e j ) p( e j ) } Jesús Andrés Ferrer 7 DSIC ITI UPV TMI 2007
3.3 Log-Lineal Models ! Most of the current SMT systems use log-lineal models (Och and Ney, 2004; Marino et al., 2006): �� M � exp m =1 λ m h m ( f , e ) p ( e | f ) ≈ �� M � � m =1 λ m h m ( f , e � ) e � exp e = arg max e ∈ E ∗ � M ! Use the ITR with previous model to obtain the classi fi cation rule: ˆ m =1 λ m h m ( f , e ) ! Where h m is usually the logarithmic of a statistical model that approximates a probability distribution ( h m ( f , e ) = log p m ( f | e ) , h m ( f , e ) = log p m ( e | f ) , h m ( f , e ) = log p m ( e ) , ...) ! Decision Theory also explains these models: " It can be understood as a linear loss function with: M � � ( f , e ) = p ( e | f ) − 1 f m ( f , e ) λ i m =1 " With f m ( f , e ) = exp[ h m ( f , e )] . " De fi ne a family of functions depending on a hyperparameter( λ M 1 ): � � M � � p ( e | f ) − 1 f m ( f , e ) λ i � ∀ λ i : i ∈ [1 , M ] � m =1 " Experimentally (with a validation set) solve the optimisation problem " Use these hyperparameters to reduce the evaluation error metric (Och, 2003) Jesús Andrés Ferrer 8 DSIC ITI UPV TMI 2007
4 Experimental Results ! Aim: Test theory in a small dataset and simple translation models ! State-of-art models in (Andrés-Ferrer et al., 2007) ! Results with IBM Model 2 (Brown and other, 1993) trained with GIZA ++ (Och, 2000) ! Decoding algorithm for each of the following rules (García-Varea and Casacuberta, 2001): " ITR: ˆ e = arg max e j ∈ E � { p( f | e j ) p( e j ) } " DTR: ˆ e = arg max e j ∈ E � { p( e j | f ) p( e j ) } " IFDTR: � p ( e ) 2 p ( f | e ) � ˆ e = arg max e ∈ E ∗ " Two version of I&DTR (I&DTR-D and I&DTR-I): ˆ e = arg max e j ∈ E � { p( e j | f ) p( f | e j ) p( e j ) } ! The Spanish-English T OURIST task (Amengual et al., 1996) " Human-to-human communication situations at the front-desk of a hotel " Semi-automatically produced using a small seed corpus from travel guides booklets " Test: 1 K sentences randomly selected " Training sets of exponentially increasing sizes from 1 K to 128 K and 170 K Test Set Train Set Spa Eng Spa Eng sentences 1 K 170 K avg. length 12 . 7 12 . 6 12 . 9 13 . 0 vocabulary 518 393 688 514 singletons 107 90 12 7 perplexity 3 . 62 2 . 95 3 . 50 2 . 89 Jesús Andrés Ferrer 9 DSIC ITI UPV TMI 2007
Asymmetry of Model 2 40 35 30 DTR DTR-N 25 IFDTR WER 20 15 10 5 1000 2000 4000 8000 16000 32000 64000 128000 Training Size Jesús Andrés Ferrer 10 DSIC ITI UPV TMI 2007
WER 18 17 IFDTR ITR I&DTR-D 16 I&DTR-I 15 14 WER 13 12 11 10 1000 2000 4000 8000 16000 32000 64000 128000 Training Size Jesús Andrés Ferrer 11 DSIC ITI UPV TMI 2007
SER DTR-N ITR 90 IFDTR I&DTR-D I&DTR-I 80 70 SER 60 50 40 1000 2000 4000 8000 16000 32000 64000 128000 Training Size Jesús Andrés Ferrer 12 DSIC ITI UPV TMI 2007
Recommend
More recommend