Contents 1 Introduction 0 2 Decision Theory 2 2.1 Loss - PowerPoint PPT Presentation

T ECHNICAL U NIVERSITY OF V ALENCIA (UPV) D EPARTMENT OF C OMPUTER S YSTEMS AND C OMPUTATION (DSIC) C OMBINING TRANSLATION MODELS IN STATISTICAL MACHINE TRANSLATION J ESÚS A NDRÉS -F ERRER I SMAEL G ARCÍA -V AREA F RANCISCO C ASACUBERTA jandres@dsic.upv.es ivarea@info-ab.uclm.es fcn@dsic.upv.es

Contents 1 Introduction 0 2 Decision Theory 2 2.1 Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 Statistical Machine Translation 5 3.1 Quadratic Loss Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.2 Linear Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3 Log-Lineal Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 4 Experimental Results 9 5 Conclusions 14 Jesús Andrés Ferrer 1 DSIC ITI UPV TMI 2007

1 Introduction ! Translate a source sentence f ∈ F ∗ into a target sentence e ∈ E ∗ ! Brown et al. (1993) approached the problem of MT from a purely statistical point of view ! Pattern recognition problem with a set of classes E ∗ ! Optimal Bayes’ classi fi cation rule: (1) ˆ e = arg max e ∈ E ∗ { p ( e | f ) } ! Applying Bayes’ theorem = ⇒ inverse translation rule (ITR): ˆ e = arg max e ∈ E ∗ { p ( e ) · p ( f | e ) } (2) ! The model problem ! The search problem: NP-hard (Knight, 1999; Udupa and Maji, 2006) ! Several search algorithms have been proposed to solve this problem ef fi ciently (Brown and others, Wang and Waibel, 1997; Yaser and others, 1999; German and others, 2001; Jelinek, 1969; García-Varea and Casacuberta, 2001; Tillmann and Ney, 2003). Jesús Andrés Ferrer 0 DSIC ITI UPV TMI 2007

Introduction ! Many SMT systems (Och et al., 1999; Och and Ney, 2004; Koehn et al., 2003; Zens et al., 2002) have proposed the use of the direct translation rule (DTR): ˆ e = arg max e ∈ E ∗ { p ( e ) · p ( e | f ) } (3) " Heuristic version of the ITR " Easier search algorithm for some of the translation models " Its statistical theoretical foundation has not been clear for long time " (Andrés-Ferrer et al., 2007) have provided an explanation of its use within decision theory Jesús Andrés Ferrer 1 DSIC ITI UPV TMI 2007

2 Decision Theory ! A classi fi cation problem is a decision problem: " A set of objects: X " A set of classes or actions: Ω = { ω 1 , . . . , ω C } for each object x " A loss function: l( ω k | x , ω j ) ! A classi fi cation system is Classification function : c : X → Ω ! The conditional risk given x : � R ( ω k | x ) = l( ω k | x , ω j ) p( ω j | x ) (4) ω j ∈ Ω ! Global risk for a classi fi cation function: � R (c)=E x [ R (c( x ) | x )]= R (c( x ) | x ) p( x ) d x (5) X ! Best system? " Minimise the global risk " Minimise the conditional risk for each x = ⇒ minimise the global risk " Bayes’ classification rule : ˆ c( x ) = arg min R ( ω | x ) (6) ω ∈ Ω " For each loss function there is one optimal classi fi cation rule Jesús Andrés Ferrer 2 DSIC ITI UPV TMI 2007

2.1 Loss Function ! Quadratic loss functions: � 0 ω k = ω j l( ω k | x , ω j ) = (7) � ( x , ω k , ω j ) otherwise " Optimal classi fi cation rule: � ˆ c( x ) = arg min � ( x , ω k , ω j ) p( ω j | x ) (8) ω k ∈ Ω ω j � = ω k " Search space: O ( | Ω | 2 ) " Can be prohibitive for some problems # Rough approximations of the sum: � ω j � = ω k # N -best lists Jesús Andrés Ferrer 3 DSIC ITI UPV TMI 2007

Loss Function ! Linear loss functions: � 0 ω k = ω j l( ω k | x , ω j ) = (9) � ( x , ω j ) otherwise " � ( · ) : # Depends on the object x # Depends on the correct class ω j # Does NOT depend on the class proposed by the system ω k " Optimal classi fi cation Rule (Andrés-Ferrer et al., 2007): (10) ˆ c( x ) = arg max { p( ω | x ) � ( x , ω ) } ω ∈ Ω " Search space: O ( | Ω | ) ! The 0 - 1 loss function is usually assumed: � 0 ω k = ω j l( ω k | x , ω j ) = (11) 1 otherwise " Optimal classi fi cation rule: ˆ c( x ) = arg max { p( ω | x ) } (12) ω ∈ Ω " Different kind of errors are not distinguished " Not specially appropriate in some cases: # In fi nite class problems Jesús Andrés Ferrer 4 DSIC ITI UPV TMI 2007

3 Statistical Machine Translation ! SMT is a decision problem where: " Objects: X = F ∗ " Classes: Ω = E ∗ " Loss function: l( e k | f , e j ) ! A 0 - 1 loss function is often assumed ! Classi fi cation rule for the 0 - 1 loss function: e = ˆ ˆ c( f ) = arg max { p( e k | f ) } e k ∈ Ω ! Classi fi cation rule for the 0 - 1 loss function + Bayes’ Theorem ˆ e = ˆ c( f ) = arg max { p( f | e k ) p( e k ) } e k ∈ Ω ! This loss function is not specially appropriate for SMT ! The set of classes is in fi nite enumerable Jesús Andrés Ferrer 5 DSIC ITI UPV TMI 2007

3.1 Quadratic Loss Function ! Quadratic loss function in STM: � 0 e k = e j l( e k | f , e j ) = � ( f , e k , e j ) otherwise ! Classi fi cation rule: � e = arg min ˆ � ( f , e k , e j ) p( e j | f ) (13) e k ∈ E � e j � = e k ! Allow to introduce the evaluation error metric: " l( e k | f , e j ) = BLEU( e k , e j ) " l( e k | f , e j ) = WER( e k , e j ) ! Metric loss functions (R. Schlüter and Ney, 2005) ! Quadratic search space ! Approximation: N -best lists (Kumar and Byrne, 2004) ! Introduce a kernel (Cortes et al., 2005) as the loss function Jesús Andrés Ferrer 6 DSIC ITI UPV TMI 2007

3.2 Linear Loss Functions ! Linear loss function: � 0 e k = e j l( e k | f , e j ) = � ( f , e j ) otherwise ! Classi fi cation rule: e = arg max ˆ { p( e | f ) � ( f , e ) } e ∈ E � ! Inverse translation rule (ITR): " Using � ( f , e j ) = 1 and Bayes’ theorem: = ⇒ ˆ e = arg max e j ∈ E � { p( f | e j ) p( e j ) } ! Direct translation rule (DTR): " Using � ( f , e j ) = p ( e j ) = ⇒ ˆ e = arg max e j ∈ E � { p( e j | f ) p( e j ) } ! Inverse form of DTR (IFDTR) p( e j ) 2 p( f | e j ) " Applying Bayes’ theorem to DTR = � � ⇒ ˆ e = arg max e j ∈ E ∗ " DTR and IFDTR a measure of model asymmetries ! Direct and inverse translation rule (I&DTR): " Using � ( f , e j ) = p ( f , e j ) = ⇒ ˆ e = arg max e j ∈ E � { p( e j | f ) p( f | e j ) p( e j ) } Jesús Andrés Ferrer 7 DSIC ITI UPV TMI 2007

3.3 Log-Lineal Models ! Most of the current SMT systems use log-lineal models (Och and Ney, 2004; Marino et al., 2006): �� M � exp m =1 λ m h m ( f , e ) p ( e | f ) ≈ �� M � � m =1 λ m h m ( f , e � ) e � exp e = arg max e ∈ E ∗ � M ! Use the ITR with previous model to obtain the classi fi cation rule: ˆ m =1 λ m h m ( f , e ) ! Where h m is usually the logarithmic of a statistical model that approximates a probability distribution ( h m ( f , e ) = log p m ( f | e ) , h m ( f , e ) = log p m ( e | f ) , h m ( f , e ) = log p m ( e ) , ...) ! Decision Theory also explains these models: " It can be understood as a linear loss function with: M � � ( f , e ) = p ( e | f ) − 1 f m ( f , e ) λ i m =1 " With f m ( f , e ) = exp[ h m ( f , e )] . " De fi ne a family of functions depending on a hyperparameter( λ M 1 ): � � M � � p ( e | f ) − 1 f m ( f , e ) λ i � ∀ λ i : i ∈ [1 , M ] � m =1 " Experimentally (with a validation set) solve the optimisation problem " Use these hyperparameters to reduce the evaluation error metric (Och, 2003) Jesús Andrés Ferrer 8 DSIC ITI UPV TMI 2007

4 Experimental Results ! Aim: Test theory in a small dataset and simple translation models ! State-of-art models in (Andrés-Ferrer et al., 2007) ! Results with IBM Model 2 (Brown and other, 1993) trained with GIZA ++ (Och, 2000) ! Decoding algorithm for each of the following rules (García-Varea and Casacuberta, 2001): " ITR: ˆ e = arg max e j ∈ E � { p( f | e j ) p( e j ) } " DTR: ˆ e = arg max e j ∈ E � { p( e j | f ) p( e j ) } " IFDTR: � p ( e ) 2 p ( f | e ) � ˆ e = arg max e ∈ E ∗ " Two version of I&DTR (I&DTR-D and I&DTR-I): ˆ e = arg max e j ∈ E � { p( e j | f ) p( f | e j ) p( e j ) } ! The Spanish-English T OURIST task (Amengual et al., 1996) " Human-to-human communication situations at the front-desk of a hotel " Semi-automatically produced using a small seed corpus from travel guides booklets " Test: 1 K sentences randomly selected " Training sets of exponentially increasing sizes from 1 K to 128 K and 170 K Test Set Train Set Spa Eng Spa Eng sentences 1 K 170 K avg. length 12 . 7 12 . 6 12 . 9 13 . 0 vocabulary 518 393 688 514 singletons 107 90 12 7 perplexity 3 . 62 2 . 95 3 . 50 2 . 89 Jesús Andrés Ferrer 9 DSIC ITI UPV TMI 2007

Asymmetry of Model 2 40 35 30 DTR DTR-N 25 IFDTR WER 20 15 10 5 1000 2000 4000 8000 16000 32000 64000 128000 Training Size Jesús Andrés Ferrer 10 DSIC ITI UPV TMI 2007

WER 18 17 IFDTR ITR I&DTR-D 16 I&DTR-I 15 14 WER 13 12 11 10 1000 2000 4000 8000 16000 32000 64000 128000 Training Size Jesús Andrés Ferrer 11 DSIC ITI UPV TMI 2007

SER DTR-N ITR 90 IFDTR I&DTR-D I&DTR-I 80 70 SER 60 50 40 1000 2000 4000 8000 16000 32000 64000 128000 Training Size Jesús Andrés Ferrer 12 DSIC ITI UPV TMI 2007

Contents 1 Introduction 0 2 Decision Theory 2 2.1 Loss - PowerPoint PPT Presentation

T ECHNICAL U NIVERSITY OF V ALENCIA (UPV) D EPARTMENT OF C OMPUTER S YSTEMS AND C OMPUTATION (DSIC) C OMBINING TRANSLATION MODELS IN STATISTICAL MACHINE TRANSLATION J ESS A NDRS -F ERRER I SMAEL G ARCA -V AREA F RANCISCO C ASACUBERTA

Level 1, V2.0 Level 1, V2.0 1 Course Contents Course Contents Course Contents Course

Oasys Post Processing New Features in Version 16.0 www.arup.com/dyna Back to Contents Back to

Contents averages averages Contents Contents Harmonic mean (average) Harmonic mean (average)

Sage as a Calculator By Samaneh shafi naderi By Samaneh shafi naderi Sage as a Calculator

Contents Contents Fluid

Contents Contents.....2 Butter

PRODUCT LAW WORLDVIEW PRODUCT LAW WORLDVIEW TABLE OF CONTENTS TABLE OF CONTENTS INTRODUCTION

The Waterbase Limited Investor Presentation June - 2016 Contents Contents 2 Safe Harbour

17 www.scad.ae Table of Contents Table of Contents

Scytls voter-verifiability solutions Pnyx.DRE and Pnyx.VVPAT Contents Contents

Cencosud April 2016 Corporate Presentation | Contents | 2 Contents Investment Highlights

3 August 2006 Hong Kong www.solomon-systech.com Table of contents Table of contents

CONTENTS CONTENTS A. Company Profile 03 B. Products 06 Appendix 29 2/30 A. Company Profile

INVESTOR PRESENTATION February 2020 CONTENTS TABLE OF CONTENTS Majid Al Futtaim 2019

Marine Biodiversity Yoshihisa Shirayama Contents Contents Characteristics of Marine

Taeil Enterprise the antimicrobial material technology Table of Contents Table of Contents

Estimating Variance under Hierarchical Case: . . . Interval and Fuzzy Hierarchical Case: . . .

Transfer Learning for Heterogeneous One-Class Collaborative Filtering Weike Pan , Mengsi Liu and

and Scholarships What is Financial Aid? Financial aid is money available to help meet a

The Arithmetic of Coxeter Permutahedra Federico Ardila San Francisco State University

Design of Robust Global Power and Ground Networks S. Boyd L. Vandenberghe A. El Gamal S. Yun

The Money Value of a Man Mark Huggett Greg Kaplan 1 Common View: Most valuable asset that most

B ALANCED T REES Acknowledgement: The course slides are adapted from the slides prepared by R.

Symplectic cohomological rigidity through toric degenerations Susan Tolman (joint work with