turchi@<k.eu Slides from the presenta&on by MaDeo Negri and - PowerPoint PPT Presentation

Outline • Importance of MT Evalua&on • Difficulty of MT Evalua&on Evalua&on of • Human evalua&on: fluency/adequacy • Automa&c evalua&on: Machine Transla&on Quality – Reference-based: BLEU, TER, HTER (chosen among MANY others) Marco Turchi – Reference-free: quality es&ma&on (es&ma&ng post-edi&ng effort) FBK Trento, Italy turchi@<k.eu Slides from the presenta&on by MaDeo Negri… and myself MT Evalua&on, Trento, Doctoral School - April 2016 Disclaimer The importance of MT evalua&on • Answering “ How good is an MT system? ” as a way to: “More has been wriDen about MT evalua&on – Which system to use for a given task over the past 50 years than about MT itself” – Assess and compare systems’ performance Hovy et al.: Principles of Context-Based Machine Transla7on Evalua7on . – Define the state of the art Machine Transla&on, 16, pp. 1–33, 2002 – Drive system development and measure improvements (aDributed to Yorick Wilks) – Decide whether to apply MT at all “It is impossible to write a comprehensive overview of the MT • …Necessary (yes, not sufficient) condi&ons for progress in evalua&on literature” any research field Adam Lopez.: Sta7s7cal Machine Transla7on. ACM Compu&ng Surveys 40(3) pp. 1–49, August 2008. • Difficult task! MT Evalua&on, Trento, Doctoral School - April 2016 MT Evalua&on, Trento, Doctoral School - April 2016

The importance of MT evalua&on Difficulty of MT evalua&on • Answering “ How good is an MT system? ” as a way to: • No formal defini=on of “transla&on” ! no defini&on of “good transla&on” – Which system to use for a given task • The no&on of quality is inherently subjec=ve – Assess and compare systems’ performance – Define the state of the art • Exact quan&fica&on is difficult (especially for long sentences) – Drive system development and measure improvements • MT errors are very varied in nature – Decide whether to apply MT at all • …Necessary (yes, not sufficient) condi&ons for progress in any research field • Difficult task! MT Evalua&on, Trento, Doctoral School - April 2016 The importance of MT evalua&on Difficulty of MT evalua&on • Answering “ How good is an MT system? ” as a way to: • No formal defini=on of “transla&on” ! no defini&on of “good transla&on” – Which system to use for a given task • The no&on of quality is inherently subjec=ve – Assess and compare systems’ performance – Define the state of the art • Exact quan&fica&on is difficult (especially for long sentences) – Drive system development and measure improvements • MT errors are very varied in nature – Decide whether to apply MT at all • …Necessary (yes, not sufficient) condi&ons for progress in any research field • Difficult task! MT Evalua&on, Trento, Doctoral School - April 2016

Difficulty of MT evalua&on Difficulty of MT evalua&on • No formal defini=on of “transla&on” ! no defini&on of “good • No formal defini=on of “transla&on” ! no defini&on of “good transla&on” transla&on” • The no&on of quality is inherently subjec=ve • The no&on of quality is inherently subjec=ve • Exact quan&fica&on is difficult (especially for long sentences) • Exact quan&fica&on is difficult (especially for long sentences) • MT errors are very varied in nature • MT errors are very varied in nature • Perfect or very poor transla&ons are easy to score, but what happens in between ? Difficulty of MT evalua&on Difficulty of MT evalua&on • Many different acceptable transla&ons for the same sentence • No formal defini=on of “transla&on” ! no defini&on of “good transla&on” • The no&on of quality is inherently subjec=ve �� • Exact quan&fica&on is difficult (especially for long sentences) – I am [experiencing|suffering from|feeling] a throbbing pain . • MT errors are very varied in nature – I [feel|can feel|have] a [throbbing pain|painful throbbing] . – [It is a|It’s in|I’ve got a] throbbing pain . – It’s throbbing [and it really hurts|with pain] . – [It’s painful and|It hurts so much] it’s throbbing . MT Evalua&on, Trento, Doctoral School - April 2016

Difficulty of MT evalua&on Human Vs Automa&c evalua&on • How would you translate: • Human MT evalua=on: – criteria: adequacy (fidelity) and fluency (intelligibility) It’s raining cats and dogs – pros: very accurate, high quality Ace in the hole – cons: expensive, slow, subjec&ve Beat around the bush Chew the fat • Automa=c MT evalua=on: Wild goose chase – criteria: “similarity” to professional human transla&on Tie one on – pros: inexpensive, quick, objec&ve Sunny smile – cons: quality is “slightly” lower than human check • Literally, its meaning or the corresponding idiom (if any)? MT Evalua&on, Trento, Doctoral School - April 2016 MT Evalua&on, Trento, Doctoral School - April 2016 Difficulty of MT evalua&on Human Vs Automa&c evalua&on • Classifica&on of errors: a quite rich taxonomy • Human MT evalua=on: – criteria: adequacy (fidelity) and fluency (intelligibility) – pros: very accurate, high quality – cons: expensive, slow, subjec&ve • Automa=c MT evalua=on: – criteria: “similarity” to professional human transla&on – pros: inexpensive, quick, objec&ve – cons: quality is “slightly” lower than human check Note: error types are not mutually exclusive and onen co-occur (Vilar et al. 2006) MT Evalua&on, Trento, ISIT School - November 2013 MT Evalua&on, Trento, Doctoral School - April 2016

Human evalua&on • Given: Human evalua&on – MT output, source and/or reference transla&on • Task: assess the quality of the MT output • Metrics – Adequacy: does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? …requires bilingual judges or a reference transla&on – Fluency : is the output good fluent English? This involves both gramma&cal correctness and idioma&c word choices. …monolingual judges are sufficient, no reference needed MT Evalua&on, Trento, ISIT School - November 2013 MT Evalua&on, Trento, Doctoral School - April 2016 Human evalua&on Human evalua&on • Given: • Given: – MT output, source and/or reference transla&on – MT output, source and/or reference transla&on • Task: assess the quality of the MT output • Task: assess the quality of the MT output • Metrics • Metrics – Adequacy: does the output convey the same meaning as the – Adequacy: does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? input sentence? Is part of the message lost, added, or distorted? …requires bilingual judges or a reference transla&on …requires bilingual judges or a reference transla&on – Fluency : is the output good fluent English? This involves both – Fluency : is the output good fluent English? This involves both gramma&cal correctness and idioma&c word choices. gramma&cal correctness and idioma&c word choices. …monolingual judges are sufficient, no reference needed …monolingual judges are sufficient, no reference needed MT Evalua&on, Trento, Doctoral School - April 2016 MT Evalua&on, Trento, Doctoral School - April 2016

Human evalua&on: adequacy and fluency Human evalua&on: subjec&vity • Source sentence: Le chat entre dans la chambre. • Perfect or very poor transla&ons are easy to score… …but what happens in between? (a) Adequate fluent transla&on: The cat enters the room. (a) Adequate fluent transla&on: The cat enters the room. (b) Adequate disfluent transla&on: The cat enter in the room. (b) Adequate disfluent transla&on: The cat enter in the room. (c) Fluent inadequate transla&on: The cats enter the bedroom . (c) Fluent inadequate transla&on: The cats enter the bedroom . (d) Disfluent inadequate transla&on: Bedroom the dogs enters the (d) Disfluent inadequate transla&on: Bedroom the dogs enters the JUDGE1 JUDGE2 JUDGE3 b b a a a b adequacy adequacy adequacy c c d c d d MT Evalua&on, Trento, Doctoral School - April 2016 fluency fluency fluency Human evalua&on: Likert scales Human evalua&on: subjec&vity Evaluators disagree! Adequacy Fluency • …look at this histogram of adequacy judgments by 5 all meaning 5 flawless English different human evaluators 4 most meaning 4 good English 3 much meaning 3 non-na&ve English 2 liDle meaning 2 disfluent English 1 none 1 incomprehensible MT Evalua&on, Trento, Doctoral School - April 2016 MT Evalua&on, Trento, ISIT School - November 2013

turchi@<k.eu Slides from the presenta&on by MaDeo Negri and - PowerPoint PPT Presentation

Outline Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua&on of Human evalua&on: fluency/adequacy Automa&c evalua&on: Machine Transla&on Quality Reference-based: BLEU, TER, HTER (chosen

!"#$%&'%()#*&$+%,'-.#-/0%1(,23% 4%5&$%/6%"&'$/7+%

Integra(on*of** humanandmachine* transla(on* * Marco'Turchi' Fondazione'Bruno'Kessler'

Forecast of the atmospheric parameters at the LBT site in the context of the ALTA project Turchi

Towards Language-Independent News Summarization Josef Steinberger Mijail Kabadjov, Ralf

Multidisciplinary approaches applied to cope with Zika epidemics Sinval Brando Filho, PhD

ADAPTIVE QUALITY ESTIMATION FOR MACHINE TRANSLATION AND AUTOMATIC SPEECH RECOGNITION Jos G. C.

negri@?k.eu ) ) Slides)from)(lots)of))presenta&ons)by:))

Movies as Programs Leif Andersen Accessibility (prominent code) (some code) One down One down

OXDWH[ OXQDWLF

Sparse similarity-preserving hashing Jonathan Masci , Alex M. Bronstein, Michael M. Bronstein,

t t P t

Reading eBraille with an iPhone Presenter: Karen Keninger, Director National Library Service

LArSoft Coordination Meeting Release and project report Erica Snider Vito di Benedetto Giuseppe

A New Type For Tactics Andrea Asperti <asperti@cs.unibo.it> Wilmer Ricciotti

EMBEDDED SYSTEMS AND KINETIC ART: DRAWING MACHINES CS5789: Erik Brunvand School of Computing

Distinguishing prime numbers from composite numbers: the state of the art D. J. Bernstein

Lecture 8: Parallelism and Locality in Scientific Codes David Bindel 22 Feb 2010 Logistics

Arduino Magic Wand Workshop petewarden@google.com Goal By the end of this workshop, you should

Hierarchical Content Stores in High-Speed ICN Routers: Emulation and Prototype Implementation

Simulation of Computing P Systems: A GPU Design for the Factorization Problem Miguel .

ART AS EXPERIMENT Our central and consistent effort is to teach method, not content; to

Large Area Resis+ve Micromegas for the Upgrade of the

Linux on ARM Gernot Kvas (gernot.kvas@fh-joanneum.at) April 19, 2008 Gernot Kvas

Powering of Detector Systems Satish Dhawan, Yale University Richard Sumner , CMCAMAC LLC AWLC

Sambuz

Useful Links

Newsletter

Mail Us

turchi@<k.eu Slides from the presenta&on by MaDeo Negri and - PowerPoint PPT Presentation

Outline Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua&on of Human evalua&on: fluency/adequacy Automa&c evalua&on: Machine Transla&on Quality Reference-based: BLEU, TER, HTER (chosen

!&quot;#$%&amp;'%()#*&amp;$+%,'-.#-/0%1(,23% 4%5&amp;$%/6%&quot;&amp;'$/7+%

Integra(on*of** human*and*machine* transla(on* * Marco'Turchi' Fondazione'Bruno'Kessler'

Forecast of the atmospheric parameters at the LBT site in the context of the ALTA project Turchi

Towards Language-Independent News Summarization Josef Steinberger Mijail Kabadjov, Ralf

Multidisciplinary approaches applied to cope with Zika epidemics Sinval Brando Filho, PhD

ADAPTIVE QUALITY ESTIMATION FOR MACHINE TRANSLATION AND AUTOMATIC SPEECH RECOGNITION Jos G. C.

negri@?k.eu ) ) Slides)from)(lots)of))presenta&amp;ons)by:))

Movies as Programs Leif Andersen Accessibility (prominent code) (some code) One down One down

OXDWH[ OXQDWLF

Sparse similarity-preserving hashing Jonathan Masci , Alex M. Bronstein, Michael M. Bronstein,

t t P t

Reading eBraille with an iPhone Presenter: Karen Keninger, Director National Library Service

LArSoft Coordination Meeting Release and project report Erica Snider Vito di Benedetto Giuseppe

A New Type For Tactics Andrea Asperti &lt;asperti@cs.unibo.it&gt; Wilmer Ricciotti

EMBEDDED SYSTEMS AND KINETIC ART: DRAWING MACHINES CS5789: Erik Brunvand School of Computing

Distinguishing prime numbers from composite numbers: the state of the art D. J. Bernstein

Lecture 8: Parallelism and Locality in Scientific Codes David Bindel 22 Feb 2010 Logistics

Arduino Magic Wand Workshop petewarden@google.com Goal By the end of this workshop, you should

Hierarchical Content Stores in High-Speed ICN Routers: Emulation and Prototype Implementation

Simulation of Computing P Systems: A GPU Design for the Factorization Problem Miguel .

ART AS EXPERIMENT Our central and consistent effort is to teach method, not content; to

Large Area Resis+ve Micromegas for the Upgrade of the

Linux on ARM Gernot Kvas (gernot.kvas@fh-joanneum.at) April 19, 2008 Gernot Kvas

Powering of Detector Systems Satish Dhawan, Yale University Richard Sumner , CMCAMAC LLC AWLC

Sambuz

Useful Links

Newsletter

Mail Us

!"#$%&'%()#*&$+%,'-.#-/0%1(,23% 4%5&$%/6%"&'$/7+%

Integra(on*of** humanandmachine* transla(on* * Marco'Turchi' Fondazione'Bruno'Kessler'

negri@?k.eu ) ) Slides)from)(lots)of))presenta&ons)by:))

A New Type For Tactics Andrea Asperti <asperti@cs.unibo.it> Wilmer Ricciotti