turchi k eu
play

turchi@<k.eu Slides from the presenta&on by MaDeo Negri and - PowerPoint PPT Presentation

Outline Importance of MT Evalua&on Difficulty of MT Evalua&on Evalua&on of Human evalua&on: fluency/adequacy Automa&c evalua&on: Machine Transla&on Quality Reference-based: BLEU, TER, HTER (chosen


  1. Outline • Importance of MT Evalua&on • Difficulty of MT Evalua&on Evalua&on of • Human evalua&on: fluency/adequacy • Automa&c evalua&on: Machine Transla&on Quality – Reference-based: BLEU, TER, HTER (chosen among MANY others) Marco Turchi – Reference-free: quality es&ma&on (es&ma&ng post-edi&ng effort) FBK Trento, Italy turchi@<k.eu Slides from the presenta&on by MaDeo Negri… and myself MT Evalua&on, Trento, Doctoral School - April 2016 Disclaimer The importance of MT evalua&on • Answering “ How good is an MT system? ” as a way to: “More has been wriDen about MT evalua&on – Which system to use for a given task over the past 50 years than about MT itself” – Assess and compare systems’ performance Hovy et al.: Principles of Context-Based Machine Transla7on Evalua7on . – Define the state of the art Machine Transla&on, 16, pp. 1–33, 2002 – Drive system development and measure improvements (aDributed to Yorick Wilks) – Decide whether to apply MT at all “It is impossible to write a comprehensive overview of the MT • …Necessary (yes, not sufficient) condi&ons for progress in evalua&on literature” any research field Adam Lopez.: Sta7s7cal Machine Transla7on. ACM Compu&ng Surveys 40(3) pp. 1–49, August 2008. • Difficult task! MT Evalua&on, Trento, Doctoral School - April 2016 MT Evalua&on, Trento, Doctoral School - April 2016

  2. The importance of MT evalua&on Difficulty of MT evalua&on • Answering “ How good is an MT system? ” as a way to: • No formal defini=on of “transla&on” ! no defini&on of “good transla&on” – Which system to use for a given task • The no&on of quality is inherently subjec=ve – Assess and compare systems’ performance – Define the state of the art • Exact quan&fica&on is difficult (especially for long sentences) – Drive system development and measure improvements • MT errors are very varied in nature – Decide whether to apply MT at all • …Necessary (yes, not sufficient) condi&ons for progress in any research field • Difficult task! MT Evalua&on, Trento, Doctoral School - April 2016 The importance of MT evalua&on Difficulty of MT evalua&on • Answering “ How good is an MT system? ” as a way to: • No formal defini=on of “transla&on” ! no defini&on of “good transla&on” – Which system to use for a given task • The no&on of quality is inherently subjec=ve – Assess and compare systems’ performance – Define the state of the art • Exact quan&fica&on is difficult (especially for long sentences) – Drive system development and measure improvements • MT errors are very varied in nature – Decide whether to apply MT at all • …Necessary (yes, not sufficient) condi&ons for progress in any research field • Difficult task! MT Evalua&on, Trento, Doctoral School - April 2016

  3. Difficulty of MT evalua&on Difficulty of MT evalua&on • No formal defini=on of “transla&on” ! no defini&on of “good • No formal defini=on of “transla&on” ! no defini&on of “good transla&on” transla&on” • The no&on of quality is inherently subjec=ve • The no&on of quality is inherently subjec=ve • Exact quan&fica&on is difficult (especially for long sentences) • Exact quan&fica&on is difficult (especially for long sentences) • MT errors are very varied in nature • MT errors are very varied in nature • Perfect or very poor transla&ons are easy to score, but what happens in between ? Difficulty of MT evalua&on Difficulty of MT evalua&on • Many different acceptable transla&ons for the same sentence • No formal defini=on of “transla&on” ! no defini&on of “good transla&on” • The no&on of quality is inherently subjec=ve ��������� • Exact quan&fica&on is difficult (especially for long sentences) – I am [experiencing|suffering from|feeling] a throbbing pain . • MT errors are very varied in nature – I [feel|can feel|have] a [throbbing pain|painful throbbing] . – [It is a|It’s in|I’ve got a] throbbing pain . – It’s throbbing [and it really hurts|with pain] . – [It’s painful and|It hurts so much] it’s throbbing . MT Evalua&on, Trento, Doctoral School - April 2016

  4. Difficulty of MT evalua&on Human Vs Automa&c evalua&on • How would you translate: • Human MT evalua=on: – criteria: adequacy (fidelity) and fluency (intelligibility) It’s raining cats and dogs – pros: very accurate, high quality Ace in the hole – cons: expensive, slow, subjec&ve Beat around the bush Chew the fat • Automa=c MT evalua=on: Wild goose chase – criteria: “similarity” to professional human transla&on Tie one on – pros: inexpensive, quick, objec&ve Sunny smile – cons: quality is “slightly” lower than human check • Literally, its meaning or the corresponding idiom (if any)? MT Evalua&on, Trento, Doctoral School - April 2016 MT Evalua&on, Trento, Doctoral School - April 2016 Difficulty of MT evalua&on Human Vs Automa&c evalua&on • Classifica&on of errors: a quite rich taxonomy • Human MT evalua=on: – criteria: adequacy (fidelity) and fluency (intelligibility) – pros: very accurate, high quality – cons: expensive, slow, subjec&ve • Automa=c MT evalua=on: – criteria: “similarity” to professional human transla&on – pros: inexpensive, quick, objec&ve – cons: quality is “slightly” lower than human check Note: error types are not mutually exclusive and onen co-occur (Vilar et al. 2006) MT Evalua&on, Trento, ISIT School - November 2013 MT Evalua&on, Trento, Doctoral School - April 2016

  5. Human evalua&on • Given: Human evalua&on – MT output, source and/or reference transla&on • Task: assess the quality of the MT output • Metrics – Adequacy: does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? …requires bilingual judges or a reference transla&on – Fluency : is the output good fluent English? This involves both gramma&cal correctness and idioma&c word choices. …monolingual judges are sufficient, no reference needed MT Evalua&on, Trento, ISIT School - November 2013 MT Evalua&on, Trento, Doctoral School - April 2016 Human evalua&on Human evalua&on • Given: • Given: – MT output, source and/or reference transla&on – MT output, source and/or reference transla&on • Task: assess the quality of the MT output • Task: assess the quality of the MT output • Metrics • Metrics – Adequacy: does the output convey the same meaning as the – Adequacy: does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? input sentence? Is part of the message lost, added, or distorted? …requires bilingual judges or a reference transla&on …requires bilingual judges or a reference transla&on – Fluency : is the output good fluent English? This involves both – Fluency : is the output good fluent English? This involves both gramma&cal correctness and idioma&c word choices. gramma&cal correctness and idioma&c word choices. …monolingual judges are sufficient, no reference needed …monolingual judges are sufficient, no reference needed MT Evalua&on, Trento, Doctoral School - April 2016 MT Evalua&on, Trento, Doctoral School - April 2016

  6. Human evalua&on: adequacy and fluency Human evalua&on: subjec&vity • Source sentence: Le chat entre dans la chambre. • Perfect or very poor transla&ons are easy to score… …but what happens in between? (a) Adequate fluent transla&on: The cat enters the room. (a) Adequate fluent transla&on: The cat enters the room. (b) Adequate disfluent transla&on: The cat enter in the room. (b) Adequate disfluent transla&on: The cat enter in the room. (c) Fluent inadequate transla&on: The cats enter the bedroom . (c) Fluent inadequate transla&on: The cats enter the bedroom . (d) Disfluent inadequate transla&on: Bedroom the dogs enters the (d) Disfluent inadequate transla&on: Bedroom the dogs enters the JUDGE1 JUDGE2 JUDGE3 b b a a a b adequacy adequacy adequacy c c d c d d MT Evalua&on, Trento, Doctoral School - April 2016 fluency fluency fluency Human evalua&on: Likert scales Human evalua&on: subjec&vity Evaluators disagree! Adequacy Fluency • …look at this histogram of adequacy judgments by 5 all meaning 5 flawless English different human evaluators 4 most meaning 4 good English 3 much meaning 3 non-na&ve English 2 liDle meaning 2 disfluent English 1 none 1 incomprehensible MT Evalua&on, Trento, Doctoral School - April 2016 MT Evalua&on, Trento, ISIT School - November 2013

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend