Chapter 8 Evaluation Statistical Machine Translation Evaluation - PowerPoint PPT Presentation

Chapter 8 Evaluation Statistical Machine Translation

Evaluation • How good is a given machine translation system? • Hard problem, since many different translations acceptable → semantic equivalence / similarity • Evaluation metrics – subjective judgments by human evaluators – automatic evaluation metrics – task-based evaluation, e.g.: – how much post-editing effort? – does information come across? Chapter 8: Evaluation 1

Ten Translations of a Chinese Sentence Israeli officials are responsible for airport security. Israel is in charge of the security at this airport. The security work for this airport is the responsibility of the Israel government. Israeli side was in charge of the security of this airport. Israel is responsible for the airport’s security. Israel is responsible for safety work at this airport. Israel presides over the security of the airport. Israel took charge of the airport security. The safety of this airport is taken charge of by Israel. This airport’s security is the responsibility of the Israeli security officials. (a typical example from the 2001 NIST evaluation set) Chapter 8: Evaluation 2

Adequacy and Fluency • Human judgement – given: machine translation output – given: source and/or reference translation – task: assess the quality of the machine translation output • Metrics Adequacy: Does the output convey the same meaning as the input sentence? Is part of the message lost, added, or distorted? Fluency: Is the output good fluent English? This involves both grammatical correctness and idiomatic word choices. Chapter 8: Evaluation 3

Fluency and Adequacy: Scales Adequacy Fluency 5 all meaning 5 flawless English 4 most meaning 4 good English 3 much meaning 3 non-native English 2 little meaning 2 disfluent English 1 none 1 incomprehensible Chapter 8: Evaluation 4

Annotation Tool Chapter 8: Evaluation 5

Evaluators Disagree • Histogram of adequacy judgments by different human evaluators 30% 20% 10% 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 (from WMT 2006 evaluation) Chapter 8: Evaluation 6

Measuring Agreement between Evaluators • Kappa coefficient K = p ( A ) − p ( E ) 1 − p ( E ) – p ( A ) : proportion of times that the evaluators agree – p ( E ) : proportion of time that they would agree by chance (5-point scale → p ( E ) = 1 5 ) • Example: Inter-evaluator agreement in WMT 2007 evaluation campaign Evaluation type P ( A ) P ( E ) K Fluency .400 .2 .250 Adequacy .380 .2 .226 Chapter 8: Evaluation 7

Ranking Translations • Task for evaluator: Is translation X better than translation Y? (choices: better, worse, equal) • Evaluators are more consistent: Evaluation type P ( A ) P ( E ) K Fluency .400 .2 .250 Adequacy .380 .2 .226 Sentence ranking .582 .333 .373 Chapter 8: Evaluation 8

Goals for Evaluation Metrics Low cost: reduce time and money spent on carrying out evaluation Tunable: automatically optimize system performance towards metric Meaningful: score should give intuitive interpretation of translation quality Consistent: repeated use of metric should give same results Correct: metric must rank better systems higher Chapter 8: Evaluation 9

Other Evaluation Criteria When deploying systems, considerations go beyond quality of translations Speed: we prefer faster machine translation systems Size: fits into memory of available machines (e.g., handheld devices) Integration: can be integrated into existing workflow Customization: can be adapted to user’s needs Chapter 8: Evaluation 10

Automatic Evaluation Metrics • Goal: computer program that computes the quality of translations • Advantages: low cost, tunable, consistent • Basic strategy – given: machine translation output – given: human reference translation – task: compute similarity between them Chapter 8: Evaluation 11

Precision and Recall of Words Israeli officials responsibility of airport safety SYSTEM A: Israeli officials are responsible for airport security REFERENCE: • Precision output-length = 3 correct 6 = 50% • Recall reference-length = 3 correct 7 = 43% • F-measure . 5 × . 43 precision × recall ( precision + recall ) / 2 = ( . 5 + . 43) / 2 = 46% Chapter 8: Evaluation 12

Precision and Recall Israeli officials responsibility of airport safety SYSTEM A: Israeli officials are responsible for airport security REFERENCE: airport security Israeli officials are responsible SYSTEM B: Metric System A System B precision 50% 100% recall 43% 100% f-measure 46% 100% flaw: no penalty for reordering Chapter 8: Evaluation 13

Word Error Rate • Minimum number of editing steps to transform output to reference match: words match, no cost substitution: replace one word with another insertion: add word deletion: drop word • Levenshtein distance wer = substitutions + insertions + deletions reference-length Chapter 8: Evaluation 14

Example responsibility responsible security officials officials airport airport Israeli safety Israeli are of 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Israeli 1 0 1 2 3 4 5 Israeli 1 1 2 2 3 4 5 officials 2 1 0 1 2 3 4 officials 2 2 2 3 2 3 4 are 3 2 1 1 2 3 4 are 3 3 3 3 3 2 3 responsible 4 3 2 2 2 3 4 responsible 4 4 4 4 4 3 2 for 5 4 3 3 3 3 4 for 5 5 5 5 5 4 3 airport 6 5 4 4 4 3 4 airport 6 5 6 6 6 5 4 security 7 6 5 5 5 4 4 security 7 6 5 6 7 6 5 Metric System A System B word error rate ( wer ) 57% 71% Chapter 8: Evaluation 15

BLEU • N-gram overlap between machine translation output and reference translation • Compute precision for n-grams of size 1 to 4 • Add brevity penalty (for too short translations) 4 � � � 1 , output-length � 1 � bleu = min 4 precision i reference-length i =1 • Typically computed over the entire corpus, not single sentences Chapter 8: Evaluation 16

Example Israeli officials responsibility of airport safety SYSTEM A: 2-GRAM MATCH 1-GRAM MATCH Israeli officials are responsible for airport security REFERENCE: airport security Israeli officials are responsible SYSTEM B: 4-GRAM MATCH 2-GRAM MATCH Metric System A System B precision (1gram) 3/6 6/6 precision (2gram) 1/5 4/5 precision (3gram) 0/4 2/4 precision (4gram) 0/3 1/3 brevity penalty 6/7 6/7 0% 52% bleu Chapter 8: Evaluation 17

Multiple Reference Translations • To account for variability, use multiple reference translations – n-grams may match in any of the references – closest reference length used • Example Israeli officials responsibility of airport safety SYSTEM: 2-GRAM MATCH 2-GRAM MATCH 1-GRAM Israeli officials are responsible for airport security Israel is in charge of the security at this airport REFERENCES: The security work for this airport is the responsibility of the Israel government Israeli side was in charge of the security of this airport Chapter 8: Evaluation 18

METEOR: Flexible Matching • Partial credit for matching stems Jim went home system Joe goes home reference • Partial credit for matching synonyms Jim walks home system Joe goes home reference • Use of paraphrases Chapter 8: Evaluation 19

Critique of Automatic Metrics • Ignore relevance of words (names and core concepts more important than determiners and punctuation) • Operate on local level (do not consider overall grammaticality of the sentence or sentence meaning) • Scores are meaningless (scores very test-set specific, absolute value not informative) • Human translators score low on BLEU (possibly because of higher variability, different word choices) Chapter 8: Evaluation 20

Evaluation of Evaluation Metrics • Automatic metrics are low cost, tunable, consistent • But are they correct? → Yes, if they correlate with human judgement Chapter 8: Evaluation 21

Correlation with Human Judgement Chapter 8: Evaluation 22

Pearson’s Correlation Coefficient • Two variables: automatic score x , human judgment y • Multiple systems ( x 1 , y 1 ) , ( x 2 , y 2 ) , ... • Pearson’s correlation coefficient r xy : � i ( x i − ¯ x )( y i − ¯ y ) r xy = ( n − 1) s x s y n x = 1 � • Note: mean ¯ x i n i =1 n 1 � variance s 2 x ) 2 x = ( x i − ¯ n − 1 i =1 Chapter 8: Evaluation 23

Metric Research • Active development of new metrics – syntactic similarity – semantic equivalence or entailment – metrics targeted at reordering – trainable metrics – etc. • Evaluation campaigns that rank metrics (using Pearson’s correlation coefficient) Chapter 8: Evaluation 24

Evidence of Shortcomings of Automatic Metrics Post-edited output vs. statistical systems (NIST 2005) 4 Adequacy Correlation 3.5 Human Score 3 2.5 2 0.38 0.4 0.42 0.44 0.46 0.48 0.5 0.52 Bleu Score Chapter 8: Evaluation 25

Evidence of Shortcomings of Automatic Metrics Rule-based vs. statistical systems 4.5 Adequacy Fluency 4 SMT System 1 Rule-based System (Systran) Human Score 3.5 3 SMT System 2 2.5 2 0.18 0.2 0.22 0.24 0.26 0.28 0.3 Bleu Score Chapter 8: Evaluation 26

Automatic Metrics: Conclusions • Automatic metrics essential tool for system development • Not fully suited to rank systems of different types • Evaluation metrics still open challenge Chapter 8: Evaluation 27

Chapter 8 Evaluation Statistical Machine Translation Evaluation - PowerPoint PPT Presentation

Chapter 8 Evaluation Statistical Machine Translation Evaluation How good is a given machine translation system? Hard problem, since many different translations acceptable semantic equivalence / similarity Evaluation metrics

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 Inheritance Concepts

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Chapter 13 Chapter 13 1 What is this? Chapter 13 2 What is this? Chapter 13 3 What is

Telematics 2 & Performance Evaluation Chapter 4 Introduction to Performance Evaluation

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation

CHAPTER CHAPTER VII CHAPTER CHAPTER VII VII VII MANAGEMENT AND MANAGEMENT AND

Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Kalev

PoET-BiN: Power Efficient Tiny Binary Neurons Sivakumar Chidambaram 1 , J.M. Pierre Langlois 2 ,

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

History & Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

EC487 Advanced Microeconomics, Part I: Lecture 2 Leonardo Felli 32L.LG.04 6 October, 2017

How to Develop a Program Logic Model Learning objectives By the end of this presentation, you

Darrell Bethea May 19, 2011 1 Assignments: Homework 0 grades up Program 1 due

High Assurance Spiral 18-847E Spiral: Formal Approaches to Hardware & Software Design &

Chapter 8 Evaluation Statistical Machine Translation Evaluation - PowerPoint PPT Presentation

Chapter 8 Evaluation Statistical Machine Translation Evaluation How good is a given machine translation system? Hard problem, since many different translations acceptable semantic equivalence / similarity Evaluation metrics

Chapter 12. Evaluation Research Chapter 12. Evaluation Research evaluation research? evaluation

User Interface Evaluation Empirical evaluation Heuristic evaluation 1 CS 349 - UI evaluation

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 Inheritance Concepts

Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Evaluation Map Guide Progress on

Evidence evaluation for discrete data Evidence evaluation for discrete data Evidence evaluation

Chapter 13 Chapter 13 1 What is this? Chapter 13 2 What is this? Chapter 13 3 What is

Telematics 2 &amp; Performance Evaluation Chapter 4 Introduction to Performance Evaluation

Webinar on Meta-evaluation Approaches to Improve Evaluation Practice Mnica Lomea Gelis,

Programme BRICK Programme Evaluation: How, why and what? The plan Practical evaluation -

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation &amp; Analysis Lori

Evaluation DEMMS: Evaluation of Multimedia What are the Evaluation lectures about: When

Heuristic Evaluation (Pinelle) Heuristic evaluation is a method of qualitative evaluation of

Model Evaluation Model Evaluation Metrics for Performance Evaluation How to evaluate the

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson &amp; Pyla UX Evaluation

CHAPTER CHAPTER VII CHAPTER CHAPTER VII VII VII MANAGEMENT AND MANAGEMENT AND

Machine Learning and Data Mining Multi-layer Perceptrons &amp; Neural Networks: Basics Kalev

PoET-BiN: Power Efficient Tiny Binary Neurons Sivakumar Chidambaram 1 , J.M. Pierre Langlois 2 ,

File Input and Output File Input and Output 1 / 9 File input/output input function reads values

History &amp; Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

EC487 Advanced Microeconomics, Part I: Lecture 2 Leonardo Felli 32L.LG.04 6 October, 2017

How to Develop a Program Logic Model Learning objectives By the end of this presentation, you

Darrell Bethea May 19, 2011 1 Assignments: Homework 0 grades up Program 1 due

High Assurance Spiral 18-847E Spiral: Formal Approaches to Hardware &amp; Software Design &amp;

Telematics 2 & Performance Evaluation Chapter 4 Introduction to Performance Evaluation

Evaluation Update Laura Forsythe, PhD, MPH Associate Director, Evaluation & Analysis Lori

UX Evaluation SWEN-444 Selected material from The UX Book , Hartson & Pyla UX Evaluation

Machine Learning and Data Mining Multi-layer Perceptrons & Neural Networks: Basics Kalev

History & Evaluation CMSC 470 Marine Carpuat T odays topics Machine Translation

High Assurance Spiral 18-847E Spiral: Formal Approaches to Hardware & Software Design &