sts for machine translation evaluation
play

STS for Machine Translation Evaluation STS Workshop, NYC March - PowerPoint PPT Presentation

Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS for Machine Translation Evaluation STS Workshop, NYC March 12-13 2012 Lucia Specia University of Sheffield l.specia@sheffield.ac.uk Lucia Specia STS for Machine


  1. Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS for Machine Translation Evaluation STS Workshop, NYC March 12-13 2012 Lucia Specia University of Sheffield l.specia@sheffield.ac.uk Lucia Specia STS for Machine Translation Evaluation

  2. Monolingual STS Multilingual STS STS for Evaluation My 2 cents Outline 1 Monolingual STS MT Evaluation against references TINE 2 Multilingual STS MT Evaluation without references Adequacy estimation - assimilation purposes 3 STS for Evaluation One metric fits evaluation for all applications? One metric fits all applications? 4 My 2 cents STS from an application perspective Lucia Specia STS for Machine Translation Evaluation

  3. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents Monolingual STS Meteor - inexact lexical/phrase matching Lucia Specia STS for Machine Translation Evaluation 1 / 17

  4. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents Monolingual STS Meteor - inexact lexical/phrase matching Pado et al. - textual entailment features Lucia Specia STS for Machine Translation Evaluation 1 / 17

  5. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents Monolingual STS Meteor - inexact lexical/phrase matching Pado et al. - textual entailment features Gimenez & Marquez - matching of semantic labels Lucia Specia STS for Machine Translation Evaluation 1 / 17

  6. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents Monolingual STS Meteor - inexact lexical/phrase matching Pado et al. - textual entailment features Gimenez & Marquez - matching of semantic labels Meant - matching of semantic roles (predicates and their arguments) Lucia Specia STS for Machine Translation Evaluation 1 / 17

  7. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents Monolingual STS Meteor - inexact lexical/phrase matching Pado et al. - textual entailment features Gimenez & Marquez - matching of semantic labels Meant - matching of semantic roles (predicates and their arguments) TINE - matching of semantic roles (predicates and their arguments), but automatically Lucia Specia STS for Machine Translation Evaluation 1 / 17

  8. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents Tine Is Not Entailment R : The lack of snow is putting [people] A 0 off booking [ski holidays] A 1 in [hotels and guest houses] AM − LOC . H : The lack of snow discourages [people] A 0 from ordering [ski stays] A 1 in [hotels and boarding houses] AM − LOC . Lucia Specia STS for Machine Translation Evaluation 2 / 17

  9. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents Tine Is Not Entailment R : The lack of snow is putting [people] A 0 off booking [ski holidays] A 1 in [hotels and guest houses] AM − LOC . H : The lack of snow discourages [people] A 0 from ordering [ski stays] A 1 in [hotels and boarding houses] AM − LOC . Lexical matching component L & semantic component A : � α L ( H , R ) + β A ( H , R ) � T ( H , R ) = max α + β R ∈ R Lucia Specia STS for Machine Translation Evaluation 2 / 17

  10. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents This Is Not Entailment L : BLEU ; S : matching of verbs and their arguments : � v ∈ V verb score ( H v , R v ) A ( H , R ) = | V r | 1. Align verbs using ontologies (VerbNet and VerbOcean): v h and v r are aligned if they share a class in VerbNet or hold a relation in VerbOcean Lucia Specia STS for Machine Translation Evaluation 3 / 17

  11. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents This Is Not Entailment 2. Match arguments with same semantic roles: � a ∈ A h ∩ A r arg score ( H a , R a ) verb score ( H v , R v ) = | A r | Lucia Specia STS for Machine Translation Evaluation 4 / 17

  12. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents This Is Not Entailment 3. Expand arguments using distributional semantics and match them using cosine similarity: arg score ( H a , R a ) Lucia Specia STS for Machine Translation Evaluation 5 / 17

  13. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents This Is Not Entailment 3. Expand arguments using distributional semantics and match them using cosine similarity: arg score ( H a , R a ) TINE did slightly better than BLEU at segment level . Lucia Specia STS for Machine Translation Evaluation 5 / 17

  14. Monolingual STS Multilingual STS MT Evaluation against references STS for Evaluation TINE My 2 cents This Is Not Entailment 3. Expand arguments using distributional semantics and match them using cosine similarity: arg score ( H a , R a ) TINE did slightly better than BLEU at segment level . Lexical component extremely important. Lucia Specia STS for Machine Translation Evaluation 5 / 17

  15. Monolingual STS Multilingual STS MT Evaluation without references STS for Evaluation Adequacy estimation - assimilation purposes My 2 cents Quality Estimation No access to reference translation - MT system in use : post-editing, dissemination, assimilation, etc Lucia Specia STS for Machine Translation Evaluation 6 / 17

  16. Monolingual STS Multilingual STS MT Evaluation without references STS for Evaluation Adequacy estimation - assimilation purposes My 2 cents Quality Estimation No access to reference translation - MT system in use : post-editing, dissemination, assimilation, etc Semantics particularly important for estimating adequacy Lucia Specia STS for Machine Translation Evaluation 6 / 17

  17. Monolingual STS Multilingual STS MT Evaluation without references STS for Evaluation Adequacy estimation - assimilation purposes My 2 cents Quality Estimation No access to reference translation - MT system in use : post-editing, dissemination, assimilation, etc Semantics particularly important for estimating adequacy Lucia Specia STS for Machine Translation Evaluation 6 / 17

  18. Monolingual STS Multilingual STS MT Evaluation without references STS for Evaluation Adequacy estimation - assimilation purposes My 2 cents Example 1 Target: Chang-e III is expected to launch after 2013 Source: 嫦娥三号预计 2013 年前后发射 Reference: Chang-e III is expected to launch around 2013 By Google Translate Lucia Specia STS for Machine Translation Evaluation 7 / 17

  19. Monolingual STS Multilingual STS MT Evaluation without references STS for Evaluation Adequacy estimation - assimilation purposes My 2 cents Example 2 Target: Continued high floods subside . Guang'an old city has been soaked 2 days 2 nights Source: 四川广安洪水持续高位不退 老城区已被泡 2 天 2 夜 Reference: The continuing floods in Guang'an - Sichuan have not subsided . The old city has been flooded for 2 days and 2 nights. By Google Translate Lucia Specia STS for Machine Translation Evaluation 8 / 17

  20. Monolingual STS Multilingual STS MT Evaluation without references STS for Evaluation Adequacy estimation - assimilation purposes My 2 cents Example 3 Target: site security should be included in sex education curriculum for students Source: 场地安全性教育应纳入学生的课程 Reference: site security requirements should be included in the education curriculum for students By Google Translate Lucia Specia STS for Machine Translation Evaluation 9 / 17

  21. Monolingual STS Multilingual STS MT Evaluation without references STS for Evaluation Adequacy estimation - assimilation purposes My 2 cents Most common problems words translated incorrectly incorrect relationship: words/constituents/clauses missing/untranslated/repeated/added words incorrect word order inflectional/voice error Lucia Specia STS for Machine Translation Evaluation 10 / 17

  22. Monolingual STS Multilingual STS One metric fits evaluation for all applications? STS for Evaluation One metric fits all applications? My 2 cents MT quality evaluation How does the metrics vary depending on how the references are produced? Standard references - semantic component only, segment-level correlation: 0.21 Post-edited translations - semantic component only, segment-level correlation: 0.55 Lucia Specia STS for Machine Translation Evaluation 11 / 17

  23. Monolingual STS Multilingual STS One metric fits evaluation for all applications? STS for Evaluation One metric fits all applications? My 2 cents MT quality evaluation vs intrinsic evaluation TINE on WMT data: correlation: 0.30 TINE on Microsoft video data: correlation: 0.43 TINE on Microsoft paraphrase data: correlation: 0.30 Lucia Specia STS for Machine Translation Evaluation 12 / 17

  24. Monolingual STS Multilingual STS One metric fits evaluation for all applications? STS for Evaluation One metric fits all applications? My 2 cents MT quality estimation and evaluation Can we use the same approach as reference-based evaluation , but bilingual ? Lucia Specia STS for Machine Translation Evaluation 13 / 17

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend