MetricsMaTr10
Evaluation Overview & Summary of Results
Kay Peterson & Mark Przybocki Brian Antonishek, Mehmet Yilmaz, Martial Michel
WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden July 15-16 2010 (public version of slides, v1-1, October 22 2010)
MetricsMaTr10 Evaluation Overview & Summary of Results Kay - - PowerPoint PPT Presentation
MetricsMaTr10 Evaluation Overview & Summary of Results Kay Peterson & Mark Przybocki Brian Antonishek, Mehmet Yilmaz, Martial Michel July 15-16 2010 (public version of WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden slides,
WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden July 15-16 2010 (public version of slides, v1-1, October 22 2010)
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 2 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Begin date End date task
January 11 Announcement of evaluation plans March 26 May 14 Metric submission May 15 June/July Metric installation and data set scoring July 2 Preliminary release of results July 15 July 16 Workshop September Official results posted on NIST web space
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 3 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 4 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Affiliation URL Metric name(s) Aalto University of S&T * MT-NCD MT-mNCD BabbleQuest http://www.babblequest.com/badger2 badger-2.0-lite badger-2.0-full City University of Hong Kong * http://mega.ctl.cityu.edu.hk/ctbwong/ATEC ATEC-2.1 Carnegie Mellon * http://www.cs.cmu.edu/~alavie/METEOR meteor-next-rank meteor-next-hter meteor-next-adq Columbia University http://www1.ccls.columbia.edu/~SEPIA SEPIA Charles University Prague * SemPOS SemPOS-BLEU Dublin City University * DCU-LFG University of Edinburgh * LRKB4 LRHB4 Harbin Institute of Technology i-letter-BLEU i-letter-recall SVM-rank National University of Singapore * http://nlp.comp.nus.edu.sg/software TESLA TESLA-M Stanford University NLP Stanford University of Maryland http://www.umiacs.umd.edu/~snover/terp TERp University Politecnica de Catalunya & University of Barcelona * http://www.lsi.upc.edu/~nlp/Asiya IQmt-Drdoc IQmt-DR Iqmt-ULCh University of Southern California, ISI http://www.isi.edu/publications/licensed- sw/BE/index.html BEwT-E Bkars entries participated in MetricsMaTr08
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 5 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
* Represented with a paper in ACL 2010 main or WMT/MetricsMaTr workshop proceedings
Metric: MT-NCD Features:
Metric: MT-mNCD Features:
WordNet synsets (English)
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 6 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Metric: badger-2.0-full Features:
languages
Levenshtein) Metric: badger-2.0-lite Features:
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 7 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.5 1 seg doc sys rho Badger lite correlation with Adequacy7, 1Ref 2008 (badger-lite) 2010 (badger-2.0-lite)
Metric: ATEC-2.1 Features:
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 8 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.5 1 seg doc sys rho ATEC correlation with Adequacy7, 1Ref 2008 (ATEC1) 2010 (ATEC2.1)
Metric: meteor-next-rank Features:
synonym, and paraphrase matches
WMT09 Metric: meteor-next-hter Features:
correlation with GALE P2 HTER data Metric: meteor-next-adq Features:
with NIST OpenMT 2009 human adequacy judgments
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 9 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Metric: SEPIA Features:
surface spans
reference(s)
n-grams, POS tags, or dependency relations and lemmatization
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 10 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.5 1 seg doc sys rho SEPIA correlation with Adequacy7, 1Ref 2008 (SEPIA1) 2010 (SEPIA)
Metric: SemPOS Features:
hyp and ref translation given a fine-grained semantic part-of-speech (sempos)
Metric: SemPOS-BLEU Features:
BLEU is calculated on surface forms only autosemantic words
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 11 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Metric: DCU-LFG Features:
labels differ
weighted to maximize correlation with human judgment
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 12 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Metric: LRscore (LRKB4, LRHB4) Features:
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 13 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Metric: i-letter-BLEU Features:
Metric: i-letter-recall Features
Metric: SVM-rank Features:
system translations
ROUGE-L recall, letter-based TER, letter-based BLEU-cum-5, letter- based ROUGE-L recall, and letter-based ROUGE-S recall.
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 14 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Metric: TESLA-M Features:
Metric: TESLA Features:
synonyms
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 15 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Metric: Stanford Features:
techniques
July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden 16
Metric: TERp Features:
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 17 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.5 1 seg doc sys rho TERp correlation with Adequacy7, 1Ref 2008 2010
Metric: ULCH Features:
Metric: DR Features:
representations operating at the segment level ** respectively computing lexical overlap ** morphosyntactic overlap ** semantic tree matching Metric: DRdoc Features: “DR” at the whole document level
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 18 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Metric: BEwT-E Features:
related words
Metric: Bkars Features:
** Uses the Snowball package of stemmers
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 19 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 20 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Metric: BLEU-v11b Version: MTEVAL version 11b Description: Modified BLEU-4 with an improved brevity penalty Case-sensitive N-gram co-occurrence statistics Official metric of recent NIST Open MT evaluations Metric: BLEU-v12 Authoring Affiliation: NIST (IBM) (2008) Description: Updated BLEU-v11b (above) with UTF-8 tokenization rules for non-English target languages Metric: BLEU-v13a Authoring Affiliation: NIST (IBM) (2009) Description: XML version Command line options for some Non-English translations
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 21 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 22 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 23 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 24 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Origin Source Target Genre(s) Doc. count Segment count Words (est.) Systems (mt+ht) Refs. available MT08 Arabic English NW, WB 42 405 15,100 10 + 2 4 Chinese English NW, WB 51 607 15,000 10 + 2 4 GALE P2 Arabic English NW, WB 45 469 11,450 3 1 Chinese English NW, WB 47 392 10,150 3 1 GALE P2.5 Arabic English BN 20 210 5,300 2 1 Chinese English BC, BN 42 289 10,000 3 1
TRANSTAC Jan07
Arabic English Dialog 15 433 5,150 5 + 2 4
TRANSTAC Jul07
Arabic English Dialog 47 419 6,450 5 + 2 4 Farsi English Dialog 25 414 4,550 5 + 2 4
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 25 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Origin Source Target Genre(s) Doc. count Segment count Words (est.) Systems (mt+ht) Refs. available CESTA run1 Arabic French General 16 298 27,950 (2 + 1) 4 English French General 15 790 21,350 (5 + 1) 4 CESTA run2 Arabic French Health 30 824 20,100 (1 + 1) 4 English French Health 16 917 22,550 (5 + 1) 4 TRANSTAC Jan07 English Arabic Dialogs 5 4
July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden 26
E0020, http://catalog.elra.info/product_info.php?products_id=994)
Source Target Genre Documents Segments Words (est.) Systems (single + combo) References
Czech English NW 94 2034 42,000 7+5 1 French English 54,000 16+8 German English 49,000 18+7 Spanish English 52,000 10+4 English Czech 50,000 each 12+5 English French 15+4 English German 14+4 English Spanish 12+4
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 27 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Data Attributes1 NIST Open MT-06 TRANSTAC Genre Newswire Training dialogs Number of documents 25 1 (included as sample) Total number of segments 249 17 Source Language Arabic Iraqi Arabic Number of system translations 8 5
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 28 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden 29
Data subset Adequacy 7pt Yes/No decision Adequacy 5pt Preference Fluency 5pt HTER Low Level concept Adequacy 4pt DLPT* Relative Rank
MT08 √ √ √ √ GALE √ √ √ √ TRANSTAC √ √ √ √ √ CESTA √ √ WMT √
WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden 30
set
July 15-16 2010 (public version of slides, v1-1, October 22 2010)
translation
translation
highlighted as a visual aid
(7-point scale)
(Yes/No)
independent judgments for each segment in MetricsMaTr08 test set
WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden 31
Allowing for 2-off category judgments, we achieve over 90% inter annotator agreement
July 15-16 2010 (public version of slides, v1-1, October 22 2010)
Adequacy Score Coverage 7 (All) Yes 20.8% 21.4% No 0.6% 6 Yes 14.9% 21.6% No 6.7% 5 Yes 8.7% 19.0% No 10.3% 4 (Half) No 18.8% 3 No 9.2% 2 No 5.6% 1 (None) No 4.4%
Coverage 6+ to 7 Yes 21.5% 23.9% mixed 2.2% No 0.2% 5+ to 6 Yes 10.2% 22.6% mixed 9.0% No 3.4% 4+ to 5 Yes 1.2% 21.3% mixed 9.2% No 10.9% 3+ to 4 Mixed 2.0% 17.0% No 15.0% 2+ to 3 Mixed 0.1% 9.4% No 9.3% 1+ to 2 No 5.8% 5.8%
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 32 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 33 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 34 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 35 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
http://www.itl.nist.gov/iad/mig/tests/metricsmatr/2010/results
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 36 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
Rank Seg rho
25473 data points
Doc rho
2179 data points
Sys rho
89 data points 1 meteor-next-rank meteor-next-rank meteor-next-rank 2 TERp meteor-next-adq meteor-next-adq 3 meteor-next-adq meteor-next-hter meteor-next-hter 4 meteor-next-hter i-letter-recall i-letter-recall 5 ATEC-2.1 i-letter-BLEU i-letter-BLEU 6 i-letter-recall TERp SEPIA 7 i-letter-BLEU NIST-c TERp 8 Bkars SEPIA NIST-c 9 SEPIA Bkars Bkars 10 NIST-c BLEU-4-v13a-c DCU-LFG 11 BLEU-4-v13a-c ATEC-2.1 ATEC-2.1 12 badger-2.0-full DCU-LFG BLEU-4-v13a-c 13 BEwT-E BEwT-E BEwT-E 14 badger-2.0-lite badger-2.0-full badger-2.0-full 15 DCU-LFG badger-2.0-lite badger-2.0-lite 16 TESLA TESLA TESLA 17 MT-mNCD TESLA-M IQMT-DR 18 MT-NCD SemPOS-BLEU TESLA-M 19 SemPOS-BLEU MT-mNCD SemPOS-BLEU 20 TESLA-M IQMT-DR SemPOS 21 IQMT-DR IQMT-DRdoc IQMT-DRdoc 22
SemPOS SemPOS MT-mNCD 23 IQMT-DRdoc MT-NCD MT-NCD
Spearman’s rho correlation
37
Bold italics = baseline metrics
July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
1Ref, Adequacy7, Target Eng, Doc
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 38 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
topic) document level
technology evaluations such as NIST OpenMT
July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho Segment level Spearman’s rho correlations (absolute values)
seg rho seg rho confint l seg rho confint u 40 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho Document level Spearman’s rho correlations (absolute values)
doc rho doc rho confint l doc rho confint u 41 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho System level Spearman’s rho correlations (absolute values)
sys rho sys rho confint l sys rho confint u 42 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho Seg, doc, sys Spearman’s rho correlations (absolute values) for 11 metrics with highest 1ref seg correlation
seg rho 1ref seg rho 4ref doc rho 1ref doc rho 4ref sys rho 1ref sys rho 4ref 43 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
TERp meteor-v0.6 meteor-v0.7 CDer CDer ATEC3 meteor-next-rank meteor-next-rank meteor-next-rank i-letter-BLEU meteor-next-rank NIST-c
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho Highest Spearman’s rho correlations with Adequacy7 judgments
highest rho 2008 highest rho 2010 44 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
TERp TERp meteor-v0.6 SVM-rank TERp SEPIA1 TERp meteor-next-adq meteor-next-adq meteor-next-adq TERp SEPIA
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho Highest Spearman’s rho correlations with AdequacyYesNo judgments
highest rho 2008 highest rho 2010 45 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
TERp LET meteor-rank SVM-rank TERp SEPIA1 TERp i-letter-BLEU i-letter-recall Bkars TERp SEPIA
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho Highest Spearman’s rho correlations with Preference judgments
highest rho 2008 highest rho 2010 46 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
TERp TERp 4-GRR 4-GRR 4-GRR 4-GRR TERp TERp TERp TERp TESLA-M BLEU-4-v13a-c
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho Highest Spearman’s rho correlations with Adequacy4 judgments
highest rho 2008 highest rho 2010 47 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
TERp meteor-v0.6 TERp CDer TERp 4-GRR meteor-next-rank meteor-next-rank TERp TERp TESLA-M TERp
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho Highest Spearman’s rho correlations with OddsConceptCorrect judgments
highest rho 2008 highest rho 2010 48 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
RTE-MT EDPM DP-Orp TERp meteor-next-hter IQMT-DRdoc
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho Highest Spearman’s rho correlations with HTER
highest rho 2008 highest rho 2010 49 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho System level Spearman’s rho correlations (absolute values)
Czech-English French-English German-English Spanish-English Average 51 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
52 July 15-16 2010 (public version of slides, v1-1, October 22 2010) WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 rho System level Spearman’s rho correlations (absolute values)
English-Czech English-French English-German English-Spanish Average
July 15-16 2010 (public version of slides, v1-1, October 22 2010) 53 WMT10 & NIST MetricsMaTr10 @ ACL10 Uppsala Sweden