Translation Quality Estimation: Past, Present, and Future
Andr´ e Martins MT Marathon, Lisbon, August 31st, 2017
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 1 / 69
Translation Quality Estimation: Past, Present, and Future Andr e - - PowerPoint PPT Presentation
Translation Quality Estimation: Past, Present, and Future Andr e Martins MT Marathon, Lisbon, August 31st, 2017 Andr e Martins (Unbabel) Quality Estimation MTM, 31/8/17 1 / 69 This Talk First part: largely based on Lucia Specias
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 1 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 2 / 69
1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 3 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 4 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 5 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 6 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 6 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 7 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 8 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 9 / 69
Direct asses. Scoring
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 9 / 69
Is this translation correct? Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 10 / 69
Direct asses. Scoring Ranking
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 11 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 12 / 69
Direct asses. Scoring Ranking Error annotation
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 13 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 14 / 69
Direct asses. Task-based Scoring Ranking Error annotation Post-editing
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 15 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 16 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 16 / 69
Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 17 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 18 / 69
Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension Eye-tracking
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 19 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 20 / 69
Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension Reference-based Eye-tracking
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 21 / 69
Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension Reference-based Quality estimation
BLEU, Meteor, NIST, TER, WER, PER, CDER, BEER, CiDER, Cobalt, RATATOUILLE, RED, AMBER, PARMESAN, ...
Eye-tracking
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 21 / 69
Source
不过这一切都由不得你 However these all totally beyond the control of you.
MT
But all this is beyond the control of you.
Human score BLEU score HT1
But all this is beyond your control.
3.4 0.427 HT2
However, you cannot choose yourself.
2 0.049 HT3
However, not everything is up to you to decide.
2 0.050 HT4
But you can’t choose that.
2.8 0.055
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69
Source
不过这一切都由不得你 However these all totally beyond the control of you.
MT
But all this is beyond the control of you.
Human score BLEU score HT1
But all this is beyond your control.
3.4 0.427 HT2
However, you cannot choose yourself.
2 0.049 HT3
However, not everything is up to you to decide.
2 0.050 HT4
But you can’t choose that.
2.8 0.055
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69
Source
不过这一切都由不得你 However these all totally beyond the control of you.
MT
But all this is beyond the control of you.
Human score BLEU score HT1
But all this is beyond your control.
3.4 0.427 HT2
However, you cannot choose yourself.
2 0.049 HT3
However, not everything is up to you to decide.
2 0.050 HT4
But you can’t choose that.
2.8 0.055
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 22 / 69
Direct asses. Task-based Scoring Ranking Error annotation Post-editing Reading comprehension Reference-based Quality estimation
BLEU, Meteor, NIST, TER, WER, PER, CDER, BEER, CiDER, Cobalt, RATATOUILLE, RED, AMBER, PARMESAN, ...
Eye-tracking
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 23 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 24 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 25 / 69
Machine Learning
X: examples of source & translations
QE model
Y: Quality scores for examples in X
Feature extraction Features (Slide credit: Lucia Specia)
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 26 / 69
MT system Translation for xt' QE model Quality score y' Features Feature extraction Source Text xs' (Slide credit: Lucia Specia)
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 26 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 27 / 69
s s-1 s+1 t t-1 t+1
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 28 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 29 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 29 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 29 / 69
System ID Pearson ↑ Spearman ↑ English-German
0.525 – POSTECH/SENT-RNN-QV2 0.460 0.483 SHEF-LIUM/SVM-NN-emb-QuEst 0.451 0.474 POSTECH/SENT-RNN-QV3 0.447 0.466 SHEF-LIUM/SVM-NN-both-emb 0.430 0.452 UGENT-LT3/SCATE-SVM2 0.412 0.418 UFAL/MULTIVEC 0.377 0.410 RTM/RTM-FS-SVR 0.376 0.400 UU/UU-SVM 0.370 0.405 UGENT-LT3/SCATE-SVM1 0.363 0.375 RTM/RTM-SVR 0.358 0.384 Baseline SVM 0.351 0.390 SHEF/SimpleNets-SRC 0.182 – SHEF/SimpleNets-TGT 0.182 –
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 30 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 31 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 32 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 33 / 69
1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 34 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 35 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 36 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 36 / 69
1
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 36 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 37 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69
Le travail de La traduction automatique fonctionne-t-elle?
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69
Le travail de La traduction automatique fonctionne-t-elle?
BAD OK OK OK BAD BAD
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 38 / 69
Le travail de La traduction automatique fonctionne-t-elle?
BAD OK OK OK BAD BAD
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 39 / 69
Le travail de La traduction automatique fonctionne-t-elle?
BAD OK OK OK BAD BAD
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 39 / 69
1 Informs an end user about the reliability of translated content 2 Decide if a translation is good to go or requires a human to fix it 3 Highlights to a human post-editor the words that need to be revised
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 40 / 69
(Credit: Jo˜ ao Gra¸ ca’s EAMT presentation)
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 41 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 42 / 69
y N
N+1
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 43 / 69
Features Label Input (referenced by the ith target word) unigram yi ∧ . . . Bias Word, LeftWord, RightWord SourceWord, SourceLeftWord, SourceRightWord LargestNGramLeft/Right SourceLargestNGramLeft/Right PosTag, SourcePosTag Word+LeftWord, Word+RightWord Word+SourceWord, PosTag+SourcePosTag simple yi ∧ yi−1 ∧ . . . Bias bigram rich yi ∧ yi−1 ∧ . . . all above bigrams yi+1 ∧ yi ∧ . . . Word+SourceWord, PosTag+SourcePosTag syntactic yi ∧ . . . DepRel, Word+DepRel HeadWord/PosTag+Word/PosTag LeftSibWord/PosTag+Word/PosTag RightSibWord/PosTag+Word/PosTag GrandWord/PosTag+HeadWord/PosTag+Word/PosTag
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 44 / 69
1
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 45 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 46 / 69
... ...
... ...
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 47 / 69
1
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 48 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 49 / 69
1
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 50 / 69
1
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 50 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 51 / 69
Le travail de La traduction automatique fonctionne-t-elle?
BAD OK OK OK BAD BAD
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 52 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 53 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 53 / 69
1
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 54 / 69
1
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 55 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 56 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 57 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 58 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 59 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 60 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 61 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 62 / 69
1 MT Evaluation & Quality Estimation 2 Pushing the Limits of Quality Estimation 3 The Future
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 63 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 64 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 65 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 66 / 69
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 67 / 69
Blatz, J., Fitzgerald, E., Foster, G., Gandrabur, S., Goutte, C., Kulesza, A., Sanchis, A., and Ueffing, N. (2004). Confidence estimation for machine translation. In Proc. of the International Conference on Computational Linguistics, page 315. Cohen, W. W. and de Carvalho, V. R. (2005). Stacked sequential learning. In IJCAI. Junczys-Dowmunt, M. and Grundkiewicz, R. (2016). Log-linear combinations of monolingual and bilingual neural machine translation models for automatic post-editing. In Proceedings of the First Conference on Machine Translation, pages 751–758, Berlin, Germany. Association for Computational Linguistics. Kozlova, A., Shmatova, M., and Frolov, A. (2016). YSDA Participation in the WMT’16 Quality Estimation Shared Task. In Proceedings of the First Conference on Machine Translation, pages 793–799. Kreutzer, J., Schamoni, S., and Riezler, S. (2015). Quality estimation from scratch (quetch): Deep learning for word-level translation quality estimation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 316–322. Martins, A. F. T., Das, D., Smith, N. A., and Xing, E. P. (2008). Stacking Dependency Parsers. In Proc. of Empirical Methods for Natural Language Processing. Simard, M., Ueffing, N., Isabelle, P., and Kuhn, R. (2007). Rule-based translation with statistical phrase-based post-editing. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 203–206.
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 68 / 69
Snover, M., Dorr, B., Schwartz, R., Micciulla, L., and Makhoul, J. (2006). A study of translation edit rate with targeted human annotation. In Proceedings of the 7th Conference
Specia, L., Shah, K., de Souza, J. G., and Cohn, T. (2013). QuEst - a translation quality estimation framework. In Proc. of the Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 79–84. Tezcan, A., Hoste, V., and Macken, L. (2016). Ugent-lt3 scate submission for wmt16 shared task on quality estimation. In Proceedings of the First Conference on Machine Translation, pages 843–850, Berlin, Germany. Association for Computational Linguistics. Ueffing, N. and Ney, H. (2007). Word-level confidence estimation for machine translation. Computational Linguistics, 33(1):9–40.
Andr´ e Martins (Unbabel) Quality Estimation MTM, 31/8/17 69 / 69