automatic machine translation evaluation using source
play

Automatic Machine Translation Evaluation using Source Language - PowerPoint PPT Presentation

Automatic Machine Translation Evaluation using Source Language Inputs and Cross-lingual Language Model Presenter : 1 Kosuke Takahashi 1 2 Katsuhito Sudoh 1 Satoshi Nakamura 1: Nara Institute of Science and Technology (NAIST) 2: PRESTO, Japan


  1. Automatic Machine Translation Evaluation using Source Language Inputs and Cross-lingual Language Model Presenter : 1 Kosuke Takahashi 1 2 Katsuhito Sudoh 1 Satoshi Nakamura 1: Nara Institute of Science and Technology (NAIST) 2: PRESTO, Japan Science and Technology Agency 1

  2. Existing metrics based on surface level features } BLEU[Papineni +, 2002], NIST[Doddington +, 2002], METEOR[Satanjeev + , 2005] } Calculate evaluation scores with word matching rate Problems: Relying on lexical features → cannot appropriately evaluate semantic and syntactic differences 2

  3. Existing metrics based on embedded representation } RUSE[Shimanaka +, 2018], BERT regressor[Shimanaka +, 2019] } Fully parameterized metrics } Use sentence vectors } fine-tuned to predict human evaluation scores } BERT regressor achieved the SOTA result on WMT17 metrics task in 2019 These metrics provide better evaluation performance than surface level ones. 3

  4. Proposed multi-reference Conventional multi-reference Proposed Idea source reference 1 sentence system (reference 1) system translation reference 2 translation (hypothesis) (hypothesis) reference sentence reference n (reference 2) ○ better evaluation ○ better evaluation × costly to prepare multiple ○ a little cost to prepare 2 reference sentences references for each hypothesis 4

  5. Architectures of a baseline and proposed models hyp+src/hyp+ref hyp+src+ref Baseline: BERT regressor evaluation ev on scor ore hyp+ref ev evaluation on scor ore evaluation ev on ML MLP scor ore ML MLP con oncaten enation on MLP ML v hy , v hy hyp+s +src, hyp+r +ref sen enten ence-pair vec ector or sen enten ence-pa pair r v hy sen enten ence-pai air r sen enten ence-pai air r hyp+src+ref vec ector or v hy vec ector or v hy vec ector or v hy hyp+r +ref hyp+src hyp+r +ref sentence-pair encoder sentence-pair encoder sentence-pair encoder hypot othes esis + sou ource e hypot othes esis + hypot othes esis + hypot othes esis + ref efer eren ence + ref efer eren ence sou ource ref efer eren ence 5

  6. The setting of experiments • Language model : mBERT, XLM15 • Input : hyp+src/ref, hyp+src+ref, hyp+ref, hyp+src • Baselines : SentBLEU, BERT regressor (BERT with hyp+ref) • Data : WMT17 metrics shared task • Language pairs : {De, Ru, Tr, Zh}-En 6

  7. Results : comparison with baselines metric or language model input style average score (r) SentBLEU hyp, ref 48.4 BERT regressor hyp+ref 74.0 (monolingual BERT) hyp+src/hyp+ref 72.6 + + 3. 3.1 mBERT hyp+src+ref 68.9 hyp+src/hyp+ref 77.1 XLM15 hyp+src+ref 74.7 • Proposed XLM15 with hyp+src/hyp+ref surpassed basline scores 7

  8. Results : evaluation performance for each input style language model input style average score (r) hyp+ref 67.9 hyp+src 55.9 + + 4. 4.7 mBERT hyp+src/hyp+ref 72.6 72. hyp+src+ref 68.9 hyp+ref 74.1 hyp+src 72.8 + 3. + 3.0 XLM15 hyp+src/hyp+ref 77.1 77. hyp+src+ref 74.7 • Using src and ref improve evaluation performance • hyp+src/hyp+ref was the best input style 8

  9. Analysis : scatter plots of evaluation and DA scores XLM15 hyp+src/ref Pearson’s correlation score All : 0. 0.768 768 DA ≧ 0.0 : 0. 0.580 580 DA < 0.0 : 0. 0.529 529 Low quality translation is hard to evaluate Note: DA (Direct Assessment) is a human evaluation score 9

  10. Analysis : The drop rate of Pearson’s correlation score from high DA to low DA range language model input style reduction rate (%) BERT regressor hyp+ref 16.10 (monolingual BERT) hyp+ref 22.05 – 14. 14.28 28 hyp+src 6.88 mBERT hyp+src/hyp+ref 7.77 hyp+src+ref 17.51 hyp+ref 14.20 – 5. 5.52 52 hyp+src 8.46 XLM15 hyp+src/hyp+ref 8.68 hyp+src+ref 11.12 Note: reduction rate indicates how much evaluation performance is degraded from high to low quality translations 10

  11. Summary } Proposed a MT evaluation metric that utilizes source sentences as pseudo references } hyp+src/hyp+ref makes good use of source sentences and is confirmed to improve evaluation performance. } XLM15 hyp+src/hyp+ref showed the higher correlation with humans than baselines } Source information is contributed to stabilize the evaluation of low quality translations Future Work } Experiment with multiple language models and datasets } Focus on a better evaluation of low quality translations 11

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend