Automatic Machine Translation Evaluation using Source Language - - PowerPoint PPT Presentation

automatic machine translation evaluation using source
SMART_READER_LITE
LIVE PREVIEW

Automatic Machine Translation Evaluation using Source Language - - PowerPoint PPT Presentation

Automatic Machine Translation Evaluation using Source Language Inputs and Cross-lingual Language Model Presenter : 1 Kosuke Takahashi 1 2 Katsuhito Sudoh 1 Satoshi Nakamura 1: Nara Institute of Science and Technology (NAIST) 2: PRESTO, Japan


slide-1
SLIDE 1

Automatic Machine Translation Evaluation using Source Language Inputs and Cross-lingual Language Model

1: Nara Institute of Science and Technology (NAIST) 2: PRESTO, Japan Science and Technology Agency

1

Presenter : 1Kosuke Takahashi

1 2Katsuhito Sudoh 1Satoshi Nakamura

slide-2
SLIDE 2

Existing metrics based on surface level features

2

} BLEU[Papineni +, 2002], NIST[Doddington +, 2002],

METEOR[Satanjeev + , 2005]

} Calculate evaluation scores with word matching rate

Problems: Relying on lexical features → cannot appropriately evaluate semantic and syntactic differences

slide-3
SLIDE 3

Existing metrics based on embedded representation

3

} RUSE[Shimanaka +, 2018], BERT regressor[Shimanaka

+, 2019]

} Fully parameterized metrics } Use sentence vectors } fine-tuned to predict human evaluation scores } BERT regressor achieved the SOTA result on

WMT17 metrics task in 2019 These metrics provide better evaluation performance than surface level ones.

slide-4
SLIDE 4

Proposed multi-reference

4

○ better evaluation × costly to prepare multiple reference sentences

system translation (hypothesis) reference 2 reference 1 reference n

Conventional multi-reference

reference sentence (reference 2) source sentence (reference 1) system translation (hypothesis)

Proposed Idea

○ better evaluation ○ a little cost to prepare 2 references for each hypothesis

slide-5
SLIDE 5

Architectures of a baseline and proposed models

ML MLP ev evaluation

  • n

scor

  • re

sentence-pair encoder

hypot

  • thes

esis + ref efer eren ence

sen enten ence-pa pair r vec ector

  • r vhy

hyp+r +ref

ML MLP ev evaluation

  • n scor
  • re

sentence-pair encoder hypot

  • thes

esis + sou

  • urce

con

  • ncaten

enation

  • n

vhy

hyp+s +src, , vhy hyp+r +ref

hypot

  • thes

esis + ref efer eren ence sen enten ence-pai air r vec ector

  • r vhy

hyp+src

sen enten ence-pai air r vec ector

  • r vhy

hyp+r +ref

ML MLP ev evaluation

  • n scor
  • re

sentence-pair encoder hypot

  • thes

esis + sou

  • urce

e + ref efer eren ence sen enten ence-pair vec ector

  • r

vhy

hyp+src+ref

hyp+src/hyp+ref Baseline: BERT regressor hyp+ref hyp+src+ref

5

slide-6
SLIDE 6

The setting of experiments

6

  • Language model : mBERT, XLM15
  • Input : hyp+src/ref, hyp+src+ref, hyp+ref, hyp+src
  • Baselines : SentBLEU,

BERT regressor (BERT with hyp+ref)

  • Data : WMT17 metrics shared task
  • Language pairs : {De, Ru, Tr, Zh}-En
slide-7
SLIDE 7

Results : comparison with baselines

7

metric or language model

input style average score (r) SentBLEU hyp, ref 48.4 BERT regressor (monolingual BERT) hyp+ref 74.0 mBERT hyp+src/hyp+ref 72.6 hyp+src+ref 68.9 XLM15 hyp+src/hyp+ref 77.1 hyp+src+ref 74.7

+ + 3. 3.1

  • Proposed XLM15 with hyp+src/hyp+ref surpassed basline scores
slide-8
SLIDE 8

language model input style average score (r) mBERT hyp+ref 67.9 hyp+src 55.9 hyp+src/hyp+ref

72. 72.6

hyp+src+ref 68.9 XLM15 hyp+ref 74.1 hyp+src 72.8 hyp+src/hyp+ref

77. 77.1

hyp+src+ref 74.7

Results :

evaluation performance for each input style

8

+ + 4. 4.7 + + 3. 3.0

  • Using src and ref improve evaluation performance
  • hyp+src/hyp+ref was the best input style
slide-9
SLIDE 9

Analysis : scatter plots of evaluation and DA scores

9

XLM15 hyp+src/ref Pearson’s correlation score All: 0. 0.768 768 DA ≧ 0.0 : 0. 0.580 580 DA < 0.0 : 0. 0.529 529 Low quality translation is hard to evaluate

Note: DA (Direct Assessment) is a human evaluation score

slide-10
SLIDE 10

Analysis : The drop rate of Pearson’s correlation score from high DA to low DA range

10

Note: reduction rate indicates how much evaluation performance is degraded from high to low quality translations

language model

input style reduction rate (%) BERT regressor (monolingual BERT) hyp+ref 16.10 mBERT hyp+ref 22.05 hyp+src 6.88 hyp+src/hyp+ref 7.77 hyp+src+ref 17.51 XLM15 hyp+ref 14.20 hyp+src 8.46 hyp+src/hyp+ref 8.68 hyp+src+ref 11.12

– 14. 14.28 28 – 5. 5.52 52

slide-11
SLIDE 11

Summary

11

} Proposed a MT evaluation metric that utilizes source sentences as

pseudo references

} hyp+src/hyp+ref makes good use of source sentences and is confirmed

to improve evaluation performance.

} XLM15 hyp+src/hyp+ref showed the higher correlation with humans

than baselines

} Source information is contributed to stabilize the evaluation of low

quality translations

Future Work

} Experiment with multiple language models and datasets } Focus on a better evaluation of low quality translations