Direct Assessment Yvette Graham August 11, 2016 Direct Assessment - - PowerPoint PPT Presentation

direct assessment
SMART_READER_LITE
LIVE PREVIEW

Direct Assessment Yvette Graham August 11, 2016 Direct Assessment - - PowerPoint PPT Presentation

Direct Assessment First Conference on Machine Translation (WMT), August 2016 Direct Assessment Yvette Graham August 11, 2016 Direct Assessment First Conference on Machine Translation (WMT), August 2016 Direct Assessment (DA) I


slide-1
SLIDE 1

Direct Assessment First Conference on Machine Translation (WMT), August 2016

Direct Assessment

Yvette Graham August 11, 2016

slide-2
SLIDE 2

Direct Assessment First Conference on Machine Translation (WMT), August 2016

Direct Assessment (DA) I

  • Consideration being given to using DA alone for next year

Reasons:

  • High correlation between RR and DA
  • It seems like we could get good clusters with (conservatively) half the

annotation time

  • Computed as follows:
  • we require 100 hits per system submission, average 5 min

per hit, so 500 minutes = 8 hours

  • DA at 500 translations is what we might need (maybe

more in some cases), and that takes about 2.5 hours (half hour per hit)

slide-3
SLIDE 3

Direct Assessment First Conference on Machine Translation (WMT), August 2016

Direct Assessment (DA) II

  • English side can be completely crowdsourced
  • Leaves researchers responsible only for tasks where we can’t find

crowdsourced workers

slide-4
SLIDE 4

Direct Assessment First Conference on Machine Translation (WMT), August 2016

Correlation of RR and DA

cs-en 0.997 fi-en 0.996 tr-en 0.988 de-en 0.964 ru-en 0.961 ro-en 0.920 en-ru 0.975

slide-5
SLIDE 5

Direct Assessment First Conference on Machine Translation (WMT), August 2016

# Human Assessment vs Significant Differences: CS-EN

500 1000 1500 2000 2500 20 40 60 80 100 Assessed Translations per System % Significant Differences DA RR (with de−dup.)

slide-6
SLIDE 6

Direct Assessment First Conference on Machine Translation (WMT), August 2016

DE-EN

500 1000 1500 2000 2500 3000 20 40 60 80 100 Assessed Translations per System % Significant Differences DA RR (with de−dup.)

slide-7
SLIDE 7

Direct Assessment First Conference on Machine Translation (WMT), August 2016

FI-EN

500 1000 1500 2000 2500 3000 20 40 60 80 100 Assessed Translations per System % Significant Differences DA RR (with de−dup.)

slide-8
SLIDE 8

Direct Assessment First Conference on Machine Translation (WMT), August 2016

RO-EN

500 1000 1500 2000 20 40 60 80 100 Assessed Translations per System % Significant Differences DA RR (with de−dup.)

slide-9
SLIDE 9

Direct Assessment First Conference on Machine Translation (WMT), August 2016

RU-EN

500 1000 1500 2000 2500 3000 3500 20 40 60 80 100 Assessed Translations per System % Significant Differences DA RR (with de−dup.)

slide-10
SLIDE 10

Direct Assessment First Conference on Machine Translation (WMT), August 2016

TR-EN

500 1000 1500 2000 20 40 60 80 100 Assessed Translations per System % Significant Differences DA RR (with de−dup.)

slide-11
SLIDE 11

Direct Assessment First Conference on Machine Translation (WMT), August 2016

EN-RU

200 400 600 800 1000 1200 20 40 60 80 100 Assessed Translations per System % Significant Differences DA RR (with de−dup.)

slide-12
SLIDE 12

Direct Assessment First Conference on Machine Translation (WMT), August 2016

Conclusions

  • Trial of DA was successful overall
  • No problems crowd-sourcing all to-English language pairs
  • Not enough workers for all out-of-English news LPs except English to

Russian – those LPs unfortunately must remain the task of participants

  • Correlation with RR high across the board
  • DA almost achieves as many significant differences as RR but without

deduplication More to come:

  • WMT’16 included RR with deduplication and DA without it – Makes

the comparison of numbers of judgments difficult

  • Future: Compare unexpanded (undeduped) versions to see what effect

it had, since this is really an unfair comparison