STS for Machine Translation Evaluation STS Workshop, NYC March - - PowerPoint PPT Presentation

sts for machine translation evaluation
SMART_READER_LITE
LIVE PREVIEW

STS for Machine Translation Evaluation STS Workshop, NYC March - - PowerPoint PPT Presentation

Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS for Machine Translation Evaluation STS Workshop, NYC March 12-13 2012 Lucia Specia University of Sheffield l.specia@sheffield.ac.uk Lucia Specia STS for Machine


slide-1
SLIDE 1

Monolingual STS Multilingual STS STS for Evaluation My 2 cents

STS for Machine Translation Evaluation

STS Workshop, NYC March 12-13 2012 Lucia Specia

University of Sheffield l.specia@sheffield.ac.uk

Lucia Specia STS for Machine Translation Evaluation

slide-2
SLIDE 2

Monolingual STS Multilingual STS STS for Evaluation My 2 cents

Outline

1 Monolingual STS

MT Evaluation against references TINE

2 Multilingual STS

MT Evaluation without references Adequacy estimation - assimilation purposes

3 STS for Evaluation

One metric fits evaluation for all applications? One metric fits all applications?

4 My 2 cents

STS from an application perspective

Lucia Specia STS for Machine Translation Evaluation

slide-3
SLIDE 3

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

Monolingual STS

Meteor - inexact lexical/phrase matching

Lucia Specia STS for Machine Translation Evaluation 1 / 17

slide-4
SLIDE 4

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

Monolingual STS

Meteor - inexact lexical/phrase matching Pado et al. - textual entailment features

Lucia Specia STS for Machine Translation Evaluation 1 / 17

slide-5
SLIDE 5

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

Monolingual STS

Meteor - inexact lexical/phrase matching Pado et al. - textual entailment features Gimenez & Marquez - matching of semantic labels

Lucia Specia STS for Machine Translation Evaluation 1 / 17

slide-6
SLIDE 6

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

Monolingual STS

Meteor - inexact lexical/phrase matching Pado et al. - textual entailment features Gimenez & Marquez - matching of semantic labels Meant - matching of semantic roles (predicates and their arguments)

Lucia Specia STS for Machine Translation Evaluation 1 / 17

slide-7
SLIDE 7

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

Monolingual STS

Meteor - inexact lexical/phrase matching Pado et al. - textual entailment features Gimenez & Marquez - matching of semantic labels Meant - matching of semantic roles (predicates and their arguments) TINE - matching of semantic roles (predicates and their arguments), but automatically

Lucia Specia STS for Machine Translation Evaluation 1 / 17

slide-8
SLIDE 8

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

Tine Is Not Entailment

R: The lack of snow is putting [people]A0 off booking [ski holidays]A1 in [hotels and guest houses]AM−LOC. H: The lack of snow discourages [people]A0 from ordering [ski stays]A1 in [hotels and boarding houses]AM−LOC.

Lucia Specia STS for Machine Translation Evaluation 2 / 17

slide-9
SLIDE 9

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

Tine Is Not Entailment

R: The lack of snow is putting [people]A0 off booking [ski holidays]A1 in [hotels and guest houses]AM−LOC. H: The lack of snow discourages [people]A0 from ordering [ski stays]A1 in [hotels and boarding houses]AM−LOC. Lexical matching component L & semantic component A: T(H, R) = max αL(H, R) + βA(H, R) α + β

  • R∈R

Lucia Specia STS for Machine Translation Evaluation 2 / 17

slide-10
SLIDE 10

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

This Is Not Entailment

L: BLEU; S: matching of verbs and their arguments: A(H, R) =

  • v∈V verb score(Hv, Rv)

|Vr|

  • 1. Align verbs using ontologies (VerbNet and VerbOcean):

vh and vr are aligned if they share a class in VerbNet or hold a relation in VerbOcean

Lucia Specia STS for Machine Translation Evaluation 3 / 17

slide-11
SLIDE 11

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

This Is Not Entailment

  • 2. Match arguments with same semantic roles:

verb score(Hv, Rv) =

  • a∈Ah∩Ar arg score(Ha, Ra)

|Ar|

Lucia Specia STS for Machine Translation Evaluation 4 / 17

slide-12
SLIDE 12

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

This Is Not Entailment

  • 3. Expand arguments using distributional semantics and match

them using cosine similarity: arg score(Ha, Ra)

Lucia Specia STS for Machine Translation Evaluation 5 / 17

slide-13
SLIDE 13

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

This Is Not Entailment

  • 3. Expand arguments using distributional semantics and match

them using cosine similarity: arg score(Ha, Ra) TINE did slightly better than BLEU at segment level.

Lucia Specia STS for Machine Translation Evaluation 5 / 17

slide-14
SLIDE 14

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation against references TINE

This Is Not Entailment

  • 3. Expand arguments using distributional semantics and match

them using cosine similarity: arg score(Ha, Ra) TINE did slightly better than BLEU at segment level. Lexical component extremely important.

Lucia Specia STS for Machine Translation Evaluation 5 / 17

slide-15
SLIDE 15

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation without references Adequacy estimation - assimilation purposes

Quality Estimation

No access to reference translation - MT system in use: post-editing, dissemination, assimilation, etc

Lucia Specia STS for Machine Translation Evaluation 6 / 17

slide-16
SLIDE 16

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation without references Adequacy estimation - assimilation purposes

Quality Estimation

No access to reference translation - MT system in use: post-editing, dissemination, assimilation, etc Semantics particularly important for estimating adequacy

Lucia Specia STS for Machine Translation Evaluation 6 / 17

slide-17
SLIDE 17

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation without references Adequacy estimation - assimilation purposes

Quality Estimation

No access to reference translation - MT system in use: post-editing, dissemination, assimilation, etc Semantics particularly important for estimating adequacy

Lucia Specia STS for Machine Translation Evaluation 6 / 17

slide-18
SLIDE 18

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation without references Adequacy estimation - assimilation purposes

Example 1

Target: Chang-e III is expected to launch after 2013 Source: 嫦娥三号预计 2013 年前后发射 Reference: Chang-e III is expected to launch around 2013

By Google Translate

Lucia Specia STS for Machine Translation Evaluation 7 / 17

slide-19
SLIDE 19

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation without references Adequacy estimation - assimilation purposes

Example 2

Target: Continued high floods subside. Guang'an old city has been soaked 2 days 2 nights Source: 四川广安洪水持续高位不退 老城区已被泡 2 天 2 夜 Reference: The continuing floods in Guang'an - Sichuan have not

  • subsided. The old city has been flooded for 2 days and 2

nights.

By Google Translate

Lucia Specia STS for Machine Translation Evaluation 8 / 17

slide-20
SLIDE 20

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation without references Adequacy estimation - assimilation purposes

Example 3

Target: site security should be included in sex education curriculum for students Source: 场地安全性教育应纳入学生的课程 Reference: site security requirements should be included in the education curriculum for students

By Google Translate

Lucia Specia STS for Machine Translation Evaluation 9 / 17

slide-21
SLIDE 21

Monolingual STS Multilingual STS STS for Evaluation My 2 cents MT Evaluation without references Adequacy estimation - assimilation purposes

Most common problems

words translated incorrectly incorrect relationship: words/constituents/clauses missing/untranslated/repeated/added words incorrect word order inflectional/voice error

Lucia Specia STS for Machine Translation Evaluation 10 / 17

slide-22
SLIDE 22

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT quality evaluation

How does the metrics vary depending on how the references are produced?

Standard references - semantic component only, segment-level correlation: 0.21 Post-edited translations - semantic component only, segment-level correlation: 0.55

Lucia Specia STS for Machine Translation Evaluation 11 / 17

slide-23
SLIDE 23

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT quality evaluation vs intrinsic evaluation

TINE on WMT data: correlation: 0.30 TINE on Microsoft video data: correlation: 0.43 TINE on Microsoft paraphrase data: correlation: 0.30

Lucia Specia STS for Machine Translation Evaluation 12 / 17

slide-24
SLIDE 24

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT quality estimation and evaluation

Can we use the same approach as reference-based evaluation, but bilingual?

Lucia Specia STS for Machine Translation Evaluation 13 / 17

slide-25
SLIDE 25

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT quality estimation and evaluation

Can we use the same approach as reference-based evaluation, but bilingual?

Possibly, assuming resources and alignments are available

Lucia Specia STS for Machine Translation Evaluation 13 / 17

slide-26
SLIDE 26

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT quality estimation and evaluation

Can we use the same approach as reference-based evaluation, but bilingual?

Possibly, assuming resources and alignments are available Cannot expect exact correspondences. E.g. thematic divergences (Dorr et al):

Lucia Specia STS for Machine Translation Evaluation 13 / 17

slide-27
SLIDE 27

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT quality estimation and evaluation

Can we use the same approach as reference-based evaluation, but bilingual?

Possibly, assuming resources and alignments are available Cannot expect exact correspondences. E.g. thematic divergences (Dorr et al): I miss you vs. Tu me manques

Lucia Specia STS for Machine Translation Evaluation 13 / 17

slide-28
SLIDE 28

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT quality estimation and evaluation

Can we use the same approach as reference-based evaluation, but bilingual?

Possibly, assuming resources and alignments are available Cannot expect exact correspondences. E.g. thematic divergences (Dorr et al): I miss you vs. Tu me manques Can learn these correspondences

Lucia Specia STS for Machine Translation Evaluation 13 / 17

slide-29
SLIDE 29

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT evaluation and summarization (evaluation)

Can the same STS metric address both?

Lucia Specia STS for Machine Translation Evaluation 14 / 17

slide-30
SLIDE 30

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT evaluation and summarization (evaluation)

Can the same STS metric address both?

MT systems make mistakes that summarization (esp. extractive) systems are not likely to make

Lucia Specia STS for Machine Translation Evaluation 14 / 17

slide-31
SLIDE 31

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT evaluation and summarization (evaluation)

Can the same STS metric address both?

MT systems make mistakes that summarization (esp. extractive) systems are not likely to make Translation is generally related/similar to the reference (and source) (a 1-2 likert score), not the case in summarization

Lucia Specia STS for Machine Translation Evaluation 14 / 17

slide-32
SLIDE 32

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT evaluation and summarization (evaluation)

Can the same STS metric address both?

MT systems make mistakes that summarization (esp. extractive) systems are not likely to make Translation is generally related/similar to the reference (and source) (a 1-2 likert score), not the case in summarization Translation is generally very similar in length to the reference, not the case in summarization

Lucia Specia STS for Machine Translation Evaluation 14 / 17

slide-33
SLIDE 33

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT evaluation and summarization (evaluation)

Can the same STS metric address both?

MT systems make mistakes that summarization (esp. extractive) systems are not likely to make Translation is generally related/similar to the reference (and source) (a 1-2 likert score), not the case in summarization Translation is generally very similar in length to the reference, not the case in summarization Translation does 1-1 comparisons, not the case in summarization

Lucia Specia STS for Machine Translation Evaluation 14 / 17

slide-34
SLIDE 34

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

MT evaluation and summarization (evaluation)

Can the same STS metric address both?

MT systems make mistakes that summarization (esp. extractive) systems are not likely to make Translation is generally related/similar to the reference (and source) (a 1-2 likert score), not the case in summarization Translation is generally very similar in length to the reference, not the case in summarization Translation does 1-1 comparisons, not the case in summarization

Translation needs a more fine-grained metric than summarization?

Lucia Specia STS for Machine Translation Evaluation 14 / 17

slide-35
SLIDE 35

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

Applications require different STS metrics

“How do we illustrate the utility of STS to end applications?”

Lucia Specia STS for Machine Translation Evaluation 15 / 17

slide-36
SLIDE 36

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

Applications require different STS metrics

“How do we illustrate the utility of STS to end applications?” STS depends on what is important for the application, and also on the sort of data that can be produced by them

Lucia Specia STS for Machine Translation Evaluation 15 / 17

slide-37
SLIDE 37

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

Applications require different STS metrics

“How do we illustrate the utility of STS to end applications?” STS depends on what is important for the application, and also on the sort of data that can be produced by them Avoid falling into the same trap as WSD?

Lucia Specia STS for Machine Translation Evaluation 15 / 17

slide-38
SLIDE 38

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

Applications require different STS metrics

“How do we illustrate the utility of STS to end applications?” STS depends on what is important for the application, and also on the sort of data that can be produced by them Avoid falling into the same trap as WSD? How many applications use an off-the-shelf WSD module?

Lucia Specia STS for Machine Translation Evaluation 15 / 17

slide-39
SLIDE 39

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

Applications require different STS metrics

“How do we illustrate the utility of STS to end applications?” STS depends on what is important for the application, and also on the sort of data that can be produced by them Avoid falling into the same trap as WSD? How many applications use an off-the-shelf WSD module? Common excuses: not good enough, WN senses not appropriate for my application...

Lucia Specia STS for Machine Translation Evaluation 15 / 17

slide-40
SLIDE 40

Monolingual STS Multilingual STS STS for Evaluation My 2 cents One metric fits evaluation for all applications? One metric fits all applications?

Applications require different STS metrics

“How do we illustrate the utility of STS to end applications?” STS depends on what is important for the application, and also on the sort of data that can be produced by them Avoid falling into the same trap as WSD? How many applications use an off-the-shelf WSD module? Common excuses: not good enough, WN senses not appropriate for my application...

Lucia Specia STS for Machine Translation Evaluation 15 / 17

slide-41
SLIDE 41

Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS from an application perspective

STS from an application perspective

Select a few applications that could benefit from STS

Lucia Specia STS for Machine Translation Evaluation 16 / 17

slide-42
SLIDE 42

Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS from an application perspective

STS from an application perspective

Select a few applications that could benefit from STS Collect examples with different levels of similarity for these applications

Lucia Specia STS for Machine Translation Evaluation 16 / 17

slide-43
SLIDE 43

Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS from an application perspective

STS from an application perspective

Select a few applications that could benefit from STS Collect examples with different levels of similarity for these applications Gold-standard annotation for these examples (like in Meant)

Lucia Specia STS for Machine Translation Evaluation 16 / 17

slide-44
SLIDE 44

Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS from an application perspective

STS from an application perspective

Select a few applications that could benefit from STS Collect examples with different levels of similarity for these applications Gold-standard annotation for these examples (like in Meant) Compute as many semantic components as possible (word-level, SRL, etc)

Lucia Specia STS for Machine Translation Evaluation 16 / 17

slide-45
SLIDE 45

Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS from an application perspective

STS from an application perspective

Select a few applications that could benefit from STS Collect examples with different levels of similarity for these applications Gold-standard annotation for these examples (like in Meant) Compute as many semantic components as possible (word-level, SRL, etc) I’m not sure components need to talk to each other: error propagation Regress on these to understand what are the important components for each application

Lucia Specia STS for Machine Translation Evaluation 16 / 17

slide-46
SLIDE 46

Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS from an application perspective

STS from an application perspective

Select a few applications that could benefit from STS Collect examples with different levels of similarity for these applications Gold-standard annotation for these examples (like in Meant) Compute as many semantic components as possible (word-level, SRL, etc) I’m not sure components need to talk to each other: error propagation Regress on these to understand what are the important components for each application Repeat process with automatic annotation

Lucia Specia STS for Machine Translation Evaluation 16 / 17

slide-47
SLIDE 47

Monolingual STS Multilingual STS STS for Evaluation My 2 cents STS from an application perspective

STS from an application perspective

Select a few applications that could benefit from STS Collect examples with different levels of similarity for these applications Gold-standard annotation for these examples (like in Meant) Compute as many semantic components as possible (word-level, SRL, etc) I’m not sure components need to talk to each other: error propagation Regress on these to understand what are the important components for each application Repeat process with automatic annotation Parameterizable STS metric

Lucia Specia STS for Machine Translation Evaluation 16 / 17

slide-48
SLIDE 48

Monolingual STS Multilingual STS STS for Evaluation My 2 cents

STS for Machine Translation Evaluation

STS Workshop, NYC March 12-13 2012 Lucia Specia

University of Sheffield l.specia@sheffield.ac.uk

Lucia Specia STS for Machine Translation Evaluation 17 / 17