[PPT] - On the annotation of TMX translation memories for advanced PowerPoint Presentation

SLIDE 1

SLIDE 2

On the annotation of TMX translation memories for advanced leveraging in computer-aided translation

Mikel L. Forcada

Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain)

May 30, 2014: Language Resources and Evaluation Conference LREC 2014, Reykjavík, Ísland

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 2 / 29

SLIDE 3

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 4 / 29

SLIDE 5

Computer-aided translation using translation memories /1

A quick review of concepts: Translation memory (TM): a set of translation units A translation unit (TU): pair of text segments:

each in a different language mutual translations

TMs store previous translation jobs in a reusable way.

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 5 / 29

SLIDE 6

Computer-aided translation using translation memories /2

English Catalan s1: The political situation is dif- ficult t1: La situació política és difícil s2: The humanitarian situation worsens t2: La situació humanitària em- pitjora s3: Humanitarian efforts have failed t3: Els esforços humanitaris han fracassat . . . . . . Fuzzy matches of a new sentence s′ help translate it: New sentence s′: The humanitarian situation is difficult Best match s2: The political situation is difficult Proposal t2: La situació política és difícil Edited proposal t2 → t′ La situació humanitària és difícil

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 6 / 29

SLIDE 7

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 7 / 29

SLIDE 8

TMX

Translation memory exchange (TMX). A well established, industry-agreed standard. Based on XML For the interchange of TMs among computer-aided translation (CAT) applications.

Example of a translation unit in TMX

1 <tu segtype="sentence" tuid="2"> 2

<tuv xml:lang="en">

3

<seg>The humanitarian situation worsens.</seg>

4

</tuv>

5

<tuv xml:lang="ca">

6

<seg>La situació humanitària empitjora.</seg>

7

</tuv>

8 </tu> Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 8 / 29

SLIDE 9

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 9 / 29

SLIDE 10

The need for sub-segment annotation

To automate the needed change,1 namely, New sentence s′: The humanitarian situation is difficult Best match s2: The political situation is difficult Proposal t2: La situació política és difícil Edited proposal t2 → t′ La situació humanitària és difícil it would be helpful to know, for instance, that political situation → situació política humanitarian situation → situació humanitària These sub-segment correspondences are in the TM but they are not annotated. But they might as well have been!

1This is sometimes called fuzzy-match repair Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 10 / 29

SLIDE 11

Advanced leveraging

The term advanced leveraging. . . . . . refers to extensions beyond current TM usage . . . . . . coming from identifying sub-segment repetitions. Commercial examples: Deep Miner in Atril’s Déjà Vu Auto-Suggest in SDL Trados Advanced Leveraging in Multicorpora TMX does not directly support sub-segment equivalence annotation. Or does it?

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 11 / 29

SLIDE 12

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 12 / 29

SLIDE 13

Annotating TMX with sub-segment information

After considering some alternatives (see paper): Proposal: repurposing existing support in TMX for overlapping format paired tags (yuck!)

Overlapping paired format tags in English

Bold,Bold + Italic, Italic.

Corresponding (also overlapping) paired format tags in Spanish

Negrita,Negrita + Cursiva, Cursiva. In TMX, one can Use an index i to pair each begin paired tag (<bpt>) with the corresponding end paired tag (<ept>) in the same segment Use an index x to align each tag in one language with the corresponding tag in the other language

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 13 / 29

SLIDE 14

Annotating TMX with sub-segment information

TMX translation unit with paired format tags

1 <tu segtype="sentence" tuid="877"> 2

<tuv xml:lang="en">

3

<seg>

4

<bpt i="1" x="1"></bpt>Bold,

5

<bpt i="2" x="2"></bpt>Bold +

6

Italic<ept i="1"></B</ept>,

7

Italic<ept i="2">.</ept>

8

</seg>

9

</tuv>

10

<tuv xml:lang="es">

11

<seg>I have written

12

<bpt i="1" x="1"></bpt>Negrita,

13

<bpt i="2" x="2"></bpt>Negrita +

14

Cursiva<ept i="1"></B</ept>,

15

Cursiva<ept i="2">.</ept>

16

</tuv>

17 </tu> Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 14 / 29

SLIDE 15

Annotating TMX with sub-segment information

The solution:2 null (empty) format tags. In TMX: Each <ept>–<bpt> pair may clearly span any arbitrary subsegment in seg Elements <ept> and <bpt> can be empty! An attribute type may be used to specify “the kind of data [the] element represents” Therefore We can use aligned <ept>–<bpt> pairs containing no format to represent subsegment correspondences We can twist the accepted use of the type attribute to encode the source of information used to annotate that correspondence.

2thanks Felipe Sánchez-Martínez! Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 15 / 29

SLIDE 16

Annotating TMX with sub-segment information

TMX translation unit with one subsegment annotated

1 <tu segtype="sentence" tuid="13123123"> 2

<tuv xml:lang="de">

3

<seg>Ich habe

4

<bpt i="1" x="1"

5

type="google-translate-de-en"/>einen

6

Artikel<ept i="1"/>

7

geschrieben.</seg>

8

</tuv>

9

<tuv xml:lang="en">

10

<seg>I have written

11

<bpt i="1" x="1"

12

type="google-translate-de-en"/>an

13

article<ept i="1"/></seg>

14

</tuv>

15 </tu> Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 16 / 29

SLIDE 17

Annotating TMX with sub-segment information

TMX translation unit with two overlapping subsegments annotated

1 <tu segtype="sentence" tuid="13123123"> 2

<tuv xml:lang="de">

3

<seg>Ich

4

<bpt i="1" x="1" type="google-translate-de-en"/>gehe

5

<bpt i="2" x="2" type="google-translate-de-en"/>ins

6

<ept i="1"/> Haus<ept i="2"/>.</seg>

7

</tuv>

8

<tuv xml:lang="en">

9

<seg>I

10

<bpt i="1" x="1" type="google-translate-de-en"/>go

11

<bpt i="2" x="2" type="google-translate-de-en"/>into the

12

<ept i="1"/> house<ept i="2"/>.</seg>

13

</tuv>

14 </tu> Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 17 / 29

SLIDE 18

Pros and cons of <ept> and <bpt> repurposing.

Pros: This method allows for a very general annotation of all kinds of subsegment correspondences. A related localization standard, XLIFF, also uses <ept> and <bpt> with similar syntax and semantics.

It remains to be seen if it would be possible to twist XLIFF too!

Cons: Extending the semantics of <bpt> and <ept> could give trouble with CAT systems that explicitly consider them (instead of just strippring them) Does not explicitly encode sub-segment correspondences as separate translation units <tu> (always bound to a subsegment, may be repeated somewhere else). In statistical machine translation parlance, one would say that “the phrase table is embedded in the bilingual training corpus”.

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 18 / 29

SLIDE 19

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 19 / 29

SLIDE 20

Sources of subsegment equivalence

Subsegment equivalences may come from. . . . . . smaller translation units in the same TM or another TM. . . . an external source of bilingual equivalence such as a machine translation system. . .

note that in this case, MT output is “validated” by the existing translation in the translation memory

. . . or a term base. . . . a statistical word alignment of the current translation memory.

subsegment pairs can be those compatible with those word alignments.

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 20 / 29

SLIDE 21

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 21 / 29

SLIDE 22

Concluding remarks

I have presented a proposal3 to enrich TMX-encoded translation memories with information about subsegment equivalence

Ready for advanced leveraging

It repurposes existing resources for formatting in the TMX standard Subsegment annotation may be generated in advance using

Machine translation [Statistical] word alignment followed by phrase-pair extraction Smaller TUs from the same or other TMs Term bases, glossaries, etc.

and stored together with the TMX file.

3The paper discusses other alternatives Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 22 / 29

SLIDE 23

Thank you!

Mikel L. Forcada

Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain)

Support from the Spanish Ministry of Economy and Competitiveness through grant TIN2012-32615 is gratefully acknowledged. I also thank Felipe Sánchez-Martínez and Juan Antonio Pérez-Ortiz for interesting suggestions. Finally, I thank Google Summer of Code student Pankaj Kumar Sharma for experimental implementations using Apertium to annotate subsegments in a TMX memory.

Mikel L. Forcada (Universitat d’Alacant) Thank you! and acknowledgements 23 / 29

SLIDE 24

This slide has been intentionally left empty

Mikel L. Forcada (Universitat d’Alacant) and acknowledgements 24 / 29

SLIDE 25

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

Mikel L. Forcada (Universitat d’Alacant) and acknowledgements 25 / 29

SLIDE 26

Discarded alternative: using <prop>/1

A possibility uses <prop> (“used to define properties of the parent element”), storing sub-segments as separate <tu> (“stand-off”?):

The annotating subsegment TU specifies how it annotates a TU

1 <tu segtype="phrase" tuid="984120312"> 2

<prop type="annotated-tuid">13123123</prop>

3

<prop type="source">google-translate-de-en</prop>

4

<tuv xml:lang="de">

5

<prop type="start-pos">10</prop>

6

<prop type="end-pos">22</prop>

7

<seg>einen Artikel</seg>

8

</tuv>

9

<tuv xml:lang="en">

10

<prop type="start-pos">16</prop>

11

<prop type="end-pos">25</prop>

12

<seg>an article</seg>

13

</tuv>

14 </tu> Mikel L. Forcada (Universitat d’Alacant) and acknowledgements 26 / 29

SLIDE 27

Discarded alternative: using <prop>/2

Treats sub-segment correspondences as TUs (natural). Cumbersome <prop> overloading for common sub-segment pairs Use of character offsets may be fragile Matching <prop> lists would be needed in annotated TUs:

The annotated TU names the annotating sub-segment TUs

1 <tu segtype="sentence" tuid="13123123"> 2

<prop type="annnotated-by-tuid">984120312</prop>

3

<tuv xml:lang="de">

4

<seg>Ich habe einen Artikel

5

geschrieben.</seg>

6

</tuv>

7

<tuv xml:lang="en">

8

<seg>I have written an article</seg>

9

</tuv>

10 </tu> Mikel L. Forcada (Universitat d’Alacant) and acknowledgements 27 / 29

SLIDE 28

Discarded alternative: using <hi>/1

A possibility would use <hi> (“used to delimit a portion of the segment for any user-defined purpose”):

TMX translation unit with one sub-segment annotated

1 <tu segtype="sentence" tuid="13123123"> 2

<tuv xml:lang="de">

3

<seg>Ich habe

4

<hi x="1" type="google-translate-de-en">einen

5

Artikel</hi> geschrieben.</seg>

6

</tuv>

7

<tuv xml:lang="en">

8

<seg>I have written

9

<hi x="1" type="google-translate-de-en">an

10

article</hi></seg>

11

</tuv>

12 </tu> Mikel L. Forcada (Universitat d’Alacant) and acknowledgements 28 / 29

SLIDE 29

Discarded alternative: using <hi>/2

Allows for a rather rich annotation of sub-segment correspondence without having to stretch too far the intended semantics of the <hi> element. Element <hi> may be indefinitely nested, but no overlap is possible. It may however be OK if a clear phrase structure is defined (for instance using a synchronous context-free grammar): [1Ich ] [2habe [3[4einen Artikel] geschrieben ] ] [1I ] [2have [3written [4an article] ] ]

Mikel L. Forcada (Universitat d’Alacant) and acknowledgements 29 / 29

On the annotation of TMX translation memories for advanced leveraging in computer-aided translation

Mikel L. Forcada

Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain)

May 30, 2014: Language Resources and Evaluation Conference LREC 2014, Reykjavík, Ísland

Contents

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

Computer-aided translation using translation memories /1

A quick review of concepts: Translation memory (TM): a set of translation units A translation unit (TU): pair of text segments:

each in a different language mutual translations

TMs store previous translation jobs in a reusable way.

Computer-aided translation using translation memories /2

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

TMX

Translation memory exchange (TMX). A well established, industry-agreed standard. Based on XML For the interchange of TMs among computer-aided translation (CAT) applications.

Example of a translation unit in TMX

<tuv xml:lang="en">

<seg>The humanitarian situation worsens.</seg>

</tuv>

<tuv xml:lang="ca">

<seg>La situació humanitària empitjora.</seg>

</tuv>

Outline

1

Computer-aided translation using translation memories

2

The TMX standard

3

The need for sub-segment annotation: advanced leveraging

4

A proposal for sub-segment correspondence annotation in TMX

5

Sources of sub-segment equivalence

6

Concluding remarks

7

[Spare slides: other alternatives considered]

The need for sub-segment annotation

Advanced leveraging

<bpt i="1" x="1"><B></bpt>Bold,

<bpt i="2" x="2"><I></bpt>Bold +

Italic<ept i="1"></B</ept>,

Italic<ept i="2"></I>.</ept>

<bpt i="1" x="1"><B></bpt>Negrita,

<bpt i="2" x="2"><I></bpt>Negrita +

Cursiva<ept i="1"></B</ept>,

Cursiva<ept i="2"></I>.</ept>