on the annotation of tmx translation memories for
play

On the annotation of TMX translation memories for advanced - PowerPoint PPT Presentation

On the annotation of TMX translation memories for advanced leveraging in computer-aided translation Mikel L. Forcada Departament de Llenguatges i Sistemes Informtics, Universitat dAlacant, E-03071 Alacant (Spain) May 30, 2014: Language


  1. On the annotation of TMX translation memories for advanced leveraging in computer-aided translation Mikel L. Forcada Departament de Llenguatges i Sistemes Informàtics, Universitat d’Alacant, E-03071 Alacant (Spain) May 30, 2014: Language Resources and Evaluation Conference LREC 2014, Reykjavík, Ísland Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 2 / 29

  2. Contents Computer-aided translation using translation memories 1 The TMX standard 2 The need for sub-segment annotation: advanced leveraging 3 A proposal for sub-segment correspondence annotation in TMX 4 Sources of sub-segment equivalence 5 Concluding remarks 6 [Spare slides: other alternatives considered] 7 Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 3 / 29

  3. Outline Computer-aided translation using translation memories 1 The TMX standard 2 The need for sub-segment annotation: advanced leveraging 3 A proposal for sub-segment correspondence annotation in TMX 4 Sources of sub-segment equivalence 5 Concluding remarks 6 [Spare slides: other alternatives considered] 7 Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 4 / 29

  4. Computer-aided translation using translation memories /1 A quick review of concepts: Translation memory (TM): a set of translation units A translation unit (TU): pair of text segments : each in a different language mutual translations TMs store previous translation jobs in a reusable way. Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 5 / 29

  5. Computer-aided translation using translation memories /2 English Catalan s 1 : The political situation is dif- t 1 : La situació política és difícil ficult s 2 : The humanitarian situation t 2 : La situació humanitària em- worsens pitjora s 3 : Humanitarian efforts have t 3 : Els esforços humanitaris han failed fracassat . . . . . . Fuzzy matches of a new sentence s ′ help translate it: s ′ : The humanitarian situation is difficult New sentence s 2 : The political situation is difficult Best match t 2 : La situació política és difícil Proposal La situació humanitària és difícil Edited proposal t 2 → t ′ Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 6 / 29

  6. Outline Computer-aided translation using translation memories 1 The TMX standard 2 The need for sub-segment annotation: advanced leveraging 3 A proposal for sub-segment correspondence annotation in TMX 4 Sources of sub-segment equivalence 5 Concluding remarks 6 [Spare slides: other alternatives considered] 7 Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 7 / 29

  7. TMX Translation memory exchange (TMX). A well established, industry-agreed standard. Based on XML For the interchange of TMs among computer-aided translation (CAT) applications. Example of a translation unit in TMX 1 <tu segtype="sentence" tuid="2"> <tuv xml:lang="en"> 2 <seg>The humanitarian situation worsens.</seg> 3 </tuv> 4 <tuv xml:lang="ca"> 5 <seg>La situació humanitària empitjora.</seg> 6 </tuv> 7 8 </tu> Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 8 / 29

  8. Outline Computer-aided translation using translation memories 1 The TMX standard 2 The need for sub-segment annotation: advanced leveraging 3 A proposal for sub-segment correspondence annotation in TMX 4 Sources of sub-segment equivalence 5 Concluding remarks 6 [Spare slides: other alternatives considered] 7 Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 9 / 29

  9. The need for sub-segment annotation To automate the needed change, 1 namely, s ′ : The humanitarian situation is difficult New sentence s 2 : The political situation is difficult Best match t 2 : La situació política és difícil Proposal La situació humanitària és difícil t 2 → t ′ Edited proposal it would be helpful to know, for instance, that political situation → situació política humanitarian situation → situació humanitària These sub-segment correspondences are in the TM but they are not annotated . But they might as well have been! 1 This is sometimes called fuzzy-match repair Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 10 / 29

  10. Advanced leveraging The term advanced leveraging . . . . . . refers to extensions beyond current TM usage . . . . . . coming from identifying sub-segment repetitions. Commercial examples: Deep Miner in Atril’s Déjà Vu Auto-Suggest in SDL Trados Advanced Leveraging in Multicorpora TMX does not directly support sub-segment equivalence annotation. Or does it? Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 11 / 29

  11. Outline Computer-aided translation using translation memories 1 The TMX standard 2 The need for sub-segment annotation: advanced leveraging 3 A proposal for sub-segment correspondence annotation in TMX 4 Sources of sub-segment equivalence 5 Concluding remarks 6 [Spare slides: other alternatives considered] 7 Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 12 / 29

  12. Annotating TMX with sub-segment information After considering some alternatives (see paper): Proposal: repurposing existing support in TMX for overlapping format paired tags (yuck!) Overlapping paired format tags in English <B>Bold,<I>Bold + Italic</B>, Italic</I>. Corresponding (also overlapping) paired format tags in Spanish <B>Negrita,<I>Negrita + Cursiva</B>, Cursiva</I>. In TMX, one can Use an index i to pair each begin paired tag ( <bpt> ) with the corresponding end paired tag ( <ept> ) in the same segment Use an index x to align each tag in one language with the corresponding tag in the other language Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 13 / 29

  13. Annotating TMX with sub-segment information TMX translation unit with paired format tags 1 <tu segtype="sentence" tuid="877"> <tuv xml:lang="en"> 2 <seg> 3 <bpt i="1" x="1">&lt;B></bpt>Bold, 4 <bpt i="2" x="2">&lt;I></bpt>Bold + 5 Italic<ept i="1">&lt;/B</ept>, 6 Italic<ept i="2">&lt;/I>.</ept> 7 </seg> 8 </tuv> 9 <tuv xml:lang="es"> 10 <seg>I have written 11 <bpt i="1" x="1">&lt;B></bpt>Negrita, 12 <bpt i="2" x="2">&lt;I></bpt>Negrita + 13 Cursiva<ept i="1">&lt;/B</ept>, 14 Cursiva<ept i="2">&lt;/I>.</ept> 15 </tuv> 16 17 </tu> Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 14 / 29

  14. Annotating TMX with sub-segment information The solution: 2 null (empty) format tags . In TMX: Each <ept> – <bpt> pair may clearly span any arbitrary subsegment in seg Elements <ept> and <bpt> can be empty ! An attribute type may be used to specify “the kind of data [the] element represents” Therefore We can use aligned <ept> – <bpt> pairs containing no format to represent subsegment correspondences We can twist the accepted use of the type attribute to encode the source of information used to annotate that correspondence. 2 thanks Felipe Sánchez-Martínez! Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 15 / 29

  15. Annotating TMX with sub-segment information TMX translation unit with one subsegment annotated 1 <tu segtype="sentence" tuid="13123123"> <tuv xml:lang="de"> 2 <seg>Ich habe 3 <bpt i="1" x="1" 4 type="google-translate-de-en"/>einen 5 Artikel<ept i="1"/> 6 geschrieben.</seg> 7 </tuv> 8 <tuv xml:lang="en"> 9 <seg>I have written 10 <bpt i="1" x="1" 11 type="google-translate-de-en"/>an 12 article<ept i="1"/></seg> 13 </tuv> 14 15 </tu> Mikel L. Forcada (Universitat d’Alacant)Annotating TMX for advanced leveraging in CAT 30/05/2014 16 / 29

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend