Word Segmentation and their Integration in Machine Translation - - PowerPoint PPT Presentation

word segmentation and their integration in machine
SMART_READER_LITE
LIVE PREVIEW

Word Segmentation and their Integration in Machine Translation - - PowerPoint PPT Presentation

Word Segmentation and their Integration in Machine Translation Advanced MT Seminar ThuyLinh Nguyen thuylinh@cs.cmu.edu Advanced MT seminar p. 1/1 Word Segmentation Problems Advanced MT seminar p. 2/1 Word Segmentation for MT Use


slide-1
SLIDE 1

Word Segmentation and their Integration in Machine Translation

Advanced MT Seminar

ThuyLinh Nguyen

thuylinh@cs.cmu.edu

Advanced MT seminar – p. 1/1

slide-2
SLIDE 2

Word Segmentation Problems

Advanced MT seminar – p. 2/1

slide-3
SLIDE 3

Word Segmentation for MT

Use word segmentation toolkit to segment character sequences into words before the training and translation. Each Chinese character is interpreted as a single word and learn the segmentation from Chinese character - English word alignment. (Xu et al. [2004]) Confusion networks: Take different segmentations into account and represent them as lattice. The input of the translation system is a set of lattices. (Xu [2005])

Advanced MT seminar – p. 3/1

slide-4
SLIDE 4

Word Segmentation Problems

Ambiguity A character can be a word component in one context

  • r a word by itself in other context.

A character can occur in different positions.

Advanced MT seminar – p. 4/1

slide-5
SLIDE 5

Word Segmentation Problems

Ambiguity A character can be a word component in one context

  • r a word by itself in other context.

A character can occur in different positions. Unknown words New words are combinations of existing words. Names are created by combining characters in unpredictable manner. Transliteration of foreign names.

Advanced MT seminar – p. 4/1

slide-6
SLIDE 6

Word Segmentation Problems

Ambiguity A character can be a word component in one context

  • r a word by itself in other context.

A character can occur in different positions. Unknown words New words are combinations of existing words. Names are created by combining characters in unpredictable manner. Transliteration of foreign names. There is no widely accepted definition of Chinese word. (Sproat et al. [1994])used 6 people segmented the same

  • text. The segmentation consistency is only 76%.

Advanced MT seminar – p. 4/1

slide-7
SLIDE 7

Word Segmentation methods

Purely dictionary-based approach (Cheng et al. [1999]) Address the ambiguity problem with maximum matching heuristic. Pros: Simple, good heuristic. Cons: Depends on the coverage of the dictionary.

Advanced MT seminar – p. 5/1

slide-8
SLIDE 8

Word Segmentation methods

Purely dictionary-based approach (Cheng et al. [1999]) Address the ambiguity problem with maximum matching heuristic. Pros: Simple, good heuristic. Cons: Depends on the coverage of the dictionary. Purely statistical-based approach Use Point-wise mutual information or EM. Pros: Not depend on a dictionary. Cons: Low accuracy.

Advanced MT seminar – p. 5/1

slide-9
SLIDE 9

Word Segmentation methods

Purely dictionary-based approach (Cheng et al. [1999]) Address the ambiguity problem with maximum matching heuristic. Pros: Simple, good heuristic. Cons: Depends on the coverage of the dictionary. Purely statistical-based approach Use Point-wise mutual information or EM. Pros: Not depend on a dictionary. Cons: Low accuracy. Statistical-based approach using manual word segmentation data.

Advanced MT seminar – p. 5/1

slide-10
SLIDE 10

CRF for Word Segmentation

Peng et al. [2004] & Tseng et al. [2005] Word segmentation as Character Tagging problem

Advanced MT seminar – p. 6/1

slide-11
SLIDE 11

CRF for Word Segmentation

Peng et al. [2004] & Tseng et al. [2005] Word segmentation as Character Tagging problem Conditional Random Field model Let c = (c1, c2, . . . , cK) be a Chinese sentence,

t = (t1, t2, . . . , tK) be the character tags of c. Pr (t|c) = 1 Z(c) exp k=K

  • k=1
  • i

λifi(tk−1, tk, c, k)

  • Advanced MT seminar – p. 6/1
slide-12
SLIDE 12

CRF for Word Segmentation

Unknown words detection Peng et al. [2004]: Use forward backward algorithm to calculate the confidence of word segment. Tseng et al. [2005]: Add additional features to the model i.e the first and the last characters of rare words.

Advanced MT seminar – p. 7/1

slide-13
SLIDE 13

CRF for Word Segmentation

Unknown words detection Peng et al. [2004]: Use forward backward algorithm to calculate the confidence of word segment. Tseng et al. [2005]: Add additional features to the model i.e the first and the last characters of rare words. Results

Advanced MT seminar – p. 7/1

slide-14
SLIDE 14

Do We Need Word Segmentation for SMT?

Xu et al. [2004] Each Chinese character is interpreted as one “word”. Aligned Chinese characters with English text.

Advanced MT seminar – p. 8/1

slide-15
SLIDE 15

Do We Need Word Segmentation for SMT?

Xu et al. [2004] Each Chinese character is interpreted as one “word”. Aligned Chinese characters with English text. Generate a Chinese word dictionary. Use self-learned dictionary for Chinese word segmentation.

Advanced MT seminar – p. 8/1

slide-16
SLIDE 16

Do We Need Word Segmentation for SMT?

Word length statistics

Advanced MT seminar – p. 9/1

slide-17
SLIDE 17

Do We Need Word Segmentation for SMT?

Advanced MT seminar – p. 10/1

slide-18
SLIDE 18

Integrated Word Segmentation in SMT

Xu [2005] Single best segmentation translation

ˆ fˆ

J 1

= arg maxf J

1 ,J

  • Pr
  • fJ

1 |cK 1

  • ˆ

I 1

= arg maxeI

1,I

  • Pr
  • eI

1|ˆ

J 1

  • Advanced MT seminar – p. 11/1
slide-19
SLIDE 19

Integrated Word Segmentation in SMT

Xu [2005] Segmentation lattice translation

Advanced MT seminar – p. 12/1

slide-20
SLIDE 20

Integrated Word Segmentation in SMT

Xu [2005] Input sentence at the character level Segmentation lattice

Advanced MT seminar – p. 13/1

slide-21
SLIDE 21

Integrated Word Segmentation in SMT

Xu [2005] Input sentence at the character level Segmentation lattice with weights

Advanced MT seminar – p. 14/1

slide-22
SLIDE 22

Integrated Word Segmentation in SMT

Xu [2005] Corpus statistics

Advanced MT seminar – p. 15/1

slide-23
SLIDE 23

Integrated Word Segmentation in SMT

Translation results Monotone finite state transducer Phrase based system

Advanced MT seminar – p. 16/1

slide-24
SLIDE 24

Conclusion & Discussion

Very few research on word segmentation for machine translation

Advanced MT seminar – p. 17/1

slide-25
SLIDE 25

Conclusion & Discussion

Very few research on word segmentation for machine translation GIZA++ can produce error alignments.

Advanced MT seminar – p. 17/1

slide-26
SLIDE 26

Conclusion & Discussion

Very few research on word segmentation for machine translation GIZA++ can produce error alignments. Unalignment of English words and Chinese characters.

Advanced MT seminar – p. 17/1

slide-27
SLIDE 27

Conclusion & Discussion

Very few research on word segmentation for machine translation GIZA++ can produce error alignments. Unalignment of English words and Chinese characters. Word reordering problems.

Advanced MT seminar – p. 17/1

slide-28
SLIDE 28

References

  • K. S. Cheng, G. H. Young, and Wong. A study on word-based

and integral-bit chinese text compression algorithms. Journal

  • f the American Society for Information Science, 50(3):218–

228, 1999. Fuchun Peng, Fangfang Feng, and Andrew Mccallum. Chi- nese segmentation and new word detection using conditional random fields. In Proceedings of Coling 2004, pages 562– 568, Geneva, Switzerland, Aug FebruaryMarch–Aug Febru- aryJuly 2004. COLING. Richard Sproat, Chilin Shih, William Gale, and Nancy Chang. A stochastic finite-state word-segmentation algorithm for chi-

  • nese. In Meeting of the Association for Computational Lin-

guistics, pages 66–73, 1994. URL #. Huihsin Tseng, Pichuan Chang, Galen Andrew, Daniel Jurafsky, and Christopher Manning. A condi- tional random field word segmenter. 2005. URL

http://www.aclweb.org/anthology-new/W/W06/

Xu. Integrated chinese word segmentation in statistical ma- chine translation. In Proceedings of the International Work- shop on Spoken Language Translation (IWSLT), pages 141– 147, Pittsburgh, PA, October 2005. 17-1

slide-29
SLIDE 29
  • J. Xu, R. Zens, and H. Ney. Do we need chinese word segmen-

tation for statistical machine translation? In Proceedings of the Third SIGHAN Workshop on Chinese Language Learn- ing, pages 122–128, Barcelona, Spain, July 2004. 17-2