E f f e c t o f C r o s s - L a n g u a g e I - - PowerPoint PPT Presentation

e f f e c t o f c r o s s l a n g u a g e i r e f f e c t
SMART_READER_LITE
LIVE PREVIEW

E f f e c t o f C r o s s - L a n g u a g e I - - PowerPoint PPT Presentation

E f f e c t o f C r o s s - L a n g u a g e I R E f f e c t o f C r o s s - L a n g u a g e I R i n B i l i n g u a l L e x i c o n A c q u i s i t i o n i n B i l i


slide-1
SLIDE 1

July 4-5, 2003, German-Japan WS on NLP

E f f e c t

  • f

C r

  • s

s E f f e c t

  • f

C r

  • s

s

  • L

a n g u a g e I R L a n g u a g e I R i n B i l i n g u a l L e x i c

  • n

A c q u i s i t i

  • n

i n B i l i n g u a l L e x i c

  • n

A c q u i s i t i

  • n

f r

  • m

C

  • m

p a r a b l e C

  • r

p

  • r

a f r

  • m

C

  • m

p a r a b l e C

  • r

p

  • r

a

Takehito Utsuro Graduate School of Informatics, Kyoto University, Japan

utsuro@pine.kuee.kyoto-u.ac.jp

slide-2
SLIDE 2

Background

Translation Knowledge Acquisition from Parallel/Comparable Corpora From Parallel Corpora

translation knowledge acquisition: relatively easier resource: less available

From Comparable Corpora

translation knowledge acquisition: relatively harder resource: more available

slide-3
SLIDE 3

Translation Knowledge Acquisition: Our Approach

Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites)

Updated everyday → enabling efficient acquisition

  • f up-to-date translation knowledge

Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

slide-4
SLIDE 4

Translation Knowledge Acquisition: Our Approach

Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites)

Updated everyday → enabling efficient acquisition

  • f up-to-date translation knowledge

Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

slide-5
SLIDE 5
slide-6
SLIDE 6

Translation Knowledge Acquisition: Our Approach

Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites)

Updated everyday → enabling efficient acquisition

  • f up-to-date translation knowledge

Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

slide-7
SLIDE 7

Translation Knowledge Acquisition from WWW News Sites: Overview

WWW WWW ( ( News Sites News Sites) )

Japanese Article English Article Translation Knowledge DB

Bilingual Lexicon

MT system

Retrieval of Bilingual Retrieval of Bilingual Article Pair Article Pair

Relevant Article Pair

Translation Translation Knowledge Knowledge Acquisition Acquisition

Japanese News Articles DB English News Articles DB

slide-8
SLIDE 8

Cross-Language Retrieval

  • f Relevant News Articles

Japanese Translation

M T S y s t e m M T S y s t e m

English Article Japanese Article

F i l t e r i n g F i l t e r i n g b y D a t e s b y D a t e s

Bilingual Article Pair Bilingual Article Pair (Relevant Articles) (Relevant Articles)

S i m i l a r i t y S i m i l a r i t y C a l c u l a t i

  • n

C a l c u l a t i

  • n

WWW WWW ( ( News Sites News Sites) )

Japanese News Articles DB English News Articles DB

cosine of frequency vectors

slide-9
SLIDE 9

Related Research Issues: Translation Knowledge Acquisition

Acquisition from Parallel Corpora

statistical MT models: e.g., [Brown 90, 93] term correspondences estimation based on contingency tables

  • f cross-language co-occurrence frequencies:

e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00]

Acquisition from Comparable Corpora: contextual similarities of words across languages

without the help of existing bilingual lexicons:

earlier works [Fung 95]

exploiting existing bilingual lexicons as initial seed:

later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02]

Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

slide-10
SLIDE 10

Related Research Issues: Translation Knowledge Acquisition

Acquisition from Parallel Corpora

statistical MT models: e.g., [Brown 90, 93] term correspondences estimation based on contingency tables

  • f cross-language co-occurrence frequencies:

e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00]

Acquisition from Comparable Corpora: contextual similarities of words across languages

without the help of existing bilingual lexicons:

earlier works [Fung 95]

exploiting existing bilingual lexicons as initial seed:

later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02]

Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

slide-11
SLIDE 11

Estimating Bilingual Term Correspondences from Parallel Sentences

Parallel Sentences x∧y ⇒ x ∧¬y ⇒ ¬x∧y ⇒ ¬x∧¬y ⇒ English term x Japanese term y term x term y term x

  • term y

slide-12
SLIDE 12

Measures for Estimating Bilingual Term Correspondences from Contingency Table

freq(¬ x,¬ y) = d freq(¬ x, y) = c ¬ x freq(x,¬ y) = b freq(x, y) = a x ¬ y y

mutual information (MI)

I(x ; y) = log2

φ2 statistic

φ2(x, y) =

dice coefficient

Dice(x, y) =

log-likelihood

Log-like = f(a)+f(b)+f(c)+f(d)-f(a+b)-f(a+c)-f(b+d)-f(c+d)-f(a+b+c+d) Where f(x) = x log x aN (a+b)(a+c) (ad-bc)2 (a+b)(a+c)(b+d)(c+d) 2a 2a+b+c

slide-13
SLIDE 13

Related Research Issues: Translation Knowledge Acquisition

Acquisition from Parallel Corpora

statistical MT models: e.g., [Brown 90, 93] term correspondences estimation based on contingency tables

  • f cross-language co-occurrence frequencies:

e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00]

Acquisition from Comparable Corpora: contextual similarities of words across languages

without the help of existing bilingual lexicons:

earlier works [Fung 95]

exploiting existing bilingual lexicons as initial seed:

later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02]

Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

slide-14
SLIDE 14

Term Correspondence Acquisition from Comparable Corpora

Whole English Corpus Whole Japanese Corpus

term term

context context frequency vector

… …

term correspondence estimation

… … …

slide-15
SLIDE 15

Related Research Issues: Translation Knowledge Acquisition

Acquisition from Parallel Corpora

statistical MT models: e.g., [Brown 90, 93] term correspondences estimation based on contingency tables

  • f cross-language co-occurrence frequencies:

e.g., [Gale 91, Kumano 94, Haruno 96, Smadja 96, Kitamura 96, Melamed 00]

Acquisition from Comparable Corpora: contextual similarities of words across languages

without the help of existing bilingual lexicons:

earlier works [Fung 95]

exploiting existing bilingual lexicons as initial seed:

later works [Rapp 95,99, Kaji 96, K.Tanaka 96, Fung 98, T.Tanaka 02]

Collecting Partially Bilingual Texts from WWW with Internet Search Engines: [Nagata 01]

slide-16
SLIDE 16

Translation Knowledge Acquisition: Our Approach

Translation knowledge acquisition from cross-lingually relevant

article pairs collected by CLIR techniques

slide-17
SLIDE 17

Whole Japanese Corpus Whole English Corpus

Article Article Article Article

Term Correspondence Acquisition from Cross-Lingually Relevant Article Pairs Collected by CLIR Techniques

term term

context

cross-lingually Non-Relevant cross-lingually Relevant

slide-18
SLIDE 18

Whole Japanese Corpus Whole English Corpus

Article Article Article Article

Term Correspondence Acquisition from Cross-Lingually Relevant Article Pairs Collected by CLIR Techniques

term term

context context frequency vector

… …

term correspondence estimation

… … …

cross-lingually Relevant

slide-19
SLIDE 19

Translation Knowledge Acquisition: Our Approach

Translation knowledge acquisition from cross-lingually relevant

article pairs collected by CLIR techniques

slide-20
SLIDE 20

Translation Knowledge Acquisition: Our Approach

Translation knowledge acquisition from cross-lingually relevant

article pairs collected by CLIR techniques

Techniques for parallel corpora become applicable

to translation knowledge acquisition from comparable corpora

slide-21
SLIDE 21

Translation Knowledge Acquisition: Our Approach

Translation knowledge acquisition from cross-lingually relevant

article pairs collected by CLIR techniques

Techniques for parallel corpora become applicable

to translation knowledge acquisition from comparable corpora

Related Work: Translation Knowledge Acquisition from Comparable Corpora

Estimating term correspondences

based on contextual similarities across languages

Contextual vectors: averaged over the whole corpus No use of CLIR techniques

for restricting relevant documents across languages

slide-22
SLIDE 22

Cross-Language Retrieval of Relevant News Articles: Evaluation Issues

Availability of Cross-Lingually Relevant Articles

Query articles should be English rather than Japanese Cross-Lingually relevant articles are available

for more than 60% English query articles

Recall/Precision of Cross-Language Retrieval of Relevant News Articles

precision: 50% or more when article similarities ≧ 0.4

slide-23
SLIDE 23

Term Correspondence Acquisition: Evaluation Issues

Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles

  • n WWW News Sites

Effect of CLIR in Reducing Bilingual Term Pair Candidates Accuracy of Bilingual Term Correspondence Estimation

slide-24
SLIDE 24

Term Correspondence Acquisition: Evaluation Issues

Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles

  • n WWW News Sites

Effect of CLIR in Reducing Bilingual Term Pair Candidates Accuracy of Bilingual Term Correspondence Estimation

slide-25
SLIDE 25

Statistics of Article Pairs with Similarity Values above Lower Bound

725 453

±2

0.4 C 185 144 0.5 377 190 0.4 101 74 0.5 631 415

±3

0.4 B 127 92 0.5 0.3 0.25 Lower Bound Ld

  • f Article’s Sim

# of Japanese Articles # of English Articles Difference of dates (days) Site 1990 473

±4

A 1128 362 16166 14854 21349 Japanese 3435 2910 607 English Total # of Articles 166 168 578 Japanese 162 162 562 English Total # of Days C B A Site

slide-26
SLIDE 26

Term Correspondence Acquisition: Evaluation Issues

Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles

  • n WWW News Sites

Effect of CLIR in Reducing Bilingual Term Pair Candidates Accuracy of Bilingual Term Correspondence Estimation

slide-27
SLIDE 27

Bilingual Term Pair Candidates: Full pairs vs. Reduced pairs

< < “ “full full” ”> >

16 pairs

mad cow disease-狂牛病

mad cow disease-感染

mad cow disease-アフ

ガン ・

mad cow disease-復興

: ・

reconstruction-狂牛病

reconstruction-感染

reconstruction-アフ

ガン ・

reconstruction-復興 English Articles Collection Japanese Articles Collection

mad cow disease infection Afghan reconstruction

狂牛病 感染 アフ ガン 復興

slide-28
SLIDE 28

Bilingual Term Pair Candidates: Full pairs vs. Reduced pairs

< < “ “reduced reduced” ”> > < < “ “full full” ”> >

16 pairs

mad cow disease-狂牛病

mad cow disease-感染

mad cow disease-アフ

ガン ・

mad cow disease-復興

: ・

reconstruction-狂牛病

reconstruction-感染

reconstruction-アフ

ガン ・

reconstruction-復興

8 pairs ・

mad cow disease-狂牛病

mad cow disease-感染

: ・

reconstruction-アフ

ガン ・

reconstruction-復興 English Articles Collection Japanese Articles Collection

Reduced pairs: collected from relevant articles

mad cow disease infection Afghan reconstruction

狂牛病 感染 アフ ガン 復興

slide-29
SLIDE 29

# of Monolingual Terms and Bilingual Term Pairs

ratio (full/ reduced) article sim lower bound full reduced Japanese English

28.5

124,515,600 4,367,775 9,433 13,200 0.4

15.4

9,821,120 638,089 2,612 3,760 0.5

C 25.4

103,618,944 4,074,980 8,658 11,968 0.4

10.8

5,325,944 494,544 2,158 2,468 0.5

B 27.1

44,354,097 1,639,714 8,119 5,463 0.3

20.3

8,672,004 427,889 3,231 2,684 0.4

11.0

574,860 52,435 737 780 0.5

# of candidate bilingual term pairs # of monolingual terms site

slide-30
SLIDE 30

Term Recognition Criteria: (preliminary) No statistics-based nor grammar-based intelligent criteria English terms: every word sequences (5 words or less) Japanese terms:

(noun|verb)+ (5 words or less)

Frequency Lower Bounds

slide-31
SLIDE 31

# of Monolingual Terms and Bilingual Term Pairs

ratio (full/ reduced) article sim lower bound full reduced Japanese English

28.5

124,515,600 4,367,775 9,433 13,200 0.4

15.4

9,821,120 638,089 2,612 3,760 0.5

C 25.4

103,618,944 4,074,980 8,658 11,968 0.4

10.8

5,325,944 494,544 2,158 2,468 0.5

B 27.1

44,354,097 1,639,714 8,119 5,463 0.3

20.3

8,672,004 427,889 3,231 2,684 0.4

11.0

574,860 52,435 737 780 0.5

# of candidate bilingual term pairs # of monolingual terms site

slide-32
SLIDE 32

Effect of CLIR in Bilingual Lexicon Acquisition

# of bilingual term pair candidates reduced to 1/10~1/30 most of filtered-out pairs are not correct translation reducing computational complexity of bilingual term correspondence estimation

slide-33
SLIDE 33

Term Correspondence Acquisition: Evaluation Issues

Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles

  • n WWW News Sites

Effect of CLIR in Reducing Bilingual Term Pair Candidates Accuracy of Bilingual Term Correspondence Estimation

slide-34
SLIDE 34

地方裁判所

:

被告 救済 東京地裁

11 : 11 11 11 3 : 9 3 9 2 : 4 3 7 0.116 : 0.151 0.268 0.486 Tokyo District Court

English Japanese term term

Bilingual Term Correspondence Estimation with Statistical Measure

freq(tE) freq(tJ) freq(tE,tJ ) φ2

h i g h e r r a n k h i g h e r r a n k Maximum Estimated Values

Japanese translation candidates

slide-35
SLIDE 35

Comparison of Measures for Bilingual Term Correspondence Estimation

(site A, Sim LBD= 0.4, for 200 English terms with highest max estimation values)

contextual similarities: reduced vs. full

slide-36
SLIDE 36

Term Correspondence Estimation based on Contextual Similarity across Languages

Whole English Corpus Whole Japanese Corpus

term term

context context frequency vector

… …

term correspondence estimation

… … …

slide-37
SLIDE 37

Comparison of Measures for Bilingual Term Correspondence Estimation

(site A, Sim LBD= 0.4, for 200 English terms with highest max estimation values)

contextual similarities: reduced vs. full

slide-38
SLIDE 38

Comparison of Measures for Bilingual Term Correspondence Estimation

(site A, Sim LBD= 0.4, for 200 English terms with highest max estimation values)

contextual similarities: reduced vs. full

estimated bilingual term pairs mostly overlap

slide-39
SLIDE 39

Numbers of Correct Bilingual Term Pair Pairs

(Manual Evaluation)

(site A, Sim LBD= 0.4, for 200 English terms with highest max estimation values)

10 20 30 40 50 60 70 80 90 1 2-5 6-10 11-50 51-100 101-500 501-1000 1001-

rank of Japanese translation candidate for an English term # of correct bilingual term pairs

contextual similarity (reduced) contextual similarity (full)

slide-40
SLIDE 40

地方裁判所

:

被告 救済 東京地裁

11 : 11 11 11 3 : 9 3 9 2 : 4 3 7 0.116 : 0.151 0.268 0.486 Tokyo District Court

English Japanese term term

Bilingual Term Correspondence Estimation with Statistical Measure

freq(tE) freq(tJ) freq(tE,tJ ) φ2

h i g h e r r a n k h i g h e r r a n k Maximum Estimated Values

Japanese translation candidates

slide-41
SLIDE 41

Comparison of Measures for Bilingual Term Correspondence Estimation

(site A, Sim LBD= 0.4, for 200 English terms with highest max estimation values)

contextual similarities: reduced vs. full contextual similarities (reduced)

  • vs. φ2 (contingency table)
slide-42
SLIDE 42

x∧y ⇒ a term x term y English Japanese Relevant Articles

Applying φ2 statistic to Bilingual Term Correspondences Estimation from Relevant Articles

x∧¬y ⇒ b ¬x∧y ⇒ c ¬x∧¬y⇒ d term x term y term x

  • term y

  • φ2(x,y)=

(a・ d-b・ c) 2 (a+b)(a+c)(b+d)(c+d)

slide-43
SLIDE 43

Comparison of Measures for Bilingual Term Correspondence Estimation

(site A, Sim LBD= 0.4, for 200 English terms with highest max estimation values)

contextual similarities: reduced vs. full

estimated bilingual term pairs mostly overlap

contextual similarities (reduced)

  • vs. φ2 (contingency table)
  • verlap of the estimated bilingual term pairs

is less than 30%

slide-44
SLIDE 44

Numbers of Correct Bilingual Term Pair Pairs

(Manual Evaluation)

(site A, Sim LBD= 0.4, for 200 English terms with highest max estimation values)

10 20 30 40 50 60 70 80 90 1 2-5 6-10 11-50 51-100 101-500 501-1000 1001-

rank of Japanese translation candidate for an English term # of correct bilingual term pairs

contextual similarity (reduced) contextual similarity (full) φ2 (contingency table)

slide-45
SLIDE 45

Accuracy of N-best Bilingual Term Pair Candidates (Manual Evaluation)

(site A, Sim LBD= 0.4, for 200 English terms with highest max estimation values)

4 2 2 7 4 8 7 2 2 3 4 5 6 7 8

Accuracy (%)

1

  • b

e s t 1

  • b

e s t

φ2statistic (contingency table) contextual similarity (reduced)

slide-46
SLIDE 46

Bilingual Term Correspondence Acquisition: Current Results Summary

Effect of CLIR in Bilingual Lexicon Acquisition

# of bilingual term pair candidates reduced to 1/10~1/30 most of filtered-out pairs are not correct translation

Metrics for Bilingual Term Correspondence Estimation

accuracy: φ2 (contingency table) :

42% (1-best) and 72% (10-best) contextual similarity (reduced): 27% (1-best) and 48% (10-best)

  • verlap of the estimated bilingual term pairs

is less than 30%

slide-47
SLIDE 47

Conclusion

Source of Translation Knowledge Acquisition: Cross-Lingually Relevant News Articles (on WWW news sites) Techniques: Integration of CLIR and Translation Knowledge Acquisition from Parallel/Comparable Corpora

Novel Term Correspondences Can Be Discovered

1.4 times those found

in an existing bilingual lexicon (0.85M entries)

Demo of Semi-automatic Acquisition Tool

at ACL-2003 Exhibition

slide-48
SLIDE 48

Examples of Discovered Correct Term Correspondences

1.000 64 1 1.000 3 3 3

新規求人

New job offers

~Not found in an existing bilingual lexicon~ ~Found in an existing bilingual lexicon~

0.357 101 4 0.249 3 7 5

内閣府

Cabinet Office 0.432 105 11 0.310 7 15 10

靖国神社

Yasukuni Shrine 0.482 45 3 0.210 5 7 16

森総理大臣

Prime Minister Yoshiro Mori 1.000 36 1 1.000 3 3 3

ハンセン病

Hansen’s disease 2 1 1

rank

1.000 70 0.748 3 4 3

加藤派

Kato faction 1.000 43 1.000 3 3 3

アフ ガン 復興会議

conference on Afghan reconstruction 0.681 84 0.681 12 16 13

狂牛病

mad cow disease

Φ2(TP(tE)

TP(tE)

φ2

fEJ fJ fE tJ tE

^ ^

slide-49
SLIDE 49

Current and Future Works

Incorporating Sophisticated Term Recognition Criteria Integrating

Article Similarities Several Primitive Measures

(contingency table and contextual similarities) for more accurate Bilingual Term Correspondences Estimation

slide-50
SLIDE 50

Publications

Takehito Utsuro, Takashi Horiuchi, Yasunobu Chiba, and Takeshi Hamamoto. Semi-automatic Compilation of Bilingual Lexicon Entries from Cross-Lingually Relevant News Articles on WWW News Sites. In S.D.Richardson,editor,Machine Translation: From Research to Real Users, Lecture Notes in Artificial Intelligence:Vol. 2499, pp. 165-176. Springer, October 2002. Takehito Utsuro, Takashi Horiuchi, Takeshi Hamamoto, Kohei Hino, and Takeaki Nakayama. Effect of Cross-Language IR in Bilingual Lexicon Acquisition from Comparable Corpora. Proc. 9th EACL, pp. 355-362. April 2003.