iasl system for ntcir 6 korean chinese clir
play

IASL System for NTCIR-6 Korean-Chinese CLIR Yu-Chun Wang Cheng-Wei - PowerPoint PPT Presentation

IASL System for NTCIR-6 Korean-Chinese CLIR Yu-Chun Wang Cheng-Wei Lee Richard Tzong-Han Tsai Wen-Lian Hsu * Min-Yuh Day Intelligent Agent Systems Lab. (IASL) Institute of Information Science, Academia Sinica, Taiwan NTCIR-6, Tokyo, Japan,


  1. IASL System for NTCIR-6 Korean-Chinese CLIR Yu-Chun Wang Cheng-Wei Lee Richard Tzong-Han Tsai Wen-Lian Hsu * Min-Yuh Day Intelligent Agent Systems Lab. (IASL) Institute of Information Science, Academia Sinica, Taiwan NTCIR-6, Tokyo, Japan, May 15-18, 2007

  2. IASL, IIS, Academia Sinica Outline � IASL CLIR System Architecture � Query Processing (Korean) � Term Translation (Korean - Chinese traditional ) � Bilingual Dictionary Translation � Person Name Translation � Term Disambiguation � Document Indexing (Chinese) � Document Retrieval (Chinese) � NTCIR-6 CLIR Evaluation Result � Error Analysis � Conclusion and Future Work 2 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  3. IASL, IIS, Academia Sinica CLIR System Architecture Korean Chinese (Traditional) Korean Query Processing Indexing Query CKIP AutoTag Title Description CIRB 4.0 Rule-based Term KLT Term Processing Extractor Lucene Indexing Sentence Document Index Index Key Terms Term Translation Lucene Query Bi-lingual Dictionary Daum Transformer Translation Korean-Chinese Dictionary Lucene Query People Name Naver Translation Korean People Search Lucene IR Wikipedia Engine Term Disambiguation IR Result Document Retrieval Transated Chinese Terms 3 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  4. IASL, IIS, Academia Sinica 1 CLIR System Architecture Korean Chinese (Traditional) Korean Query Processing Indexing Query 1 CKIP AutoTag Title Description CIRB 4.0 Rule-based Term KLT Term Processing Extractor Lucene Indexing Sentence Document Index Index Key Terms Term Translation Lucene Query Bi-lingual Dictionary Daum Transformer Translation Korean-Chinese Dictionary Lucene Query People Name Naver Translation Korean People Search Lucene IR Wikipedia Engine Term Disambiguation IR Result Document Retrieval Transated Chinese Terms 4 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  5. IASL, IIS, Academia Sinica CLIR System Architecture 2 Korean Chinese (Traditional) 2 Korean Query Processing Indexing Query Term Translation CKIP AutoTag Title Description CIRB 4.0 Rule-based Term KLT Term Daum Processing Extractor Korean-Chinese Lucene Indexing Sentence Document Dictionary Index Index Key Terms Term Translation Lucene Query Bi-lingual Dictionary Daum Transformer Translation Korean-Chinese Dictionary Lucene Query People Name Naver Translation Korean People Search Lucene IR Wikipedia Engine Term Disambiguation IR Result Transated Document Retrieval Transated Chinese Terms Chinese Terms 5 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  6. IASL, IIS, Academia Sinica 3 CLIR System Architecture Korean Chinese (Traditional) Korean Query Processing Indexing Query CKIP AutoTag Title Description CIRB 4.0 3 Rule-based Term KLT Term Processing Extractor Lucene Indexing Sentence Document Index Index Key Terms Term Translation Lucene Query Bi-lingual Dictionary Daum Transformer Translation Korean-Chinese Dictionary Lucene Query People Name Naver Translation Korean People Search Lucene IR Wikipedia Engine Term Disambiguation IR Result Document Retrieval Transated Chinese Terms 6 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  7. IASL, IIS, Academia Sinica CLIR System Architecture 4 Korean Chinese (Traditional) 4 Korean Query Processing Indexing Query CKIP AutoTag Title Description CIRB 4.0 Rule-based Term KLT Term Processing Extractor Lucene Indexing Sentence Document Index Index Key Terms Term Translation Lucene Query Bi-lingual Dictionary Daum Transformer Translation Korean-Chinese Dictionary Lucene Query People Name Naver Translation Korean People Search Lucene IR Wikipedia Engine Term Disambiguation IR Result Document Retrieval Transated Chinese Terms 7 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  8. IASL, IIS, Academia Sinica CLIR System Architecture Korean Chinese (Traditional) Korean Query Processing Indexing Query CKIP AutoTag Title Description CIRB 4.0 Rule-based Term KLT Term Processing Extractor Lucene Indexing Sentence Document Index Index Key Terms Term Translation Lucene Query Bi-lingual Dictionary Daum Transformer Translation Korean-Chinese Dictionary Lucene Query People Name Naver Translation Korean People Search Lucene IR Wikipedia Engine Term Disambiguation IR Result Document Retrieval Transated Chinese Terms 8 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  9. IASL, IIS, Academia Sinica Query Processing � Pre-defined rules for the title of query: � Chunk the sentence with spaces and punctuations. � Remove Josa at the end of the terms. � For descriptive part of a Korean query: � Use KLT Term Extractor (by Kookmin University) to extract vital key words and remove stop words. 9 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  10. IASL, IIS, Academia Sinica CLIR System Architecture Korean Chinese (Traditional) Korean Query Processing Indexing Query CKIP AutoTag Title Description CIRB 4.0 Rule-based Term KLT Term Processing Extractor Lucene Indexing Sentence Document Index Index Key Terms Term Translation Lucene Query Bi-lingual Dictionary Daum Transformer Translation Korean-Chinese Dictionary Lucene Query People Name Naver Translation Korean People Search Lucene IR Wikipedia Engine Term Disambiguation IR Result Document Retrieval Transated Chinese Terms 10 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  11. IASL, IIS, Academia Sinica Bilingual Dictionary Translation � Dictionary-based translation method: � Daum Chinese-Korean online dictionary � Korean Wikipedia with inter-language link to Chinese Wikipedia � Mapping table to convert simplified Chinese characters to traditional Chinese ones. 11 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  12. IASL, IIS, Academia Sinica The Rules for Splitting Korean Terms � Apply the rules Number of Separation Character (based on the ABC → A, BC 3 ABC → AB, C properties of Korean 4 ABCD → AB, CD ABCD → A, BCD morphemes) to split a ABCD → ABC, D long term into several ABCDE → AB, CDE 5 ABCDE → ABC,DE shorter terms. 6 ABCDEF → AB, CD, EF ABCDEF → ABC, DEF ABCDEFG → AB, CD, EFG 7 ABCDEFG → AB, CDE, FG ABCDEFG → ABC, DE, FG 8 ABCDEFGH → AB, CD, EF, GH ABCDEFGHI → AB, CD, EF, GHI 9 ABCDEFGHIJ → AB, CD, EF, GH, IJ 10 12 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  13. IASL, IIS, Academia Sinica Person Name Translation � Transliteration methods are not appropriate for Korean-Chinese CLIR (Unlike Korean-English or Korean-Japanese CLIR) � Many Chinese characters have the same pronunciation in Korean. � Korean uses Japanese pronunciation to translate Japanese personal names. � Chinese uses Japanese Kanji characters directly. � Naver People Search for person name translation processing. � Naver People Search is a database containing the basic profiles of famous people, including their original names. � If the original name is composed of Chinese characters, it will be sent to the next stage directly. (CJK person names) � If the original name is in English, we use the English name translation/transliteration table provided by Taiwan’s Central News Agency (CNA) to translate it into Chinese. 13 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  14. IASL, IIS, Academia Sinica Term Disambiguation � Ambiguity in translating Korean to Chinese � Since Hangul is an alphabet writing system, many different Chinese characters are written in the same Hangul characters. � For example � The Hangul word “ 이상 ” corresponds to four different Chinese words: “ 理想 ”(ideal), “ 異常 ”(unusual), “ 以上 ”(above), “ 異狀 ” (indisposition). � Apply Mutual Information to measure correlation to choose the best translation term among translation candidates. ( ) Pr( , ) Z qt te te n ∑ ∑ x = ij xy MI score ( | ) te Q ij Pr( ) Pr( ) te te = ≠ = 1 , 1 x x i y ij xy 14 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  15. IASL, IIS, Academia Sinica CLIR System Architecture Korean Chinese (Traditional) Korean Query Processing Indexing Query CKIP AutoTag Title Description CIRB 4.0 Rule-based Term KLT Term Processing Extractor Lucene Indexing Sentence Document Index Index Key Terms Term Translation Lucene Query Bi-lingual Dictionary Daum Transformer Translation Korean-Chinese Dictionary Lucene Query People Name Naver Translation Korean People Search Lucene IR Wikipedia Engine Term Disambiguation IR Result Document Retrieval Transated Chinese Terms 15 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

  16. IASL, IIS, Academia Sinica Chinese Document Indexing and Lucene IR � CIRB 4.0 documents are pre-processed to remove noise and then segmented by CKIP AutoTag. � Lucene IR engine � Index Chinese documents based on Chinese characters. � The translated Chinese query from the original Korean query will be transformed into Lucene query to proceed IR. � If a term has different translation candidates, the weight of the candidate with highest mutual information score will be increased by 1 by the boost operator ^. 16 NTCIR-6 IASL System for NTCIR-6 Korean-Chinese CLIR

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend