word segmentation
play

Word segmentation most European languages Based on rules, - PowerPoint PPT Presentation

1. Motivation Using Unigram and Bigram Language Models 2. Related Work for Monolingual and Cross-Language IR 3. Using Different Indexing Units 4. Using Different Translation Units Lixin Shi and Jian-Yun Nie 5. Conclusion and Future Work


  1. 1. Motivation Using Unigram and Bigram Language Models 2. Related Work for Monolingual and Cross-Language IR 3. Using Different Indexing Units 4. Using Different Translation Units Lixin Shi and Jian-Yun Nie 5. Conclusion and Future Work Dept. d'Informatique et de Recherche Opérationnelle Université de Montréal Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 2 The difference between East-Asian and Word segmentation most European languages Based on rules, dictionaries and/or statistics Problems for information retrieval A common problem in East-Asian languages — Segmentation Ambiguity: The same string can be segmented into different words (Chinese, Japanese and Korean to some extent) e.g. “ 发展中国家 ” � is the lack of natural word boundaries. 发展中 (developing)/ 国家 (country) 发展 (development)/ 中 (middle)/ 国家 (country) For information retrieval, we have to determine 发展 (development)/ 中国 (China)/ 家 (family) the index units first. — If a document and a query are segmented into different words, there may be mismatch. � Using word segmentation � Cutting sentence into n-grams — Two different words may have the same or related meaning, especially when they share come common characters. 办公室 (office) ↔ 办公楼 (office building) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 3 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 4 1. Motivation 1. Motivation 1

  2. Cutting the sentence into n-grams We focus on Need not any linguistic resource Using words and n-grams as index The utilization of unigrams and bigrams has been units for monolingual IR under LM investigated in several previous studies. frame work. — As effective as using a word segmentation Using words and n-grams as translation units in CLIR The limitation of previous studies — N-grams only used in monolingual IR — we only tested for English-Chinese CLIR — Integration of n-grams and words in retrieval models (vector space model, probabilistic model, etc) other than language modeling (LM) Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 5 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 6 1. Motivation 1. Motivation Mono-lingual IR Chinese text input Segmentation into words or n-grams (indexing units) — Various approaches to word segmentation (e.g. longest matching) — Overlapping n-grams 2. Related work E.g. 前年收入有所下降 Unigram: 前 � 年 � 收 � 入 � 有 � 所 � 下 � 降 Word: 前年 / 收入 / 有所 / 下降 or : 前 / 年收入 / 有所 / 下降 Bigram: 前年 / 年收 / 收入 / 入有 / 有所 / 所下 / 下降 Score function in language modeling similar to other languages Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 8 2. Related Work 2

  3. LM approach to IR Cross-Language IR Query-likelihood retrieval model: (1) Build a LM for each document Translation between query and document languages (2) Rank in the probability of document model generating query Q (Ponte&Croft’98, Croft’03) Basic approach: translation query ∏ = P ( Q | D ) P ( q | D ) — MT system i q ∈ Q — Bilingual dictionary i KL-divergence: — Parallel corpus (1) Build LMs for document and query, (2) determine the divergence between them (Lafferty&Zhai’01,’02) � Train a probabilistic translation model from θ P ( w | ) ∑ Q parallel corpus, then use the TM for CLIR = − θ θ = − θ Score ( D , Q ) KL ( || ) P ( w | ) log Q D Q P ( w | θ ) (Nie et al’99, Gao et al’01,’02, Jin&Chai’05) ∈ w V D θ = λ ⋅ + − λ P ( w | ) P ( w | d ) ( 1 ) P ( w | C ) Smoothing D θ = P ( w | ) c ( w , q ) / | q | Maximum Likelihood Estimation Q Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 9 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 10 2. Related Work 2. Related Work LM approach to CLIR For KL-divergence model (Kraaij et al’03) ∑ θ = θ = θ 3. Using different indexing P ( w | ) P ( t | ) P ( s , t | ) Q i Q j i Q j s s ∑ = θ θ P ( t | s , ) P ( s | ) units i j Q j Q j s s ∑ ≈ θ t ( t | s ) P ( s | ) i j j Q j s where t is a term in document (target) language; s in query (source) language; t ( t i |s j ) is translation model. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 11 2. Related Work 3

  4. Combining different indexes Different indexing units Combine words with characters or bigrams and characters Single index — Merging indexes “ 国企研发投资 ” — Unigram (single character) “ 国企研发投资 ” � WU: Word & Unigram U: 国 / 企 / 研 / 发 / 投 / 资 WU: 国企 / 研发 / 投资 / 国 / 企 / 研 / 发 / 投 / 资 B: 国企 / 企研 / 研发 / 发投 / 投资 — Bigram � BU: Bigram & Unigram BU: 国企 / 企研 / 研发 / 发投 / 投资 / 国 / 企 / 研 / 发 / 投 / 资 W: 国企 / 研发 / 投资 — Word — Multiple indexes � B+U: Interpolate Bigram and Unigram = − θ θ Score ( D , Q ) KL ( || ) Q − KL ( Q U D , ) D Q D U U U Q D Q − KL ( Q U D , ) D B U Problems with single index B ∑ — Words can be segmented in different ways Score ( D , Q ) = α Score ( D , Q ) i i i — Closely related words cannot match Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 13 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 14 3. Using different index units 3. Using different index units Using different index units for C/J/K Experiment Setting monolingual IR on NTCIR4/5 Means Average Precision (MAP) NTCIR3/4 NTCIR5/6 Run U B W BU WU 0.3B+0.7U Collections #doc (KB) Collections #doc(KB) ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� Cn CIRB011 CIRB020 381 CIRB040r 901 �������� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� Mainichi98/99 Mainichi00/01r �������� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� ����� Jp 594 858 Yomiuri98+99 Yomiuri00+01 �������� ����� ����� ����� ����� ����� ����� ����� ����� − − − − �������� ����� ����� ����� ����� ����� ����� ����� ����� Chosunilbo98/99 Chosunilbo00/01 − − − − Kr 254 220 Hankookilbo Hankookilbo00/01 �������� ����� ����� ����� ����� ����� ����� ����� ����� − − − − �������� ����� ����� ����� ����� ����� ����� ����� ����� − − − − NTCIR3 NTCIR4 NTCIR5 NTCIR6 Surprisingly, U is better than B and W for Chinese Numbers of topics 50 60 50 50 Interpolating unigram and bigram (B+U) has the best performance for Chinese and Japanese. However, BU and B are the best for Korean. Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 15 Using Unigram and Bigram Language Models for Monolingual and Cross-Language IR 16 3. Using different index units 3. Using different index units 4

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend