Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. - PowerPoint PPT Presentation

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis (PLSA) Berlin Chen 2003 References: 1. Berlin Chen et al., “An HMM/N-gram-based Linguistic Processing Approach for Mandarin Spoken Document Retrieval,” EUROSPEECH 2001 2. M. W. Berry et al., “Using Linear Algebra for Intelligent Information Retrieval,” technical report, 1994 3. Thomas Hofmann, “Unsupervised Learning by Probabilistic Latent Semantic Analysis,” Machine Learning, 2001

HMM/N-gram-based Model • Model the query as a sequence of input Q observations (index terms), = Q q q .. q n q .. 1 2 N • Model the doc as a discrete HMM D composed of distribution of N -gram parameters ( ) • The relevance measure, , can be P Q D is R estimated by the N -gram probabilities of the index term sequence for the query, = Q q q .. q n q .. 1 2 N , predicted by the doc D – A generative model for IR ( ) = D * arg max P D is R Q D ( ) ( ) ≈ arg max P Q D is R P D is R D ( ) ≈ with the assumption that …… arg max P Q D is R 2 D

HMM/N-gram-based Model ( ) { } = P W W w w .. w .. w 1 2 n N ( ) = P w w .. w .. w 1 2 n N ) ( ) ( ) ( ) ( = P w P w w P w w w ..... P w w w .... w − 1 2 1 3 1 2 N 1 2 N 1 • N -gram approximation (Language Model) – Unigram ( ) ( ) ( ) ( ) ( ) = P W P w P w P w ..... P w 1 2 3 N – Bigram ) ( ) ( ) ( ) ( ) ( = P W P w P w w P w w ..... P w N w − 1 2 1 3 2 N 1 – Trigram ) ( ) ( ) ( ) ( ) ( = P W P w P w w P w w w ..... P w w w − − 1 2 1 3 1 2 N N 2 N 1 – …….. 3

HMM/N-gram-based Model • A discrete HMM composed of distribution of N - gram parameters ( ) P q D m n 1 m ( ) P q Corpus 2 n ∑ ( ) m P q q , D = Q q q .. q .. q 3 − n n 1 1 2 n N + + + = m m m m 1 1 2 3 4 m ( ) 4 P q q , Corpus − n n 1 [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ) ] ( ) ( ) ( ) ( N ⋅ + + + m P q D m P q Corpus m P q q , D m P q q , Corpus ∏ − − 1 n 2 n 3 n n 1 4 n n 1 = n 2 4

HMM/N-gram-based Model • Three Types of HMM Structures – Type I: Unigram-Based (Uni) [ ] ( ) ( ) ( ) N = + P Q D is R ∏ m P q D m P q Corpus 1 n 2 n = n 1 – Type II: Unigram/Bigram-Based (Uni+Bi) [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ] ( ) ( ) ( ) N ⋅ + + m P q D m P q Corpus m P q q , D ∏ − 1 n 2 n 3 n n 1 = n 2 – Type III: Unigram/Bigram/Corpus-Based (Uni+Bi*) [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ) ] ( ) ( ) ( ) ( N ⋅ + + + m P q D m P q Corpus m P q q , D m P q q , Corpus ∏ − − 1 n 2 n 3 n n 1 4 n n 1 = n 2 P ( 陳水扁總統視察阿里山小火車 | D ) =[ m 1 P ( 陳水扁 | D )+ m 2 P ( 陳水扁 | C )] x [ m 1 P ( 總統 | D )+ m 2 P ( 總統 | C )+ m 3 P ( 總統 | 陳水扁, D )+ m 4 P ( 總統 | 陳水扁, C )] x[ m 1 P ( 視察 | D )+ m 2 P ( 視察 | C )+ m 3 P ( 視察 | 總統, D )+ m 4 P ( 視察 | 總統, C )] x ………. 5

HMM/N-gram-based Model ( ) P q Corpus • The role of the corpus N -gram probabilities n ( ) P q q , Corpus − n n 1 – Model the general distribution of the index terms ( ) • Help to solve zero-frequency problem = P q D 0 ! n • Help to differentiate the contributions of different missing terms in a doc – The corpus N -gram probabilities were estimated using an outside corpus P ( q a | D )=0.4 P ( q b | D )=0.3 Doc D P ( q c | D )=0.2 q c q b q a q b P ( q d | D )=0.1 q a q a q c q d P ( q e | D )=0.0 q a q b P ( q f | D )=0.0 6

HMM/N-gram-based Model • Estimation of N -grams (Language Models) – Maximum likelihood estimation (MLE) for doc N -grams • Unigram Counts of term q i in the doc D ( ) ( ) = ∑ ( ) C q C q = P q D D i D i ( ) i D C q Length of the doc D D j ∈ q D j • Bigram Counts of term pair ( q j , q i ) in the doc D ( ) ( ) C q , q = D j i P q q , D ( ) i j C q Counts of term q i in the doc D D j – Similar formulas for corpus N-grams ( ) ( ) ( ) C q , q C q ( ) = = Corpus j i Corpus i P q q , D P q Corpus ( ) i j i C q Corpus Corpus j Corpus : an outside corpus or just the doc collection 7

HMM/N-gram-based Model • Basically, m 1 , m 2 , m 3 , m 4 , can be estimated by using the Expectation-Maximization (EM) algorithm because of the insufficiency of training data – All docs share the same weights here – The N -gram probability distributions also can be estimated using the EM algorithm instead of the maximum likelihood estimation • For those docs with training queries, m 1 , m 2 , m 3 , m 4 , can be estimated by using the Minimum Classification Error (MCE) training algorithm – The docs can have different weights 8

HMM/N-gram-based Model • Expectation-Maximum Training – The weights are tied among the documents – E.g. m 1 of Type I HMM can be trained using the following equation: the old weight ( )   ˆ m P q D ≦ 2265 docs 819 queries 1 n ∑ ∑ ∑   ( ) ( ) the new weight + ˆ ˆ m P q D m P q Corpus [ ] [ ] ∈ ∈ ∈   Q TrainSet D Doc q Q n Q R to Q = 1 n 2 n m [ ] ⋅ 1 Q Doc ∑ R to Q [ ] ∈ Q TrainSet Q [ ] Q TrainSet • Where is the set of training query exemplars, [ ] Doc is the set of docs that are relevant to a specific R to Q Q training query exemplar , is the length of the query , Q [ ] and is the total number of docs relevant to the Doc R to Q query Q 9

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. - PowerPoint PPT Presentation

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis (PLSA) Berlin Chen 2003 References: 1. Berlin Chen et al., An HMM/N-gram-based Linguistic Processing

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Models for Models for Retrieval and Browsing Retrieval and Browsing - Structural Models and

Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Luo Si Department of Computer Science Purdue University Retrieval Models Information Need

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Query Likelihood Retrieval LM, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Generating Functions Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Structure functions and Structure functions and electroweak studies at electroweak studies at

Lightweight Software Process Assessment and Improvement Tom Feliz tom.feliz@tektronix.com

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

Words and Automata, Lecture 4 Ergodic sources and compression Dominique Perrin 20 octobre 2012

Atsushi Tamii Research Center for Nuclear Physics (RCNP) Osaka University, Japan photos in

More 2HDM checks Nick Amin October 27, 2018 Overview Goal is to compare 2HDM results from CMS

SPAMIA: Spam filtering by quantitative profiles Marin Grendr, Jana kutov, Vladimr

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. - PowerPoint PPT Presentation

Models for Retrieval Models for Retrieval 1. HMM/N-gram-based 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis (PLSA) Berlin Chen 2003 References: 1. Berlin Chen et al., An HMM/N-gram-based Linguistic Processing

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Models for Models for Retrieval and Browsing Retrieval and Browsing - Structural Models and

Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Luo Si Department of Computer Science Purdue University Retrieval Models Information Need

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Information Retrieval Introducing Information Retrieval and Web Search

Accessing XML content: An information retrieval perspective Mounia Lalmas mounia@acm.org 1

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Query Likelihood Retrieval LM, session 6 CS6200: Information Retrieval Slides by: Jesse Anderton

Generating Functions Saravanan Vijayakumaran sarva@ee.iitb.ac.in Department of Electrical

Structure functions and Structure functions and electroweak studies at electroweak studies at

Lightweight Software Process Assessment and Improvement Tom Feliz tom.feliz@tektronix.com

Relevance Feedback Relevance Feedback Relevance Feedback Prof. Paolo Ciaccia Prof. Paolo

Words and Automata, Lecture 4 Ergodic sources and compression Dominique Perrin 20 octobre 2012

Atsushi Tamii Research Center for Nuclear Physics (RCNP) Osaka University, Japan photos in

More 2HDM checks Nick Amin October 27, 2018 Overview Goal is to compare 2HDM results from CMS

SPAMIA: Spam filtering by quantitative profiles Marin Grendr, Jana kutov, Vladimr

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models