Statistical Modeling Approaches for Statistical Modeling Approaches - PowerPoint PPT Presentation

Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval Information Retrieval 1. HMM/N-gram-based 2. Topical Mixture Model (TMM) 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis (PLSA) Berlin Chen 2004 References: 1. W. B. Croft and J. Lafferty (Editors). Language Modeling for Information Retrieval . July 2003 2. Berlin Chen et al. A Discriminative HMM/N-Gram-Based Retrieval Approach for Mandarin Spoken Documents . ACM Transactions on Asian Language Information Processing, June 2004 3. Berlin Chen. Exploring the Use of Latent Topical Information for Statistical Chinese Spoken Document Retrieval , 2004 4. M. W. Berry et al. Using Linear Algebra for Intelligent Information Retrieval . Technical report, 1994 5. Thomas Hofmann. Unsupervised Learning by Probabilistic Latent Semantic Analysis . Machine Learning, 2001

Taxonomy of Classic IR Models Set Theoretic Fuzzy Extended Boolean Classic Models Boolean Algebraic Vector U Generalized Vector Probabilistic Retrieval: s Latent Semantic Adhoc e Indexing (LSI) Filtering Neural Networks r Structured Models Probabilistic T Non-Overlapping Lists a Inference Network Proximal Nodes s Belief Network k Browsing Hidden Markov Model Topical Mixture Model Browsing Probabilistic LSI Flat Language Model Structure Guided probability-based Hypertext IR 2004 – Berlin Chen 2

Two Perspectives for IR Models • Matching Strategy – Literal term matching • Vector Space Model (VSM), Hidden Markov Model (HMM), Language Model (LM) – Concept matching • Latent Semantic Indexing (LSI), Probabilistic Latent Semantic Indexing (PLSI), Topical Mixture Model (TMM) • Learning Capability – Term weight, query expansion, document expansion, etc • Vector Space Model, Latent Semantic Indexing – Solid statistical foundations • Hidden Markov Model, Probabilistic Latent Semantic Indexing (PLSI), Topical Mixture Model (TMM) IR 2004 – Berlin Chen 3

Two Perspectives for IR Models (cont.) • Literal Term Matching vs. Concept Matching 香港星島日報篇報導引述軍事觀察家的話表示到二零零中國解放軍五年台灣將完全喪失空中優勢原因是中國大陸戰機不論是數量蘇愷戰機或是性能上都將超越台灣報導指出中國在大量引進俄羅斯先進武器的同時也得加快研發自製武器系統目前西安飛機製造廠任職的改進型飛豹戰機即將部署尚未與蘇愷三十通道地對地攻擊住宅飛機以督促遇到挫折的監控其戰機目前也已經取得了重大階段性的認知成果根據日本媒體報導在台海戰爭隨時可能爆發情況之下北京方面的基本方針使用高科技答應局部戰爭因此解放軍打算在二零零四年前又有包括蘇愷三十二期在內的兩百架蘇霍伊戰鬥機中共新一代空軍戰力 IR 2004 – Berlin Chen 4

HMM/N-gram-based Model • Model the query as a sequence of input observations Q (index terms), = Q q q .. q n q .. 1 2 N • Model the doc as a discrete HMM composed of D distributions of N -gram parameters ( ) • The relevance measure, , can be estimated by P Q D is R the N -gram probabilities of the index term sequence for the query, , predicted by the doc = D Q q q .. q n q .. 1 2 N – A generative model for IR ( ) = D * arg max P D is R Q D ( ) ( ) ≈ arg max P Q D is R P D is R D ( ) ≈ with the assumption that …… arg max P Q D is R D IR 2004 – Berlin Chen 5

HMM/N-gram-based Model (cont.) • Given a word sequence, , of length N W ⇒ = W w w .. w n w .. 1 2 N – How to estimate its corresponding probability ? ( ) chain rule is applied P W ( ) = P w w .. w .. w 1 2 n N ) ( ) ( ) ( ) ( = P w P w w P w w w ..... P w w w .... w − 1 2 1 3 1 2 N 1 2 N 1 Too complicate to estimate all the necessary probability items ! IR 2004 – Berlin Chen 6

HMM/N-gram-based Model (cont.) • N -gram approximation (Language Model) – Unigram ( ) ( ) ( ) ( ) ( ) = P W P w P w P w ..... P w 1 2 3 N – Bigram ) ( ) ( ) ( ) ( ) ( = P W P w P w w P w w ..... P w N w − 1 2 1 3 2 N 1 – Trigram ) ( ) ( ) ( ) ( ) ( = P W P w P w w P w w w ..... P w w w − − 1 2 1 3 1 2 N N 2 N 1 – …….. IR 2004 – Berlin Chen 7

HMM/N-gram-based Model (cont.) • A discrete HMM composed of distributions of N -gram parameters (viewed as a language model source) ( ) ( ) ( ) N bigram modeling = P Q D is R P q D ∏ P q q , D − 1 n n 1 = n 2 smoothing/interpolation , but reasons for what: avoiding zero prob., and …? [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ) ] ( ) ( ) ( ) ( N ⋅ + + + ∏ m P q D m P q Corpus m P q q , D m P q q , Corpus − − 1 n 2 n 3 n n 1 4 n n 1 = n 2 A mixture of N ( ) P q D m n 1 probability distributions m ( ) P q Corpus 2 n ∑ ( ) m P q q , D = Q q q .. q .. q 3 − n n 1 1 2 n N + + + = m m m m 1 1 2 3 4 m ( ) 4 P q q , Corpus − n n 1 IR 2004 – Berlin Chen 8

HMM/N-gram-based Model (cont.) • Variants: Three Types of HMM Structures – Type I: Unigram-Based (Uni) [ ] ( ) ( ) ( ) N = + P Q D is R ∏ m P q D m P q Corpus 1 n 2 n = n 1 – Type II: Unigram/Bigram-Based (Uni+Bi) [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ] ( ) ( ) ( ) N ⋅ + + m P q D m P q Corpus m P q q , D ∏ − 1 n 2 n 3 n n 1 = n 2 – Type III: Unigram/Bigram/Corpus-Based (Uni+Bi*) [ ] ( ) ( ) ( ) = + P Q D is R m P q D m P q Corpus 1 1 2 1 [ ) ] ( ) ( ) ( ) ( N ⋅ + + + m P q D m P q Corpus m P q q , D m P q q , Corpus ∏ − − 1 n 2 n 3 n n 1 4 n n 1 = n 2 P ( 陳水扁總統視察阿里山小火車 | D ) =[ m 1 P ( 陳水扁 | D )+ m 2 P ( 陳水扁 | C )] x [ m 1 P ( 總統 | D )+ m 2 P ( 總統 | C )+ m 3 P ( 總統 | 陳水扁, D )+ m 4 P ( 總統 | 陳水扁, C )] x[ m 1 P ( 視察 | D )+ m 2 P ( 視察 | C )+ m 3 P ( 視察 | 總統, D )+ m 4 P ( 視察 | 總統, C )] x ………. IR 2004 – Berlin Chen 9

HMM/N-gram-based Model (cont.) ( ) • The role of the corpus N -gram probabilities P q Corpus n ( ) P q q , Corpus – Model the general distribution of the index terms − n n 1 ( ) • Help to solve zero-frequency problem = P q D 0 ! n • Help to differentiate the contributions of different missing terms in a doc (global information like IDF?) – The corpus N -gram probabilities were estimated using an outside corpus P ( q a | D )=0.4 P ( q b | D )=0.3 Doc D P ( q c | D )=0.2 q c q b q a q b P ( q d | D )=0.1 q a q a q c q d P ( q e | D )=0.0 q a q b P ( q f | D )=0.0 IR 2004 – Berlin Chen 10

HMM/N-gram-based Model (cont.) • Estimation of N -grams (Language Models) – Maximum likelihood estimation (MLE) for doc N -grams • Unigram Counts of term q i in the doc D ( ) ( ) = ∑ ( ) C q C q = Length of the doc D P q D D i D i ( ) i D C q Or number of terms in the doc D D j ∈ q D j • Bigram ( ) Counts of term pair ( q j , q i ) in the doc D ( ) C q , q = D j i P q q , D ( ) i j C q D j – Similar formulas for corpus N-grams Counts of term q i in the Corpus ( ) ( ) ( ) C q ( ) C q , q = Corpus i = P q Corpus Corpus j i P q q , D ( ) i i j Corpus C q Corpus j Corpus : an outside corpus or just the doc collection IR 2004 – Berlin Chen 11

HMM/N-gram-based Model (cont.) • Basically, m 1 , m 2 , m 3 , m 4 , can be estimated by using the Expectation-Maximization (EM) algorithm because of the insufficiency of training data – All docs share the same weights m i here – The N -gram probability distributions also can be estimated using the EM algorithm instead of the maximum likelihood (ML) estimation • Unsupervised: using doc itself, ML • Supervised: using query exemplars, EM • For those docs with training queries, m 1 , m 2 , m 3 , m 4 , can be estimated by using the Minimum Classification Error (MCE) training algorithm – The docs can have different weights IR 2004 – Berlin Chen 12

HMM/N-gram-based Model (cont.) • Expectation-Maximization Training – The weights are tied among the documents – E.g. m 1 of Type I HMM can be trained using the following equation: the old weight ( ) ≦ 2265 docs 819 queries ⎡ ⎤ m P q D ∑ ∑ ∑ the new weight 1 n ⎢ ⎥ ( ) ( ) + m P q D m P q Corpus ⎣ ⎦ [ ] [ ] ∈ ∈ ∈ Q TrainSet D Doc q Q 1 n 2 n = ˆ m n Q R to Q [ ] ∑ 1 ⋅ Q Doc R to Q [ ] ∈ Q TrainSet Q [ ] Q TrainSet • Where is the set of training query exemplars, [ ] Doc is the set of docs that are relevant to a specific R to Q training query exemplar , is the length of the query , Q Q [ ] and is the total number of docs relevant to the Doc R to Q query Q IR 2004 – Berlin Chen 13

Statistical Modeling Approaches for Statistical Modeling Approaches - PowerPoint PPT Presentation

Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval Information Retrieval 1. HMM/N-gram-based 2. Topical Mixture Model (TMM) 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis

Machine Translation Week 1: Classical approaches Classical and Statistical Approaches

Statistical Approaches for Statistical Approaches for Determining Normal Limits in Determining

New Approaches to New Approaches to New Approaches to Repair of Repair of Repair of Spinal

God of Peace? Question Question Various approaches Question Various approaches Suggestions

Cognitive Modeling Symbolic School Lecture 2: Approaches Symbolic Models 2 Symbolic

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Statistical Modeling of UNIX Statistical Modeling of UNIX Users and Processes With Users and

Modeling and high-throughput approaches in developmental patterning approaches in developmental

Learning Approaches to Estimate Depth from RGB Lecture 5 What will we learn - Latest Approaches

Outline Specification Approaches Munindar P. Singh (NCSU) Service-Oriented Computing Spring

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

III.4 Statistical Language Models 1. Basics of Statistical Language Models 2. Query-Likelihood

Modeling approaches for switching converters by Giorgio Spiazzi University of Padova ITALY

Probabilistic Foundations of Statistical Network Analysis Chapter 5: Statistical modeling paradigm

HL7 2.x Security Hacking medical devices Anirudh Duggal Disclaimer: All the views/ research

Maths Knowledge Overview - for Part 1, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Optimizations & Bounds for Sparse Symmetric Matrix-Vector Multiply Berkeley Benchmarking and

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia

Tricks for kernel methods in large datasets Matthias Treder Stellenbosch University MML 10 May

CS 134: Operating Systems I/O Hardware 1 / 23 Overview CS34 Overview 2013-05-17 Hardware

Integrating cover crop residue and moldboard plowing into glyphosate-resistant Palmer amaranth

Statistical Modeling Approaches for Statistical Modeling Approaches - PowerPoint PPT Presentation

Statistical Modeling Approaches for Statistical Modeling Approaches for Information Retrieval Information Retrieval 1. HMM/N-gram-based 2. Topical Mixture Model (TMM) 2. Latent Semantic Indexing (LSI) 3. Probabilistic Latent Semantic Analysis

Machine Translation Week 1: Classical approaches Classical and Statistical Approaches

Statistical Approaches for Statistical Approaches for Determining Normal Limits in Determining

New Approaches to New Approaches to New Approaches to Repair of Repair of Repair of Spinal

God of Peace? Question Question Various approaches Question Various approaches Suggestions

Cognitive Modeling Symbolic School Lecture 2: Approaches Symbolic Models 2 Symbolic

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Modeling of proteins and complexes High resolution Low resolution Modeling of domains Modeling

Virtual Reality Modeling Virtual Reality Modeling from http://www.okino.com/ Modeling Modeling

Statistical Modeling of UNIX Statistical Modeling of UNIX Users and Processes With Users and

Modeling and high-throughput approaches in developmental patterning approaches in developmental

Learning Approaches to Estimate Depth from RGB Lecture 5 What will we learn - Latest Approaches

Outline Specification Approaches Munindar P. Singh (NCSU) Service-Oriented Computing Spring

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

III.4 Statistical Language Models 1. Basics of Statistical Language Models 2. Query-Likelihood

Modeling approaches for switching converters by Giorgio Spiazzi University of Padova ITALY

Probabilistic Foundations of Statistical Network Analysis Chapter 5: Statistical modeling paradigm

HL7 2.x Security Hacking medical devices Anirudh Duggal Disclaimer: All the views/ research

Maths Knowledge Overview - for Part 1, COMP24111 Tingting Mu tingtingmu@manchester.ac.uk School

Temporal Planning Planning with Temporal and Concurrent Actions 1 Literature Malik

Optimizations &amp; Bounds for Sparse Symmetric Matrix-Vector Multiply Berkeley Benchmarking and

Latent Semantic Indexing Information Systems M Prof. Paolo Ciaccia

Tricks for kernel methods in large datasets Matthias Treder Stellenbosch University MML 10 May

CS 134: Operating Systems I/O Hardware 1 / 23 Overview CS34 Overview 2013-05-17 Hardware

Integrating cover crop residue and moldboard plowing into glyphosate-resistant Palmer amaranth

Optimizations & Bounds for Sparse Symmetric Matrix-Vector Multiply Berkeley Benchmarking and