Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical - PowerPoint PPT Presentation

Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) 4.3 Latent-Concept Models 4.3.1 Foundations from Linear Algebra 4.3.2 Latent Semantic Indexing (LSI) 4.3.3 Probabilistic Aspect Model (pLSI) 4-1 IRDM WS 2005

Key Idea of Latent Concept Models Objective: Transformation of document vectors from high-dimensional term vector space into lower-dimensional topic vector space with • exploitation of term correlations (e.g. „Web“ and „Internet“ frequently occur in together) • implicit differentiation of polysems that exhibit different term correlations for different meanings (e.g. „Java“ with „Library“ vs. „Java“ with „Kona Blend“ vs. „Java“ with „Borneo“) mathematically: given: m terms, n docs (usually n > m) and a m × n term-document similarity matrix A, needed: largely similarity-preserving mapping of column vectors of A into k-dimensional vector space (k << m) for given k 4-2 IRDM WS 2005

4.3.1 Foundations from Linear Algebra A set S of vectors is called linearly independent if no x ∈ S can be written as a linear combination of other vectors in S. The rank of matrix A is the maximal number of linearly independent row or column vectors. A basis of an n × n matrix A is a set S of row or column vectors such that all rows or columns are linear combinations of vectors from S. A set S of n × 1 vectors is an orthonormal basis if for all x, y ∈ S: n 2 = = = ⋅ = x : X 1 y and x y 0 i ∑ 2 2 = i 1 4-3 IRDM WS 2005

Eigenvalues and Eigenvectors Let A be a real-valued n × n matrix, x a real-valued n × 1 vector, and λ a real-valued scalar. Solutions x and λ of the equation A × x = λ x are called an Eigenvector and Eigenvalue of A. Eigenvectors of A are vectors whose direction is preserved by the linear transformation described by A. The Eigenvalues of A are the roots (Nullstellen) of the characteristic polynom f( λ ) of A: λ = − λ = f ( ) A I 0 with the determinant (developing the i-th row): n where matrix A (ij) is derived from A by + i j ( ij ) = − A ( 1 ) a A ij ∑ removing the i-th row and the j-th column = j 1 The real-valued n × n matrix A is symmetric if a ij =a ji for all i, j. A is positive definite if for all n × 1 vectors x ≠ 0 : x T × A × x > 0. If A is symmetric then all Eigenvalues of A are A real. If A is symmetric and positive definite then all Eigenvalues are positive. 4-4 IRDM WS 2005

Illustration of Eigenvectors 2 1   =  A Matrix  1 3   describes affine transformation x Ax a Eigenvector x1 = (0.52 0.85) T for Eigenvalue λ 1=3.62 Eigenvector x2 = (0.85 -0.52) T for Eigenvalue λ 2=1.38 4-5 IRDM WS 2005

Principal Component Analysis (PCA) Spectral Theorem: (PCA, Karhunen-Loewe transform): Let A be a symmetric n × n matrix with Eigenvalues λ 1, ..., λ n = x 1 and Eigenvectors x1, ..., xn such that for all i. i 2 The Eigenvectors form an orthonormal basis of A. Then the following holds: D = Q T × A × Q, where D is a diagonal matrix with diagonal elements λ 1, ..., λ n and Q consists of column vectors x1, ..., xn. often applied to covariance matrix of n-dim. data points 4-6 IRDM WS 2005

Singular Value Decomposition (SVD) Theorem: Each real-valued m × n matrix A with rank r can be decomposed into the form A = U × × ∆ × × ∆ ∆ × ∆ × V T × × with an m × × r matrix U with orthonormal column vectors, × × an r × × × × r diagonal matrix ∆ ∆ ∆ , and ∆ an n × × × × r matrix V with orthonormal column vectors. This decomposition is called singular value decomposition and is unique when the elements of ∆ or sorted. Theorem: In the singular value decomposition A = U × ∆ × V T of matrix A the matrices U, ∆ , and V can be derived as follows: • ∆ consists of the singular values of A, i.e. the positive roots of the Eigenvalues of A T × A, • the columns of U are the Eigenvectors of A × A T , • the columns of V are the Eigenvectors of A T × A. 4-7 IRDM WS 2005

SVD for Regression Theorem: Let A be an m × n matrix with rank r, and let A k = U k × × × × ∆ ∆ ∆ ∆ k × × × × V k T , where the k × k diagonal matrix ∆ k contains the k largest singular values of A and the m × k matrix U k and the n × k matrix V k contain the corresponding Eigenvectors from the SVD of A. Among all m × n matrices C with rank at most k A k is the matrix that minimizes the Frobenius norm m n y 2 2 − = − A C ( A C ) ij ij ∑ ∑ F = = i 1 j 1 y‘ Example: x‘ m=2, n=8, k=1 projection onto x‘ axis minimizes „error“ or maximizes „variance“ in k-dimensional space x 4-8 IRDM WS 2005

4.3.2 Latent Semantic Indexing (LSI) [Deerwester et al. 1990] : Applying SVD to Vector Space Model A is the m × n term-document similarity matrix. Then: • U and U k are the m × r and m × k term-topic similarity matrices, • V and V k are the n × r and n × k document-topic similarity matrices, • A × A T and A k × A k T are the m × m term-term similarity matrices, • A T × A and A k T × A k are the n × n document-document similarity matrices latent doc j Σ Σ Σ Σ Σ Σ k topic t Σ Σ V T T U V k U k A doc j .............. .............. .............. .............. σ 1 σ 1 ....... ......... .. ........ 0 ≈ ≈ ≈ ≈ × × × × × × × × 0 × × ........................ ........................ ........ ........... σ k ...................... = latent ...................... term i 0 topic t 0 σ r k × r × r × k r × n k × × n × × × × m × n m × × n × × m × r m × × k × × mapping of m × 1 vectors into latent-topic space: × = T d U d : d ' a j k j j × = T q U q : q' a k T ) *j ) T × q’ scalar-product similarity in latent-topic space: d j ‘ T × q‘ = (( ∆ k V k 4-9 IRDM WS 2005

Indexing and Query Processing T corresponds to a „topic index“ and • The matrix ∆ ∆ ∆ ∆ k V k is stored in a suitable data structure. Instead of ∆ k V k T the simpler index V k T could be used. • Additionally the term-topic mapping U k must be stored. • A query q (an m × 1 column vector) in the term vector space T × × × q (a k × 1 column vector) × is transformed into query q‘= U k and evaluated in the topic vector space (i.e. V k ) T × q‘ or cosine similarity) (e.g. by scalar-product similarity V k • A new document d (an m × 1 column vector) is transformed into T × × d (a k × 1 column vector) and × × d‘ = U k T as an additional column („folding-in“) appended to the „index“ V k 4-10 IRDM WS 2005

Example 1 for Latent Semantic Indexing m=5 (interface, library, Java, Kona, blend), n=7 1 2 1 5 0 0 0 0 . 58 0 . 00       1 2 1 5 0 0 0  0 . 58 0 . 00    9 . 64 0 . 00 0 . 18 0 . 36 0 . 18 0 . 90 0 . 00 0 . 00 0 . 00   = = × × A 1 2 1 5 0 0 0 0 . 58 0 . 00          0 . 00 5 . 29   0 . 00 0 . 00 0 . 00 0 . 00 0 . 53 0 . 80 0 . 27    0 0 0 0 2 3 1   0 . 00 0 . 71         0 0 0 0 2 3 1 0 . 00 0 . 71   ∆   V T     U query q = (0 0 1 0 0) T is transformed into q‘ = U T × q = (0.58 0.00) T and evaluated on V T the new document d8 = (1 1 0 0 0) T is transformed into d8‘ = U T × d8 = (1.16 0.00) T and appended to V T 4-11 IRDM WS 2005

Example 2 for Latent Semantic Indexing n=5 documents m=6 terms d1: How to bake bread without recipes t1: bak(e,ing) d2: The classic art of Viennese Pastry t2: recipe(s) d3: Numerical recipes: the art of t3: bread scientific computing t4: cake d4: Breads, pastries, pies and cakes: t5: pastr(y,ies) quantity baking recipes t6: pie d5: Pastry: a book of best French recipes 0 . 5774 0 . 0000 0 . 0000 0 . 4082 0 . 0000     0 . 5774 0 . 0000 1 . 0000 0 . 4082 0 . 7071   0 . 5774 0 . 0000 0 . 0000 0 . 4082 0 . 0000   = A   0 . 0000 0 . 0000 0 . 0000 0 . 4082 0 . 0000     0 . 0000 1 . 0000 0 . 0000 0 . 4082 0 . 7071     0 . 0000 0 . 0000 0 . 0000 0 . 4082 0 . 0000   4-12 IRDM WS 2005

Example 2 for Latent Semantic Indexing (2) − − 0 . 2670 0 . 2567 0 . 5308 0 . 2847   − −  0 . 7479 0 . 3981 0 . 5249 0 . 0816    − − 0 . 2670 0 . 2567 0 . 5308 0 . 2847 = A   U −   0 . 1182 0 . 0127 0 . 2774 0 . 6394   − 0 . 5198 0 . 8423 0 . 0838 0 . 1158   −   0 . 1182 0 . 0127 0 . 2774 0 . 6394     1 . 6950 0 . 0000 0 . 0000 0 . 0000    0 . 0000 1 . 1158 0 . 0000 0 . 0000  ∆ ×   0 . 0000 0 . 0000 0 . 8403 0 . 0000     0 . 0000 0 . 0000 0 . 0000 0 . 4195     0 . 4366 0 . 3067 0 . 4412 0 . 4909 0 . 5288   − − −  0 . 4717 0 . 7549 0 . 3568 0 . 0346 0 . 2815  × V T   − − 0 . 3688 0 . 0998 0 . 6247 0 . 5711 0 . 3712   − − −   0 . 6715 0 . 2760 0 . 1945 0 . 6571 0 . 0577     4-13 IRDM WS 2005

Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical - PowerPoint PPT Presentation

Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) 4.3 Latent-Concept Models 4.3.1 Foundations from Linear Algebra 4.3.2 Latent Semantic Indexing (LSI) 4.3.3 Probabilistic Aspect Model (pLSI) 4-1 IRDM

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 Inheritance Concepts

Advanced NLU & Dialog Models Ling575 Spoken Dialog Systems April 21, 2016 Roadmap

1/88 Presentation: Advanced Techniques 2/88 Presentation: Advanced Techniques 3/88

Advanced Nutrition Course Advanced Nutrition Course 6 Week Advanced Nutrition Live Online

Chapter 13 Chapter 13 1 What is this? Chapter 13 2 What is this? Chapter 13 3 What is

Special layout models Chapter 7 (Warehouse Operations) Chapter 10 (Facility Planning Models)

Resource and Application Resource and Application Models for Advanced Grid Models for Advanced

CHAPTER CHAPTER VII CHAPTER CHAPTER VII VII VII MANAGEMENT AND MANAGEMENT AND

Appendix A Chapter 9 versus Chapter 1 1 at a Glance Chapter 9 Chapter 1 1 ( I n) voluntary Cannot

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Pushdown Automata Chapter 5 Chapter 5 Chapter 5 Chapter 5

Chapter 6 Programme design and development Lets Recap Chapter 2: Chapter 3: Chapter 1:

OWASP London Chapter Meeting 27th July 2017 London Chapter Chapter Leaders: Sam

Constraint Satisfaction Problem s C t i t S ti f ti P bl Reading: Chapter 6 (3 rd ed );

Chapter 3 Chapter 3 Data Description McGraw-Hill, Bluman, 7 th ed, Chapter 3 1 Ch Chapter 3

Acts Series Lesson #93 December 11, 2012 Dean Bible Ministries www.deanbible.org Dr. Robert L.

CapitaLand Limited 1H 2018 Financial Results 8 August 2018 CapitaLand Limited 1H 2018 Results 1

Outline enhanced diagnostics and potential treatment strategies To understand rare cancers

Moral Hazard and Efficiency J. Parman (College of William & Mary) Regulation of Markets,

De Developme elopment nt (HUFED) (HUFED) Cen Center ter An An NG NGFN W FN Webinar binar

1 James 1:1a James , a servant of God and of the Lord Jesus Christ - James is used 42

DIOCESAN SYNOD SATURDAY 9 MARCH 2019 Walking, Welcoming, Growing DIOCESAN SYNOD 01/19 WELCOMES

& St. Mina Coptic Orthodox Church Pre-Service Training Seminars Lecture 1: THE SPIRITUALITY

Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical - PowerPoint PPT Presentation

Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) 4.3 Latent-Concept Models 4.3.1 Foundations from Linear Algebra 4.3.2 Latent Semantic Indexing (LSI) 4.3.3 Probabilistic Aspect Model (pLSI) 4-1 IRDM

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 11/27/2006 Chapter 13

Topics 11/13/2006 Chapter 11, start Chapter 12 11/20/2006 Chapter 12 Inheritance Concepts

Advanced NLU &amp; Dialog Models Ling575 Spoken Dialog Systems April 21, 2016 Roadmap

1/88 Presentation: Advanced Techniques 2/88 Presentation: Advanced Techniques 3/88

Advanced Nutrition Course Advanced Nutrition Course 6 Week Advanced Nutrition Live Online

Chapter 13 Chapter 13 1 What is this? Chapter 13 2 What is this? Chapter 13 3 What is

Special layout models Chapter 7 (Warehouse Operations) Chapter 10 (Facility Planning Models)

Resource and Application Resource and Application Models for Advanced Grid Models for Advanced

CHAPTER CHAPTER VII CHAPTER CHAPTER VII VII VII MANAGEMENT AND MANAGEMENT AND

Appendix A Chapter 9 versus Chapter 1 1 at a Glance Chapter 9 Chapter 1 1 ( I n) voluntary Cannot

CHAPTER II I CHAPTER I Recurrent Neural Networks Recurrent Neural Networks CHAPTER II : I :

Pushdown Automata Chapter 5 Chapter 5 Chapter 5 Chapter 5

Chapter 6 Programme design and development Lets Recap Chapter 2: Chapter 3: Chapter 1:

OWASP London Chapter Meeting 27th July 2017 London Chapter Chapter Leaders: Sam

Constraint Satisfaction Problem s C t i t S ti f ti P bl Reading: Chapter 6 (3 rd ed );

Chapter 3 Chapter 3 Data Description McGraw-Hill, Bluman, 7 th ed, Chapter 3 1 Ch Chapter 3

Acts Series Lesson #93 December 11, 2012 Dean Bible Ministries www.deanbible.org Dr. Robert L.

CapitaLand Limited 1H 2018 Financial Results 8 August 2018 CapitaLand Limited 1H 2018 Results 1

Outline enhanced diagnostics and potential treatment strategies To understand rare cancers

Moral Hazard and Efficiency J. Parman (College of William &amp; Mary) Regulation of Markets,

De Developme elopment nt (HUFED) (HUFED) Cen Center ter An An NG NGFN W FN Webinar binar

1 James 1:1a James , a servant of God and of the Lord Jesus Christ - James is used 42

DIOCESAN SYNOD SATURDAY 9 MARCH 2019 Walking, Welcoming, Growing DIOCESAN SYNOD 01/19 WELCOMES

&amp; St. Mina Coptic Orthodox Church Pre-Service Training Seminars Lecture 1: THE SPIRITUALITY

Advanced NLU & Dialog Models Ling575 Spoken Dialog Systems April 21, 2016 Roadmap

Moral Hazard and Efficiency J. Parman (College of William & Mary) Regulation of Markets,

& St. Mina Coptic Orthodox Church Pre-Service Training Seminars Lecture 1: THE SPIRITUALITY