Linear Algebraic Models in Information Retrieval Nathan Pruitt and - PowerPoint PPT Presentation

Linear Algebraic Models in Information Retrieval Nathan Pruitt and Rami Awwad December 12th, 2016 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18

Information Retrieval In a Nutshell Information Retrieval– Defined as finding relevant information to a search in a database containing documents, images, articles, etc. Practical real life example– Finding an article or book in a library through catalog system or through library’s database via search engine Most common type are internet search engines a la Google, Yahoo, but also used on many other sites wherever there’s a search feature Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 2 / 18

A Brief History of I.R. in the Digital Domain S.M.A.R.T. (System for the Mechanical Analysis and Retrieval of Text) developed at Cornell University in the 1960s Obtains legacy for the development of I.R. models including the vector space model Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 3 / 18

The Vector Space Model A text based ranking model common to internet search engines in the early 1990s Works by making a t × d matrix, where t can represent all terms in an English dictionary d representing the number of documents in a search engine database Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 4 / 18

The Vector Space Model d 1 d 2 d 3 d 4 d 5 d 1 , 000 , 000 . . .   t 1 m 1 , 1 m 1 , 2 m 1 , 3 m 1 , 4 m 1 , 5 m 1 , 1 , 000 , 000   t 2 m 2 , 1 m 2 , 2 m 2 , 3 m 2 , 4 m 2 , 5 m 2 , 1 , 000 , 000     t 3  m 3 , 1 m 3 , 2 m 3 , 3 m 3 , 4 m 3 , 5 m 3 , 1 , 000 , 000   . . .    t 4 m 4 , 1 m 4 , 2 m 4 , 3 m 4 , 4 m 4 , 5 m 4 , 1 , 000 , 000       t 5 m 5 , 1 m 5 , 2 m 5 , 3 m 5 , 4 m 5 , 5 m 5 , 1 , 000 , 000       .  .    . .   . .     t 300 , 000 m 300 , 000 , 1 m 300 , 000 , 2 m 300 , 000 , 3 m 300 , 000 , 4 m 300 , 000 , 5 m 300 , 000 , 1 , 000 , 000 Each m given a weight depending on number of times each term t occurs in document d , then weighed with an arithmetic weighing scheme Weight allows comparison between document to document and document to query by the angles between their column vectors Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 5 / 18

VSM: A Simpler Example doc 1 doc 2 doc 3   internet 38 14 20   M example = graph 10 20 5     directed 0 2 10 term   1 internet   Query = graph 1     1 directed Entries called term frequencies Term frequencies processed through arithmetic weighing scheme because higher tf doesn’t necessarily mean a more relevant website Engine considers query as a bag of words – order of terms eschewed Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 6 / 18

Length Normalized t × d Matrix and Query Vector term doc 1 doc 2 doc 3   1 internet   √ internet 0 . 790 0 . 630 0 . 659 3   1   Query ∗ = graph M example ∗ = graph 0 . 612 0 . 676 0 . 487  √    3       1 0 0 . 382 0 . 573 directed directed √ 3 After arithmetic scheme, matrix and query vector are length normalized Serves to simplify calculation of angles between document vectors, and between the document vectors and the query Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 7 / 18

VSM: The ”Cosine Similarity”     0 . 790 0 . 630  · 0 . 612 0 . 676    0 0 . 382 ≈ 0 . 912 cos ( doc 1 , doc 2) ≈ ≈ 0 . 912 � doc 1 �� doc 2 � 1 cos ( doc 1 , doc 3) ≈ 0 . 819 cos ( doc 2 , doc 3) ≈ 0 . 963 cos ( Query , doc 1) ≈ 0 . 810 cos ( Query , doc 2) ≈ 0 . 975 cos ( Query , doc 3) ≈ 0 . 993 These calculations imply the following angles separate each vector: � 180 ◦ � ≈ 24 . 188 ◦ ( doc 1 , doc 2) ≈ arccos 0 . 912 π ( doc 1 , doc 3) ≈ 34 . 985 ◦ ( doc 2 , doc 3) ≈ 15 . 530 ◦ ( Query , doc 1) ≈ 35 . 901 ◦ ( Query , doc 2) ≈ 12 . 918 ◦ ( Query , doc 3) ≈ 7 . 006 ◦ Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 8 / 18

VSM: Visualization of Document Vectors and their Shared Angles Figure: Cosine similarity between Figure: Cosine similarity between doc1 to doc2 and doc2 to doc3 doc1 and doc3 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 9 / 18

VSM: Visualization of Document Vectors and their Shared Angles with Query Vector Figure: Cosine similarity between Figure: Cosine similarity between doc2 to the query and doc3 to query doc1 and the query Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 10 / 18

PageRank Algorithm Google’s matrix has over 8 billion row and columns. 1 2 6 7 3 4 5 This directed graph represents the overall rankings of the websites. This is a Markov Chain. The arrows represent links between different websites. For example, website 1 only links to website 2. Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 11 / 18

PageRank Algorithm j 1 j 2 j 3 j 4 j 5 j 6 j 7   1 0 0 0 0 0 0 i 1 2  1 1 1  i 2 1 0 0 0   2 2 4   1 0 0 0 0 0 0 i 3   3    1 1 1  P = i 4 0 0 0 0   3 2 4   1 1 0 0 0 0 0 i 5   2 4    1 1  i 6 0 0 0 0 0   3 2   1 0 0 0 0 0 1 i 7 4 This matrix P shows the probabilities of movement between these websites. Because website 1 only links to website 2, there is a 100 percent chance of that move. Matrix P is a transition matrix because the entries describe the probability of a transition from state j to state i . Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 12 / 18

PageRank Algorithm Notice that each column vector in transition matrix P obtains entries that when added total 1. Therefore, all column vectors in P are probability vectors . Thus our transition matrix is also a stochastic matrix , which describes a Markov chain with some interesting properties. One of these properties state that all stochastic matrices have at least one eigenvalue of 1. The eigenvector corresponding to 1 will tell us the rank of our 7 websites, or in Google terms, the PageRank of each website. Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 13 / 18

PageRank Algorithm To approach this eigenvector, we calculate the steady-state vector x n of our 7 website chain:   a 1 . .   .     x n = a j   .   .   .   a 7 All stochastic matrices have a steady-state vector. Our x n is a probability vector describing the chance of landing on each website after clicking through n links within our chain. Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 14 / 18

PageRank Algorithm We use this equation to compute steady-state vectors: n →∞ x n = P n lim k x 0 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 15 / 18

Adjustment to Transition Matrix Google is said to use a p with a value of 0 . 85. Then, we retrieve our P n k as follows:  1 1   1 1 1 1 1 1 1  0 0 0 0 0 2 7 7 7 7 7 7 7 7     1 1 1 1 1 1 1 1 1 1 1 1 0 0      2 2 4 7   7 7 7 7 7 7 7      1 1 1 1 1 1 1 1 1     0 0 0 0 0     3 7 7 7 7 7 7 7 7         P n 1 1 1 1 1 1 1 1 1 1 1 k = 0 . 85 0 0 0 + 0 . 15     3 2 4 7 7 7 7 7 7 7 7         1 1 1 1 1 1 1 1 1 1  0 0 0 0     2 4 7   7 7 7 7 7 7 7      1 1 1 1 1 1 1 1 1 1     0 0 0 0     3 2 7 7 7 7 7 7 7 7         1 1 1 1 1 1 1 1 1 0 0 0 0 0 4 7 7 7 7 7 7 7 7   0 . 02142 0 . 02142 0 . 02142 0 . 44642 0 . 02142 0 . 02142 0 . 14285   0 . 87142 0 . 02142 0 . 44642 0 . 02142 0 . 44642 0 . 23392 0 . 14285       0 . 02142 0 . 30476 0 . 02142 0 . 02142 0 . 02142 0 . 02142 0 . 14285     = 0 . 02142 0 . 30476 0 . 44642 0 . 02142 0 . 02142 0 . 23392 0 . 14285       0 . 02142 0 . 02142 0 . 02142 0 . 44642 0 . 02142 0 . 23392 0 . 14285       0 . 02142 0 . 30476 0 . 02142 0 . 02142 0 . 44642 0 . 02142 0 . 14285     0 . 02142 0 . 02142 0 . 02142 0 . 02142 0 . 02142 0 . 23392 0 . 14285 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 16 / 18

Linear Algebraic Models in Information Retrieval Nathan Pruitt and - PowerPoint PPT Presentation

Linear Algebraic Models in Information Retrieval Nathan Pruitt and Rami Awwad December 12th, 2016 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18 Information Retrieval In a Nutshell

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Models for Models for Retrieval and Browsing Retrieval and Browsing - Structural Models and

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Luo Si Department of Computer Science Purdue University Retrieval Models Information Need

Analysis of the Paragraph Vector Model for Information Retrieval Qingyao Ai 1 , Liu Yang 1 ,

Voice Based Information Retrieval System How far is it from text based retrieval system? PRAJNA

Structured Document Retrieval Benjamin Piwowarski DCC October 28, 2004 B. Piwowarski (DCC)

Impact of ASF on availability of critical nutrients in breast milk Lindsay H. Allen Center

Challenges for search engine retrieval effectiveness evaluations: Universal Search and user

Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP Workshop 2004 Final

FLAT 3 : Feature Location & Textual Tracing Tool Trevor Savage, Meghan Revelle, Denys

Video Retrieval using Speech and Image Information Alexander G. Hauptmann, Rong Jin, and Tobun D.

Linear Algebraic Models in Information Retrieval Nathan Pruitt and - PowerPoint PPT Presentation

Linear Algebraic Models in Information Retrieval Nathan Pruitt and Rami Awwad December 12th, 2016 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18 Information Retrieval In a Nutshell

CS54701: Information Retrieval CS-54701 Information Retrieval Retrieval Models: Language models

Retrieval Models: Outline CS490W: Web I nformation Search &amp; Management Retrieval Models

Information Retrieval Introducing Information Retrieval and Web Search Information Retrieval

Model Divergence Retrieval LM, session 10 CS6200: Information Retrieval Slides by: Jesse

Models for Models for Retrieval and Browsing Retrieval and Browsing - Structural Models and

XML Retrieval XML Retrieval XML Retrieval XML Retrieval DB/IR in DB/IR in Theory Theory Web

Models for Models for Retrieval and Browsing Retrieval and Browsing - Fuzzy Set, Extended

CS54701: Information Retrieval CS-54701 Information Retrieval Luo Si Department of Computer

Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar

Information Retrieval Introducing Information Retrieval and Web Search

Information Retrieval CS276: Information Retrieval and Web Search Text Classification 1 Chris

Introduction to Data Science: Logistic 0 1 1 according to a data fit criterion. account

Retrieval by Content Image Retrieval Image Retrieval Problem Large Image and video data sets

Luo Si Department of Computer Science Purdue University Retrieval Models Information Need

Analysis of the Paragraph Vector Model for Information Retrieval Qingyao Ai 1 , Liu Yang 1 ,

Voice Based Information Retrieval System How far is it from text based retrieval system? PRAJNA

Structured Document Retrieval Benjamin Piwowarski DCC October 28, 2004 B. Piwowarski (DCC)

Impact of ASF on availability of critical nutrients in breast milk Lindsay H. Allen Center

Challenges for search engine retrieval effectiveness evaluations: Universal Search and user

Joint Visual-Text Modeling for Multimedia Retrieval JHU CLSP Workshop 2004 Final

FLAT 3 : Feature Location &amp; Textual Tracing Tool Trevor Savage, Meghan Revelle, Denys

Video Retrieval using Speech and Image Information Alexander G. Hauptmann, Rong Jin, and Tobun D.

Retrieval Models: Outline CS490W: Web I nformation Search & Management Retrieval Models

FLAT 3 : Feature Location & Textual Tracing Tool Trevor Savage, Meghan Revelle, Denys