Linear Algebraic Models in Information Retrieval Nathan Pruitt and - - PowerPoint PPT Presentation

linear algebraic models in information retrieval
SMART_READER_LITE
LIVE PREVIEW

Linear Algebraic Models in Information Retrieval Nathan Pruitt and - - PowerPoint PPT Presentation

Linear Algebraic Models in Information Retrieval Nathan Pruitt and Rami Awwad December 12th, 2016 Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18 Information Retrieval In a Nutshell


slide-1
SLIDE 1

Linear Algebraic Models in Information Retrieval

Nathan Pruitt and Rami Awwad December 12th, 2016

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 1 / 18

slide-2
SLIDE 2

Information Retrieval In a Nutshell

Information Retrieval– Defined as finding relevant information to a search in a database containing documents, images, articles, etc. Practical real life example– Finding an article or book in a library through catalog system or through library’s database via search engine Most common type are internet search engines a la Google, Yahoo, but also used on many other sites wherever there’s a search feature

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 2 / 18

slide-3
SLIDE 3

A Brief History of I.R. in the Digital Domain

S.M.A.R.T. (System for the Mechanical Analysis and Retrieval of Text) developed at Cornell University in the 1960s Obtains legacy for the development of I.R. models including the vector space model

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 3 / 18

slide-4
SLIDE 4

The Vector Space Model

A text based ranking model common to internet search engines in the early 1990s Works by making a t × d matrix, where t can represent all terms in an English dictionary d representing the number of documents in a search engine database

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 4 / 18

slide-5
SLIDE 5

The Vector Space Model

                  d1 d2 d3 d4 d5 . . . d1,000,000 t1 m1,1 m1,2 m1,3 m1,4 m1,5 m1,1,000,000 t2 m2,1 m2,2 m2,3 m2,4 m2,5 m2,1,000,000 t3 m3,1 m3,2 m3,3 m3,4 m3,5 . . . m3,1,000,000 t4 m4,1 m4,2 m4,3 m4,4 m4,5 m4,1,000,000 t5 m5,1 m5,2 m5,3 m5,4 m5,5 m5,1,000,000 . . . . . . t300,000 m300,000,1 m300,000,2 m300,000,3 m300,000,4 m300,000,5 m300,000,1,000,000                  

Each m given a weight depending on number of times each term t

  • ccurs in document d, then weighed with an arithmetic weighing

scheme Weight allows comparison between document to document and document to query by the angles between their column vectors

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 5 / 18

slide-6
SLIDE 6

VSM: A Simpler Example

Mexample =     doc1 doc2 doc3 internet 38 14 20 graph 10 20 5 directed 2 10     Query =     term internet 1 graph 1 directed 1     Entries called term frequencies Term frequencies processed through arithmetic weighing scheme because higher tf doesn’t necessarily mean a more relevant website Engine considers query as a bag of words– order of terms eschewed

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 6 / 18

slide-7
SLIDE 7

Length Normalized t × d Matrix and Query Vector

Query∗ =      term internet

1 √ 3

graph

1 √ 3

directed

1 √ 3

     Mexample∗ =     doc1 doc2 doc3 internet 0.790 0.630 0.659 graph 0.612 0.676 0.487 directed 0.382 0.573    

After arithmetic scheme, matrix and query vector are length normalized Serves to simplify calculation of angles between document vectors, and between the document vectors and the query

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 7 / 18

slide-8
SLIDE 8

VSM: The ”Cosine Similarity”

cos(doc1, doc2) ≈   0.790 0.612   ·   0.630 0.676 0.382   doc1doc2 ≈ 0.912 1 ≈ 0.912 cos(doc1, doc3) ≈ 0.819 cos(doc2, doc3) ≈ 0.963 cos(Query, doc1) ≈ 0.810 cos(Query, doc2) ≈ 0.975 cos(Query, doc3) ≈ 0.993

These calculations imply the following angles separate each vector:

(doc1, doc2) ≈ arccos 0.912 180◦ π

  • ≈ 24.188◦

(doc1, doc3) ≈ 34.985◦ (doc2, doc3) ≈ 15.530◦ (Query, doc1) ≈ 35.901◦ (Query, doc2) ≈ 12.918◦ (Query, doc3) ≈ 7.006◦

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 8 / 18

slide-9
SLIDE 9

VSM: Visualization of Document Vectors and their Shared Angles

Figure: Cosine similarity between doc1 to doc2 and doc2 to doc3 Figure: Cosine similarity between doc1 and doc3

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 9 / 18

slide-10
SLIDE 10

VSM: Visualization of Document Vectors and their Shared Angles with Query Vector

Figure: Cosine similarity between doc2 to the query and doc3 to query Figure: Cosine similarity between doc1 and the query

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 10 / 18

slide-11
SLIDE 11

PageRank Algorithm

Google’s matrix has over 8 billion row and columns.

1 2 3 4 5 6 7

This directed graph represents the overall rankings of the websites. This is a Markov Chain. The arrows represent links between different websites. For example, website 1 only links to website 2.

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 11 / 18

slide-12
SLIDE 12

PageRank Algorithm

P =               j1 j2 j3 j4 j5 j6 j7 i1

1 2

i2 1

1 2 1 2 1 4

i3

1 3

i4

1 3 1 2 1 4

i5

1 2 1 4

i6

1 3 1 2

i7

1 4

1               This matrix P shows the probabilities of movement between these

  • websites. Because website 1 only links to website 2, there is a 100

percent chance of that move. Matrix P is a transition matrix because the entries describe the probability of a transition from state j to state i.

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 12 / 18

slide-13
SLIDE 13

PageRank Algorithm

Notice that each column vector in transition matrix P obtains entries that when added total 1. Therefore, all column vectors in P are probability vectors. Thus our transition matrix is also a stochastic matrix, which describes a Markov chain with some interesting properties. One of these properties state that all stochastic matrices have at least

  • ne eigenvalue of 1. The eigenvector corresponding to 1 will tell us

the rank of our 7 websites, or in Google terms, the PageRank of each website.

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 13 / 18

slide-14
SLIDE 14

PageRank Algorithm

To approach this eigenvector, we calculate the steady-state vector xn of

  • ur 7 website chain:

xn =         a1 . . . aj . . . a7         All stochastic matrices have a steady-state vector. Our xn is a probability vector describing the chance of landing on each website after clicking through n links within our chain.

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 14 / 18

slide-15
SLIDE 15

PageRank Algorithm

We use this equation to compute steady-state vectors: lim

n→∞ xn = Pn k x0

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 15 / 18

slide-16
SLIDE 16

Adjustment to Transition Matrix

Google is said to use a p with a value of 0.85. Then, we retrieve our Pn

k as follows:

Pn

k = 0.85

                  

1 2 1 7

1

1 2 1 2 1 4 1 7 1 3 1 7 1 3 1 2 1 4 1 7 1 2 1 4 1 7 1 3 1 2 1 7 1 4 1 7

                   + 0.15                   

1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7 1 7

                   =                0.02142 0.02142 0.02142 0.44642 0.02142 0.02142 0.14285 0.87142 0.02142 0.44642 0.02142 0.44642 0.23392 0.14285 0.02142 0.30476 0.02142 0.02142 0.02142 0.02142 0.14285 0.02142 0.30476 0.44642 0.02142 0.02142 0.23392 0.14285 0.02142 0.02142 0.02142 0.44642 0.02142 0.23392 0.14285 0.02142 0.30476 0.02142 0.02142 0.44642 0.02142 0.14285 0.02142 0.02142 0.02142 0.02142 0.02142 0.23392 0.14285                Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 16 / 18

slide-17
SLIDE 17

Final Rank

lim

n→75 xn =

         0.02142 0.02142 0.02142 0.44642 0.02142 0.02142 0.14285 0.87142 0.02142 0.44642 0.02142 0.44642 0.23392 0.14285 0.02142 0.30476 0.02142 0.02142 0.02142 0.02142 0.14285 0.02142 0.30476 0.44642 0.02142 0.02142 0.23392 0.14285 0.02142 0.02142 0.02142 0.44642 0.02142 0.23392 0.14285 0.02142 0.30476 0.02142 0.02142 0.44642 0.02142 0.14285 0.02142 0.02142 0.02142 0.02142 0.02142 0.23392 0.14285         

n 

        1          xn =          0.104631 0.253767 0.100953 0.177828 0.138598 0.159857 0.063021         

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 17 / 18

slide-18
SLIDE 18

Bibliography

1 Christopher D. Manning, Prabhankar Reghaven, Hinrich Schutze, Introduction to Information Retrieval 2 Michael W. Berry, Zlatko Drmac, Elizabeth R. Jessup. Matrices, Vector Spaces, and Information Retrieval 3 Raluca Tanase, Remus Redu. The Mathematics of Web Search 4 M.W. Berry, S.T. Dumais, G.W. O’Brien. Using Lienar Algebra for Intelligent Information Retrieval. 5 Howard Anton, Robert C. Busby. Contemporary Linear Algebra

Nathan Pruitt and Rami Awwad Linear Algebraic Models in Information Retrieval December 12th, 2016 18 / 18