Data Mining and Matrices 10 Graphs II Rainer Gemulla, Pauli - PowerPoint PPT Presentation

Data Mining and Matrices 10 – Graphs II Rainer Gemulla, Pauli Miettinen Jul 4, 2013

Link analysis The web as a directed graph ◮ Set of web pages with associated textual content ◮ Hyperlinks between webpages (potentially with anchor text) → Directed graph Our focus: Which pages are “relevant” (to a query)? ◮ Analysis of link structure instrumental for web search ◮ Assumption: incoming link is a quality signal (endorsement) ◮ Page has high quality ≈ links from/to high-quality pages ◮ (We are ignoring anchor text in this lecture.) Gives rise to HITS and PageRank algorithms Similarly: citations of scientific papers, social networks, . . . v 2 v 4 v 1 v 3 v 5 2 / 45

Outline Background: Power Method 1 HITS 2 Background: Markov Chains 3 PageRank 4 Summary 5 3 / 45

Eigenvectors and diagonalizable matrices Denote by A an n × n real matrix Recap eigenvectors ◮ v is a right eigenvector with eigenvalue λ of A if Av = λ v ◮ v is a left eigenvector with eigenvalue λ of A if vA = λ v ◮ If v is a right eigenvector of A , then v T is a left eigenvector of A T (and vice versa) A is diagonalizable if it has n linearly independent eigenvectors ◮ Some matrices are not diagonalizable (called defective ) ◮ If A is symmetric (our focus), it is diagonalizable ◮ If A is symmetric, v 1 , . . . , v n can be chosen to be real and orthonormal → These eigenvectors then form an orthonormal basis of R n ◮ Denote by λ 1 , . . . , λ n are the corresponding eigenvalues (potentially 0) ◮ Then for every x ∈ R n , there exist c 1 , . . . , c n such that x = c 1 v 1 + c 2 v 2 + · · · + c n v n ◮ And therefore Ax = λ 1 c 1 v 1 + λ 2 c 2 v 2 + · · · + λ n c n v n ◮ Eigenvectors “explain” effect of linear transformation A 4 / 45

Example x ~ x λ 2 v 2 λ 1 v 1 λ 2 v 2 λ 1 v 1 ˜ λ 1 = 2, λ 2 = 1 x = Ax 5 / 45

Power method Simple method to determine the largest eigenvalue λ 1 and the corresponding eigenvector v 1 Algorithm Start at some x 0 1 While not converged 2 Set ˜ x t +1 ← Ax t 1 Normalize: x t +1 ← ˜ x t +1 / � ˜ x t +1 � 2 What happens here? ◮ Observe that x t = A t x 0 / C , where C = � A t x 0 � ◮ Assume that A is real symmetric ◮ Then x t = ( λ t 1 c 1 v 1 + λ t 2 c 2 v 2 + · · · + λ t n c n v n ) / C ◮ If | λ 1 | > | λ 2 | , then � t c 2 λ t � λ 2 2 c 2 lim = lim = 0 λ t λ 1 1 c 1 c 1 t →∞ t →∞ ◮ So as t → ∞ , x t converges to v 1 6 / 45

Power method (example) x ~ 1 x 0 x 1 v 2 v 1 v 2 v 1 v 2 v 1 n = 0 n = 1 n = 1 (normalized) ~ 2 x x ~ 100 x 2 v 2 v 1 v 2 v 1 v 2 v 1 n = 2 n = 2 (normalized) n = 100 7 / 45

Discussion Easy to implement and parallelize We will see: useful for understanding link analysis Convergence ◮ Works if A is real symmetric, | λ 1 | > | λ 2 | , and x 0 �⊥ v 1 (i.e., c 1 � = 0) ◮ Speed depends on eigengap | λ 1 | / | λ 2 | ◮ Also works in many other settings (but not always) 8 / 45

Power method and singular vectors Unit vectors u and v are left and right singular vectors of A if A T u = σ v and Av = σ u σ is the corresponding singular value The SVD decomposition is formed of the singular values ( Σ ) and corresponding left and right singular vectors (columns of U and V ) u is an eigenvector of AA T with eigenvalue σ 2 since AA T u = A σ v = σ Av = σ 2 u Similarly v is an eigenvector of A T A with eigenvalue σ 2 Power method for principal singular vectors u t +1 ← Av t / � Av t � 1 v t +1 ← A T u t +1 / � A T u t +1 � 2 Why does it work? ◮ AA T and A T A are symmetric (and positive semi-definite) ◮ u t +2 = Av t +1 / � Av t +1 � = AA T u t +1 / � AA T u t +1 � 9 / 45

Outline Background: Power Method 1 HITS 2 Background: Markov Chains 3 PageRank 4 Summary 5 10 / 45

Asking Google for search engines 11 / 45

Asking Bing for search engines 12 / 45

Searching the WWW Some difficulties in web search ◮ “search engine”: many of the search engines do not contain phrase “search engine” ◮ “Harvard”: millions of pages contain “Harvard”, but www.harvard.edu may not contain it most often ◮ “lucky”: there is an “I’m feeling lucky” button on google.com , but google.com is (probably) not relevant (popularity) ◮ “automobile”: some pages say “car” instead (synonymy) ◮ “jaguar”: the car or the animal? (polysemy) Query types Specific queries (“name of Michael Jackson’s dog”) 1 → Scarcity problem: few pages contain required information Broad-topic queries (“Java”) 2 → Abundance problem: large number of relevant pages Similar-page queries (“Pages similar to java.com ”) 3 Our focus: broad-topic queries ◮ Goal is to find “most relevant” pages 13 / 45

Hyperlink Induced Topic Search (HITS) HITS analyzes the link structure to mitigate these challenges ◮ Uses links as source of exogenous information ◮ Key idea: If p links to q , p confers “authority” on q → Try to find authorities through links that point to them ◮ HITS aims to balance between relevance to a query (content) and popularity (in-links) HITS uses two notions of relevance ◮ Authority page directly answers information need → Page pointed to by many hubs for the query ◮ Hub page contains link to pages that answer information need → Points to many authorities for the query ◮ Note: circular definition Algorithm Create a focused subgraph of the WWW based on the query 1 Score each page w.r.t. to authority and hub 2 Return the pages with the largest authority scores 3 14 / 45

Hubs and authorities (example) 15 / 45 Manning et al., 2008

Creating a focused subgraph Desiderata Should be small (for efficiency) 1 Should contain most (or many) of the strongest authorities (for recall) 2 Should be rich in relevant pages (for precision) 3 Using all pages that contain query may violate (1) and (2) Construction ◮ Root set : the highest-ranked pages for the query (regular web search) → Satisfies (1) and (3), but often not (2) ◮ Base set : pages that point to or are pointed to from the root set → Increases number of authorities, addressing (2) ◮ Focused subgraph = induced subgraph of base set → Consider all links between pages in the base set 16 / 45

Root set and base set 17 / 45 Kleinberg, 1999

Heuristics Retain efficiency ◮ Focus on t highest ranked pages for the query (e.g., t = 200) → Small root set ◮ Allow each page to bring in at most d pages pointing to it (e.g., d = 50) → Small base set ( ≈ 5000 pages) Try to avoid links that serve a purely navigational function ◮ E.g., link to homepage ◮ Keep transverse links (to different domain) ◮ Ignore intrinsic links (to same domain) Try to avoid links that indicate collusion/advertisement ◮ E.g., “This site is designed by...” ◮ Allow each page to be pointed to at most m times from each domain ( m ≈ 4–8) 18 / 45

Hubs and authorities Simple approach: rank pages by in-degrees in focused subgraph ◮ Works better than on whole web ◮ Still problematic: some pages are “universally popular” regardless of underlying query topic Key idea: weight links from different pages differently ◮ Authoritative pages have high in-degree and a common topic → Considerable overlap in sets of pages that point to authorities ◮ Hub pages “pull together” authorities on a common topic → Considerable overlap in sets of pages that are pointed to by hubs ◮ Mutual reinforcment ⋆ Good hub points to many good authorities ⋆ Good authority is pointed to by many good hubs 19 / 45

Hub and authority scores Denote by G = ( V , E ) the focused subgraph Assign to page p ◮ A non-negative hub weight u p ◮ A non-negative authority weight v p Larger means “better” Authority weight = sum of weights of hubs pointing to the page � v p ← u q ( q , p ) ∈ E Hub weight = sum of weights of authorities pointed to by the page � u p ← v p ( p , q ) ∈ E HITS iterates until it reaches a fixed point ◮ Normalize vectors to length 1 after every iteration (does not affect ranking) 20 / 45

Example � T � u = 0 . 63 0 . 46 0 . 55 0 . 29 0 . 00 0 . 00 0 . 00 (hubs) � T � v = 0 . 00 0 . 00 0 . 00 0 . 21 0 . 42 0 . 46 0 . 75 (authorities) (0 . 63 , 0 . 00) (0 . 00 , 0 . 42) 1 5 (0 . 46 , 0 . 00) (0 . 00 , 0 . 46) 2 6 (0 . 55 , 0 . 00) (0 . 00 , 0 . 75) 3 7 4 (0 . 29 , 0 . 21) 21 / 45

Authorities for Chicago Bulls 22 / 45 Manning et al., 2008

Top-authority for Chicago Bulls 23 / 45 Manning et al., 2008

Hubs for Chicago Bulls 24 / 45 Manning et al., 2008

What happens here? Adjacency matrix A ( A pq = 1 if p links to q ) ◮ v p ← � ( q , p ) ∈ E u q = ( A ∗ p ) T u ◮ Thus: v ← A T u ◮ Similarly u ← Av This is the power method for principal singular vectors ◮ u and v correspond to principal left and right singular vectors of A ◮ u is principal eigenvector of AA T (co-citation matrix) ◮ v is principal eigenvector of A T A (bibliographic coupling matrix) 1 5  0 0 0 0 1 1 1  0 0 0 0 1 0 1   2 6   0 0 0 1 0 1 1     A = 0 0 0 0 0 0 1   3 7   0 0 0 0 0 0 0     0 0 0 0 0 0 0   4 0 0 0 0 0 0 0 25 / 45

Data Mining and Matrices 10 Graphs II Rainer Gemulla, Pauli - PowerPoint PPT Presentation

Data Mining and Matrices 10 Graphs II Rainer Gemulla, Pauli Miettinen Jul 4, 2013 Link analysis The web as a directed graph Set of web pages with associated textual content Hyperlinks between webpages (potentially with anchor text)

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining and Matrices 01 Introduction Rainer Gemulla, Pauli Miettinen April 18, 2013

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal & spectral matrices) by

Transformations and Matrices Transformations I Transformations are functions Matrices

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Structural Matrices in MDOF Systems Evaluation of Structural Matrices Choice of Property

Structural Matrices in MDOF Systems Structural Matrices Evaluation of Structural Giacomo Boffi

SOCIAL ENGINEERING THREATS & COUNTERMEASURES IN AN OVERLY

Selective Early Request Termination Selective Early Request Termination for Busy Internet

Improving and Proving Marketing ROI with Testing How Shoebuy.com uses cross-site testing to

Systems & Applications: Introduction Ling 573 NLP Systems and Applications April 1, 2014

Host profiling based on remote measurements Robert Kulzer Advisor: Ralph Holz Master Thesis

From subexponentials in linear logic to concurrent constraint programming and back Carlos Olarte

Efficient Processing of Set-Similarity Joins on Large Computer Clusters Rares Vernica

Link Analysis Paolo Boldi DSI LAW (Laboratory for Web Algorithmics) Universit` a degli Studi

Data Mining and Matrices 10 Graphs II Rainer Gemulla, Pauli - PowerPoint PPT Presentation

Data Mining and Matrices 10 Graphs II Rainer Gemulla, Pauli Miettinen Jul 4, 2013 Link analysis The web as a directed graph Set of web pages with associated textual content Hyperlinks between webpages (potentially with anchor text)

Results for different matrices and comparisons Dense Matrices Rectangular Matrices

MATHEMATICS 1 CONTENTS Matrices Special matrices Operations with matrices Matrix

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Data Mining and Matrices 01 Introduction Rainer Gemulla, Pauli Miettinen April 18, 2013

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices &amp; quadratic forms)

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal &amp; spectral matrices) by

Transformations and Matrices Transformations I Transformations are functions Matrices

Matrices with Application to Page Rank Markov Matrices Pagerank Anil Maheshwari

Structural Matrices in MDOF Systems Evaluation of Structural Matrices Choice of Property

Structural Matrices in MDOF Systems Structural Matrices Evaluation of Structural Giacomo Boffi

SOCIAL ENGINEERING THREATS &amp; COUNTERMEASURES IN AN OVERLY

Selective Early Request Termination Selective Early Request Termination for Busy Internet

Improving and Proving Marketing ROI with Testing How Shoebuy.com uses cross-site testing to

Systems &amp; Applications: Introduction Ling 573 NLP Systems and Applications April 1, 2014

Host profiling based on remote measurements Robert Kulzer Advisor: Ralph Holz Master Thesis

From subexponentials in linear logic to concurrent constraint programming and back Carlos Olarte

Efficient Processing of Set-Similarity Joins on Large Computer Clusters Rares Vernica

Link Analysis Paolo Boldi DSI LAW (Laboratory for Web Algorithmics) Universit` a degli Studi

JUST THE MATHS SLIDES NUMBER 9.10 MATRICES 10 (Symmetric matrices & quadratic forms)

JUST THE MATHS SLIDES NUMBER 9.9 MATRICES 9 (Modal & spectral matrices) by

SOCIAL ENGINEERING THREATS & COUNTERMEASURES IN AN OVERLY

Systems & Applications: Introduction Ling 573 NLP Systems and Applications April 1, 2014