cs345a data mining jure leskovec and anand rajaraman j
play

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Instead of generic popularity can we measure Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the


  1. CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

  2.  Instead of generic popularity can we measure Instead of generic popularity, can we measure popularity within a topic?  E.g., computer science, health  Bias the random walk  When the random walker teleports, he picks a page from a set S of web pages from a set S of web pages  S contains only pages that are relevant to the topic  E g Open Directory (DMOZ) pages for a given topic E.g., Open Directory (DMOZ) pages for a given topic (www.dmoz.org)  For each teleport set S, we get a different rank vector r S 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

  3.  Let:  Let:  A ik =  M ik + (1 ‐  )/|S| if i  S  M ik  M otherwise th i  A is stochastic!  We have weighted all pages in the teleport set S equally teleport set S equally  Could also assign different weights to pages 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

  4. Suppose S = { 1} ,  = 0.8 0.2 0.2 1 0.5 0.5 0.4 0.4 1 1 0.8 2 3 Node I teration 1 1 0.8 0.8 0 1 2… stable 1 1.0 0.2 0.52 0.294 4 2 0 0.4 0.08 0.118 3 0 0.4 0.08 0.327 4 4 0 0 0 0 0 32 0.32 0 261 0.261 Note how we initialize the PageRank vector differently from the unbiased PageRank case. 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

  5.  Experimental results [Haveliwala 2000]  Experimental results [Haveliwala 2000]  Picked 16 topics  Teleport sets determined using DMOZ Teleport sets determined using DMOZ  E.g., arts, business, sports,…  “Blind study” using volunteers  35 test queries  Results ranked using PageRank and TSPR of most closely related topic  E.g., bicycling using Sports ranking  In most cases volunteers preferred TSPR ranking  I t l t f d TSPR ki 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

  6.  User can pick from a menu  User can pick from a menu  Use Naïve Bayes to classify query into a topic  Can use the context of the query  Can use the context of the query  E.g., query is launched from a web page talking about a known topic about a known topic  History of queries e.g., “basketball” followed by “Jordan” Jordan  User context e.g., user’s My Yahoo settings, bookmarks, … bookmarks, … 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

  7.  Goal:  Goal:  Don’t just find newspapers but also find “experts” – people who link in a coordinated way to many – people who link in a coordinated way to many good newspapers  Idea: link voting Idea: link voting  Quality as an expert (hub): NYT: 10  Total sum of votes of pages pointed to Total sum of votes of pages pointed to Ebay: 3 Ebay: 3 Yahoo: 3  Quality as an content (authority): CNN: 8  Total sum of votes of experts WSJ: 9 p  Principle of repeated improvement 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

  8. 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

  9. 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

  10. 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

  11. Interesting documents fall into two classes: Interesting documents fall into two classes: 1. Authorities are pages containing useful information  Newspaper home pages  Course home pages  Home pages of auto manufacturers 2. Hubs are pages that link to authorities p g  List of newspapers NYT: 10 Ebay: 3  Course bulletin Yahoo: 3  CNN: 8 List of US auto manufacturers WSJ: 9 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

  12.  A good hub links to many good authorities  A good hub links to many good authorities  A good authority is linked from many good g y y g hubs  Model using two scores for each node: f  Hub score and Authority score  Represented as vectors h and a 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

  13.  Each page i has 2 kinds of scores:  Each page i has 2 kinds of scores:  Hub score: h i  A th  Authority score : a i it  Algorithm:  Initialize: a i =h i =1 I iti li h 1  Then keep iterating:   h   Authority: A th it a h j i    i j  Hub: h a i j  i j  Normalize:  Normalize:  a i =1,  h i =1 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

  14.  HITS uses adjacency matrix  HITS uses adjacency matrix A [ i j ] = 1 A [ i , j ] = 1 if page i links to page j if page i links to page j , 0 else  A T , the transpose of A , is similar to the PageRank matrix M but A T has 1’s where M PageRank matrix M but A has 1 s where M has fractions 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

  15. Yahoo y a m y y 1 1 1 1 1 1 A = a 1 0 1 m 0 1 0 Amazon M’soft 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

  16.  Notation:  Notation:  Vector a=(a 1 …,a n ), h=(h 1 …,h n )  Adj  Adjacency matrix (n x n): A ij =1 if i  j t i ( ) A 1 if i j  Then:        h h a h h A A a i j i ij j  i j j h   So:  So: h A Aa  Likewise: a   T a A A h h 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

  17.  The hub score of page i is proportional to the  The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = λ Aa links to: h = λ Aa  Constant λ is a scale factor, λ =1/  h i  The authority score of page i is proportional to the sum of the hub scores of the pages it is p g linked from: a = μ A T h  Constant μ is scale factor, μ =1/  a i Constant μ is scale factor, μ 1/  a i 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

  18.  The HITS algorithm:  The HITS algorithm:  Initialize h , a to all 1’s  R  Repeat: t  h = Aa  Scale h so that its sums to 1 0  Scale h so that its sums to 1.0  a = A T h  Scale a so that its sums to 1.0  Until h , a converge (i.e., change very little) 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

  19. 1 1 1 1 1 0 Yahoo T = 1 0 1 T A = 1 0 1 A 1 0 1 A A 1 0 1 0 1 0 1 1 0 Amazon Amazon M’soft . . . 1 1 = 1 1 1 1 1 1 1 1 a(yahoo) a(yahoo) . . . 0.732 = 1 1 4/5 0.75 a(amazon) . . . 1 = 1 1 1 1 a(m’soft) . . . h(yahoo) = 1 1 1 1.000 1 . . . h(amazon) = 1 2/3 0.73 0.732 0.71 0.27 . . . h(m’soft) = 1 1/3 0.268 0.29 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19

  20.  Algorithm:  Algorithm:  Set: a = h = 1 n  Repeat: Repeat:  h=Ma, a=M T h  Normalize a is being updated (in 2 steps): a is being updated (in 2 steps):  Then: a=M T (Ma) T M T (Ma)=(M T M)a new h h is updated (in 2 steps): p ( p ) new a new a M (M T h)=(MM T )h  Thus, in 2k steps: a=(M T M) k a a=(M M) a Repeated matrix powering Repeated matrix powering h=(MM T ) k h 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

  21.  h = λ Aa  a = μ A T h  h = λμ AA T h  a = λμ A T A a λ A T A  Under reasonable assumptions about A, the Under reasonable assumptions about A, the HITS iterative algorithm converges to vectors h* and a*:  h* is the principal eigenvector of matrix AA T  a* is the principal eigenvector of matrix A T A 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

  22. Hubs Authorities Most densely ‐ connected core Most densely connected core (primary core) Less densely ‐ connected core Less densely connected core (secondary core) 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

  23.  A single topic can have many bipartite cores  A single topic can have many bipartite cores  Corresponding to different meanings or points of view: points of view:  abortion: pro ‐ choice, pro ‐ life  evolution: darwinian, intelligent design e o ut o da a , te ge t des g  jaguar: auto, Mac, NFL team, panthera onca  How to find such secondary cores? H fi d h d ? 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

  24.  Once we find the primary core we can  Once we find the primary core, we can remove its links from the graph  Repeat HITS algorithm on residual graph to find the next bipartite core p  Roughly, correspond to non ‐ primary eigenvectors of AA T and A T A T T f d 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

  25.  We need a well connected graph of pages for  We need a well ‐ connected graph of pages for HITS to work well: 1/28/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend