http cs224w stanford edu how to organize navigate it
play

http://cs224w.stanford.edu How to organize/navigate it? First try: - PowerPoint PPT Presentation

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize/navigate it? First try: Human curated Web directories Yahoo, DMOZ, LookSmart 11/8/2011 Jure Leskovec,


  1. CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu

  2.  How to organize/navigate it?  First try: Human curated Web directories  Yahoo,  DMOZ,  LookSmart 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2

  3.  SEARCH!  Find relevant docs in a small and trusted set:  Newspaper articles  Patents, etc.  Two traditional problems:  Synonimy: buy – purchase, sick – ill  Polysemi: jaguar 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3

  4. Does more documents mean better results? 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4

  5.  What is “best” answer to query “Stanford”?  Anchor Text: I go to Stanford where I study  What about query “newspaper”?  No single right answer  Scarcity (IR) vs. abundance (Web) of information  Web: Many sources of information. Who to “trust”?  Trick:  Pages that actually know about newspapers might all be pointing to many newspapers  Ranking! 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5

  6. the “golden triangle” 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6

  7.  Web pages are not equally “important”  www.joe ‐ schmoe.com vs. www.stanford.edu  We already know: Since there is large diversity in the connectivity of the webgraph we can rank the pages by the link structure 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7

  8.  We will cover the following Link Analysis approaches to computing importances of nodes in a graph:  Hubs and Authorities (HITS)  Page Rank  Topic ‐ Specific (Personalized) Page Rank Sidenote: Various notions of node centrality: Node u  Degree dentrality = degree of u  Betweenness centrality = #shortest paths passing through u  Closeness centrality = avg. length of shortest paths from u to all other nodes  Eigenvector centrality = like PageRank 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8

  9.  Goal (back to the newspaper example):  Don’t just find newspapers.Find “experts” – people who link in a coordinated way to good newspapers  Idea: Links as votes  Page is more important if it has more links  In ‐ coming links? Out ‐ going links?  Hubs and Authorities NYT: 10 Each page has 2 scores: Ebay: 3  Quality as an expert (hub):  Total sum of votes of pages pointed to Yahoo: 3  Quality as an content (authority): CNN: 8  Total sum of votes of experts  Principle of repeated improvement WSJ: 9 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9

  10. Interesting pages fall into two classes: 1. Authorities are pages containing useful information  Newspaper home pages  Course home pages  Home pages of auto manufacturers 2. Hubs are pages that link to authorities  List of newspapers NYT: 10 Ebay: 3  Course bulletin Yahoo: 3  CNN: 8 List of US auto manufacturers WSJ: 9 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10

  11. 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11

  12. 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12

  13. 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

  14.  A good hub links to many good authorities  A good authority is linked from many good hubs  Model using two scores for each node:  Hub score and Authority score  Represented as vectors h and a 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 14

  15. [Kleinberg ‘98] j 1 j 2 j 3 j 4  Each page i has 2 scores:  Authority score: �  Hub score: � i � � � � � � HITS algorithm: �→�  Initialize: � �  Then keep iterating: i  Authority: � � �→�  Hub: � � �→� j 1 j 2 j 3 j 4  normalize: , � � � � � � � � � � �→� 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15

  16. [Kleinberg ‘98]  HITS converges to a single stable point  Slightly change the notation:  Vector a = (a 1 …,a n ), h = (h 1 …,h n )  Adjacency matrix ( n x n ): M ij =1 if i  j  Then:      h a h M a i j i ij j  i j j h  Ma  So:  T a M h  And likewise: 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16

  17.  HITS algorithm in new notation:  Set: a = h = 1 n  Repeat:  h=Ma, a=M T h  Normalize  Then: a=M T (Ma) new h a is being updated (in 2 steps): new a M T (M a)=(M T M) a  Thus, in 2k steps: h is updated (in 2 steps): a=(M T M) k a M (M T h)=(MM T ) h h=(M M T ) k h Repeated matrix powering 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17

  18.  Definition:  Let Ax=  x for some scalar  , vector x , matrix A  Then x is an eigenvector, and  is its eigenvalue  Fact:  If A is symmetric ( A ij =A ji ) (in our case M T M and M M T are symmetric)  Then A has n orthogonal unit eigenvectors w 1 …w n that form a basis (coordinate system) with eigenvalues  1 ...  n (|  i |  |  i+1 |) 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18

  19.  Let’s write x in coordinate system w 1 …w n x=  i  i w i  x has coordinates (  1 ,…,  n )  Suppose:  1 ...  n (|  1 |  …  |  n |) k  i w i  A k x =  k x =  i  i �� � ��  As k  , if we normalize A k x   1  1 w 1 � � � � � � lim � � → ∞ � � �→� � � (contribution of all other coordinates  0)  So authority a is eigenvector of M T M associated with largest eigenvalue  1  Similarly: hub h is eigenvector of M M T 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19

  20.  A “vote” from an important The web in 1839 page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r j for node j m a m a/2 r   r i Flow equations: j d out (i) r y = r y /2 + r a /2  i j r a = r y /2 + r m r m = r a /2 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20

  21.  Stochastic adjacency matrix M j  Let page j has d j out ‐ links  If j → i , then M ij = 1/d j else M ij = 0  M is a column stochastic matrix i � �� � 1  Columns sum to 1 3  Rank vector r : vector with an entry per page  r i is the importance score of page i   i r i = 1  The flow equations can be written r = M r 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21

  22.  Imagine a random web surfer:  At any time t , surfer is on some page u  At time t+1 , the surfer follows an out ‐ link from u uniformly at random  Ends up on some page v linked from u  Process repeats indefinitely  Let:  p (t) … vector whose i th coordinate is the prob. that the surfer is at page i at time t  p (t) is a probability distribution over pages 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22

  23.  Where is the surfer at time t+1 ?  Follows a link uniformly at random p (t+1) = Mp (t)  Suppose the random walk reaches a state p (t+1) = Mp (t) = p (t) then p (t) is stationary distribution of a random walk  Our rank vector r satisfies r = Mr  So, it is a stationary distribution for the random walk 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23

  24. Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Assign each node an initial page rank  Repeat until convergence  calculate the page rank of each node t   ( ) r  t ( 1 ) r i j d  i j i d i …. out-degree of node i 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24

  25. y a m  Power Iteration: y y ½ ½ 0  Set � a ½ 0 1 a m � � m 0 ½ 0  � �→� � � r y = r y /2 + r a /2  And iterate r a = r y /2 + r m r m = r a /2  Example: r y 1/3 1/3 5/12 9/24 6/15 r a = 1/3 3/6 1/3 11/24 … 6/15 r m 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 11/8/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend