http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

 HITS (Hypertext ‐ Induced Topic Selection)  Is a measure of importance of pages or documents, similar to PageRank  Proposed at around same time as PageRank (‘98)  Goal : Say we want to find good newspapers  Don’t just find newspapers. Find “experts” – people who link in a coordinated way to good newspapers  Idea: Links as votes  Page is more important if it has more links  In ‐ coming links? Out ‐ going links? 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

 Hubs and Authorities NYT: 10 Each page has 2 scores: Ebay: 3  Quality as an expert (hub): Yahoo: 3  Total sum of votes of authorities pointed to CNN: 8  Quality as a content (authority): WSJ: 9  Total sum of votes coming from experts  Principle of repeated improvement 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

Interesting pages fall into two classes: 1. Authorities are pages containing useful information  Newspaper home pages  Course home pages  Home pages of auto manufacturers 2. Hubs are pages that link to authorities  List of newspapers  Course bulletin  List of US auto manufacturers 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

Each page starts with hub score 1. Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

Sum of hub scores of nodes pointing to NYT. Each page starts with hub score 1. Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

Sum of authority scores of nodes that the node points to. Hubs collect authority scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

Authorities again collect the hub scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

 A good hub links to many good authorities  A good authority is linked from many good hubs  Model using two scores for each node:  Hub score and Authority score  Represented as vectors and 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

[Kleinberg ‘98] j 1 j 2 j 3 j 4  Each page has 2 scores:  Authority score: �  Hub score: i � � � � � � � HITS algorithm: n…number of node in a graph �→�  Initialize: � �  Then keep iterating until convergence: i  Authority: � � �→�  Hub: � � �→� j 1 j 2 j 3 j 4  Normalize , such that: � � � � � � � � , � � � � �→� 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

1 1 1 1 1 0 Yahoo Yahoo T = 1 0 1 A = 1 0 1 A 0 1 0 1 1 0 Amazon Amazon M’soft M’soft . . . .788 = .58 .80 .80 .79 h(yahoo) . . . .577 = .58 .53 .53 .57 h(amazon) . . . .211 = .58 .27 .27 .23 h(m’soft) . . . a(yahoo) = .58 .58 .62 .628 .62 . . . a(amazon) = .58 .58 .49 .459 .49 . . . a(m’soft) = .58 .58 .62 .628 .62 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

[Kleinberg ‘98]  HITS converges to a single stable point  Notation:  Vector � � 1 1 if   Adjacency matrix ( n x n ): ��  Then � � �→� can be rewritten as � �� So:  Similarly, � � �→� � can be rewritten as � �� 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

 The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = λ A a �  λ is a scale factor: � ∑ � � �  The authority score of page i is proportional to the sum of the hub scores of the pages it is linked from: a = μ A T h �  μ is scale factor: � ∑ � � � 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

 HITS algorithm in vector notation: �  Set: Convergence criterion: � � � � � � � � �� Repeat until convergence : � � � � � � ��  � � � � � � �   Normalize and �  Then: is updated (in 2 steps): new � � � new �  Thus, in steps: h is updated (in 2 steps): � � � � � � Repeated matrix powering 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

� � � 1/ ∑ � �  � �  � � � 1/ ∑ � � � �  �   Under reasonable assumptions about A , HITS converges to vectors h * and a * :  h * is the principal eigenvector of matrix A A T  a * is the principal eigenvector of matrix A T A 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

 PageRank and HITS are two solutions to the same problem:  What is the value of an in ‐ link from u to v ?  In the PageRank model, the value of the link depends on the links into u  In the HITS model, it depends on the value of the other links out of u  The destinies of PageRank and HITS post ‐ 1998 were very different 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

 We often think of networks being organized into modules, cluster, communities: 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

 Find micro ‐ markets by partitioning the query ‐ to ‐ advertiser graph: query advertiser [Andersen, Lang: Communities from seed sets, 2006] 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

 Clusters in Movies ‐ to ‐ Actors graph: [Andersen, Lang: Communities from seed sets, 2006] 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

 Discovering social circles, circles of trust: [McAuley, Leskovec: Discovering social circles in ego networks, 2012] 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

 Graph is large  Assume the graph fits in main memory  For example, to work with a 200M node and 2B edge graph one needs approx. 16GB RAM  But the graph is too big for running anything more than linear time algorithms  We will cover a PageRank based algorithm for finding dense clusters  The runtime of the algorithm will be proportional to the cluster size (not the graph size!) 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

 Discovering clusters based on seed nodes  Given: Seed node S  Compute (approximate) Personalized PageRank ( PPR ) around node S (teleport set={ S })  Idea is that if S belongs to a nice cluster, the random walk will get trapped inside the cluster Seed node 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

Cluster “quality” (lower is better) Good clusters Seed node  Algorithm outline: Node rank in decreasing PPR score  Pick a seed node S of interest  Run PPR with teleport set = { S }  Sort the nodes by the decreasing PPR score  Sweep over the nodes and find good clusters 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

5 1  Undirected graph 2 6 4 3  Partitioning task:  Divide vertices into 2 disjoint groups A B=V\A 5 1 2 6 4 3  Question:  How can we define a “good” cluster in ? 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

 What makes a good cluster?  Maximize the number of within ‐ cluster connections  Minimize the number of between ‐ cluster connections 5 1 2 6 4 3 A V\A 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

 Express cluster quality as a function of the “edge cut” of the cluster  Cut: Set of edges with only one node in the cluster: Note: This works for weighed and unweighted (set all w ij =1 ) graphs A 5 1 cut(A) = 2 2 6 4 3 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 29

 Partition quality: Cut score  Quality of a cluster is the weight of connections pointing outside the cluster  Degenerate case: “Optimal cut” Minimum cut  Problem:  Only considers external cluster connections  Does not consider internal cluster connectivity 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

[Shi ‐ Malik]  Criterion: Conductance: Connectivity of the group to the rest of the network relative to the density of the group    | {( , ) ; , } | i j E i A j A   ( ) A  min( ( ), 2 ( )) vol A m vol A : total weight of the edges with at least m … number of edges of one endpoint in : � �∈� the graph  Why use this criterion? d i … degree of node i  Produces more balanced partitions 2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 31

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 32

http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank Proposed at around same time

T=0 V out Q d V V v Induced current Induced charge T=1 V out Q d V V Induced current

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

& Microsoft and HITS welcome you for the first conference for HITS partners 2016 New Era of

IV.3 HITS Hyperlinked-Induced Topic Search (HITS) identifies authorities as good content

hypertext, multimedia and the world-wide web hypertext, multimedia and the world-wide web

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

Regulated and Non-Regulated Emissions of Selected State-of-the-Art European Mopeds Ccile

Disclosures Updates in HIV Care I have no disclosures Susa Coffey, MD UCSF Division of HIV,

CS 523: Multimedia Systems Angus Forbes creativecoding.evl.uic.edu/courses/cs523 Graphcore

DOAC Educational Webinar 12 th November 2020 1-2pm and repeated 17 th November 2020 7-8pm

Introduction to SALSA (Stochastic Approach for Link- Structure Analysis) A fundamental

Lets create change together Follow us to learn more about how you Community Partners can make

CSE 141L Steven Swanson Ameen Akel Matt DeVuyst 1 You will design and implement a

communicating analytical results 14 th January 2020 NHS England and NHS Improvement Population

http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank Proposed at around same time

T=0 V out Q d V V v Induced current Induced charge T=1 V out Q d V V Induced current

Virtual Student Orientation Information for Families SLIDESMANIA.COM TOPIC TOPIC TOPIC TOPIC

ConnectHome ConnectHome Topic 2 Topic 2 Nation Webinar Nation Webinar Topic 3 Topic 3 Topic

&amp; Microsoft and HITS welcome you for the first conference for HITS partners 2016 New Era of

IV.3 HITS Hyperlinked-Induced Topic Search (HITS) identifies authorities as good content

hypertext, multimedia and the world-wide web hypertext, multimedia and the world-wide web

http://cs246.stanford.edu Web pages are not equally important www.joe-schmoe.com vs.

http://cs246.stanford.edu Instructor: Jure Leskovec TAs: Aditya Parameswaran

http://cs246.stanford.edu CPU Machine Learning, Statistics Memory Classical Data Mining

http://cs246.stanford.edu High dim. Graph Infinite Machine Apps data data data learning

http://cs246.stanford.edu More algorithms for streams: (1) Filtering a data stream: Bloom

http://cs246.stanford.edu High-dimension == many features Find concepts/topics/genres:

http://cs246.stanford.edu High dim. High dim. Graph Graph Infinite Infinite Machine Machine Apps

http://cs246.stanford.edu Classic model of algorithms You get to see the entire input, then

http://cs246.stanford.edu Rank nodes using link structure PageRank: Link voting: P

http://cs246.stanford.edu Web advertising Weve learned how to match advertisers to

Regulated and Non-Regulated Emissions of Selected State-of-the-Art European Mopeds Ccile

Disclosures Updates in HIV Care I have no disclosures Susa Coffey, MD UCSF Division of HIV,

CS 523: Multimedia Systems Angus Forbes creativecoding.evl.uic.edu/courses/cs523 Graphcore

DOAC Educational Webinar 12 th November 2020 1-2pm and repeated 17 th November 2020 7-8pm

Introduction to SALSA (Stochastic Approach for Link- Structure Analysis) A fundamental

Lets create change together Follow us to learn more about how you Community Partners can make

CSE 141L Steven Swanson Ameen Akel Matt DeVuyst 1 You will design and implement a

communicating analytical results 14 th January 2020 NHS England and NHS Improvement Population

& Microsoft and HITS welcome you for the first conference for HITS partners 2016 New Era of