http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) - - PowerPoint PPT Presentation
http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) - - PowerPoint PPT Presentation
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank Proposed at around same time
HITS (Hypertext‐Induced Topic Selection)
- Is a measure of importance of pages or documents,
similar to PageRank
- Proposed at around same time as PageRank (‘98)
Goal: Say we want to find good newspapers
- Don’t just find newspapers. Find “experts” – people
who link in a coordinated way to good newspapers
Idea: Links as votes
- Page is more important if it has more links
- In‐coming links? Out‐going links?
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
Hubs and Authorities
Each page has 2 scores:
- Quality as an expert (hub):
- Total sum of votes of authorities pointed to
- Quality as a content (authority):
- Total sum of votes coming from experts
Principle of repeated improvement
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 4
NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9
Interesting pages fall into two classes:
- 1. Authorities are pages containing
useful information
- Newspaper home pages
- Course home pages
- Home pages of auto manufacturers
- 2. Hubs are pages that link to authorities
- List of newspapers
- Course bulletin
- List of US auto manufacturers
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 5
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 6
(Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)
Each page starts with hub score 1. Authorities collect their votes
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 7
(Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)
Sum of hub scores of nodes pointing to NYT.
Each page starts with hub score 1. Authorities collect their votes
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 8
Hubs collect authority scores
(Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)
Sum of authority scores of nodes that the node points to.
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 9
Authorities again collect the hub scores
(Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)
A good hub links to many good authorities A good authority is linked from many good
hubs
Model using two scores for each node:
- Hub score and Authority score
- Represented as vectors
and
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 10
Each page has 2 scores:
- Authority score:
- Hub score:
- HITS algorithm:
Initialize:
- Then keep iterating until convergence:
- Authority:
- →
- Hub:
- →
- Normalize ,
such that:
- ,
- 2/12/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets 11
[Kleinberg ‘98]
i j1 j2 j3 j4
→
j1 j2 j3 j4
→
i
n…number of node in a graph
1 1 1 A = 1 0 1 0 1 0 1 1 0 A
T = 1 0 1
1 1 0 h(yahoo) h(amazon) h(m’soft) = = = .58 .58 .58 .80 .53 .27 .80 .53 .27 .79 .57 .23 . . . . . . . . . .788 .577 .211 a(yahoo) = .58 a(amazon) = .58 a(m’soft) = .58 .58 .58 .58 .62 .49 .62 . . . . . . . . . .628 .459 .628 .62 .49 .62
2/12/2014 12 Jure Leskovec, Stanford C246: Mining Massive Datasets
Yahoo Yahoo M’soft M’soft Amazon Amazon
HITS converges to a single stable point Notation:
- Vector
1
- 1
- Adjacency matrix
(n x n):
- if
Then
- →
can be rewritten as
- So:
Similarly,
- →
can be rewritten as
- 2/12/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets 13
[Kleinberg ‘98]
The hub score of page i is proportional to the
sum of the authority scores of the pages it links to: h = λ A a
- λ is a scale factor:
- ∑
- The authority score of page i is proportional
to the sum of the hub scores of the pages it is linked from: a = μ AT h
- μ is scale factor:
- ∑
- 2/12/2014
14 Jure Leskovec, Stanford C246: Mining Massive Datasets
HITS algorithm in vector notation:
- Set:
- Repeat until convergence:
- Normalize
and
Then:
- Thus, in
steps:
- 2/12/2014
Jure Leskovec, Stanford C246: Mining Massive Datasets
new new
is updated (in 2 steps):
- h is updated (in 2 steps):
- Repeated matrix powering
15
- Convergence criterion:
-
-
- Under reasonable assumptions about A,
HITS converges to vectors h* and a*:
- h* is the principal eigenvector of matrix A AT
- a* is the principal eigenvector of matrix AT A
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
1/ ∑
- 1/ ∑
PageRank and HITS are two solutions to the
same problem:
- What is the value of an in‐link from u to v?
- In the PageRank model, the value of the link
depends on the links into u
- In the HITS model, it depends on the value of the
- ther links out of u
The destinies of PageRank and HITS
post‐1998 were very different
2/12/2014 17 Jure Leskovec, Stanford C246: Mining Massive Datasets
We often think of networks being organized
into modules, cluster, communities:
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 19
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 20
Find micro‐markets by partitioning the
query‐to‐advertiser graph:
advertiser query
[Andersen, Lang: Communities from seed sets, 2006]
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 21
Clusters in Movies‐to‐Actors graph:
[Andersen, Lang: Communities from seed sets, 2006]
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 22
Discovering social circles, circles of trust:
[McAuley, Leskovec: Discovering social circles in ego networks, 2012]
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 23
Graph is large
- Assume the graph fits in main memory
- For example, to work with a 200M node and 2B edge
graph one needs approx. 16GB RAM
- But the graph is too big for running anything
more than linear time algorithms
We will cover a PageRank based algorithm
for finding dense clusters
- The runtime of the algorithm will be proportional
to the cluster size (not the graph size!)
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 24
Discovering clusters based on seed nodes
- Given: Seed node S
- Compute (approximate) Personalized PageRank
(PPR) around node S (teleport set={S})
- Idea is that if S belongs to a nice cluster, the
random walk will get trapped inside the cluster
Seed node
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 25
Algorithm outline:
- Pick a seed node S of interest
- Run PPR with teleport set = {S}
- Sort the nodes by the decreasing PPR score
- Sweep over the nodes and find good clusters
Node rank in decreasing PPR score
Cluster “quality” (lower is better)
Good clusters
Seed node
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 26
Undirected graph Partitioning task:
- Divide vertices into 2 disjoint groups
Question:
- How can we define a “good” cluster in ?
1 3 2 5 4 6 A B=V\A
1 3 2 5 4 6
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 27
What makes a good cluster?
- Maximize the number of within‐cluster
connections
- Minimize the number of between‐cluster
connections
1 3 2 5 4 6
A V\A
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 28
A
Express cluster quality as a function of the
“edge cut” of the cluster
Cut: Set of edges with only one node in the
cluster:
cut(A) = 2
1 3 2 5 4 6
Note: This works for weighed and unweighted (set all wij=1) graphs
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 29
Partition quality: Cut score
- Quality of a cluster is the weight of connections
pointing outside the cluster
Degenerate case: Problem:
- Only considers external cluster connections
- Does not consider internal cluster connectivity
“Optimal cut” Minimum cut
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 30
Criterion: Conductance:
Connectivity of the group to the rest of the network relative to the density of the group
: total weight of the edges with at least
- ne endpoint in :
- ∈
Why use this criterion?
Produces more balanced partitions
[Shi‐Malik]
)) ( 2 ), ( min( | } , ; ) , {( | ) ( A vol m A vol A j A i E j i A
m… number
- f edges of
the graph di… degree
- f node i
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 31
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 32
Algorithm outline:
- Pick a seed node S of
interest
- Run PPR w/ teleport={S}
- Sort the nodes by the
decreasing PPR score
- Sweep over the nodes
and find good clusters
Node rank i in decreasing PPR score
Conductance
Good clusters
Sweep:
- Sort nodes in decreasing PPR score
- For each compute
- Local minima of
correspond to good clusters
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 33
How to compute Personalized PageRank (PPR)
without touching the whole graph?
- Power method won’t work since each single iteration
accesses all nodes of the graph:
- is a teleport vector: … …
- is the personalized PageRank vector
PageRank‐Nibble [Andersen,Chung, Lang, ‘07]
- A fast method for computing approximate
Personalized PageRank (PPR) with teleport set ={S}
- ApproxPageRank(S, β, ε)
- S … seed node
- β … teleportation parameter
- ε … approximation error parameter
At index S
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 34
Overview of the approximate PPR
- Lazy random walk, which is a variant of a random walk
that stays put with probability 1/2 at each time step, and walks to a random neighbor the other half of the time:
- →
- Keep track of residual PPR score
- Residual tells us how well PPR score of is approximated
- … is the ‘true’ PageRank of node
… is PageRank estimate at around
- If residual
- f node u is too big
- then spread
the walk further, else don’t touch the node
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 35
A different way to look at PageRank:
[Jeh&Widom. Scaling Personalized Web Search, 2002]
- is the true PageRank vector
- is the PageRank vector with teleportation
vector (set) and teleportation parameter
Gives an idea how to compute PageRank:
- Node ’s “view” of the graph ( ) is the average of
its out‐neighbors views plus ’s own importance
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 36
Idea:
- … approx. PageRank, … its residual PageRank
- Start with
and
- Iteratively push PageRank from
to until is small enough
- Maintain invariant:
- =
(the true PageRank)
[Andersen, Chung, Lang. Local graph partitioning using PageRank vectors, 2007]
Push: 1 step of a lazy random walk from node :
Do 1 step of a walk: Stay at u with prob. ½ Spread remaining ½ fraction of qu as if a single step of random walk were applied to u Update r
, , : ′ , ′
- for each such that , ∈ :
- return ,
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 37
ApproxPageRank(S, β, ε):
Set ,
While
∈
- :
Choose any vertex where
- :
- For each
such that
- Update
- Return
r … PPR vector ru …PPR score of u q …residual PPR vector qu … residual of node u du … degree of u Update r: Move 1
- f the prob. from qu to ru
1 step of a lazy random walk:
- Stay at qu with prob. ½
- Spread remaining ½
fraction of qu as if a single step of random walk were applied to u At index S
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 38
s c b a
s a b c Init: r = [0, 0, 0, 0] q = [1, 0, 0, 0] Push(s, r, q): r = [.5, 0, 0, 0] q = [.25, .08, .08, .08] Push(s, r, q): r = [.62, 0, 0, 0] q = [.06, .10, .10, .10] Push(a, r, q): r = [0.62, .05, 0, 0] q = [.09, .03, .10, .10] Push(b, r, q): r = [0.62, .05, .05, 0] q = [.09, .05, .03, .10] …. r = [.57, .19, .14, .09]
ApproxPageRank(S, β, ε):
Set 0, 0 . . 0 1 0 … 0 while
∈
- :
Choose any vertex where
- , , :
′ , ′
- For each such that , ∈ :
- Update ,
return
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 39
Runtime:
- PageRank‐Nibble computes PPR in time
- with residual error
- Power method would take time
- Graph cut approximation guarantee:
- If there exists a cut of conductance
then the method finds a cut of conductance
- Details in [Andersen, Chung, Lang. Local graph
partitioning using PageRank vectors, 2007]
http://www.math.ucsd.edu/~fan/wp/localpartfull.pdf
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 40
The smaller the ε the farther the random
walk will spread!
Seed node
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 41
[Andersen, Lang: Communities from seed sets, 2006]
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 42
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 43
Algorithm summary:
- Pick a seed node S of interest
- Run PPR with teleport set = {S}
- Sort the nodes by the decreasing PPR score
- Sweep over the nodes and find good clusters
Seed node Node rank in decreasing PPR score
Cluster “quality” (lower is better)
Good clusters
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 44
Searching for small communities in
the Web graph
What is the signature of a community /
discussion in a Web graph?
[Kumar et al. ‘99]
Dense 2-layer graph Intuition: Many people all talking about the same things … … Use this to define “topics”: What the same people on the left talk about on the right Remember HITS!
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 46
A more well‐defined problem:
Enumerate complete bipartite subgraphs Ks,t
- Where Ks,t : s nodes on the “left” where each links
to the same t other nodes on the “right”
K3,4
|X| = s = 3 |Y| = t = 4
X Y
Fully connected
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 47
Market basket analysis. Setting:
- Market: Universe U of n items
- Baskets: m subsets of U: S1, S2, …, Sm U
(Si is a set of items one person bought)
- Support: Frequency threshold f
Goal:
- Find all subsets T s.t. T Si of at least f sets Si
(items in T were bought together at least f times)
What’s the connection between the
itemsets and complete bipartite graphs?
[Agrawal-Srikant ‘99]
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 48
Frequent itemsets = complete bipartite graphs!
How?
- View each node i as a
set Si of nodes i points to
- Ks,t = a set Y of size t
that occurs in s sets Si
- Looking for Ks,t set of
frequency threshold to s and look at layer t – all frequent sets of size t
[Kumar et al. ‘99]
i b c d a
Si={a,b,c,d}
j i k b c d a
X Y
s … minimum support (|X|=s) t … itemset size (|Y|=t)
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 49
[Kumar et al. ‘99]
i b c d a
Si={a,b,c,d}
x y z b c a
X Y
Find frequent itemsets: s … minimum support t … itemset size
x b c a
We found Ks,t! Ks,t = a set Y of size t that occurs in s sets Si View each node i as a set Si of nodes i points to Say we find a frequent itemset Y={a,b,c} of supp s So, there are s nodes that link to all of {a,b,c}:
z a b c y b c a
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 50
Support threshold s=2
- {b,d}: support 3
- {e,f}: support 2
And we just found 2 bipartite
subgraphs:
c a b d f
Itemsets: a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {}
e c a b d e c d f e
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 51
Example of a community from a web graph
Nodes on the right Nodes on the left
[Kumar, Raghavan, Rajagopalan, Tomkins: Trawling the Web for emerging cyber-communities 1999]
2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 52