http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) - - PowerPoint PPT Presentation

http cs246 stanford edu hits hypertext induced topic
SMART_READER_LITE
LIVE PREVIEW

http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) - - PowerPoint PPT Presentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank Proposed at around same time


slide-1
SLIDE 1

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

http://cs246.stanford.edu

slide-2
SLIDE 2
slide-3
SLIDE 3

 HITS (Hypertext‐Induced Topic Selection)

  • Is a measure of importance of pages or documents,

similar to PageRank

  • Proposed at around same time as PageRank (‘98)

 Goal: Say we want to find good newspapers

  • Don’t just find newspapers. Find “experts” – people

who link in a coordinated way to good newspapers

 Idea: Links as votes

  • Page is more important if it has more links
  • In‐coming links? Out‐going links?

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

slide-4
SLIDE 4

 Hubs and Authorities

Each page has 2 scores:

  • Quality as an expert (hub):
  • Total sum of votes of authorities pointed to
  • Quality as a content (authority):
  • Total sum of votes coming from experts

 Principle of repeated improvement

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9

slide-5
SLIDE 5

Interesting pages fall into two classes:

  • 1. Authorities are pages containing

useful information

  • Newspaper home pages
  • Course home pages
  • Home pages of auto manufacturers
  • 2. Hubs are pages that link to authorities
  • List of newspapers
  • Course bulletin
  • List of US auto manufacturers

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

slide-6
SLIDE 6

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

(Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)

Each page starts with hub score 1. Authorities collect their votes

slide-7
SLIDE 7

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

(Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)

Sum of hub scores of nodes pointing to NYT.

Each page starts with hub score 1. Authorities collect their votes

slide-8
SLIDE 8

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

Hubs collect authority scores

(Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)

Sum of authority scores of nodes that the node points to.

slide-9
SLIDE 9

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

Authorities again collect the hub scores

(Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score)

slide-10
SLIDE 10

 A good hub links to many good authorities  A good authority is linked from many good

hubs

 Model using two scores for each node:

  • Hub score and Authority score
  • Represented as vectors

and

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

slide-11
SLIDE 11

 Each page has 2 scores:

  • Authority score:
  • Hub score:
  • HITS algorithm:

 Initialize:

  •  Then keep iterating until convergence:
  • Authority:
  • Hub:
  • Normalize ,

such that:

  • ,
  • 2/12/2014

Jure Leskovec, Stanford C246: Mining Massive Datasets 11

[Kleinberg ‘98]

i j1 j2 j3 j4

j1 j2 j3 j4

i

n…number of node in a graph

slide-12
SLIDE 12

1 1 1 A = 1 0 1 0 1 0 1 1 0 A

T = 1 0 1

1 1 0 h(yahoo) h(amazon) h(m’soft) = = = .58 .58 .58 .80 .53 .27 .80 .53 .27 .79 .57 .23 . . . . . . . . . .788 .577 .211 a(yahoo) = .58 a(amazon) = .58 a(m’soft) = .58 .58 .58 .58 .62 .49 .62 . . . . . . . . . .628 .459 .628 .62 .49 .62

2/12/2014 12 Jure Leskovec, Stanford C246: Mining Massive Datasets

Yahoo Yahoo M’soft M’soft Amazon Amazon

slide-13
SLIDE 13

 HITS converges to a single stable point  Notation:

  • Vector

1

  • 1
  • Adjacency matrix

(n x n):

  • if 

 Then

can be rewritten as

  • So:

 Similarly,

can be rewritten as

  • 2/12/2014

Jure Leskovec, Stanford C246: Mining Massive Datasets 13

[Kleinberg ‘98]

slide-14
SLIDE 14

 The hub score of page i is proportional to the

sum of the authority scores of the pages it links to: h = λ A a

  • λ is a scale factor:
  •  The authority score of page i is proportional

to the sum of the hub scores of the pages it is linked from: a = μ AT h

  • μ is scale factor:
  • 2/12/2014

14 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-15
SLIDE 15

 HITS algorithm in vector notation:

  • Set:
  • Repeat until convergence:
  • Normalize

and

 Then:

  •  Thus, in

steps:

  • 2/12/2014

Jure Leskovec, Stanford C246: Mining Massive Datasets

new new

is updated (in 2 steps):

  • h is updated (in 2 steps):
  • Repeated matrix powering

15

  • Convergence criterion:
slide-16
SLIDE 16

 

  •  Under reasonable assumptions about A,

HITS converges to vectors h* and a*:

  • h* is the principal eigenvector of matrix A AT
  • a* is the principal eigenvector of matrix AT A

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

1/ ∑

  • 1/ ∑
slide-17
SLIDE 17

 PageRank and HITS are two solutions to the

same problem:

  • What is the value of an in‐link from u to v?
  • In the PageRank model, the value of the link

depends on the links into u

  • In the HITS model, it depends on the value of the
  • ther links out of u

 The destinies of PageRank and HITS

post‐1998 were very different

2/12/2014 17 Jure Leskovec, Stanford C246: Mining Massive Datasets

slide-18
SLIDE 18
slide-19
SLIDE 19

 We often think of networks being organized

into modules, cluster, communities:

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

slide-20
SLIDE 20

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

slide-21
SLIDE 21

 Find micro‐markets by partitioning the

query‐to‐advertiser graph:

advertiser query

[Andersen, Lang: Communities from seed sets, 2006]

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

slide-22
SLIDE 22

 Clusters in Movies‐to‐Actors graph:

[Andersen, Lang: Communities from seed sets, 2006]

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

slide-23
SLIDE 23

 Discovering social circles, circles of trust:

[McAuley, Leskovec: Discovering social circles in ego networks, 2012]

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

slide-24
SLIDE 24

 Graph is large

  • Assume the graph fits in main memory
  • For example, to work with a 200M node and 2B edge

graph one needs approx. 16GB RAM

  • But the graph is too big for running anything

more than linear time algorithms

 We will cover a PageRank based algorithm

for finding dense clusters

  • The runtime of the algorithm will be proportional

to the cluster size (not the graph size!)

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

slide-25
SLIDE 25

 Discovering clusters based on seed nodes

  • Given: Seed node S
  • Compute (approximate) Personalized PageRank

(PPR) around node S (teleport set={S})

  • Idea is that if S belongs to a nice cluster, the

random walk will get trapped inside the cluster

Seed node

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

slide-26
SLIDE 26

 Algorithm outline:

  • Pick a seed node S of interest
  • Run PPR with teleport set = {S}
  • Sort the nodes by the decreasing PPR score
  • Sweep over the nodes and find good clusters

Node rank in decreasing PPR score

Cluster “quality” (lower is better)

Good clusters

Seed node

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

slide-27
SLIDE 27

 Undirected graph  Partitioning task:

  • Divide vertices into 2 disjoint groups

 Question:

  • How can we define a “good” cluster in ?

1 3 2 5 4 6 A B=V\A

1 3 2 5 4 6

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

slide-28
SLIDE 28

 What makes a good cluster?

  • Maximize the number of within‐cluster

connections

  • Minimize the number of between‐cluster

connections

1 3 2 5 4 6

A V\A

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

slide-29
SLIDE 29

A

 Express cluster quality as a function of the

“edge cut” of the cluster

 Cut: Set of edges with only one node in the

cluster:

cut(A) = 2

1 3 2 5 4 6

Note: This works for weighed and unweighted (set all wij=1) graphs

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 29

slide-30
SLIDE 30

 Partition quality: Cut score

  • Quality of a cluster is the weight of connections

pointing outside the cluster

 Degenerate case:  Problem:

  • Only considers external cluster connections
  • Does not consider internal cluster connectivity

“Optimal cut” Minimum cut

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

slide-31
SLIDE 31

 Criterion: Conductance:

Connectivity of the group to the rest of the network relative to the density of the group

: total weight of the edges with at least

  • ne endpoint in :

 Why use this criterion?

 Produces more balanced partitions

[Shi‐Malik]

)) ( 2 ), ( min( | } , ; ) , {( | ) ( A vol m A vol A j A i E j i A      

m… number

  • f edges of

the graph di… degree

  • f node i

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 31

slide-32
SLIDE 32

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 32

slide-33
SLIDE 33

 Algorithm outline:

  • Pick a seed node S of

interest

  • Run PPR w/ teleport={S}
  • Sort the nodes by the

decreasing PPR score

  • Sweep over the nodes

and find good clusters

Node rank i in decreasing PPR score

Conductance

Good clusters

 Sweep:

  • Sort nodes in decreasing PPR score
  • For each compute
  • Local minima of

correspond to good clusters

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 33

slide-34
SLIDE 34

 How to compute Personalized PageRank (PPR)

without touching the whole graph?

  • Power method won’t work since each single iteration

accesses all nodes of the graph:

  • is a teleport vector: … …
  • is the personalized PageRank vector

 PageRank‐Nibble [Andersen,Chung, Lang, ‘07]

  • A fast method for computing approximate

Personalized PageRank (PPR) with teleport set ={S}

  • ApproxPageRank(S, β, ε)
  • S … seed node
  • β … teleportation parameter
  • ε … approximation error parameter

At index S

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

slide-35
SLIDE 35

 Overview of the approximate PPR

  • Lazy random walk, which is a variant of a random walk

that stays put with probability 1/2 at each time step, and walks to a random neighbor the other half of the time:

  • Keep track of residual PPR score
  • Residual tells us how well PPR score of is approximated
  • … is the ‘true’ PageRank of node

… is PageRank estimate at around

  • If residual
  • f node u is too big
  • then spread

the walk further, else don’t touch the node

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 35

slide-36
SLIDE 36

 A different way to look at PageRank:

[Jeh&Widom. Scaling Personalized Web Search, 2002]

  • is the true PageRank vector
  • is the PageRank vector with teleportation

vector (set) and teleportation parameter

 Gives an idea how to compute PageRank:

  • Node ’s “view” of the graph ( ) is the average of

its out‐neighbors views plus ’s own importance

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

slide-37
SLIDE 37

 Idea:

  • … approx. PageRank, … its residual PageRank
  • Start with

and

  • Iteratively push PageRank from

to until is small enough

  • Maintain invariant:
  • =

(the true PageRank)

[Andersen, Chung, Lang. Local graph partitioning using PageRank vectors, 2007]

 Push: 1 step of a lazy random walk from node :

Do 1 step of a walk: Stay at u with prob. ½ Spread remaining ½ fraction of qu as if a single step of random walk were applied to u Update r

, , : ′ , ′

  • for each such that , ∈ :
  • return ,

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 37

slide-38
SLIDE 38

 ApproxPageRank(S, β, ε):

Set ,

While

  • :

Choose any vertex where

  • :
  • For each

such that

  • Update
  • Return

r … PPR vector ru …PPR score of u q …residual PPR vector qu … residual of node u du … degree of u Update r: Move 1

  • f the prob. from qu to ru

1 step of a lazy random walk:

  • Stay at qu with prob. ½
  • Spread remaining ½

fraction of qu as if a single step of random walk were applied to u At index S

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 38

slide-39
SLIDE 39

s c b a

s a b c Init: r = [0, 0, 0, 0] q = [1, 0, 0, 0] Push(s, r, q): r = [.5, 0, 0, 0] q = [.25, .08, .08, .08] Push(s, r, q): r = [.62, 0, 0, 0] q = [.06, .10, .10, .10] Push(a, r, q): r = [0.62, .05, 0, 0] q = [.09, .03, .10, .10] Push(b, r, q): r = [0.62, .05, .05, 0] q = [.09, .05, .03, .10] …. r = [.57, .19, .14, .09]

ApproxPageRank(S, β, ε):

Set 0, 0 . . 0 1 0 … 0 while

  • :

Choose any vertex where

  • , , :

′ , ′

  • For each such that , ∈ :
  • Update ,

return

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 39

slide-40
SLIDE 40

 Runtime:

  • PageRank‐Nibble computes PPR in time
  • with residual error
  • Power method would take time
  •  Graph cut approximation guarantee:
  • If there exists a cut of conductance

then the method finds a cut of conductance

  • Details in [Andersen, Chung, Lang. Local graph

partitioning using PageRank vectors, 2007]

http://www.math.ucsd.edu/~fan/wp/localpartfull.pdf

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 40

slide-41
SLIDE 41

 The smaller the ε the farther the random

walk will spread!

Seed node

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 41

slide-42
SLIDE 42

[Andersen, Lang: Communities from seed sets, 2006]

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 42

slide-43
SLIDE 43

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 43

slide-44
SLIDE 44

 Algorithm summary:

  • Pick a seed node S of interest
  • Run PPR with teleport set = {S}
  • Sort the nodes by the decreasing PPR score
  • Sweep over the nodes and find good clusters

Seed node Node rank in decreasing PPR score

Cluster “quality” (lower is better)

Good clusters

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 44

slide-45
SLIDE 45
slide-46
SLIDE 46

 Searching for small communities in

the Web graph

 What is the signature of a community /

discussion in a Web graph?

[Kumar et al. ‘99]

Dense 2-layer graph Intuition: Many people all talking about the same things … … Use this to define “topics”: What the same people on the left talk about on the right Remember HITS!

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 46

slide-47
SLIDE 47

 A more well‐defined problem:

Enumerate complete bipartite subgraphs Ks,t

  • Where Ks,t : s nodes on the “left” where each links

to the same t other nodes on the “right”

K3,4

|X| = s = 3 |Y| = t = 4

X Y

Fully connected

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 47

slide-48
SLIDE 48

 Market basket analysis. Setting:

  • Market: Universe U of n items
  • Baskets: m subsets of U: S1, S2, …, Sm  U

(Si is a set of items one person bought)

  • Support: Frequency threshold f

 Goal:

  • Find all subsets T s.t. T  Si of at least f sets Si

(items in T were bought together at least f times)

 What’s the connection between the

itemsets and complete bipartite graphs?

[Agrawal-Srikant ‘99]

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 48

slide-49
SLIDE 49

Frequent itemsets = complete bipartite graphs!

 How?

  • View each node i as a

set Si of nodes i points to

  • Ks,t = a set Y of size t

that occurs in s sets Si

  • Looking for Ks,t  set of

frequency threshold to s and look at layer t – all frequent sets of size t

[Kumar et al. ‘99]

i b c d a

Si={a,b,c,d}

j i k b c d a

X Y

s … minimum support (|X|=s) t … itemset size (|Y|=t)

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 49

slide-50
SLIDE 50

[Kumar et al. ‘99]

i b c d a

Si={a,b,c,d}

x y z b c a

X Y

Find frequent itemsets: s … minimum support t … itemset size

x b c a

We found Ks,t! Ks,t = a set Y of size t that occurs in s sets Si View each node i as a set Si of nodes i points to Say we find a frequent itemset Y={a,b,c} of supp s So, there are s nodes that link to all of {a,b,c}:

z a b c y b c a

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 50

slide-51
SLIDE 51

 Support threshold s=2

  • {b,d}: support 3
  • {e,f}: support 2

 And we just found 2 bipartite

subgraphs:

c a b d f

Itemsets: a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {}

e c a b d e c d f e

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 51

slide-52
SLIDE 52

 Example of a community from a web graph

Nodes on the right Nodes on the left

[Kumar, Raghavan, Rajagopalan, Tomkins: Trawling the Web for emerging cyber-communities 1999]

2/12/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 52