CS473: Link Analysis Luo Si Department of Computer Science Purdue - - PowerPoint PPT Presentation

cs473
SMART_READER_LITE
LIVE PREVIEW

CS473: Link Analysis Luo Si Department of Computer Science Purdue - - PowerPoint PPT Presentation

CS473: Link Analysis Luo Si Department of Computer Science Purdue University Borrowed Slides from Prof. Rong Jin (MSU) Citation Analysis Web Structure Web is a graph Each web site correspond to a node A link from one site to another


slide-1
SLIDE 1

Luo Si

Department of Computer Science Purdue University

Link Analysis

Borrowed Slides from Prof. Rong Jin (MSU)

CS473:

slide-2
SLIDE 2

Citation Analysis

slide-3
SLIDE 3

Web Structure

Web is a graph

– Each web site correspond to a node – A link from one site to another site forms a directed edge

What does it look likes?

 Web is small world  The diameter of the web is 19

e.g. the average number of clicks from one web site to another is 19

slide-4
SLIDE 4

Bowtie Structure

Strongly Connected Component

Broder et al., 2001

slide-5
SLIDE 5

Bowtie Structure

Sites that link towards the ‘center’ of the web

Broder et al., 2001

slide-6
SLIDE 6

Bowtie Structure

Sites that link from the ‘center’ of the web

Broder et al., 2001

slide-7
SLIDE 7

Inlinks and Outlinks

Both degrees of incoming and outgoing links follow power law

Broder et al., 2001

slide-8
SLIDE 8

Early Approaches

  • Hyperlinks contain information about the human

judgment of a site

  • The more incoming links to a site, the more it is

judged important

Basic Assumptions Bray 1996

  • The visibility of a site is measured by the number
  • f other sites pointing to it
  • The luminosity of a site is measured by the number
  • f other sites to which it points

 Limitation: failure to capture the relative importance of different parents (children) sites

slide-9
SLIDE 9

HITS - Kleinberg’s Algorithm

  • HITS – Hypertext Induced Topic Selection
  • For each vertex v Є V in a subgraph of interest:
  • A site is very authoritative if it receives many
  • citations. Citations from important sites weight

more than citations from less-important sites

  • Hubness shows the importance of a site. A good

hub is a site that links to many authoritative sites a(v) - the authority of v h(v) - the hubness of v

slide-10
SLIDE 10

Authority and Hubness

2 3 4 1 1 5 6 7 a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

slide-11
SLIDE 11

Authority and Hubness: Version 1

HubsAuthorities(G)

1 1  [1,…,1] Є R 2 a  h  1 3 t  1 4 repeat 5 for each v in V 6 do a (v)  Σ h (w) 7 h (v)  Σ a (w) 8 t  t + 1 9 until || a – a || + || h – h || < ε 10 return (a , h )

t t t t t t t -1

t -1

t -1

t -1 w Є pa[v] w Є pa[v] |V| [ ] [ ]

Recursive dependency ( ) ( ) ( ) ( )

w pa v w ch v

a v h w h v a w

 

 

 

slide-12
SLIDE 12

Authority and Hubness: Version 1

HubsAuthorities(G)

1 1  [1,…,1] Є R 2 a  h  1 3 t  1 4 repeat 5 for each v in V 6 do a (v)  Σ h (w) 7 h (v)  Σ a (w) 8 t  t + 1 9 until || a – a || + || h – h || < ε 10 return (a , h )

t t t t t t t -1

t -1

t -1

t -1 w Є pa[v] w Є pa[v] |V| [ ] [ ]

Recursive dependency ( ) ( ) ( ) ( )

w pa v w ch v

a v h w h v a w

 

 

 

Problems ?

slide-13
SLIDE 13

Authority and Hubness: Version 2

[ ] [ ]

Recursive dependency ( ) ( ) ( ) ( )

w pa v w ch v

a v h w h v a w

 

 

 

( ) ( ) ( ) ( ) ( ) ( )

w w

a v a v a w h v h v h w  

 

+ Normalization

HubsAuthorities(G)

1 1  [1,…,1] Є R 2 a  h  1 3 t  1 4 repeat 5 for each v in V 6 do a (v)  Σ h (w) 7 h (v)  Σ a (w) 8 a  a / || a || 9 h  h / || h || 10 t  t + 1 11 until || a – a || + || h – h || < ε 12 return (a , h )

t t t t t t t t t t t -1

t -1

t -1

t -1 w Є pa[v] w Є pa[v] |V|

slide-14
SLIDE 14

HITS Example Results

Authority Hubness

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Authority and hubness weights

slide-15
SLIDE 15

Authority and Hubness

Authority score

– Not only depends on the number of incoming links – But also the ‘quality’ (e.g., hubness) of the incoming links

Hubness score

– Not only depends on the number of outgoing links – But also the ‘quality’ (e.g., hubness) of the outgoing links

slide-16
SLIDE 16

Authority and Hub

Column vector a: ai is the authority score for the i-th site Column vector h: hi is the hub score for the i-th site Matrix M:

,

1 the th site points to the th site

  • therwise

i j

i j     M

M =

slide-17
SLIDE 17

Authority and Hub

Vector a: ai is the authority score for the i-th site Vector h: hi is the hub score for the i-th site Matrix M:

  • Recursive dependency:

a(v)  Σ h(w) h(v)  Σ a(w)

w Є pa[v] w Є ch[v]

,

1 the th site points to the th site

  • therwise

i j

i j     M

slide-18
SLIDE 18

Authority and Hub

Column vector a: ai is the authority score for the i-th site Column vector h: hi is the hub score for the i-th site Matrix M:

  • Recursive dependency:

a(v)  Σ h(w) h(v)  Σ a(w)

w Є pa[v] w Є ch[v]

,

1 the th site points to the th site

  • therwise

i j

i j     M

 h Ma

h M a

T

slide-19
SLIDE 19

Authority and Hub

Column vector a: ai is the authority score for the i-th site Column vector h: hi is the hub score for the i-th site Matrix M:

T t t t

  a M h

  • Recursive dependency:

a(v)  Σ h(w) h(v)  Σ a(w)

w Є pa[v] w Є ch[v]

,

1 the th site points to the th site

  • therwise

i j

i j     M

t t t

  h Ma

Normalization Procedure

slide-20
SLIDE 20

Authority and Hub

Apply singular vector decomposition to matrix M

T T t t t t t t t T t t t t t t t

                a M Ma a M h h Ma h MM h

T T i i i i

    M UΣV u v

1 1

,   a u h v

slide-21
SLIDE 21

PageRank

Introduced by Page et al (1998)

– The weight is assigned by the rank

  • f parents

Difference with HITS

– HITS takes Hubness & Authority weights – The page rank is proportional to its parents’ rank, but inversely proportional to its parents’

  • utdegree
slide-22
SLIDE 22

Matrix Notation

B = M =

,

1 the th site points to the th site

  • therwise

i j

i j     M

, , ,

1

  • therwise

i j j i j j i j

      

 

M M B

slide-23
SLIDE 23

Matrix Notation

r = α BT r

α : eigenvalue r : eigenvector of B

Finding Pagerank to find principle eigenvector of B

: represents the rank score for the i-th web page

i

r r

slide-24
SLIDE 24

Matrix Notation

slide-25
SLIDE 25

Random Walk Model

Consider a random walk through the Web graph

B =

? ? ? ? ?

slide-26
SLIDE 26

Random Walk Model

Consider a random walk through the Web graph

B =

slide-27
SLIDE 27

Random Walk Model

Consider a random walk through the Web graph

B =

slide-28
SLIDE 28

Random Walk Model

Consider a random walk through the Web graph

B =

T, what is portion of time that the surfer will spend time

  • n each site?
slide-29
SLIDE 29

Random Walk Model

Consider a random walk through the Web graph

B =

,

( ) : percentage of time that the surfer will stay at the i-th site ( ) ( )

i k i

p k p k p i   B

T

 p B p

slide-30
SLIDE 30

Adding Self Loop

Allow surfer to decide to stay on the same place

B =

' (1 )      B B I