Link Analysis Stony Brook University CSE545, Fall 2016 The Web , - - PowerPoint PPT Presentation

link analysis
SMART_READER_LITE
LIVE PREVIEW

Link Analysis Stony Brook University CSE545, Fall 2016 The Web , - - PowerPoint PPT Presentation

Link Analysis Stony Brook University CSE545, Fall 2016 The Web , circa 1998 The Web , circa 1998 The Web , circa 1998 Match keywords, language ( information retrieval ) Explore directory The Web , circa 1998 Easy to game with term


slide-1
SLIDE 1

Link Analysis

Stony Brook University CSE545, Fall 2016

slide-2
SLIDE 2

The Web , circa 1998

slide-3
SLIDE 3

The Web , circa 1998

slide-4
SLIDE 4

The Web , circa 1998

Match keywords, language (information retrieval) Explore directory

slide-5
SLIDE 5

The Web , circa 1998

Match keywords, language (information retrieval) Explore directory

Easy to game with “term spam” Time-consuming; Not open-ended

slide-6
SLIDE 6

Enter PageRank

...

slide-7
SLIDE 7

PageRank

Key Idea: Consider the citations of the website in addition to keywords.

slide-8
SLIDE 8

PageRank

Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations?

slide-9
SLIDE 9

PageRank

Key Idea: Consider the citations of the website in addition to keywords. The Web as a directed graph: Who links to it? and what are their citations?

slide-10
SLIDE 10

PageRank

Key Idea: Consider the citations of the website in addition to keywords. Who links to it? and what are their citations? Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

slide-11
SLIDE 11

PageRank

Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links as votes Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

slide-12
SLIDE 12

PageRank

Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links (citations) as votes But citations from important pages should count more. Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

slide-13
SLIDE 13

Key Idea: Consider the citations of the website in addition to keywords.

PageRank

Flow Model: in-links (citations) as votes But citations from important pages should count more. Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations?

slide-14
SLIDE 14

PageRank

Key Idea: Consider the citations of the website in addition to keywords. Flow Model: in-links (citations) as votes But citations from important pages should count more. Use recursion to figure out if each page is important. Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations? How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is |out-links|)

slide-15
SLIDE 15

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations? How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is |out-links|) A B C D

slide-16
SLIDE 16

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations? How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is |out-links|) A B C D rA/1 rB/4 rC/2 rD = rA/1 + rB/4 + rC/2

slide-17
SLIDE 17

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations? How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is |out-links|) A B C D

slide-18
SLIDE 18

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations? How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is |out-links|) A B C D A system of equations?

slide-19
SLIDE 19

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations? How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is |out-links|) A B C D A system of equations?

slide-20
SLIDE 20

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Innovation 2: Not just own terms but what terms are used by citations? How to compute? Each page (j) has an importance (i.e. rank, rj) (nj is |out-links|) A B C D A system of equations? Provides intuition, but impractical to solve at scale.

slide-21
SLIDE 21

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? A B C D

to \ from A B C D A 1/2 1 B 1/3 1/2 C 1/3 1/2 D 1/3 1/2

“Transition Matrix”, M

slide-22
SLIDE 22

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] A B C D

to \ from A B C D A 1/2 1 B 1/3 1/2 C 1/3 1/2 D 1/3 1/2

“Transition Matrix”, M

slide-23
SLIDE 23

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] A B C D

to \ from A B C D A 1/2 1 B 1/3 1/2 C 1/3 1/2 D 1/3 1/2

“Transition Matrix”, M

slide-24
SLIDE 24

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M2·r = [15/48, 11/48, …] A B C D

to \ from A B C D A 1/2 1 B 1/3 1/2 C 1/3 1/2 D 1/3 1/2

“Transition Matrix”, M

slide-25
SLIDE 25

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M2·r = [15/48, 11/48, …] A B C D

to \ from A B C D A 1/2 1 B 1/3 1/2 C 1/3 1/2 D 1/3 1/2

“Transition Matrix”, M Power iteration algorithm Initialize: r[0] = [1/N, …, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M·r[t] t+=1 solution = r[t] err_norm(v1, v2) = |v1 - v2| #L1 norm

slide-26
SLIDE 26

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M2·r = [15/48, 11/48, …] A B C D

to \ from A B C D A 1/2 1 B 1/3 1/2 C 1/3 1/2 D 1/3 1/2

“Transition Matrix”, M Power iteration algorithm Initialize: r[0] = [1/N, …, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M·r[t] t+=1 solution = r[t] err_norm(v1, v2) = |v1 - v2| #L1 norm

slide-27
SLIDE 27

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M2·r = [15/48, 11/48, …] Power iteration algorithm Initialize: r[0] = [1/N, …, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M·r[t] t+=1 solution = r[t] err_norm(v1, v2) = |v1 - v2| #L1 norm As err_norm gets smaller we are moving toward: r = M·r We are actually just finding the eigenvector of M.

slide-28
SLIDE 28

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M2·r = [15/48, 11/48, …] Power iteration algorithm Initialize: r[0] = [1/N, …, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M·r[t] t+=1 solution = r[t] err_norm(v1, v2) = |v1 - v2| #L1 norm As err_norm gets smaller we are moving toward: r = M·r We are actually just finding the eigenvector of M. x is an eigenvector of if: A·x = ·x

slide-29
SLIDE 29

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M2·r = [15/48, 11/48, …] Power iteration algorithm Initialize: r[0] = [1/N, …, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M·r[t] t+=1 solution = r[t] err_norm(v1, v2) = |v1 - v2| #L1 norm As err_norm gets smaller we are moving toward: r = M·r We are actually just finding the eigenvector of M. x is an eigenvector of if: A·x = ·x A = 1 since columns of M sum to 1. thus, 1r=Mr finds the...

slide-30
SLIDE 30

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? To start: N=4 nodes, so r = [¼, ¼, ¼, ¼,] after first iteration: M·r = [3/8, 5/24, 5/24, 5/24] after second iteration: M(M·r) = M2·r = [15/48, 11/48, …] Power iteration algorithm Initialize: r[0] = [1/N, …, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M·r[t] t+=1 solution = r[t] err_norm(v1, v2) = |v1 - v2| #L1 norm As err_norm gets smaller we are moving toward: r = M·r We are actually just finding the eigenvector of M. x is an eigenvector of if: A·x = ·x A = 1 since columns of M sum to 1. thus, 1r=Mr finds the...

slide-31
SLIDE 31

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution

  • f a random walk.

Thus, r is a stationary

  • distribution. Probability of

being at given node. Power iteration algorithm Initialize: r[0] = [1/N, …, 1/N], r[-1]=[0,...,0] while (err_norm(r[t],r[t-1])>min_err): r[t+1] = M·r[t] t+=1 solution = r[t] err_norm(v1, v2) = |v1 - v2| #L1 norm

slide-32
SLIDE 32

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution

  • f a random walk.

Thus, r is a stationary

  • distribution. Probability of

being at given node. aka 1st order Markov Process

  • Rich probabilistic theory. One finding:

○ Stationary distributions have a unique distribution if:

slide-33
SLIDE 33

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution

  • f a random walk.

Thus, r is a stationary

  • distribution. Probability of

being at given node. aka 1st order Markov Process

  • Rich probabilistic theory. One finding:

○ Stationary distributions have a unique distribution if: ■ No “dead-ends” a node can’t propagate its rank ■ No “spider traps” set of nodes with no way out.

slide-34
SLIDE 34

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution

  • f a random walk.

Thus, r is a stationary

  • distribution. Probability of

being at given node. aka 1st order Markov Process

  • Rich probabilistic theory. One finding:

○ Stationary distributions have a unique distribution if: ■ No “dead-ends” a node can’t propagate its rank ■ No “spider traps” set of nodes with no way out. (technically: it needs to be: stochastic, irreducible, and aperiodic )

slide-35
SLIDE 35

PageRank

Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution

  • f a random walk.

Thus, r is a stationary

  • distribution. Probability of

being at given node. aka 1st order Markov Process

  • Rich probabilistic theory. One finding:

○ Stationary distributions have a unique distribution if: ■ No “dead-ends” a node can’t propagate its rank ■ No “spider traps” set of nodes with no way out. A B C D

to \ from A B C D A 1 B 1/3 1 C 1/3 D 1/3

r would eventually converge to [0, 0, …]

slide-36
SLIDE 36

PageRank

Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution

  • f a random walk.

Thus, r is a stationary

  • distribution. Probability of

being at given node. aka 1st order Markov Process

  • Rich probabilistic theory. One finding:

○ Stationary distributions have a unique distribution if: ■ No “dead-ends” a node can’t propagate its rank ■ No “spider traps” set of nodes with no way out. A B C D

slide-37
SLIDE 37

PageRank

Innovation 1: What pages would a “random Web surfer” end up at? Where is surfer at time t+1? p(t+1) = M · p(t) Suppose: p(t+1) = p(t), then p(t) is a stationary distribution

  • f a random walk.

Thus, r is a stationary

  • distribution. Probability of

being at given node. aka 1st order Markov Process

  • Rich probabilistic theory. One finding:

○ Stationary distributions have a unique distribution if: ■ No “dead-ends” a node can’t propagate its rank ■ No “spider traps” set of nodes with no way out. (technically: it needs to be: stochastic, irreducible, and aperiodic ) columns sum to 1 same node doesn’t repeat at a regular interval non-zero chance of going to any another node

slide-38
SLIDE 38

PageRank

No “dead-ends” No “spider traps” A B C D

The “Google” PageRank Formulation

Add teleportation:At each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1-)

slide-39
SLIDE 39

PageRank

No “dead-ends” No “spider traps”

The “Google” PageRank Formulation

Add teleportation:At each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1-) A B C D

slide-40
SLIDE 40

PageRank

No “dead-ends” No “spider traps”

The “Google” PageRank Formulation

Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1-) Add teleportation from dead end with probability 1 A B C D

to \ from A B C D A 1 B 1/3 1 C 1/3 D 1/3

slide-41
SLIDE 41

PageRank

No “dead-ends” No “spider traps”

The “Google” PageRank Formulation

Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1-) Add teleportation from dead end with probability 1 A B C D

to \ from A B C D A ¼ 1 B 1/3 ¼ 1 C 1/3 ¼ D 1/3 ¼

slide-42
SLIDE 42

PageRank

No “dead-ends” No “spider traps”

The “Google” PageRank Formulation

Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1-) Add teleportation from dead end with probability 1

(Brin and Page, 1998)

A B C D

to \ from A B C D A 1 B 1/3 1 C 1/3 D 1/3

slide-43
SLIDE 43

PageRank

No “dead-ends” No “spider traps”

The “Google” PageRank Formulation

Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1-) Add teleportation from dead end with probability 1

(Brin and Page, 1998)

A B C D

to \ from A B C D A 1 B 1/3 1 C 1/3 D 1/3 A B ... A ⅘*0 + ⅕*¼ ¼ B ⅘*⅓ + ⅕*¼ ¼ C ⅘*⅓ + ⅕*¼ ¼ D ⅘*⅓ + ⅕*¼ ¼

assume = ⅘ Flow Model Matrix Model

slide-44
SLIDE 44

PageRank

No “dead-ends” No “spider traps”

The “Google” PageRank Formulation

Add teleportation chance at all nodes. i.e. at each step, two choices 1. Follow a random link (probability, = ~.85) 2. Teleport to a random node (probability, 1-) Add teleportation from dead end with probability 1

A B ... A ⅘*0 + ⅕*¼ ¼ B ⅘*⅓ + ⅕*¼ ¼ C ⅘*⅓ + ⅕*¼ ¼ D ⅘*⅓ + ⅕*¼ ¼

assume = ⅘ To apply: run power iterations over M’ instead of M. Matrix Model