CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
Link Analysis
Lecture 7
November 29, 2017 Link Analysis 1
Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 - - PowerPoint PPT Presentation
CS6220 Data Mining Techniques Fall 2017 Derbinsky Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques Fall 2017 Derbinsky Outline 1. Lets Build a
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 1
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
1. Let’s Build a Search Engine :)
– The Model – Problem #1: Fast Text Search – Problem #2: Ranking Documents – Enter the ers
2. Bringing Order to the Web
– Voting with Your Links
– Problem Representation
– Return of the Spammer (Farms)
3. Related Approaches
– Topic-Specific PageRank – SimRank – HITS: Hubs and Authorities
November 29, 2017 Link Analysis 2
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
– The Web (set of webpages)
– User query
– Ranked list of the “best” documents related to the query – Desired properties
November 29, 2017 Link Analysis 3
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 4
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 5
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 6
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
~ $0.7M/sec
November 29, 2017 Link Analysis 7
~10.4M blu ray ~ 7.75 miles ~ 15 x Burj Khalifa ~67 CPU/person ~ $21.6T/year ~ 1.25 x US GDP
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 8
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 9
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 10
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 11
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 12
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 13
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 14
Oj : number of out-links of page j
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
– Start uniformly at random
– A Markov Chain
November 29, 2017 Link Analysis 15
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 16
A B C D
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 17
P(X|A) Starting at A, probability of arriving at node X P(A|X) Starting at X, “probability”
A B C D
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 18
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 19
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 20
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 21
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 22
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 23
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 24
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 25
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
– The conditions held for our example, but not in the general case for the web (yet!)
– Useful particularly when the matrix in question is large & sparse, as the method doesn’t require any decomposition – Convergence: typically when the residual (norm of the difference between P vs P’) is below a threshold; may require many iterations, typically few are good enough for the web (according to Google)
November 29, 2017 Link Analysis 26
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 27
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 28
A B C D
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 29
A B C D C is termed a dead end
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 30
A B C D
Now substochastic (columns sum to at most 1)
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 31
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 32
25% loss!
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 33
1 0.75 0.54
0.29
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 34
A B C D
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 35
A B C D Simplest form
spider trap (could involve multiple nodes)
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 36
Stochastic – yes! But strongly connected? A B C D
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 37
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 38
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 39
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 40
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 41
n: number of nodes E: eeT
Think back to Naïve Bayes Anything seem smoother? Think about extreme values of d…
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 42
A B C D
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 43
A B C D
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 44
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 45
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 46
Compare to…
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 47
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 48
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 49
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 50
A B C D
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 51
(1 – 0.8) / 2 nodes
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 52
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 53
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 54
p∈In(i)
q∈In(i)
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 55
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 56
CS6220 – Data Mining Techniques・ ・・ Fall 2017・ ・・ Derbinsky
November 29, 2017 Link Analysis 57