Data Mining Techniques
CS 6220 - Section 2 - Spring 2017
Lecture 9: Link Analysis
Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: - - PowerPoint PPT Presentation
Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.) Web search before PageRank Human-curated (e.g. Yahoo, Looksmart)
CS 6220 - Section 2 - Spring 2017
Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.)
(e.g. Yahoo, Looksmart)
(e.g. WebCrawler, Lycos)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Not all pages are equally important Many inbound links Few/no inbound links Links from unimportant pages Links from important pages
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
importance of its source page
sum of the votes on its in-links
j
k i rj/3 rj/3 rj/3
rj = ri/3+rk/4
ri/3 rk/4
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
link to a new page at random
spent on page i
j
k i rj/3 rj/3 rj/3
rj = ri/3+rk/4
ri/3 rk/4
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
→
j i i j
i
y m a ra/2 ra/2 rm ry/2 ry/2
“Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
→
j i i j
i
“Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2
r = M·r
Matrix M is stochastic (i.e. columns sum to one)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
y m a ra/2 ra/2 rm ry/2 ry/2
with eigenvalue λ = 1
to exist since M is a stochastic matrix (i.e. if a = M b then Σ ai = Σ bi)
with order 1010 elements?
Model for random Surfer:
Probabilistic interpretation:
y m a a/2 y/2 a/2 m y/2
pt converges to r. Iterate until |pt - pt -1| < ε
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Ergodicity Irreducibility Stationary distribution (for ergodic chains) Markov Property
model for individual surfers
in which equal fractions of surfers follow each link at every time
is the same as the asymptotic distribution for an individual random walk
model for individual surfers
in which equal fractions of surfers follow each link at every time
is the same as the asymptotic distribution for an individual random walk
Dead end
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Spider trap
links to wider graph
no way out. Not irreducible
y m a a/2 y/2 a/2 y/2
Probability not conserved
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
y m a a/2 y/2 a/2 y/2
Fixes “probability sink” issue
(teleport at dead ends)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
y m a a/2 y/2 a/2 y/2
Probability accumulates in traps (surfers get stuck)
m
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Model for teleporting random surfer:
to a new initial location at random
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
rj = X
i→j
β ri di + (1 − β) 1 N
PageRank Equation [Page & Brin 1998]
y m a
(can use power iteration as normal)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
y m a
(can use power iteration as normal)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
y m a
(can use power iteration as normal)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23
source node degree destination nodes
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4
src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold
M
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
4 0, 1 1 3 2 2 1
src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold
4 5 1 3 5 2 2 4 4 3 2 2 3
Break M into stripes: Each stripe contains only destination nodes in the corresponding block of r new
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
so only search engines would see it
target search engine
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
underlined to represent the link) and its surrounding text
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
boost PageRank of a page
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
(spammer can post links to his pages)
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
as possible to target page t
multiplier effect
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
Inaccessible t Accessible Owned 1 2 M
Millions of farm pages
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
as teleport set
Time Series Histogram Mixture Posterior on states
Time Series Histogram Mixture Posterior on states
Estimate from GMM
Estimate from HMM
(more likely to be in same state than different state)
y m a a/2 y/2 a/2 m y/2
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
y m a a/2 y/2 a/2 m y/2
(adapted from:: Mining of Massive Datasets, http://www.mmds.org)
A = M >
Gaussian Mixture Gaussian HMM
Expectation Maximization
Expectation step for HMM
z1 ∼ Discrete(π) zt+1|zt = k ∼ Discrete(Ak) xt|zt = k ∼ Normal(µk, σk)
γt,k = p(zt = k | x1:T , θ) = p(x1:t, zt)p(xt+1:T |zt) p(x1:T ) ∝ αt,kβt,k X βt,k := p(xt+1:T | zt) = X
l
βt+1,l p(xt+1|µl, σl) Akl αt,l := p(x1:t, zt) = X
k
p(xt|µl, σl)Aklαt−1,k
E-step: Posterior probabilities on Transitions M-step updates
Handwritten Digits
itten sam- hid- ained
RNA splicing