Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: - - PowerPoint PPT Presentation

data mining techniques
SMART_READER_LITE
LIVE PREVIEW

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: - - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.) Web search before PageRank Human-curated (e.g. Yahoo, Looksmart)


slide-1
SLIDE 1

Data Mining Techniques

CS 6220 - Section 2 - Spring 2017

Lecture 9: Link Analysis

Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, 
 Tan et al., Leskovec et al.)

slide-2
SLIDE 2
  • Human-curated 


(e.g. Yahoo, Looksmart)

  • Hand-written descriptions
  • Wait time for inclusion
  • Text-search


(e.g. WebCrawler, Lycos)

  • Prone to term-spam

Web search before PageRank

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-3
SLIDE 3

PageRank: Links as Votes

  • Pages with more inbound links are more important
  • Inbound links from important pages carry more weight

Not all pages are equally important Many inbound
 links Few/no inbound
 links Links from unimportant pages Links from important pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-4
SLIDE 4

Example: PageRank Scores

B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-5
SLIDE 5

PageRank: Recursive Formulation

  • A link’s vote is proportional to the

importance of its source page

  • If page j with importance rj has n
  • ut-links, each link gets rj / n votes
  • Page j’s own importance is the

sum of the votes on its in-links

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-6
SLIDE 6

Equivalent Formulation: Random Surfer

  • At time t a surfer is on some page i
  • At time t+1 the surfer follows a 


link to a new page at random

  • Define rank ri as fraction of time

spent on page i

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-7
SLIDE 7

PageRank: The “Flow” Model

  • 3 equations, 3 unknowns
  • Impose constraint: ry + ra + rm = 1
  • Solution: ry = 2/5, ra = 2/5, rm = 1/5

=

j i i j

r r

i

d

y m a ra/2 ra/2 rm ry/2 ry/2

“Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-8
SLIDE 8

PageRank: The “Flow” Model

=

j i i j

r r

i

d

“Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

r = M·r

Matrix M is stochastic (i.e. columns sum to one)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

y m a ra/2 ra/2 rm ry/2 ry/2

slide-9
SLIDE 9

PageRank: Eigenvector Problem

  • PageRank: Solve for eigenvector r = M r 


with eigenvalue λ = 1

  • Eigenvector with λ = 1 is guaranteed 


to exist since M is a stochastic matrix
 (i.e. if a = M b then Σ ai = Σ bi)

  • Problem: There are billions of pages on the 

  • internet. How do we solve for eigenvector


with order 1010 elements?

slide-10
SLIDE 10

PageRank: Power Iteration

Model for random Surfer:

  • At time t = 0 pick a page at random
  • At each subsequent time t follow an

  • utgoing link at random

Probabilistic interpretation:

slide-11
SLIDE 11

PageRank: Power Iteration

y m a a/2 y/2 a/2 m y/2

pt converges to r. Iterate until |pt - pt -1| < ε

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-12
SLIDE 12

Intermezzo: Markov Chains

Ergodicity Irreducibility Stationary distribution (for ergodic chains) Markov Property

slide-13
SLIDE 13
  • PageRank is assumes a random walk


model for individual surfers

  • Equivalent assumption: flow model


in which equal fractions of surfers
 follow each link at every time

  • Ergodicity: The equilibrium of the flow model

is the same as the asymptotic distribution for an individual random walk

Aside: Ergodicity

slide-14
SLIDE 14

Aside: Ergodicity

  • PageRank is assumes a random walk


model for individual surfers

  • Equivalent assumption: flow model


in which equal fractions of surfers
 follow each link at every time

  • Ergodicity: The equilibrium of the flow model

is the same as the asymptotic distribution for an individual random walk

slide-15
SLIDE 15

PageRank: Problems

Dead end

  • 1. Dead Ends
  • Nodes with no outgoing links.
  • Where do surfers go next?

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Spider trap

  • 2. Spider Traps
  • Subgraph with no outgoing

links to wider graph

  • Surfers are “trapped” with 


no way out. Not irreducible

slide-16
SLIDE 16

Power Iteration: Dead Ends

y m a a/2 y/2 a/2 y/2

Probability not conserved

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-17
SLIDE 17

Power Iteration: Dead Ends

y m a a/2 y/2 a/2 y/2

Fixes “probability sink” issue

(teleport at dead ends)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-18
SLIDE 18

Power Iteration: Spider Traps

y m a a/2 y/2 a/2 y/2

Probability accumulates in traps (surfers get stuck)

m

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-19
SLIDE 19

Solution: Random Teleports

Model for teleporting random surfer:

  • At time t = 0 pick a page at random
  • At each subsequent time t
  • With probability β follow an 

  • utgoing link at random
  • With probability 1-β teleport


to a new initial location at random

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

rj = X

i→j

β ri di + (1 − β) 1 N

PageRank Equation [Page & Brin 1998]

slide-20
SLIDE 20

Power Iteration: Teleports

y m a

(can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-21
SLIDE 21

Power Iteration: Teleports

y m a

(can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-22
SLIDE 22

Power Iteration: Teleports

y m a

(can use power iteration as normal)

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-23
SLIDE 23

Computing PageRank

  • M is sparse - only store nonzero entries
  • Space proportional roughly to number of links
  • Say 10N, or 4*10*1 billion = 40GB
  • Still won’t fit in memory, but will fit on disk

3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23

source node degree destination nodes

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-24
SLIDE 24

Block-based Update Algorithm

  • Break r new into k blocks that fit in memory
  • Scan M and r old once for each block

4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

M

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-25
SLIDE 25

Block-Stripe Update Algorithm

4 0, 1 1 3 2 2 1

src degree destination 1 2 3 4 5 1 2 3 4 5 rnew rold

4 5 1 3 5 2 2 4 4 3 2 2 3

Break M into stripes: Each stripe contains only destination nodes in the corresponding block of r new

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-26
SLIDE 26

Problems: Term Spam

  • How do you make your page appear to be

about movies?

  • (1) Add the word movie 1,000 times to your page
  • Set text color to the background color, 


so only search engines would see it

  • (2) Or, run the query “movie” on your 


target search engine

  • See what page came first in the listings
  • Copy it into your page, make it “invisible”
  • These and similar techniques are term spam

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-27
SLIDE 27

Google’s Solution to Term Spam

  • Believe what people say about you, rather

than what you say about yourself

  • Use words in the anchor text (words that appear

underlined to represent the link) and its surrounding text

  • PageRank as a tool to measure the

“importance” of Web pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-28
SLIDE 28

Problems 2: Link Spam

  • Once Google became the dominant search

engine, spammers began to work out ways to fool Google

  • Spam farms were developed to concentrate

PageRank on a single page

  • Link spam:
  • Creating link structures that 


boost PageRank of a page

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-29
SLIDE 29

Link Spamming

  • Three kinds of web pages from a 


spammer’s point of view

  • Inaccessible pages
  • Accessible pages
  • e.g., blog comments pages


(spammer can post links to his pages)

  • Owned pages
  • Completely controlled by spammer
  • May span multiple domain names

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-30
SLIDE 30

Link Farms

  • Spammer’s goal:
  • Maximize the PageRank of target page t
  • Technique:
  • Get as many links from accessible pages 


as possible to target page t

  • Construct “link farm” to get PageRank 


multiplier effect

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-31
SLIDE 31

Link Farms

Inaccessible t Accessible Owned 1 2 M

One of the most common and effective 


  • rganizations for a link farm

Millions of 
 farm pages

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-32
SLIDE 32

PageRank: Extensions

  • Topic-specific PageRank:
  • Restrict teleportation to some set S

  • f pages related to a specific topic
  • Set p0i = 1/|S| if i ∈ S, p0i = 0 otherwise
  • Trust Propagation
  • Use set S of trusted pages 


as teleport set

slide-33
SLIDE 33

Hidden Markov Models

slide-34
SLIDE 34

Time Series with Distinct States

slide-35
SLIDE 35

Can we use a Gaussian Mixture Model?

Time Series Histogram Mixture Posterior on states

slide-36
SLIDE 36

Time Series Histogram Mixture Posterior on states

Can we use a Gaussian Mixture Model?

slide-37
SLIDE 37

Estimate from GMM

Hidden Markov Models

Estimate from HMM

  • Idea: Mixture model + Markov chain for states
  • Can model correlation between subsequent states


(more likely to be in same state than different state)


slide-38
SLIDE 38

Reminder: Random Surfers in PageRank

y m a a/2 y/2 a/2 m y/2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

slide-39
SLIDE 39

Reminder: Random Surfers in PageRank

y m a a/2 y/2 a/2 m y/2

(adapted from:: Mining of Massive Datasets, http://www.mmds.org)

A = M >

slide-40
SLIDE 40

Hidden Markov Models

Gaussian Mixture Gaussian HMM

slide-41
SLIDE 41

Review: Gaussian Mixtures

Expectation Maximization

  • 1. Update cluster probabilities
  • 2. Update parameters
slide-42
SLIDE 42

Forward-backward Algorithm

Expectation step for HMM

z1 ∼ Discrete(π) zt+1|zt = k ∼ Discrete(Ak) xt|zt = k ∼ Normal(µk, σk)

γt,k = p(zt = k | x1:T , θ) = p(x1:t, zt)p(xt+1:T |zt) p(x1:T ) ∝ αt,kβt,k X βt,k := p(xt+1:T | zt) = X

l

βt+1,l p(xt+1|µl, σl) Akl αt,l := p(x1:t, zt) = X

k

p(xt|µl, σl)Aklαt−1,k

slide-43
SLIDE 43

Parameter Updates

E-step: Posterior probabilities on Transitions M-step updates

slide-44
SLIDE 44

Other Examples for HMMs

Handwritten Digits

  • State 1: Sweeping arc
  • State 2: Horizontal line

itten sam- hid- ained

RNA splicing

  • State 1: Exon (relevant)
  • State 2: Splice site
  • State 3: Intron (ignored)