data mining techniques
play

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.) Web search before PageRank Human-curated (e.g. Yahoo, Looksmart)


  1. Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, 
 Tan et al., Leskovec et al.)

  2. Web search before PageRank • Human-curated 
 (e.g. Yahoo, Looksmart) • Hand-written descriptions • Wait time for inclusion • Text-search 
 (e.g. WebCrawler, Lycos) • Prone to term-spam (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  3. PageRank: Links as Votes Not all pages are equally important Few/no Many inbound 
 inbound 
 links links Links from Links from unimportant important pages pages • Pages with more inbound links are more important • Inbound links from important pages carry more weight (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  4. Example: PageRank Scores A B 3.3 C 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  5. PageRank: Recursive Formulation i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 • A link’s vote is proportional to the importance of its source page • If page j with importance r j has n out-links, each link gets r j / n votes • Page j ’s own importance is the sum of the votes on its in-links (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  6. Equivalent Formulation: Random Surfer i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 • At time t a surfer is on some page i • At time t+1 the surfer follows a 
 link to a new page at random • Define rank r i as fraction of time spent on page i (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  7. PageRank: The “Flow” Model r r y /2 r i ∑ = j d i j i → y “Flow” equations: r a /2 r y /2 r y = r y /2 + r a /2 r m r a = r y /2 + r m a m r m = r a /2 r a /2 • 3 equations, 3 unknowns • Impose constraint: r y + r a + r m = 1 • Solution: r y = 2/5 , r a = 2/5 , r m = 1/5 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  8. PageRank: The “Flow” Model r r y /2 r i ∑ = j d i j i → y “Flow” equations: r a /2 r y /2 r y = r y /2 + r a /2 r m r a = r y /2 + r m a m r m = r a /2 r a /2 r = M·r Matrix M is stochastic (i.e. columns sum to one) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  9. PageRank: Eigenvector Problem • PageRank: Solve for eigenvector r = M r 
 with eigenvalue λ = 1 • Eigenvector with λ = 1 is guaranteed 
 to exist since M is a stochastic matrix 
 (i.e. if a = M b then Σ a i = Σ b i ) • Problem : There are billions of pages on the 
 internet. How do we solve for eigenvector 
 with order 10 10 elements?

  10. PageRank: Power Iteration Model for random Surfer: • At time t = 0 pick a page at random • At each subsequent time t follow an 
 outgoing link at random Probabilistic interpretation:

  11. PageRank: Power Iteration y/2 y a/2 y/2 m a m a/2 p t converges to r . Iterate until | p t - p t -1 | < ε (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  12. Intermezzo: Markov Chains Markov Property Irreducibility Ergodicity Stationary distribution (for ergodic chains)

  13. Aside: Ergodicity • PageRank is assumes a random walk 
 model for individual surfers • Equivalent assumption : flow model 
 in which equal fractions of surfers 
 follow each link at every time • Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

  14. Aside: Ergodicity • PageRank is assumes a random walk 
 model for individual surfers • Equivalent assumption : flow model 
 in which equal fractions of surfers 
 follow each link at every time • Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

  15. PageRank: Problems 1. Dead Ends • Nodes with no outgoing links. Dead end • Where do surfers go next? 2. Spider Traps • Subgraph with no outgoing links to wider graph Spider trap • Surfers are “trapped” with 
 Not irreducible no way out. (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  16. Power Iteration: Dead Ends y/2 y a/2 y/2 a m a/2 Probability not conserved (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  17. Power Iteration: Dead Ends y/2 y a/2 y/2 a m (teleport at dead ends) a/2 Fixes “probability sink” issue (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  18. Power Iteration: Spider Traps y/2 y a/2 m y/2 a m a/2 Probability accumulates in traps (surfers get stuck) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  19. Solution: Random Teleports Model for teleporting random surfer: • At time t = 0 pick a page at random • At each subsequent time t • With probability β follow an 
 outgoing link at random • With probability 1- β teleport 
 to a new initial location at random PageRank Equation [Page & Brin 1998] β r i + ( 1 − β ) 1 X r j = d i N i → j (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  20. Power Iteration: Teleports y a m (can use power iteration as normal) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  21. Power Iteration: Teleports y a m (can use power iteration as normal) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  22. Power Iteration: Teleports y a m (can use power iteration as normal) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  23. Computing PageRank • M is sparse - only store nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk source degree destination nodes node 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  24. Block-based Update Algorithm • Break r new into k blocks that fit in memory • Scan M and r old once for each block r old r new src degree destination 0 0 4 0, 1, 3, 5 0 1 1 1 2 0, 5 2 2 2 3 3, 4 2 4 3 M 5 4 5 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  25. Block-Stripe Update Algorithm Break M into stripes : Each stripe contains only destination nodes in the corresponding block of r new src degree destination r new 0 4 0, 1 0 1 3 0 r old 1 0 2 2 1 1 2 0 4 3 3 2 4 2 2 3 3 5 0 4 5 4 1 3 5 5 2 2 4 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  26. Problems: Term Spam • How do you make your page appear to be about movies? • (1) Add the word movie 1,000 times to your page • Set text color to the background color, 
 so only search engines would see it • (2) Or, run the query “movie” on your 
 target search engine • See what page came first in the listings • Copy it into your page, make it “invisible” • These and similar techniques are term spam (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  27. Google’s Solution to Term Spam • Believe what people say about you, rather than what you say about yourself • Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text • PageRank as a tool to measure the “importance” of Web pages (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  28. Problems 2: Link Spam • Once Google became the dominant search engine, spammers began to work out ways to fool Google • Spam farms were developed to concentrate PageRank on a single page • Link spam: • Creating link structures that 
 boost PageRank of a page (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  29. Link Spamming • Three kinds of web pages from a 
 spammer’s point of view • Inaccessible pages • Accessible pages • e.g., blog comments pages 
 (spammer can post links to his pages) • Owned pages • Completely controlled by spammer • May span multiple domain names (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  30. Link Farms • Spammer’s goal: • Maximize the PageRank of target page t • Technique: • Get as many links from accessible pages 
 as possible to target page t • Construct “link farm” to get PageRank 
 multiplier effect (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  31. Link Farms Accessible Owned 1 Inaccessible 2 t M Millions of 
 farm pages One of the most common and effective 
 organizations for a link farm (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

  32. PageRank: Extensions • Topic-specific PageRank : • Restrict teleportation to some set S 
 of pages related to a specific topic • Set p 0i = 1/ | S | if i ∈ S , p 0i = 0 otherwise • Trust Propagation • Use set S of trusted pages 
 as teleport set

  33. Hidden Markov Models

  34. Time Series with Distinct States

  35. Can we use a Gaussian Mixture Model? Time Series Histogram Posterior on states Mixture

  36. Can we use a Gaussian Mixture Model? Time Series Histogram Posterior on states Mixture

  37. Hidden Markov Models Estimate from GMM Estimate from HMM • Idea: Mixture model + Markov chain for states • Can model correlation between subsequent states 
 (more likely to be in same state than different state) 


Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend