Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang,   Tan et al., Leskovec et al.)

Web search before PageRank • Human-curated   (e.g. Yahoo, Looksmart) • Hand-written descriptions • Wait time for inclusion • Text-search   (e.g. WebCrawler, Lycos) • Prone to term-spam (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: Links as Votes Not all pages are equally important Few/no Many inbound   inbound   links links Links from Links from unimportant important pages pages • Pages with more inbound links are more important • Inbound links from important pages carry more weight (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Example: PageRank Scores A B 3.3 C 38.4 34.3 D E F 3.9 8.1 3.9 1.6 1.6 1.6 1.6 1.6 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: Recursive Formulation i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 • A link’s vote is proportional to the importance of its source page • If page j with importance r j has n out-links, each link gets r j / n votes • Page j ’s own importance is the sum of the votes on its in-links (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Equivalent Formulation: Random Surfer i k r i /3 r k /4 j r j /3 r j = r i /3+r k /4 r j /3 r j /3 • At time t a surfer is on some page i • At time t+1 the surfer follows a   link to a new page at random • Define rank r i as fraction of time spent on page i (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: The “Flow” Model r r y /2 r i ∑ = j d i j i → y “Flow” equations: r a /2 r y /2 r y = r y /2 + r a /2 r m r a = r y /2 + r m a m r m = r a /2 r a /2 • 3 equations, 3 unknowns • Impose constraint: r y + r a + r m = 1 • Solution: r y = 2/5 , r a = 2/5 , r m = 1/5 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: The “Flow” Model r r y /2 r i ∑ = j d i j i → y “Flow” equations: r a /2 r y /2 r y = r y /2 + r a /2 r m r a = r y /2 + r m a m r m = r a /2 r a /2 r = M·r Matrix M is stochastic (i.e. columns sum to one) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: Eigenvector Problem • PageRank: Solve for eigenvector r = M r   with eigenvalue λ = 1 • Eigenvector with λ = 1 is guaranteed   to exist since M is a stochastic matrix   (i.e. if a = M b then Σ a i = Σ b i ) • Problem : There are billions of pages on the   internet. How do we solve for eigenvector   with order 10 10 elements?

PageRank: Power Iteration Model for random Surfer: • At time t = 0 pick a page at random • At each subsequent time t follow an   outgoing link at random Probabilistic interpretation:

PageRank: Power Iteration y/2 y a/2 y/2 m a m a/2 p t converges to r . Iterate until | p t - p t -1 | < ε (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Intermezzo: Markov Chains Markov Property Irreducibility Ergodicity Stationary distribution (for ergodic chains)

Aside: Ergodicity • PageRank is assumes a random walk   model for individual surfers • Equivalent assumption : flow model   in which equal fractions of surfers   follow each link at every time • Ergodicity: The equilibrium of the flow model is the same as the asymptotic distribution for an individual random walk

PageRank: Problems 1. Dead Ends • Nodes with no outgoing links. Dead end • Where do surfers go next? 2. Spider Traps • Subgraph with no outgoing links to wider graph Spider trap • Surfers are “trapped” with   Not irreducible no way out. (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Power Iteration: Dead Ends y/2 y a/2 y/2 a m a/2 Probability not conserved (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Power Iteration: Dead Ends y/2 y a/2 y/2 a m (teleport at dead ends) a/2 Fixes “probability sink” issue (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Power Iteration: Spider Traps y/2 y a/2 m y/2 a m a/2 Probability accumulates in traps (surfers get stuck) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Solution: Random Teleports Model for teleporting random surfer: • At time t = 0 pick a page at random • At each subsequent time t • With probability β follow an   outgoing link at random • With probability 1- β teleport   to a new initial location at random PageRank Equation [Page & Brin 1998] β r i + ( 1 − β ) 1 X r j = d i N i → j (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Power Iteration: Teleports y a m (can use power iteration as normal) (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Computing PageRank • M is sparse - only store nonzero entries • Space proportional roughly to number of links • Say 10N, or 4*10*1 billion = 40GB • Still won’t fit in memory, but will fit on disk source degree destination nodes node 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Block-based Update Algorithm • Break r new into k blocks that fit in memory • Scan M and r old once for each block r old r new src degree destination 0 0 4 0, 1, 3, 5 0 1 1 1 2 0, 5 2 2 2 3 3, 4 2 4 3 M 5 4 5 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Block-Stripe Update Algorithm Break M into stripes : Each stripe contains only destination nodes in the corresponding block of r new src degree destination r new 0 4 0, 1 0 1 3 0 r old 1 0 2 2 1 1 2 0 4 3 3 2 4 2 2 3 3 5 0 4 5 4 1 3 5 5 2 2 4 (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Problems: Term Spam • How do you make your page appear to be about movies? • (1) Add the word movie 1,000 times to your page • Set text color to the background color,   so only search engines would see it • (2) Or, run the query “movie” on your   target search engine • See what page came first in the listings • Copy it into your page, make it “invisible” • These and similar techniques are term spam (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Google’s Solution to Term Spam • Believe what people say about you, rather than what you say about yourself • Use words in the anchor text (words that appear underlined to represent the link) and its surrounding text • PageRank as a tool to measure the “importance” of Web pages (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Problems 2: Link Spam • Once Google became the dominant search engine, spammers began to work out ways to fool Google • Spam farms were developed to concentrate PageRank on a single page • Link spam: • Creating link structures that   boost PageRank of a page (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Link Spamming • Three kinds of web pages from a   spammer’s point of view • Inaccessible pages • Accessible pages • e.g., blog comments pages   (spammer can post links to his pages) • Owned pages • Completely controlled by spammer • May span multiple domain names (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Link Farms • Spammer’s goal: • Maximize the PageRank of target page t • Technique: • Get as many links from accessible pages   as possible to target page t • Construct “link farm” to get PageRank   multiplier effect (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

Link Farms Accessible Owned 1 Inaccessible 2 t M Millions of   farm pages One of the most common and effective   organizations for a link farm (adapted from:: Mining of Massive Datasets, http://www.mmds.org)

PageRank: Extensions • Topic-specific PageRank : • Restrict teleportation to some set S   of pages related to a specific topic • Set p 0i = 1/ | S | if i ∈ S , p 0i = 0 otherwise • Trust Propagation • Use set S of trusted pages   as teleport set

Hidden Markov Models

Time Series with Distinct States

Can we use a Gaussian Mixture Model? Time Series Histogram Posterior on states Mixture

Hidden Markov Models Estimate from GMM Estimate from HMM • Idea: Mixture model + Markov chain for states • Can model correlation between subsequent states   (more likely to be in same state than different state)  

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 9: Link Analysis Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.) Web search before PageRank Human-curated (e.g. Yahoo, Looksmart)

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

The Road to REST One link at a time Rickard berg, Neo Technology QCon code: 7817 How it all

CS261 Data Structures Linked Lists - Introduction Dynamic Arrays Revisited Dynamic array can

VIREO@INS-TV13 Search of Small Objects by Topology Matching, Context Modeling, and Pattern Mining

Welcome! Office Hours will start at 2pm and run until 3pm Please mute your microphone As time

Adding financial skills to your farmer training. An EIEIO

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets

Access to Land: Economics of Leasing Gordon Groover Ag and Applied Economics Department

Securing Software Supply Chains with in in-to toto to Tobias Furuholm Combient NotPetya