Data Mining: Concepts and Techniques Web Mining Li Xiong Slides - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu 4/9/2008 1

Web Mining � Web mining vs. data mining � Structure (or lack of it) � Linkage structure and lack of structure in textual information � Scale � Data generated per day is comparable to largest conventional data warehouses � Speed � Often need to react to evolving usage patterns in real-time (e.g., merchandising)

Web Mining � Structure Mining � Extracting info from topology of the Web (links among pages) � Content Mining � Extracting info from page content (text, images, audio or video, etc) � Natural language processing and information retrieval � Usage Mining � Extracting info from user’s usage data on the web (how user visits the pages or makes transactions) 4/9/2008 Li Xiong 3

Web Mining 4/9/2008 4

Web Mining � Web structure mining � Web graph structure and link analysis � Web text mining � Text representation and IR models � Web usage mining � Collaborative filtering 4/9/2008 Li Xiong 5

Structure of Web Graph � Web as a directed graph � Pages = nodes, hyperlinks = edges � Problem: Understand the macroscopic structure and evolution of the web graph � Practical implications � Crawling, browsing, computation of link analysis algorithms

Power-law degree distribution Source: Broder et al, 00

Bow-tie Structure (Broder et al. 00)

The Daisy Structure (Donato et al. 05) 4/9/2008 9

Link Analysis � Problem: exploit the link structure of a graph to order or prioritize the set of objects within the graph � Application of social network analysis at actor level: centrality and prestige � Algorithms � PageRank � HITS 10 April 9, 2008 Li Xiong

PageRank (Brin & Page’98) � Intuition � Web pages are not equally “important” � www.joe-schmoe.com v www.stanford.edu � Links as citations: a page cited often is more important � www.stanford.edu has 23,400 inlinks � www.joe-schmoe.com has 1 inlink � Recursive model: links from heavily linked pages weighted more � PageRank is essentially the eigenvector prestige in social network

Simple Recursive Flow Model � Each link’s vote is proportional to the importance of its source page � If page P with importance x has n outlinks, each link gets x/n votes � Page P’s own importance is the sum of the votes on its inlinks y = y /2 + a /2 y/2 Yahoo y a = y /2 + m m = a /2 a/2 y/2 Solving the equation with constraint: y+ a+ m = 1 m y = 2/5, a = 2/5, m = 1/5 Amazon M’soft a/2 m a

Matrix formulation Web link matrix M : one row and one column per web page � ⎧ 1 ∈ ⎪ ( , ) if i j E = ⎨ M O ij j ⎪ 0 ⎩ otherwise Rank vector r : one entry per web page � Flow equation: r = Mr � r is an eigenvector of the M � j i i = j M r r

Matrix formulation Example y a m y 1/2 1/2 0 Yahoo a 1/2 0 1 m 0 1/2 0 r = Mr Amazon M’soft y 1/2 1/2 0 y y = y /2 + a /2 a = 1/2 0 1 a m 0 1/2 0 m a = y /2 + m m = a /2

Power I teration method Solving equation: r = Mr � Suppose there are N web pages � Initialize: r 0 = [1/N,….,1/N] T � Iterate: r k+ 1 = Mr k � Stop when | r k+ 1 - r k | 1 < ε � | x | 1 = ∑ 1 ≤ i ≤ N |x i | is the L 1 norm � Can use any other vector norm e.g., Euclidean

Power I teration Example y a m Yahoo y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M’soft y 1/3 1/3 5/12 3/8 2/5 a = . . . 1/3 1/2 1/3 11/24 2/5 m 1/3 1/6 1/4 1/6 1/5

Random Walk I nterpretation Imagine a random web surfer � � At any time t, surfer is on some page P � At time t+ 1, the surfer follows an outlink from P uniformly at random � Ends up on some page Q linked from P � Process repeats indefinitely p (t) is the probability distribution whose i th component is the � probability that the surfer is at page i at time t

The stationary distribution � Where is the surfer at time t+ 1? � p (t+ 1) = Mp (t) � Suppose the random walk reaches a state such that p (t+ 1) = Mp (t) = p (t) � Then p (t) is a stationary distribution for the random walk � Our rank vector r satisfies r = Mr

Existence and Uniqueness of the Solution � Theory of random walks (aka Markov processes): A finite Markov chain defined by the stochastic matrix has a unique stationary probability distribution if the matrix is irreducible and aperiodic . 19 April 9, 2008 Mining and Searching Graphs in Graph Databases

M is a not stochastic matrix � M is the transition matrix of the Web graph ⎧ 1 ∈ ⎪ ( , ) if i j E = ⎨ M O ij j ⎪ 0 ⎩ otherwise n ∑ = 1 � It does not satisfy M ij = 1 i � Many web pages have no out-links � Such pages are called the dangling pages . CS583, Bing Liu, UIC 20

M is a not irreducible � Irreducible means that the Web graph G is strongly connected. Definition: A directed graph G = ( V , E ) is strongly connected if and only if, for each pair of nodes u , v ∈ V , there is a path from u to v . � A general Web graph is not irreducible because � for some pair of nodes u and v , there is no path from u to v . CS583, Bing Liu, UIC 21

M is a not aperiodic � A state i in a Markov chain being periodic means that there exists a directed cycle that the chain has to traverse. Definition: A state i is periodic with period k > 1 if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k . � If a state is not periodic (i.e., k = 1), it is aperiodic . � A Markov chain is aperiodic if all states are aperiodic. CS583, Bing Liu, UIC 22

Solution: Random teleports � Add a link from each page to every page � At each time step, the random surfer has a small probability teleporting to those links � With probability β , follow a link at random � With probability 1- β , jump to some page uniformly at random � Common values for β are in the range 0.8 to 0.9

Random teleports Example ( β = 0.8 ) 1/2 1/2 0 1/3 1/3 1/3 1/2 0 0 + 0.2 1/3 1/3 1/3 0.8 Yahoo 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 Amazon M’soft y 1 1.00 0.84 0.776 7/11 a = . . . 1 0.60 0.60 0.536 5/11 m 1 1.40 1.56 1.688 21/11

Matrix formulation � Matrix vector A � A ij = β M ij + (1- β )/N � M ij = 1/|O(j)| when j → i and M ij = 0 otherwise � Verify that A is a stochastic matrix � The page rank vector r is the principal eigenvector of this matrix � satisfying r = Ar � Equivalently, r is the stationary distribution of the random walk with teleports

Advantages and Limitations of PageRank � Fighting spam � PageRank is a global measure and is query independent � Computed offline � Criticism: query-independence. � It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic. CS583, Bing Liu, UIC 26

HI TS: Capturing Authorities & Hubs (Kleinberg’98) Intuitions � � Pages that are widely cited are good authorities � Pages that cite many other pages are good hubs HITS (Hypertext-Induced Topic Selection) � � When the user issues a search query, HITS expands the list of relevant pages returned by a search engine and produces two rankings Hubs Authorities 1. Authorities are pages containing useful information and linked by Hubs course home pages � home pages of auto manufacturers � 2. Hubs are pages that link to Authorities course bulletin � list of US auto manufacturers � 27 April 9, 2008 Data Mining: Concepts and Techniques

Matrix Formulation � Transition (adjacency) matrix A � A [ i , j ] = 1 if page i links to page j , 0 if Hubs Authorities not � The hub score vector h : score is proportional to the sum of the authority scores of the pages it links to � h = λ A a � Constant λ is a scale factor � The authority score vector a : score is proportional to the sum of the hub scores of the pages it is linked from � a = μ A T h � Constant μ is scale factor

Transition Matrix Example y a m Yahoo y 1 1 1 A = a 1 0 1 m 0 1 0 Amazon M’soft

I terative algorithm � Initialize h , a to all 1’s � h = Aa � Scale h so that its max entry is 1.0 � a = A T h � Scale a so that its max entry is 1.0 � Continue until h , a converge

I terative Algorithm Example 1 1 1 1 1 0 A T = 1 0 1 A = 1 0 1 0 1 0 1 1 0 . . . 1 = 1 1 1 1 a(yahoo) . . . 0.732 = 1 1 4/5 0.75 a(amazon) . . . 1 = 1 1 1 1 a(m’soft) . . . h(yahoo) = 1 1 1 1.000 1 . . . h(amazon) = 1 2/3 0.73 0.732 0.71 . . . h(m’soft) = 1 1/3 0.27 0.268 0.29

Existence and Uniqueness of the Solution h = λ A a a = μ A T h h = λμ AA T h a = λμ A T A a Under reasonable assumptions about A , the dual iterative algorithm converges to vectors h* and a* such that: h* is the principal eigenvector of the matrix AA T • a* is the principal eigenvector of the matrix A T A •

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu 4/9/2008 1 Web Mining Web mining vs. data mining Structure (or lack

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Data Mining Based Detection Methods Data Mining in Intrusion detection Feng Pan Outline

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Cement, Aggregates, Mining Presentation Cement, Aggregates and Mining Cement, Aggregates and

Frequent Pattern Mining Frequent Sequence Mining Frequent Tree Mining Christian Borgelt

Web Mining Andreas Andersson Gustav Strmberg Sandra Stendahl Introduction Web mining o

Week 5 Video 2 Relationship Mining Causal Mining Causal Data Mining These slides developed in

Data Mining 2018 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 10, 2018

The Tax Elasticity of Formal Work in African Countries Andy McKay (University of Sussex) Jukka

10/26/14 Am I Productive? A Primer for APRNs and PAs Todd Pickard, MMSc, PA-C MD Anderson

WebEx Instructions 3 2 1 1. When logging in, please include a first name and initial of your

A l l i e s wo r k s h o p Men learning to support women in open technology and

Infectious Diseases Family Medicine Board Review 2016 Brian Schwartz, MD UCSF, Division of

Recent progress in Homotopy type theory Univalent Foundations team July 22nd, 2013 Supported by

Future Internet Applications and Overlays (FIA Research Group) www3.informatik.uni-wuerzburg.de

Update in diagnosis I have no disclosures and management of UTIs Brian S. Schwartz, MD