data mining
play

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu 4/9/2008 1 Web Mining Web mining vs. data mining Structure (or lack


  1. Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu 4/9/2008 1

  2. Web Mining � Web mining vs. data mining � Structure (or lack of it) � Linkage structure and lack of structure in textual information � Scale � Data generated per day is comparable to largest conventional data warehouses � Speed � Often need to react to evolving usage patterns in real-time (e.g., merchandising)

  3. Web Mining � Structure Mining � Extracting info from topology of the Web (links among pages) � Content Mining � Extracting info from page content (text, images, audio or video, etc) � Natural language processing and information retrieval � Usage Mining � Extracting info from user’s usage data on the web (how user visits the pages or makes transactions) 4/9/2008 Li Xiong 3

  4. Web Mining 4/9/2008 4

  5. Web Mining � Web structure mining � Web graph structure and link analysis � Web text mining � Text representation and IR models � Web usage mining � Collaborative filtering 4/9/2008 Li Xiong 5

  6. Structure of Web Graph � Web as a directed graph � Pages = nodes, hyperlinks = edges � Problem: Understand the macroscopic structure and evolution of the web graph � Practical implications � Crawling, browsing, computation of link analysis algorithms

  7. Power-law degree distribution Source: Broder et al, 00

  8. Bow-tie Structure (Broder et al. 00)

  9. The Daisy Structure (Donato et al. 05) 4/9/2008 9

  10. Link Analysis � Problem: exploit the link structure of a graph to order or prioritize the set of objects within the graph � Application of social network analysis at actor level: centrality and prestige � Algorithms � PageRank � HITS 10 April 9, 2008 Li Xiong

  11. PageRank (Brin & Page’98) � Intuition � Web pages are not equally “important” � www.joe-schmoe.com v www.stanford.edu � Links as citations: a page cited often is more important � www.stanford.edu has 23,400 inlinks � www.joe-schmoe.com has 1 inlink � Recursive model: links from heavily linked pages weighted more � PageRank is essentially the eigenvector prestige in social network

  12. Simple Recursive Flow Model � Each link’s vote is proportional to the importance of its source page � If page P with importance x has n outlinks, each link gets x/n votes � Page P’s own importance is the sum of the votes on its inlinks y = y /2 + a /2 y/2 Yahoo y a = y /2 + m m = a /2 a/2 y/2 Solving the equation with constraint: y+ a+ m = 1 m y = 2/5, a = 2/5, m = 1/5 Amazon M’soft a/2 m a

  13. Matrix formulation Web link matrix M : one row and one column per web page � ⎧ 1 ∈ ⎪ ( , ) if i j E = ⎨ M O ij j ⎪ 0 ⎩ otherwise Rank vector r : one entry per web page � Flow equation: r = Mr � r is an eigenvector of the M � j i i = j M r r

  14. Matrix formulation Example y a m y 1/2 1/2 0 Yahoo a 1/2 0 1 m 0 1/2 0 r = Mr Amazon M’soft y 1/2 1/2 0 y y = y /2 + a /2 a = 1/2 0 1 a m 0 1/2 0 m a = y /2 + m m = a /2

  15. Power I teration method Solving equation: r = Mr � Suppose there are N web pages � Initialize: r 0 = [1/N,….,1/N] T � Iterate: r k+ 1 = Mr k � Stop when | r k+ 1 - r k | 1 < ε � | x | 1 = ∑ 1 ≤ i ≤ N |x i | is the L 1 norm � Can use any other vector norm e.g., Euclidean

  16. Power I teration Example y a m Yahoo y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Amazon M’soft y 1/3 1/3 5/12 3/8 2/5 a = . . . 1/3 1/2 1/3 11/24 2/5 m 1/3 1/6 1/4 1/6 1/5

  17. Random Walk I nterpretation Imagine a random web surfer � � At any time t, surfer is on some page P � At time t+ 1, the surfer follows an outlink from P uniformly at random � Ends up on some page Q linked from P � Process repeats indefinitely p (t) is the probability distribution whose i th component is the � probability that the surfer is at page i at time t

  18. The stationary distribution � Where is the surfer at time t+ 1? � p (t+ 1) = Mp (t) � Suppose the random walk reaches a state such that p (t+ 1) = Mp (t) = p (t) � Then p (t) is a stationary distribution for the random walk � Our rank vector r satisfies r = Mr

  19. Existence and Uniqueness of the Solution � Theory of random walks (aka Markov processes): A finite Markov chain defined by the stochastic matrix has a unique stationary probability distribution if the matrix is irreducible and aperiodic . 19 April 9, 2008 Mining and Searching Graphs in Graph Databases

  20. M is a not stochastic matrix � M is the transition matrix of the Web graph ⎧ 1 ∈ ⎪ ( , ) if i j E = ⎨ M O ij j ⎪ 0 ⎩ otherwise n ∑ = 1 � It does not satisfy M ij = 1 i � Many web pages have no out-links � Such pages are called the dangling pages . CS583, Bing Liu, UIC 20

  21. M is a not irreducible � Irreducible means that the Web graph G is strongly connected. Definition: A directed graph G = ( V , E ) is strongly connected if and only if, for each pair of nodes u , v ∈ V , there is a path from u to v . � A general Web graph is not irreducible because � for some pair of nodes u and v , there is no path from u to v . CS583, Bing Liu, UIC 21

  22. M is a not aperiodic � A state i in a Markov chain being periodic means that there exists a directed cycle that the chain has to traverse. Definition: A state i is periodic with period k > 1 if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k . � If a state is not periodic (i.e., k = 1), it is aperiodic . � A Markov chain is aperiodic if all states are aperiodic. CS583, Bing Liu, UIC 22

  23. Solution: Random teleports � Add a link from each page to every page � At each time step, the random surfer has a small probability teleporting to those links � With probability β , follow a link at random � With probability 1- β , jump to some page uniformly at random � Common values for β are in the range 0.8 to 0.9

  24. Random teleports Example ( β = 0.8 ) 1/2 1/2 0 1/3 1/3 1/3 1/2 0 0 + 0.2 1/3 1/3 1/3 0.8 Yahoo 0 1/2 1 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 Amazon M’soft y 1 1.00 0.84 0.776 7/11 a = . . . 1 0.60 0.60 0.536 5/11 m 1 1.40 1.56 1.688 21/11

  25. Matrix formulation � Matrix vector A � A ij = β M ij + (1- β )/N � M ij = 1/|O(j)| when j → i and M ij = 0 otherwise � Verify that A is a stochastic matrix � The page rank vector r is the principal eigenvector of this matrix � satisfying r = Ar � Equivalently, r is the stationary distribution of the random walk with teleports

  26. Advantages and Limitations of PageRank � Fighting spam � PageRank is a global measure and is query independent � Computed offline � Criticism: query-independence. � It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic. CS583, Bing Liu, UIC 26

  27. HI TS: Capturing Authorities & Hubs (Kleinberg’98) Intuitions � � Pages that are widely cited are good authorities � Pages that cite many other pages are good hubs HITS (Hypertext-Induced Topic Selection) � � When the user issues a search query, HITS expands the list of relevant pages returned by a search engine and produces two rankings Hubs Authorities 1. Authorities are pages containing useful information and linked by Hubs course home pages � home pages of auto manufacturers � 2. Hubs are pages that link to Authorities course bulletin � list of US auto manufacturers � 27 April 9, 2008 Data Mining: Concepts and Techniques

  28. Matrix Formulation � Transition (adjacency) matrix A � A [ i , j ] = 1 if page i links to page j , 0 if Hubs Authorities not � The hub score vector h : score is proportional to the sum of the authority scores of the pages it links to � h = λ A a � Constant λ is a scale factor � The authority score vector a : score is proportional to the sum of the hub scores of the pages it is linked from � a = μ A T h � Constant μ is scale factor

  29. Transition Matrix Example y a m Yahoo y 1 1 1 A = a 1 0 1 m 0 1 0 Amazon M’soft

  30. I terative algorithm � Initialize h , a to all 1’s � h = Aa � Scale h so that its max entry is 1.0 � a = A T h � Scale a so that its max entry is 1.0 � Continue until h , a converge

  31. I terative Algorithm Example 1 1 1 1 1 0 A T = 1 0 1 A = 1 0 1 0 1 0 1 1 0 . . . 1 = 1 1 1 1 a(yahoo) . . . 0.732 = 1 1 4/5 0.75 a(amazon) . . . 1 = 1 1 1 1 a(m’soft) . . . h(yahoo) = 1 1 1 1.000 1 . . . h(amazon) = 1 2/3 0.73 0.732 0.71 . . . h(m’soft) = 1 1/3 0.27 0.268 0.29

  32. Existence and Uniqueness of the Solution h = λ A a a = μ A T h h = λμ AA T h a = λμ A T A a Under reasonable assumptions about A , the dual iterative algorithm converges to vectors h* and a* such that: h* is the principal eigenvector of the matrix AA T • a* is the principal eigenvector of the matrix A T A •

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend