Data Mining: Concepts and Techniques Web Mining Li Xiong Slides - - PowerPoint PPT Presentation

data mining
SMART_READER_LITE
LIVE PREVIEW

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides - - PowerPoint PPT Presentation

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu 4/9/2008 1 Web Mining Web mining vs. data mining Structure (or lack


slide-1
SLIDE 1

4/9/2008 1

Data Mining:

Concepts and Techniques

Web Mining

Li Xiong

Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu

slide-2
SLIDE 2

Web Mining

Web mining vs. data mining

Structure (or lack of it)

Linkage structure and lack of structure in textual

information

Scale

Data generated per day is comparable to largest

conventional data warehouses

Speed

Often need to react to evolving usage patterns in

real-time (e.g., merchandising)

slide-3
SLIDE 3

Web Mining

Structure Mining

Extracting info from topology of the Web (links among

pages)

Content Mining

Extracting info from page content (text, images, audio

  • r video, etc)

Natural language processing and information retrieval

Usage Mining

Extracting info from user’s usage data on the web

(how user visits the pages or makes transactions)

4/9/2008 Li Xiong 3

slide-4
SLIDE 4

Web Mining

4/9/2008 4

slide-5
SLIDE 5

Web Mining

Web structure mining

Web graph structure and link analysis

Web text mining

Text representation and IR models

Web usage mining

Collaborative filtering

4/9/2008 Li Xiong 5

slide-6
SLIDE 6

Structure of Web Graph

Web as a directed graph

Pages = nodes, hyperlinks = edges

Problem: Understand the macroscopic structure

and evolution of the web graph

Practical implications

Crawling, browsing, computation of link

analysis algorithms

slide-7
SLIDE 7

Power-law degree distribution

Source: Broder et al, 00

slide-8
SLIDE 8

Bow-tie Structure (Broder et al. 00)

slide-9
SLIDE 9

The Daisy Structure (Donato et al. 05)

4/9/2008 9

slide-10
SLIDE 10

April 9, 2008 Li Xiong

10

Link Analysis

Problem: exploit the link structure of a graph to order or

prioritize the set of objects within the graph

Application of social network analysis at actor level:

centrality and prestige

Algorithms

PageRank HITS

slide-11
SLIDE 11

PageRank (Brin & Page’98)

Intuition

Web pages are not equally “important”

www.joe-schmoe.com v www.stanford.edu

Links as citations: a page cited often is more important

www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink

Recursive model: links from heavily linked pages

weighted more

PageRank is essentially the eigenvector prestige in social

network

slide-12
SLIDE 12

Each link’s vote is proportional to the importance of its

source page

If page P with importance x has n outlinks, each link gets

x/n votes

Page P’s own importance is the sum of the votes on its

inlinks

Simple Recursive Flow Model

Yahoo M’soft Amazon

y a m y/2 y/2 a/2 a/2 m y = y /2 + a /2 a = y /2 + m m = a /2 Solving the equation with constraint: y+ a+ m = 1 y = 2/5, a = 2/5, m = 1/5

slide-13
SLIDE 13

Matrix formulation

  • Web link matrix M: one row and one column per web page
  • Rank vector r: one entry per web page
  • Flow equation: r = Mr
  • r is an eigenvector of the M

i j

M r r

=

j i

⎪ ⎩ ⎪ ⎨ ⎧ ∈ =

  • therwise

E j i if O M

j ij

) , ( 1

slide-14
SLIDE 14

Matrix formulation Example

Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m

y = y /2 + a /2 a = y /2 + m m = a /2

r = Mr

y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m

slide-15
SLIDE 15

Power I teration method

Solving equation: r = Mr

Suppose there are N web pages Initialize: r0 = [1/N,….,1/N] T Iterate: rk+ 1 = Mrk Stop when |rk+ 1 - rk| 1 < ε

|x| 1 = ∑1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean

slide-16
SLIDE 16

Power I teration Example

Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 2/5 1/5 . . .

slide-17
SLIDE 17

Random Walk I nterpretation

  • Imagine a random web surfer

At any time t, surfer is on some page P At time t+ 1, the surfer follows an outlink from P uniformly at

random

Ends up on some page Q linked from P Process repeats indefinitely

  • p(t) is the probability distribution whose ith component is the

probability that the surfer is at page i at time t

slide-18
SLIDE 18

The stationary distribution

Where is the surfer at time t+ 1?

p(t+ 1) = Mp(t)

Suppose the random walk reaches a state such

that p(t+ 1) = Mp(t) = p(t)

Then p(t) is a stationary distribution for the

random walk

Our rank vector r satisfies r = Mr

slide-19
SLIDE 19

Existence and Uniqueness of the Solution

Theory of random walks (aka Markov processes):

A finite Markov chain defined by the stochastic

matrix has a unique stationary probability distribution if the matrix is irreducible and aperiodic.

April 9, 2008 Mining and Searching Graphs in Graph Databases

19

slide-20
SLIDE 20

CS583, Bing Liu, UIC 20

M is a not stochastic matrix

M is the transition matrix of the Web graph It does not satisfy Many web pages have no out-links

Such pages are called the dangling pages.

=

=

n i ij

M

1

1

⎪ ⎩ ⎪ ⎨ ⎧ ∈ =

  • therwise

E j i if O M

j ij

) , ( 1

slide-21
SLIDE 21

CS583, Bing Liu, UIC 21

M is a not irreducible

Irreducible means that the Web graph G is

strongly connected.

Definition: A directed graph G = (V, E) is

strongly connected if and only if, for each pair of nodes u, v ∈ V, there is a path from u to v.

A general Web graph is not irreducible because

for some pair of nodes u and v, there is no

path from u to v.

slide-22
SLIDE 22

CS583, Bing Liu, UIC 22

M is a not aperiodic

A state i in a Markov chain being periodic means

that there exists a directed cycle that the chain has to traverse.

Definition: A state i is periodic with period k > 1

if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k.

If a state is not periodic (i.e., k = 1), it is

aperiodic.

A Markov chain is aperiodic if all states are

aperiodic.

slide-23
SLIDE 23

Solution: Random teleports

Add a link from each page to every page At each time step, the random surfer has a small

probability teleporting to those links

With probability β, follow a link at random With probability 1-β, jump to some page uniformly at

random

Common values for β are in the range 0.8 to 0.9

slide-24
SLIDE 24

Random teleports Example (β = 0.8)

Yahoo M’soft Amazon 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15

0.8 + 0.2

y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11 . . .

slide-25
SLIDE 25

Matrix formulation

Matrix vector A

Aij = βMij + (1-β)/N Mij = 1/|O(j)| when j→i and Mij = 0 otherwise Verify that A is a stochastic matrix

The page rank vector r is the principal

eigenvector of this matrix

satisfying r = Ar

Equivalently, r is the stationary distribution of the

random walk with teleports

slide-26
SLIDE 26

CS583, Bing Liu, UIC 26

Advantages and Limitations of PageRank

Fighting spam PageRank is a global measure and is query

independent

Computed offline Criticism: query-independence.

It could not distinguish between pages that are

authoritative in general and pages that are authoritative on the query topic.

slide-27
SLIDE 27

April 9, 2008 Data Mining: Concepts and Techniques

27

HI TS: Capturing Authorities & Hubs (Kleinberg’98)

  • Intuitions

Pages that are widely cited are good authorities Pages that cite many other pages are good hubs

  • HITS (Hypertext-Induced Topic Selection)

When the user issues a search query, HITS expands the list of

relevant pages returned by a search engine and produces two rankings

  • 1. Authorities are pages containing useful information and linked by

Hubs

  • course home pages
  • home pages of auto manufacturers
  • 2. Hubs are pages that link to Authorities
  • course bulletin
  • list of US auto manufacturers

Hubs Authorities

slide-28
SLIDE 28

Matrix Formulation

Transition (adjacency) matrix A

A[i, j] = 1 if page i links to page j, 0 if

not

The hub score vector h: score is

proportional to the sum of the authority scores of the pages it links to

h = λAa Constant λ is a scale factor

The authority score vector a: score is

proportional to the sum of the hub scores

  • f the pages it is linked from

a = μAT h Constant μ is scale factor

Hubs Authorities

slide-29
SLIDE 29

Transition Matrix Example

Yahoo M’soft Amazon y 1 1 1 a 1 0 1 m 0 1 0 y a m A =

slide-30
SLIDE 30

I terative algorithm

Initialize h, a to all 1’s h = Aa Scale h so that its max entry is 1.0 a = ATh Scale a so that its max entry is 1.0 Continue until h, a converge

slide-31
SLIDE 31

I terative Algorithm Example

1 1 1 A = 1 0 1 0 1 0 1 1 0 AT = 1 0 1 1 1 0 a(yahoo) a(amazon) a(m’soft) = = = 1 1 1 1 1 1 1 4/5 1 1 0.75 1 . . . . . . . . . 1 0.732 1 h(yahoo) = 1 h(amazon) = 1 h(m’soft) = 1 1 2/3 1/3 1 0.73 0.27 . . . . . . . . . 1.000 0.732 0.268 1 0.71 0.29

slide-32
SLIDE 32

Existence and Uniqueness of the Solution

h = λAa a = μAT h h = λμAAT h a = λμATA a

Under reasonable assumptions about A, the dual iterative algorithm converges to vectors

h* and a* such that:

  • h* is the principal eigenvector of the matrix AAT
  • a* is the principal eigenvector of the matrix ATA
slide-33
SLIDE 33

33

Strengths and weaknesses of HI TS

Strength: its ability to rank pages according to the

query topic, which may be able to provide more relevant authority and hub pages.

Weaknesses:

Easily spammed Topic drift Inefficiency at query time

slide-34
SLIDE 34

PageRank and HI TS

Model

PageRank: depends on the links into S HITS: depends on the value of the other links out of S

Characteristics

Spam resistance Query independence

Destinies post-1998

PageRank: trademark of Google HITS: not commonly used by search engines

(Ask.com?)

slide-35
SLIDE 35

Web Mining

Web structure mining

Web graph structure Link analysis

Web text mining Web usage mining

Collaborative filtering

4/9/2008 35

slide-36
SLIDE 36

Li Xiong

Text Mining

Text mining refers to data mining using text

documents as data.

Tasks

Text summarization Text classification Text clustering …

Intersection with Information Retrieval and

Natural Language Processing

slide-37
SLIDE 37

Levels of text representations

Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories

slide-38
SLIDE 38

N-Gram

N-gram: a sub-sequence of n items from a given

sequence.

The items can be characters, words or base pairs

according to the application.

Unigram, bigram, trigram

Example: Google n-gram corpus

4-grams serve as the incoming (92) serve as the incubator (99) serve as the independent (794) serve as the index (223) serve as the indication (72) serve as the indicator (120)

slide-39
SLIDE 39

Bag-of-Words Document Representation

slide-40
SLIDE 40

40

Vector space model

Each document is represented as a vector. Given a collection of documents D, let V = { t1,

t2, ..., t|V|} be the set of distinctive words/terms in the collection. V is called the vocabulary.

A weight wij > 0 is associated with each term ti

  • f a document dj. For a term that does not

appear in document dj, wij = 0.

dj = (w1j, w2j, ..., w|V|j)

slide-41
SLIDE 41

TFI DF Weighting

TF (Term frequency) IDF (Inverse Document Frequency)

  • Tf(w) – term frequency (number of word occurrences in a document)
  • Df(w) – document frequency (number of documents containing the word)
  • N – number of all documents
  • TfIdf(w) – relative importance of the word in the document

) ) ( log( . ) ( w df N tf w tfidf =

slide-42
SLIDE 42

Similarity between document vectors

Each document is represented as a vector of weights D

= < x>

Cosine similarity (dot product) is the most widely used

similarity measure between two document vectors

…calculates cosine of the angle between document vectors …efficient to calculate (sum of products of intersecting words) …similarity value between 0 (different) and 1 (the same)

∑ ∑ ∑

=

k k j j i i i

x x x x D D Sim

2 2 2 1 2 1

) , (

slide-43
SLIDE 43

Web Mining

Web structure mining

Web graph structure Link analysis

Web text mining Web usage mining

Collaborative filtering

4/9/2008 Li Xiong 43

slide-44
SLIDE 44

Web Usage Data

Web Logs: Low level

Tracks queries, individual pages/items

requested by a Web browser

Application logs: Higher level

When customers check in and check out, items

placed or removed from shopping cart, …etc

4/9/2008 44

slide-45
SLIDE 45

Web Usage Mining

Association rule mining

Discovered associations between pages and products

Sequential pattern discovery

Help to discover visit patterns and make predictions

about visit patterns

Clustering

Group similar sessions into clusters which may

correspond to user profiles / modes of usage of the website

Collaborative Filtering

Filter/recommend pages and products based on similar

users

4/9/2008 45

slide-46
SLIDE 46

4/9/2008 Data Mining: Principles and Algorithms 46

Collaborative Filtering: Motivation

User Perspective

Lots of web pages, online products, books,

movies, etc.

Reduce my choices…please…

Manager Perspective

“ if I have 3 million customers on the web, I should have 3 million stores on the web.” CEO of Amazon.com [SCH01]

slide-47
SLIDE 47

4/9/2008 Data Mining: Principles and Algorithms 47

Basic Approaches

Collaborative Filtering (CF)

Based on the active user’s history Based on other users’ collective behavior

Content-based Filtering

Based on keywords and other features

slide-48
SLIDE 48

4/9/2008 Data Mining: Principles and Algorithms 48

Collaborative Filtering: A Framework

u1 u2

ui

...

um Items: I i1 i2 … ij … in

3 1.5 …. … 2 2 1 3

rij=? The task: Q1: Find Unknown ratings? Q2: Which items should we recommend to this user? . . . Unknown function f: U x I→ R Users: U

slide-49
SLIDE 49

4/9/2008 Data Mining: Principles and Algorithms 49

Collaborative Filtering: Main Methods

User-User Methods

Memory-based: K-NN Model-based: Clustering

Item-Item Method

Correlation Analysis Linear Regression Belief Network Association Rule Mining

slide-50
SLIDE 50

4/9/2008 Data Mining: Principles and Algorithms 50

User-User method: I ntuition

Target Target Customer Customer

Q1: How to measure similarity? Q2: How to select neighbors? Q3: How to combine?

slide-51
SLIDE 51

4/9/2008 Data Mining: Principles and Algorithms 51

How to Measure Similarity?

Pearson correlation coefficient Cosine measure

Users are vectors in product-dimension space

∑ ∑ ∑

∈ ∈ ∈

− − − − =

Items Rated Commonly j 2 Items Rated Commonly j 2 Items Rated Commonly j

) ( ) ( ) )( ( ) , (

i ij a aj i ij a aj p

r r r r r r r r i a w

ui ua

i1 in

2 2 *

. ) , (

i a i a c

r r r r i a w =

slide-52
SLIDE 52

4/9/2008 Data Mining: Principles and Algorithms 52

Nearest Neighbor Approaches [SAR00a]

Offline phase:

Do nothing…just store transactions

Online phase:

Identify highly similar users to the active one

Best K ones All with a measure greater than a threshold

Prediction

∑ ∑

− + =

i i ij i a aj

i a w r r i a w r r ) , ( ) ( ) , (

User a’s neutral User i’s deviation User a’s estimated deviation

slide-53
SLIDE 53

4/9/2008 Data Mining: Principles and Algorithms 53

Clustering [BRE98]

Offline phase:

Build clusters: k-mean, k-medoid, etc.

Online phase:

Identify the nearest cluster to the active user Prediction:

Use the center of the cluster Weighted average between cluster members

Weights depend on the active user

slide-54
SLIDE 54

4/9/2008 Data Mining: Principles and Algorithms 54

Clustering vs. k-NN Approaches

K-NN using Pearson measure is slower but more

accurate

Clustering is more scalable Active user Bad recommendations

slide-55
SLIDE 55

Reference: Link Analysis

Brin, S. and Page, L. The anatomy of a large-scale

hypertextual Web search engine (PageRank). In Computer Networks and ISDN Systems, 1998

  • J. Kleinberg. Authoritative sources in a hyperlinked

environment (HITS). In ACM-SIAM Symp. Discrete Algorithms, 1998

  • S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R.

Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99

  • D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis.

SIGIR'2004.

slide-56
SLIDE 56

References

1.

  • C. D. Manning and H. Schutze, “Foundations of Natural Language

Processing”, MIT Press, 1999. 2.

  • S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”,

Prentice Hall, 1995. 3.

  • S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi-

Structured Data”, Morgan Kaufmann, 2002. 4.

  • G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five

papers on WordNet. Princeton University, August 1993. 5.

  • C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC, Fall 2003.

6.

  • M. Hearst, Untangling Text Data Mining, ACL’99, invited paper.

http://www.sims.berkeley.edu/~ hearst/papers/acl99/acl99-tdm.html 7.

  • R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall

2003. 8. A Road Map to Text Mining and Web Mining, University of Texas resource

  • page. http://www.cs.utexas.edu/users/pebronia/text-mining/

9. Computational Linguistics and Text Mining Group, IBM Research, http://www.research.ibm.com/dssgrp/

slide-57
SLIDE 57

References

  • Fabrizio Sebastiani, “Machine Learning in Automated Text

Categorization”, ACM Computing Surveys, Vol. 34, No.1, March 2002

  • Soumen Chakrabarti, “Data mining for hypertext: A tutorial survey”,

ACM SIGKDD Explorations, 2000.

  • Cleverdon, “Optimizing convenient online accesss to bibliographic

databases”, Information Survey, Use4, 1, 37-47, 1984

  • Yiming Yang, “An evaluation of statistical approaches to text

categorization”, Journal of Information Retrieval, 1:67-88, 1999.

  • Yiming Yang and Xin Liu “A re-examination of text categorization

methods”. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999.

slide-58
SLIDE 58

4/9/2008 Data Mining: Principles and Algorithms 58

References: Collaborative Filtering

  • Charu C. Aggarwal, Joel L. Wolf, Kun-Lung Wu, Philip S. Yu: Horting Hatches

an Egg: A New Graph-Theoretic Approach to Collaborative Filtering. KDD 1999: 201-212

  • J. Breese, D. Heckerman, C. Kadie Empirical Analysis of Predictive Algorithms

for Collaborative Filtering. In Proc. 14th Conf. Uncertainty in Artificial Intelligence, Madison, July 1998.

  • Yoon Ho Cho and Jae Kyeong Kim: Application of Web usage mining and

product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications, 26(2), 2003

  • William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order
  • things. In Advances in Neural Processing Systems 10, Denver, CO, 1997
  • Toshihiro Kamishima: Nantonac collaborative filtering: recommendation

based on order responses. KDD 2003: 583-588

  • Lee, C.-H, Kim, Y.-H., Rhee, P.-K. Web personalization expert with combining

collaborative filtering and association rule mining technique. Expert Systems with Applications, v 21, n 3, October, 2001, p 131-137

slide-59
SLIDE 59

4/9/2008 Data Mining: Principles and Algorithms 59

  • W. Lin, 2001P, online presentation available at: http://www.wiwi.hu-

berlin.de/~ myra/WEBKDD2000/WEBKDD2000_ARCHIVE/LinAlvarezRuiz_Web KDD2000.ppt

  • Weiyang Lin, Sergio A. Alvarez, and Carolina Ruiz. Efficient adaptive-support

association rule mining for recommender systems. Data Mining and Knowledge Discovery, 6:83--105, 2002

  • G. Linden, B. Smith, and J. York, "Amazon.com Recommendations Iemto -

item collaborative filtering", IEEE Internet Computing, Vo. 7, No. 1, pp. 7680,

  • Jan. 2003.

Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Analysis of recommendation algorithms for e-commerce. ACM Conf. Electronic Commerce 2000: 158-167

  • B. Sarwar, G. Karypis, J. Konstan, and J. Riedl: Application of dimensionality

reduction in recommender systems--a case study. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000.

  • B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based

collaborative filtering recommendation algorithms. WWW’01

References: Collaborative Filtering

slide-60
SLIDE 60

4/9/2008 Data Mining: Principles and Algorithms 60

  • B. Sarwar, 2000P, online presentation available at: http://www.wiwi.hu-

berlin.de/~ myra/WEBKDD2000/WEBKDD2000_ARCHIVE/badrul.ppt

  • J. Ben Schafer, Joseph A. Konstan, John Riedl: E-Commerce

Recommendation Applications. Data Mining and Knowledge Discovery 5(1/2): 115-153, 2001

  • L.H. Ungar and D.P. Foster: Clustering Methods for Collaborative Filtering,

AAAI Workshop on Recommendation Systems, 1998.

  • Yi-Fan Wang, Yu-Liang Chuang, Mei-Hua Hsu and Huan-Chao Keh: A

personalized recommender system for the cosmetic business. Expert Systems with Applications, v 26, n 3, April, 2004 Pages 427-434

  • S. Vucetic and Z. Obradovic. A regression-based approach for scaling-up

personalized recommender systems in e-commerce. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000.

  • Kai Yu, Xiaowei Xu, Martin Ester, and Hans-Peter Kriegel: Selecting relevant

instances for efficient accurate collaborative filtering. In Proceedings of the 10th CIKM, pages 239--246. ACM Press, 2001.

  • Cheng Zhai, Spring 2003 online course notes available at:

http://sifaka.cs.uiuc.edu/course/2003-497CXZ/loc/cf.ppt

References: Collaborative Filtering