4/9/2008 1
Data Mining:
Concepts and Techniques
Web Mining
Li Xiong
Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu
Data Mining: Concepts and Techniques Web Mining Li Xiong Slides - - PowerPoint PPT Presentation
Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu 4/9/2008 1 Web Mining Web mining vs. data mining Structure (or lack
4/9/2008 1
Li Xiong
Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu
Web mining vs. data mining
Structure (or lack of it)
Linkage structure and lack of structure in textual
information
Scale
Data generated per day is comparable to largest
conventional data warehouses
Speed
Often need to react to evolving usage patterns in
real-time (e.g., merchandising)
Structure Mining
Extracting info from topology of the Web (links among
pages)
Content Mining
Extracting info from page content (text, images, audio
Natural language processing and information retrieval
Usage Mining
Extracting info from user’s usage data on the web
(how user visits the pages or makes transactions)
4/9/2008 Li Xiong 3
4/9/2008 4
Web structure mining
Web graph structure and link analysis
Web text mining
Text representation and IR models
Web usage mining
Collaborative filtering
4/9/2008 Li Xiong 5
Web as a directed graph
Pages = nodes, hyperlinks = edges
Problem: Understand the macroscopic structure
and evolution of the web graph
Practical implications
Crawling, browsing, computation of link
analysis algorithms
Source: Broder et al, 00
4/9/2008 9
April 9, 2008 Li Xiong
10
Problem: exploit the link structure of a graph to order or
prioritize the set of objects within the graph
Application of social network analysis at actor level:
centrality and prestige
Algorithms
PageRank HITS
Intuition
Web pages are not equally “important”
www.joe-schmoe.com v www.stanford.edu
Links as citations: a page cited often is more important
www.stanford.edu has 23,400 inlinks www.joe-schmoe.com has 1 inlink
Recursive model: links from heavily linked pages
weighted more
PageRank is essentially the eigenvector prestige in social
network
Each link’s vote is proportional to the importance of its
source page
If page P with importance x has n outlinks, each link gets
x/n votes
Page P’s own importance is the sum of the votes on its
inlinks
Yahoo M’soft Amazon
y a m y/2 y/2 a/2 a/2 m y = y /2 + a /2 a = y /2 + m m = a /2 Solving the equation with constraint: y+ a+ m = 1 y = 2/5, a = 2/5, m = 1/5
i j
=
j i
⎪ ⎩ ⎪ ⎨ ⎧ ∈ =
E j i if O M
j ij
) , ( 1
Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m
y = y /2 + a /2 a = y /2 + m m = a /2
y 1/2 1/2 0 y a = 1/2 0 1 a m 0 1/2 0 m
Solving equation: r = Mr
Suppose there are N web pages Initialize: r0 = [1/N,….,1/N] T Iterate: rk+ 1 = Mrk Stop when |rk+ 1 - rk| 1 < ε
|x| 1 = ∑1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean
Yahoo M’soft Amazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m y a = m 1/3 1/3 1/3 1/3 1/2 1/6 5/12 1/3 1/4 3/8 11/24 1/6 2/5 2/5 1/5 . . .
At any time t, surfer is on some page P At time t+ 1, the surfer follows an outlink from P uniformly at
random
Ends up on some page Q linked from P Process repeats indefinitely
probability that the surfer is at page i at time t
Where is the surfer at time t+ 1?
p(t+ 1) = Mp(t)
Suppose the random walk reaches a state such
that p(t+ 1) = Mp(t) = p(t)
Then p(t) is a stationary distribution for the
random walk
Our rank vector r satisfies r = Mr
Theory of random walks (aka Markov processes):
A finite Markov chain defined by the stochastic
April 9, 2008 Mining and Searching Graphs in Graph Databases
19
CS583, Bing Liu, UIC 20
M is the transition matrix of the Web graph It does not satisfy Many web pages have no out-links
Such pages are called the dangling pages.
=
=
n i ij
M
1
1
⎪ ⎩ ⎪ ⎨ ⎧ ∈ =
E j i if O M
j ij
) , ( 1
CS583, Bing Liu, UIC 21
Irreducible means that the Web graph G is
strongly connected.
strongly connected if and only if, for each pair of nodes u, v ∈ V, there is a path from u to v.
A general Web graph is not irreducible because
for some pair of nodes u and v, there is no
path from u to v.
CS583, Bing Liu, UIC 22
A state i in a Markov chain being periodic means
that there exists a directed cycle that the chain has to traverse.
if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k.
If a state is not periodic (i.e., k = 1), it is
A Markov chain is aperiodic if all states are
aperiodic.
Add a link from each page to every page At each time step, the random surfer has a small
probability teleporting to those links
With probability β, follow a link at random With probability 1-β, jump to some page uniformly at
random
Common values for β are in the range 0.8 to 0.9
Yahoo M’soft Amazon 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15
0.8 + 0.2
y a = m 1 1 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11 . . .
Matrix vector A
Aij = βMij + (1-β)/N Mij = 1/|O(j)| when j→i and Mij = 0 otherwise Verify that A is a stochastic matrix
The page rank vector r is the principal
eigenvector of this matrix
satisfying r = Ar
Equivalently, r is the stationary distribution of the
random walk with teleports
CS583, Bing Liu, UIC 26
Fighting spam PageRank is a global measure and is query
independent
Computed offline Criticism: query-independence.
It could not distinguish between pages that are
authoritative in general and pages that are authoritative on the query topic.
April 9, 2008 Data Mining: Concepts and Techniques
27
Pages that are widely cited are good authorities Pages that cite many other pages are good hubs
When the user issues a search query, HITS expands the list of
relevant pages returned by a search engine and produces two rankings
Hubs
Hubs Authorities
Transition (adjacency) matrix A
A[i, j] = 1 if page i links to page j, 0 if
not
The hub score vector h: score is
proportional to the sum of the authority scores of the pages it links to
h = λAa Constant λ is a scale factor
The authority score vector a: score is
proportional to the sum of the hub scores
a = μAT h Constant μ is scale factor
Hubs Authorities
Yahoo M’soft Amazon y 1 1 1 a 1 0 1 m 0 1 0 y a m A =
Initialize h, a to all 1’s h = Aa Scale h so that its max entry is 1.0 a = ATh Scale a so that its max entry is 1.0 Continue until h, a converge
1 1 1 A = 1 0 1 0 1 0 1 1 0 AT = 1 0 1 1 1 0 a(yahoo) a(amazon) a(m’soft) = = = 1 1 1 1 1 1 1 4/5 1 1 0.75 1 . . . . . . . . . 1 0.732 1 h(yahoo) = 1 h(amazon) = 1 h(m’soft) = 1 1 2/3 1/3 1 0.73 0.27 . . . . . . . . . 1.000 0.732 0.268 1 0.71 0.29
h = λAa a = μAT h h = λμAAT h a = λμATA a
Under reasonable assumptions about A, the dual iterative algorithm converges to vectors
h* and a* such that:
33
Strength: its ability to rank pages according to the
query topic, which may be able to provide more relevant authority and hub pages.
Weaknesses:
Easily spammed Topic drift Inefficiency at query time
Model
PageRank: depends on the links into S HITS: depends on the value of the other links out of S
Characteristics
Spam resistance Query independence
Destinies post-1998
PageRank: trademark of Google HITS: not commonly used by search engines
(Ask.com?)
Web structure mining
Web graph structure Link analysis
Web text mining Web usage mining
Collaborative filtering
4/9/2008 35
Li Xiong
Text mining refers to data mining using text
documents as data.
Tasks
Text summarization Text classification Text clustering …
Intersection with Information Retrieval and
Natural Language Processing
Character (character n-grams and sequences) Words (stop-words, stemming, lemmatization) Phrases (word n-grams, proximity features) Part-of-speech tags Taxonomies / thesauri Vector-space model Language models Full-parsing Cross-modality Collaborative tagging / Web2.0 Templates / Frames Ontologies / First order theories
N-gram: a sub-sequence of n items from a given
sequence.
The items can be characters, words or base pairs
according to the application.
Unigram, bigram, trigram
Example: Google n-gram corpus
4-grams serve as the incoming (92) serve as the incubator (99) serve as the independent (794) serve as the index (223) serve as the indication (72) serve as the indicator (120)
40
Each document is represented as a vector. Given a collection of documents D, let V = { t1,
t2, ..., t|V|} be the set of distinctive words/terms in the collection. V is called the vocabulary.
A weight wij > 0 is associated with each term ti
appear in document dj, wij = 0.
TF (Term frequency) IDF (Inverse Document Frequency)
Each document is represented as a vector of weights D
= < x>
Cosine similarity (dot product) is the most widely used
similarity measure between two document vectors
…calculates cosine of the angle between document vectors …efficient to calculate (sum of products of intersecting words) …similarity value between 0 (different) and 1 (the same)
k k j j i i i
2 2 2 1 2 1
Web structure mining
Web graph structure Link analysis
Web text mining Web usage mining
Collaborative filtering
4/9/2008 Li Xiong 43
Web Logs: Low level
Tracks queries, individual pages/items
requested by a Web browser
Application logs: Higher level
When customers check in and check out, items
placed or removed from shopping cart, …etc
4/9/2008 44
Association rule mining
Discovered associations between pages and products
Sequential pattern discovery
Help to discover visit patterns and make predictions
about visit patterns
Clustering
Group similar sessions into clusters which may
correspond to user profiles / modes of usage of the website
Collaborative Filtering
Filter/recommend pages and products based on similar
users
4/9/2008 45
4/9/2008 Data Mining: Principles and Algorithms 46
User Perspective
Lots of web pages, online products, books,
movies, etc.
Reduce my choices…please…
Manager Perspective
“ if I have 3 million customers on the web, I should have 3 million stores on the web.” CEO of Amazon.com [SCH01]
4/9/2008 Data Mining: Principles and Algorithms 47
Collaborative Filtering (CF)
Based on the active user’s history Based on other users’ collective behavior
Content-based Filtering
Based on keywords and other features
4/9/2008 Data Mining: Principles and Algorithms 48
u1 u2
…
ui
...
um Items: I i1 i2 … ij … in
3 1.5 …. … 2 2 1 3
rij=? The task: Q1: Find Unknown ratings? Q2: Which items should we recommend to this user? . . . Unknown function f: U x I→ R Users: U
4/9/2008 Data Mining: Principles and Algorithms 49
User-User Methods
Memory-based: K-NN Model-based: Clustering
Item-Item Method
Correlation Analysis Linear Regression Belief Network Association Rule Mining
4/9/2008 Data Mining: Principles and Algorithms 50
Q1: How to measure similarity? Q2: How to select neighbors? Q3: How to combine?
4/9/2008 Data Mining: Principles and Algorithms 51
Pearson correlation coefficient Cosine measure
Users are vectors in product-dimension space
∈ ∈ ∈
− − − − =
Items Rated Commonly j 2 Items Rated Commonly j 2 Items Rated Commonly j
) ( ) ( ) )( ( ) , (
i ij a aj i ij a aj p
r r r r r r r r i a w
ui ua
i1 in
2 2 *
i a i a c
4/9/2008 Data Mining: Principles and Algorithms 52
Offline phase:
Do nothing…just store transactions
Online phase:
Identify highly similar users to the active one
Best K ones All with a measure greater than a threshold
Prediction
− + =
i i ij i a aj
i a w r r i a w r r ) , ( ) ( ) , (
User a’s neutral User i’s deviation User a’s estimated deviation
4/9/2008 Data Mining: Principles and Algorithms 53
Offline phase:
Build clusters: k-mean, k-medoid, etc.
Online phase:
Identify the nearest cluster to the active user Prediction:
Use the center of the cluster Weighted average between cluster members
Weights depend on the active user
4/9/2008 Data Mining: Principles and Algorithms 54
K-NN using Pearson measure is slower but more
accurate
Clustering is more scalable Active user Bad recommendations
Brin, S. and Page, L. The anatomy of a large-scale
hypertextual Web search engine (PageRank). In Computer Networks and ISDN Systems, 1998
environment (HITS). In ACM-SIAM Symp. Discrete Algorithms, 1998
Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99
SIGIR'2004.
1.
Processing”, MIT Press, 1999. 2.
Prentice Hall, 1995. 3.
Structured Data”, Morgan Kaufmann, 2002. 4.
papers on WordNet. Princeton University, August 1993. 5.
6.
http://www.sims.berkeley.edu/~ hearst/papers/acl99/acl99-tdm.html 7.
2003. 8. A Road Map to Text Mining and Web Mining, University of Texas resource
9. Computational Linguistics and Text Mining Group, IBM Research, http://www.research.ibm.com/dssgrp/
Categorization”, ACM Computing Surveys, Vol. 34, No.1, March 2002
ACM SIGKDD Explorations, 2000.
databases”, Information Survey, Use4, 1, 37-47, 1984
categorization”, Journal of Information Retrieval, 1:67-88, 1999.
methods”. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99, pp 42--49), 1999.
4/9/2008 Data Mining: Principles and Algorithms 58
an Egg: A New Graph-Theoretic Approach to Collaborative Filtering. KDD 1999: 201-212
for Collaborative Filtering. In Proc. 14th Conf. Uncertainty in Artificial Intelligence, Madison, July 1998.
product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications, 26(2), 2003
based on order responses. KDD 2003: 583-588
collaborative filtering and association rule mining technique. Expert Systems with Applications, v 21, n 3, October, 2001, p 131-137
4/9/2008 Data Mining: Principles and Algorithms 59
berlin.de/~ myra/WEBKDD2000/WEBKDD2000_ARCHIVE/LinAlvarezRuiz_Web KDD2000.ppt
association rule mining for recommender systems. Data Mining and Knowledge Discovery, 6:83--105, 2002
item collaborative filtering", IEEE Internet Computing, Vo. 7, No. 1, pp. 7680,
Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Analysis of recommendation algorithms for e-commerce. ACM Conf. Electronic Commerce 2000: 158-167
reduction in recommender systems--a case study. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000.
collaborative filtering recommendation algorithms. WWW’01
4/9/2008 Data Mining: Principles and Algorithms 60
berlin.de/~ myra/WEBKDD2000/WEBKDD2000_ARCHIVE/badrul.ppt
Recommendation Applications. Data Mining and Knowledge Discovery 5(1/2): 115-153, 2001
AAAI Workshop on Recommendation Systems, 1998.
personalized recommender system for the cosmetic business. Expert Systems with Applications, v 26, n 3, April, 2004 Pages 427-434
personalized recommender systems in e-commerce. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000.
instances for efficient accurate collaborative filtering. In Proceedings of the 10th CIKM, pages 239--246. ACM Press, 2001.
http://sifaka.cs.uiuc.edu/course/2003-497CXZ/loc/cf.ppt