[PPT] - http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 PowerPoint Presentation

SLIDE 1

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Stanford University

http://www.mmds.org

SLIDE 2

Data Mining: Concepts and Techniques

2

 #1: C4.5 Decision Tree - Classification (61 votes)  #2: K-Means - Clustering (60 votes)  #3: SVM – Classification (58 votes)  #4: Apriori - Frequent Itemsets (52 votes)  #5: EM – Clustering (48 votes)  #6: PageRank – Link mining (46 votes)  #7: AdaBoost – Boosting (45 votes)  #7: kNN – Classification (45 votes)  #7: Naive Bayes – Classification (45 votes)  #10: CART – Classification (34 votes)

SLIDE 3

 How to organize the Web?

 First try: Human curated

Web directories

Yahoo, DMOZ, LookSmart

 Second try: Web Search

Content based: Find relevant docs
Top-k ranking based on TF-IDF
Works well in a small and trusted set
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5

SLIDE 4

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6

SLIDE 5

 Link based ranking

algorithms

PageRank
HITS

7

Data Mining: Concepts and Techniques

SLIDE 6

 All web pages are not equally “important”

www.joe-schmoe.com vs. www.stanford.edu

 There is large diversity

in the web-graph node connectivity. Let’s rank the pages by the link structure!

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

8

SLIDE 7

 Idea: Links as votes

Page is more important if it has more links
In-coming links? Out-going links?

 Think of in-links as votes:

www.stanford.edu has 23,400 in-links
www.joe-schmoe.com has 1 in-link

 Are all in-links are equal?

Links from important pages count more
Recursive question!

9

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 8

B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6

10

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 9

 Each link’s vote is proportional to the

importance of its source page

 If page j with importance rj has n out-links,

each link gets rj / n votes

 Page j’s own importance is the sum of the

votes on its in-links

11

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

SLIDE 10

 A “vote” from an important

page is worth more

 A page is important if it is

pointed to by other important pages

 Define a “rank” rj for page j

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

12







j i i j

r r

i

d

y m a a/2 y/2 a/2 m y/2

The web in 1839 “Flow” equations:

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

𝒆𝒋 … out-degree of node 𝒋

SLIDE 11

 3 equations, 3 unknowns,

no constants

No unique solution
All solutions equivalent modulo the scale factor

 Additional constraint forces uniqueness:

𝒔𝒛 + 𝒔𝒃 + 𝒔𝒏 = 𝟐
Solution: 𝒔𝒛 =

𝟑 𝟔 , 𝒔𝒃 = 𝟑 𝟔 , 𝒔𝒏 = 𝟐 𝟔

 Gaussian elimination method works for

small examples, but we need a better method for large web-size graphs

 We need a new formulation!

13

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

Flow equations:

SLIDE 12

 Stochastic adjacency matrix 𝑵

Let page 𝑗 has 𝑒𝑗 out-links
If 𝑗 → 𝑘, then 𝑁𝑘𝑗 =

1 𝑒𝑗

else 𝑁𝑘𝑗 = 0

𝑵 is a column stochastic matrix
Columns sum to 1

 Rank vector 𝒔: vector with an entry per page

𝑠𝑗 is the importance score of page 𝑗
𝑗 𝑠

𝑗

= 1

 The flow equations can be written

𝒔 = 𝑵 ⋅ 𝒔

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

14







j i i j

r r

i

d

SLIDE 13

 Remember the flow equation:  Flow equation in the matrix form

𝑵 ⋅ 𝒔 = 𝒔

Suppose page i links to 3 pages, including j
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

15

j i

M r r =

rj 1/3







j i i j

r r

i

d

ri

. . =

SLIDE 14

r = M∙r

y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

16

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y a m y ½ ½ a ½ 1 m ½ ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

SLIDE 15

 The flow equations can be written

𝒔 = 𝑵 ∙ 𝒔

 So the rank vector r is an eigenvector of the

stochastic web matrix M

In fact, its first or principal eigenvector,

with corresponding eigenvalue 1

Largest eigenvalue of M is 1 since M is

column stochastic (with non-negative entries)

 We can now efficiently solve for r!

The method is called Power iteration

17

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

NOTE: x is an eigenvector with the corresponding eigenvalue λ if:

𝑩𝒚 = 𝝁𝒚

SLIDE 16

 Given a web graph with n nodes, where the

nodes are pages and edges are hyperlinks

 Power iteration: a simple iterative scheme

Suppose there are N web pages
Initialize: r(0) = [1/N,….,1/N]T
Iterate: r(t+1) = M ∙ r(t)
Stop when |r(t+1) – r(t)|1 < 

18

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org



   j i t i t j

r r

i ) ( ) 1 (

d

di …. out-degree of node i

|x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean

SLIDE 17

 Example:

ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

y a m y ½ ½ a ½ 1 m ½

19

Iteration 0, 1, 2, …

SLIDE 18

 Example:

ra 1 1 rb 1 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

20

=

b a

Iteration 0, 1, 2, …



   j i t i t j

r r

i ) ( ) 1 (

d

SLIDE 19

 Example:

ra 1 rb 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

21

=

b a

Iteration 0, 1, 2, …



   j i t i t j

r r

i ) ( ) 1 (

d

SLIDE 20

 Imagine a random web surfer:

At any time 𝒖, surfer is on some page 𝒋
At time 𝒖 + 𝟐, the surfer follows an
ut-link from 𝒋 uniformly at random
Ends up on some page 𝒌 linked from 𝒋
Process repeats indefinitely

 Let:

 𝒒(𝒖) … vector whose 𝒋th coordinate is the

prob. that the surfer is at page 𝒋 at time 𝒖
So, 𝒒(𝒖) is a probability distribution over pages
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

22







j i i j

r r (i) dout

j i1 i2 i3

SLIDE 21

 Where is the surfer at time t+1?

Follows a link uniformly at random

𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖)

 Suppose the random walk reaches a state

𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖)

then 𝒒(𝒖) is stationary distribution of a random walk

 Our original rank vector 𝒔 satisfies 𝒔 = 𝑵 ⋅ 𝒔

So, 𝒔 is a stationary distribution for

the random walk

) ( M ) 1 ( t p t p   

j i1 i2 i3

23

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 22

 Does this converge?  Does it converge to what we want?



   j i t i t j

r r

i ) ( ) 1 (

d

Mr r 

r

equivalently

24

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 23

 A central result from the theory of random

walks (a.k.a. Markov processes): For graphs that satisfy certain conditions (strong connected, no dead ends) the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0

25

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 24

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

26

SLIDE 25

2 problems:

 (1) Some pages are

dead ends (have no out-links)

Random walk has “nowhere” to go to
Such pages cause importance to “leak out”

 (2) Spider traps:

(all out-links are within the group)

Random walked gets “stuck” in a trap
And eventually spider traps absorb all importance
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

27

Dead end

SLIDE 26

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

28

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½

Here the PageRank “leaks” out since the matrix is not stochastic.

SLIDE 27

 Teleports: Follow random teleport links from

dead-ends

Adjust matrix accordingly
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

29

y a m

y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½

y a m

SLIDE 28

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

30

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½ 1 m is a spider trap

All the PageRank score gets “trapped” in node m.

SLIDE 29

 The Google solution for spider traps: At each

time step, the random surfer has two options

With prob. , follow a link at random
With prob. 1-, jump to some random page
Common values for  are in the range 0.8 to 0.9

 Surfer will teleport out of spider trap

within a few time steps

31

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y a m

SLIDE 30

 PageRank equation [Brin-Page, ‘98]

𝑠

𝑘 = 𝑗→𝑘

𝛾 𝑠𝑗 𝑒𝑗 +

𝑗

(1 − 𝛾) 1 𝑂

 The Google Matrix A:

𝐵 = 𝛾 𝑁 + 1 − 𝛾 1 𝑂 𝑂×𝑂

 We have a recursive problem:  And the Power method still works!  What is  ?

In practice  =0.8,0.9 (make 5 steps on avg., jump)
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

34

[1/N]NxN…N by N matrix where all entries are 1/N

SLIDE 31

y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

35

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

13/15 7/15

1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A

SLIDE 32

 Input: Graph 𝑯 and parameter 𝜸

Directed graph 𝑯 (can have spider traps and dead ends)
Parameter 𝜸

 Output: PageRank vector 𝒔𝒐𝒇𝒙

Set: 𝑠

𝑘 𝑝𝑚𝑒 = 1 𝑂

repeat until convergence: 𝑘 𝑠

𝑘 𝑜𝑓𝑥 − 𝑠 𝑘 𝑝𝑚𝑒 > 𝜁

∀𝑘: 𝒔′𝒌

𝒐𝒇𝒙 = 𝒋→𝒌 𝜸 𝒔𝒋

𝒑𝒎𝒆

𝒆𝒋

𝒔′𝒌

𝒐𝒇𝒙 = 𝟏 if in-degree of 𝒌 is 0

Now re-insert the leaked PageRank:

∀𝒌: 𝒔𝒌

𝒐𝒇𝒙 = 𝒔′𝒌 𝒐𝒇𝒙 + 𝟐−𝑻 𝑶

𝒔𝒑𝒎𝒆 = 𝒔𝒐𝒇𝒙

36

where: 𝑇 = 𝑘 𝑠′𝑘

𝑜𝑓𝑥

If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.

J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

SLIDE 33

 Measures generic popularity of a page

Biased against topic-specific authorities
Solution: Topic-Specific PageRank

 Uses a single measure of importance

Other models of importance
Solution: Hubs-and-Authorities

 Susceptible to Link spam

Artificial link topographies created in order to

boost page rank

Solution: TrustRank
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

39