http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 - - PowerPoint PPT Presentation

http mmds org 1 c4 5 decision tree classification 61 votes
SMART_READER_LITE
LIVE PREVIEW

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 - - PowerPoint PPT Presentation

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means - Clustering (60 votes) #3: SVM


slide-1
SLIDE 1

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman

Stanford University

http://www.mmds.org

slide-2
SLIDE 2

Data Mining: Concepts and Techniques

2

 #1: C4.5 Decision Tree - Classification (61 votes)  #2: K-Means - Clustering (60 votes)  #3: SVM – Classification (58 votes)  #4: Apriori - Frequent Itemsets (52 votes)  #5: EM – Clustering (48 votes)  #6: PageRank – Link mining (46 votes)  #7: AdaBoost – Boosting (45 votes)  #7: kNN – Classification (45 votes)  #7: Naive Bayes – Classification (45 votes)  #10: CART – Classification (34 votes)

slide-3
SLIDE 3

 How to organize the Web?

 First try: Human curated

Web directories

  • Yahoo, DMOZ, LookSmart

 Second try: Web Search

  • Content based: Find relevant docs
  • Top-k ranking based on TF-IDF
  • Works well in a small and trusted set
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

5

slide-4
SLIDE 4
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

6

slide-5
SLIDE 5

 Link based ranking

algorithms

  • PageRank
  • HITS

7

Data Mining: Concepts and Techniques

slide-6
SLIDE 6

 All web pages are not equally “important”

www.joe-schmoe.com vs. www.stanford.edu

 There is large diversity

in the web-graph node connectivity. Let’s rank the pages by the link structure!

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

8

slide-7
SLIDE 7

 Idea: Links as votes

  • Page is more important if it has more links
  • In-coming links? Out-going links?

 Think of in-links as votes:

  • www.stanford.edu has 23,400 in-links
  • www.joe-schmoe.com has 1 in-link

 Are all in-links are equal?

  • Links from important pages count more
  • Recursive question!

9

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-8
SLIDE 8

B 38.4 C 34.3 E 8.1 F 3.9 D 3.9 A 3.3 1.6 1.6 1.6 1.6 1.6

10

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-9
SLIDE 9

 Each link’s vote is proportional to the

importance of its source page

 If page j with importance rj has n out-links,

each link gets rj / n votes

 Page j’s own importance is the sum of the

votes on its in-links

11

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

j

k i rj/3 rj/3 rj/3

rj = ri/3+rk/4

ri/3 rk/4

slide-10
SLIDE 10

 A “vote” from an important

page is worth more

 A page is important if it is

pointed to by other important pages

 Define a “rank” rj for page j

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

12

j i i j

r r

i

d

y m a a/2 y/2 a/2 m y/2

The web in 1839 “Flow” equations:

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

𝒆𝒋 … out-degree of node 𝒋

slide-11
SLIDE 11

 3 equations, 3 unknowns,

no constants

  • No unique solution
  • All solutions equivalent modulo the scale factor

 Additional constraint forces uniqueness:

  • 𝒔𝒛 + 𝒔𝒃 + 𝒔𝒏 = 𝟐
  • Solution: 𝒔𝒛 =

𝟑 𝟔 , 𝒔𝒃 = 𝟑 𝟔 , 𝒔𝒏 = 𝟐 𝟔

 Gaussian elimination method works for

small examples, but we need a better method for large web-size graphs

 We need a new formulation!

13

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

Flow equations:

slide-12
SLIDE 12

 Stochastic adjacency matrix 𝑵

  • Let page 𝑗 has 𝑒𝑗 out-links
  • If 𝑗 → 𝑘, then 𝑁𝑘𝑗 =

1 𝑒𝑗

else 𝑁𝑘𝑗 = 0

  • 𝑵 is a column stochastic matrix
  • Columns sum to 1

 Rank vector 𝒔: vector with an entry per page

  • 𝑠𝑗 is the importance score of page 𝑗
  • 𝑗 𝑠

𝑗

= 1

 The flow equations can be written

𝒔 = 𝑵 ⋅ 𝒔

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

14

j i i j

r r

i

d

slide-13
SLIDE 13

 Remember the flow equation:  Flow equation in the matrix form

𝑵 ⋅ 𝒔 = 𝒔

  • Suppose page i links to 3 pages, including j
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

15

j i

M r r =

rj 1/3

j i i j

r r

i

d

ri

. . =

slide-14
SLIDE 14

r = M∙r

y ½ ½ 0 y a = ½ 0 1 a m 0 ½ 0 m

16

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y a m y ½ ½ a ½ 1 m ½ ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2

slide-15
SLIDE 15

 The flow equations can be written

𝒔 = 𝑵 ∙ 𝒔

 So the rank vector r is an eigenvector of the

stochastic web matrix M

  • In fact, its first or principal eigenvector,

with corresponding eigenvalue 1

  • Largest eigenvalue of M is 1 since M is

column stochastic (with non-negative entries)

 We can now efficiently solve for r!

The method is called Power iteration

17

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

NOTE: x is an eigenvector with the corresponding eigenvalue λ if:

𝑩𝒚 = 𝝁𝒚

slide-16
SLIDE 16

 Given a web graph with n nodes, where the

nodes are pages and edges are hyperlinks

 Power iteration: a simple iterative scheme

  • Suppose there are N web pages
  • Initialize: r(0) = [1/N,….,1/N]T
  • Iterate: r(t+1) = M ∙ r(t)
  • Stop when |r(t+1) – r(t)|1 < 

18

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

   j i t i t j

r r

i ) ( ) 1 (

d

di …. out-degree of node i

|x|1 = 1≤i≤N|xi| is the L1 norm Can use any other vector norm, e.g., Euclidean

slide-17
SLIDE 17

 Example:

ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

y a m y ½ ½ a ½ 1 m ½

19

Iteration 0, 1, 2, …

slide-18
SLIDE 18

 Example:

ra 1 1 rb 1 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

20

=

b a

Iteration 0, 1, 2, …

   j i t i t j

r r

i ) ( ) 1 (

d

slide-19
SLIDE 19

 Example:

ra 1 rb 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

21

=

b a

Iteration 0, 1, 2, …

   j i t i t j

r r

i ) ( ) 1 (

d

slide-20
SLIDE 20

 Imagine a random web surfer:

  • At any time 𝒖, surfer is on some page 𝒋
  • At time 𝒖 + 𝟐, the surfer follows an
  • ut-link from 𝒋 uniformly at random
  • Ends up on some page 𝒌 linked from 𝒋
  • Process repeats indefinitely

 Let:

 𝒒(𝒖) … vector whose 𝒋th coordinate is the

  • prob. that the surfer is at page 𝒋 at time 𝒖
  • So, 𝒒(𝒖) is a probability distribution over pages
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

22

j i i j

r r (i) dout

j i1 i2 i3

slide-21
SLIDE 21

 Where is the surfer at time t+1?

  • Follows a link uniformly at random

𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖)

 Suppose the random walk reaches a state

𝒒 𝒖 + 𝟐 = 𝑵 ⋅ 𝒒(𝒖) = 𝒒(𝒖)

then 𝒒(𝒖) is stationary distribution of a random walk

 Our original rank vector 𝒔 satisfies 𝒔 = 𝑵 ⋅ 𝒔

  • So, 𝒔 is a stationary distribution for

the random walk

) ( M ) 1 ( t p t p   

j i1 i2 i3

23

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-22
SLIDE 22

 Does this converge?  Does it converge to what we want?

   j i t i t j

r r

i ) ( ) 1 (

d

Mr r 

  • r

equivalently

24

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-23
SLIDE 23

 A central result from the theory of random

walks (a.k.a. Markov processes): For graphs that satisfy certain conditions (strong connected, no dead ends) the stationary distribution is unique and eventually will be reached no matter what the initial probability distribution at time t = 0

25

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-24
SLIDE 24
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

26

slide-25
SLIDE 25

2 problems:

 (1) Some pages are

dead ends (have no out-links)

  • Random walk has “nowhere” to go to
  • Such pages cause importance to “leak out”

 (2) Spider traps:

(all out-links are within the group)

  • Random walked gets “stuck” in a trap
  • And eventually spider traps absorb all importance
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

27

Dead end

slide-26
SLIDE 26

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 1/6 1/12 2/24

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

28

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½

Here the PageRank “leaks” out since the matrix is not stochastic.

slide-27
SLIDE 27

 Teleports: Follow random teleport links from

dead-ends

  • Adjust matrix accordingly
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

29

y a m

y a m y ½ ½ ⅓ a ½ ⅓ m ½ ⅓ y a m y ½ ½ a ½ m ½

y a m

slide-28
SLIDE 28

 Example:

ry 1/3 2/6 3/12 5/24 ra = 1/3 1/6 2/12 3/24 … rm 1/3 3/6 7/12 16/24 1

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

30

Iteration 0, 1, 2, …

y a m

y a m y ½ ½ a ½ m ½ 1 m is a spider trap

All the PageRank score gets “trapped” in node m.

slide-29
SLIDE 29

 The Google solution for spider traps: At each

time step, the random surfer has two options

  • With prob. , follow a link at random
  • With prob. 1-, jump to some random page
  • Common values for  are in the range 0.8 to 0.9

 Surfer will teleport out of spider trap

within a few time steps

31

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m y a m

slide-30
SLIDE 30

 PageRank equation [Brin-Page, ‘98]

𝑠

𝑘 = 𝑗→𝑘

𝛾 𝑠𝑗 𝑒𝑗 +

𝑗

(1 − 𝛾) 1 𝑂

 The Google Matrix A:

𝐵 = 𝛾 𝑁 + 1 − 𝛾 1 𝑂 𝑂×𝑂

 We have a recursive problem:  And the Power method still works!  What is  ?

  • In practice  =0.8,0.9 (make 5 steps on avg., jump)
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

34

[1/N]NxN…N by N matrix where all entries are 1/N

slide-31
SLIDE 31

y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

35

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

y a m

13/15 7/15

1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 0.8 + 0.2 M [1/N]NxN A

slide-32
SLIDE 32

 Input: Graph 𝑯 and parameter 𝜸

  • Directed graph 𝑯 (can have spider traps and dead ends)
  • Parameter 𝜸

 Output: PageRank vector 𝒔𝒐𝒇𝒙

  • Set: 𝑠

𝑘 𝑝𝑚𝑒 = 1 𝑂

  • repeat until convergence: 𝑘 𝑠

𝑘 𝑜𝑓𝑥 − 𝑠 𝑘 𝑝𝑚𝑒 > 𝜁

  • ∀𝑘: 𝒔′𝒌

𝒐𝒇𝒙 = 𝒋→𝒌 𝜸 𝒔𝒋

𝒑𝒎𝒆

𝒆𝒋

𝒔′𝒌

𝒐𝒇𝒙 = 𝟏 if in-degree of 𝒌 is 0

  • Now re-insert the leaked PageRank:

∀𝒌: 𝒔𝒌

𝒐𝒇𝒙 = 𝒔′𝒌 𝒐𝒇𝒙 + 𝟐−𝑻 𝑶

  • 𝒔𝒑𝒎𝒆 = 𝒔𝒐𝒇𝒙

36

where: 𝑇 = 𝑘 𝑠′𝑘

𝑜𝑓𝑥

If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S.

  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org
slide-33
SLIDE 33

 Measures generic popularity of a page

  • Biased against topic-specific authorities
  • Solution: Topic-Specific PageRank

 Uses a single measure of importance

  • Other models of importance
  • Solution: Hubs-and-Authorities

 Susceptible to Link spam

  • Artificial link topographies created in order to

boost page rank

  • Solution: TrustRank
  • J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org

39