CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - - PowerPoint PPT Presentation

cs345a data mining jure leskovec and anand rajaraman j
SMART_READER_LITE
LIVE PREVIEW

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University Would like to do prediction: Would like to do prediction: learn a function: y = f(x) Where y can be: h b Real: Regression Categorical: Classification


slide-1
SLIDE 1

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j

Stanford University

slide-2
SLIDE 2

 Would like to do prediction:  Would like to do prediction:

learn a function: y = f(x) h b

 Where y can be:

  • Real: Regression

l l f

  • Categorical: Classification
  • More complex:
  • Ranking Str ct red prediction etc
  • Ranking, Structured prediction, etc.

 Data is labeled:

  • Have many pairs (x,y)

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

slide-3
SLIDE 3

 We will talk about the following methods:  We will talk about the following methods:

  • k‐Nearest Neighbor (Instance based learning)
  • P

t l ith

  • Perceptron algorithm
  • Support Vector Machines

D i i (l Th d b

  • Decision trees (lecture on Thursday by

Sugato Basu from Google)

 How to efficiently train (build a model)?

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

slide-4
SLIDE 4

 Instance based learning  Instance based learning  Example: Nearest neighbor

  • Keep the whole training dataset: (x y)
  • Keep the whole training dataset: (x,y)
  • A query example x’ comes

Fi d l t l ( ) *

  • Find closest example(s) x*
  • Predict y*

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

slide-5
SLIDE 5

 To make things work we need 4 things:

g g

  • Distance metric:
  • Euclidean
  • How many neighbors to look at?

y g

  • One
  • Weighting function (optional):
  • Unused
  • How to fit with the local points?
  • Just predict the same output as the nearest neighbor

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

slide-6
SLIDE 6

 Suppose x

x are two dimensional:

 Suppose x1,…, xm are two dimensional:

  • x1=(x11,x12), x2=(x21,x22), …

 One can draw nearest neighbor regions:  One can draw nearest neighbor regions:

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

d(xi,xj)=(xi1‐xj1)2+(xi2‐xj2)2 d(xi,xj)=(xi1‐xj1)2+(3xi2‐3xj2)2

slide-7
SLIDE 7

 Distance metric:

  • Euclidean

 How many neighbors to look at?

  • k

 Weighting function (optional):

  • Unused

 How to fit with the local points?

  • Just predict the average output among k nearest neighbors

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

k=9

slide-8
SLIDE 8

 Distance metric:

  • Euclidean
  • Euclidean

 How many neighbors to look at?

  • All of them

 Weighting function:

Weighting function:

  • wi=exp(‐d(xi, q)2/Kw)
  • Nearby points to query q are weighted more strongly. Kw…kernel width.

 How to fit with the local points?

p

  • Predict weighted average: wiyi/wi

K=10 K=20 K=80 K=10 K=20 K=80

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

slide-9
SLIDE 9

 Given: a set P of n points in Rd  Given: a set P of n points in R  Goal: Given a query point q:

  • NN: find the nearest neighbor p of q in P
  • NN: find the nearest neighbor p of q in P
  • Range search: find one/all points in P within

distance r from q distance r from q

q p

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

slide-10
SLIDE 10

 Main memory:  Main memory:

  • Linear scan
  • T

b d

  • Tree based:
  • Quadtree
  • kd‐tree
  • kd‐tree
  • Hashing:
  • Locality‐Sensitive Hashing

Locality Sensitive Hashing

 Secondary storage:

  • R‐trees

R trees

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

slide-11
SLIDE 11

 Simplest spatial structure on Earth!  Simplest spatial structure on Earth!  Split the space into 2d equal subsquares  Repeat until done:

Repeat until done:

  • only one pixel left
  • only one point left

y p

  • only a few points left

 Variants:

  • split only one dimension

at a time

  • kd‐trees (in a moment)

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

slide-12
SLIDE 12

 Range search:

Range search:

  • Put root node on the stack
  • Repeat:
  • pop the next node T from the stack
  • pop the next node T from the stack
  • for each child C of T:
  • if C is a leaf, examine point(s) in C
  • if C intersects with the ball of radius r

q

if C intersects with the ball of radius r around q, add C to the stack

 Nearest neighbor:

  • Start range search with r = 

 Great in 2 or 3

dimensions

g

  • Whenever a point is found,

update r

  • Only investigate nodes with

dimensions

 Nodes have 2d

parents S i

Only investigate nodes with respect to current r

 Space issues:

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

slide-13
SLIDE 13

 Main ideas [Bentley ’75] :

Main ideas [Bentley 75] :

  • Only one‐dimensional splits
  • Choose the split “carefully”

p y (many variations)

  • Queries: as for quadtrees

 Advantages:

  • no (or less) empty spaces
  • only linear space

 Query time at most:  Query time at most:

  • Min[dn, exponential(d)]

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

slide-14
SLIDE 14

 “Bottom‐up” approach [Guttman 84]: 

Bottom‐up approach [Guttman 84]:

  • Start with a set of points/rectangles
  • Partition the set into groups of small cardinality
  • Partition the set into groups of small cardinality
  • For each group, find minimum rectangle containing
  • bjects from this group (MBR)

j g p ( )

  • Repeat

Advantages

 Advantages:

  • Supports near(est) neighbor search (similar as before)
  • Works for points and rectangles
  • Works for points and rectangles
  • Avoids empty spaces

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

slide-15
SLIDE 15

 R trees with fan out 4:  R‐trees with fan‐out 4:

  • group nearby rectangles to parent MBRs

A C G I A B F G H J D E J

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #15

slide-16
SLIDE 16

 R trees with fan out 4:  R‐trees with fan‐out 4:

  • every parent node completely covers its ‘children’

A C G I P1 P3 A B F G H J

H I J A B C

D E J P2 P4

F G D E H I J A B C

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #16

slide-17
SLIDE 17

 R trees with fan out 4:  R‐trees with fan‐out 4:

  • every parent node completely covers its ‘children’

A C G I P1 P3

P1 P2 P3 P4

A B F G H J

H I J A B C

D E J P2 P4

F G D E H I J A B C

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #17

slide-18
SLIDE 18

 Example of a range search query  Example of a range search query

A C G I P1 P3

P1 P2 P3 P4

A B F G H J

H I J A B C

D E J P2 P4

F G D E H I J A B C

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #18

slide-19
SLIDE 19

 Example of a range search query  Example of a range search query

A C G I P1 P3

P1 P2 P3 P4

A B F G H J

H I J A B C

D E J P2 P4

F G D E H I J A B C

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining #19

slide-20
SLIDE 20

 Example: Spam filtering  Example: Spam filtering  Instance space X:

  • Feature vector of word occurrences (binary, TF‐IDF)
  • d features (d~100,000)

 Class Y:

  • Spam (+1), Ham (‐1)

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

slide-21
SLIDE 21

 Very loose motivation: Neuron  Very loose motivation: Neuron  Inputs are feature values  Each feature has a weight w  Each feature has a weight w  Activation is the sum:

  • f(x)  w x

w x 

x1 x2 x3

>0?

w1 w2 w3 w

  • f(x)=i wixi = wx

 If the f(x) is:

  • Positive predict +1

3

x4 w4

geria wx=0

  • Positive: predict +1
  • Negative: predict ‐1

nig Spam=1 x1 x2

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

viagra Ham=‐1 w

slide-22
SLIDE 22

 If more than 2 classes:  If more than 2 classes:

  • Weight vector wc for each class
  • C l

l t ti ti f h l

  • Calculate activation for each class
  • f(x,c)= i wc,ixi = wcx
  • Highest activation wins:

w3x

  • Highest activation wins:
  • c = arg maxc f(x,c)

w3 biggest w1 w2 w1x biggest w2x biggest

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

biggest

slide-23
SLIDE 23

 Define a model:  Define a model:

Perceptron: y = sign(wx)

 Define a loss function:

L(w) = –i yi  wxi

i i i

 Minimize the loss:

  • Compute gradient L’(w) and optimize:

wt+1 = wt ‐ tL’(w) = wt ‐ t i dL(yiwxi)/dw (Batch gradient descent)

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

slide-24
SLIDE 24

 Stochastic gradient descent:  Stochastic gradient descent:

  • Examples are drawn from a finite training set
  • Pi k

d l d d t

  • Pick random example xj and update

wt+1 = wt ‐ t dL(wxj, yj)/dw

Cost per iteration Time to reach accuracy  Time for

  • ptimization error <

GD O(md) O(mdlog(1/)) O(d2/ log2(1/)) g  g 2nd order GD O(d(d+m)) O(md log log(1/)) O(d2/ log(1/)log log(1/)) Stochastic GD O(d) O(d/) O(d/) [Bottou‐LeCun ‘04] m number of examples

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

[Bottou LeCun 04] m… number of examples d… number of features … condition number

slide-25
SLIDE 25

 Start with w=0  Start with w=0  Pick training examples x one by one  Predict class of x using current weights

g g

  • y’ = sign(wx)

 If y’ is correct:

  • no change

 If y’ is wrong: adjust w

wt yx wt+1

  • wt+1 = wt +   y  x
  •  is the learning rate parameter
  • is the training e ample

x

t 1

  • x is the training example
  • y is true class label

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

slide-26
SLIDE 26

 Separability: some parameters get  Separability: some parameters get

training set perfectly

 Convergence: if training set is

separable, perceptron will converge (binary case)

 Mistake bound: number of mistakes  Mistake bound: number of mistakes

(binary case) related to the margin or degree of separability : degree of separability :

  • mistakes < 1/2

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 26

slide-27
SLIDE 27

 Overfitting:  Overfitting:  Regularization: if the data  Regularization: if the data

is not separable weights dance around dance around

 Mediocre generalization:  Mediocre generalization:

  • Finds a “barely” separating

solution solution

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27

slide-28
SLIDE 28

 Which is best linear separator?  Which is best linear separator?

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 28

slide-29
SLIDE 29

 Maximize the margin:

 Maximize the margin:

  • Good according to

intuition theory & practice intuition, theory & practice

w

max

  • Si

     w x y i t s

i i

, . .

|| || i

2

  • Since:

 1 

1 , . . || || min

2

    w x y i t s w

i i w

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 29

w w

1 , . .   w x y i t s

i i

SVM with “hard” constraints

slide-30
SLIDE 30

 If not separable introduce slack variables :  If not separable introduce slack variables :

O i th “ t l” f

 Or in the “natural” form:

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 30

Empirical loss Regularization term

slide-31
SLIDE 31

 Use quadratic solver:  Use quadratic solver:

  • Minimize quadratic function
  • S bj

t t li t i t

  • Subject to linear constraints

 Stochastic gradient descent:

Mi i i

  • Minimize:
  • Update:

        y wx L w w w f w w

t t

) , ( ) ( '   

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 31

          w w w w f w w

t t

) (   

slide-32
SLIDE 32

 Example by Leon Bottou:  Example by Leon Bottou:

  • Reuters RCV1 document corpus
  • 781k t

i i l 23k t t l

  • m=781k training examples, 23k test examples
  • d=50k features

 Training time:

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 32

slide-33
SLIDE 33

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 33

slide-34
SLIDE 34

 What if we subsample the dataset?

  • SGD on full dataset vs.
  • Conjugate gradient on n training examples

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 34

slide-35
SLIDE 35

 Need to choose learning rate :  Need to choose learning rate :

) ( '

1

w L w w

t t t

  

 Leon suggests:

  • Select small subsample
  • Select small subsample
  • Try various rates 

Pi k th th t t d th l

  • Pick the one that most reduces the loss
  • Use  for next 100k iterations on the full dataset

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 35

slide-36
SLIDE 36

 Stopping criteria:  Stopping criteria:

How many iterations of SGD?

  • Early stopping with cross validation

Early stopping with cross validation

  • Create validation set
  • Monitor cost function on the validation set
  • Stop when loss stops decreasing
  • Early stopping a priori
  • Extract two disjoint subsamples A and B of training data
  • Extract two disjoint subsamples A and B of training data
  • Determine the number of epochs k by training on A, stop

by validating on B

  • Train for k epochs on the full dataset

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 36

slide-37
SLIDE 37

 Kernel function: K(x x )=(x )  (x )  Kernel function: K(xi,xj)=(xi)  (xj)  Does the SVM kernel trick still work?  Yes (but not without a price)  Yes (but not without a price)

  • Represent w with its kernel expansion:

 ( ) i i  (xi)

  • Usually:

dL( )/d ( ) dL(w)/dw = ‐   (xj)

  • Then update w at epoch t by combining :

t= (1‐ )  t + 

2/23/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 37

slide-38
SLIDE 38

[Shalev‐Shwartz et al. ICML ‘07] |At| = S |At| = 1 |At| S Subgradient method |

t|

Stochastic gradient

Subgradient Projection

2/23/2010 38 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-39
SLIDE 39

[Shalev‐Shwartz et al. ICML ‘07]

 Choosing |At|=1 and a linear kernel over Rn

Choosing |At| 1 and a linear kernel over R

 Theorem [Shalev‐Shwartz et al. ‘07]:

d f f d

  • Run‐time required for Pegasos to find 

accurate solution with prob. >1‐

i d d b f f

 Run‐time depends on number of features n  Does not depend on #examples m  Depends on “difficulty” of problem ( and )  Depends on difficulty of problem ( and )

2/23/2010 39 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining