CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - - PowerPoint PPT Presentation

cs345a data mining jure leskovec and anand rajaraman j
SMART_READER_LITE
LIVE PREVIEW

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3 is out Poster session is on last day of classes: Thu March 11 at 4:15 Thu March 11 at 4:15 Reports are due March 14 Final is


slide-1
SLIDE 1

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j

Stanford University

slide-2
SLIDE 2

 HW3 is out  HW3 is out  Poster session is on last day of classes:

  • Thu March 11 at 4:15
  • Thu March 11 at 4:15

 Reports are due March 14  Final is March 18 at 12:15  Final is March 18 at 12:15

  • Open book, open notes

N l t

  • No laptop

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

slide-3
SLIDE 3

Whi h i b t li t ?

 Which is best linear separator?

Data:

+

 Examples:

  • (x1, y1),… (xn, yn)

 Example i:

+ + + + ‐

 Example i:

  • xi=(x1

(1),…, x1 (d))

  • yi{‐1, +1}

+ + + ‐ ‐

yi { , }

 Inner product:  w x

‐ ‐ ‐ ‐

 wx=

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

slide-4
SLIDE 4

f d

 Confidence:

=(wxi)yi

 For all datapoints:

+ +

wx=0

 For all datapoints:

i=

+ + + + ‐ + + + ‐ ‐ ‐ ‐ ‐ ‐

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

slide-5
SLIDE 5

 Maximize the margin:

+ +

 Maximize the margin:

  • Good according to

intuition theory & practice

+ + + +

wx=0

intuition, theory & practice

max

, w

+ + + ‐ ‐ ‐

     ) ( , . . w x y i t s

i i

+ ‐ ‐ ‐ ‐

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

slide-6
SLIDE 6

 Canonical hyperplanes:  Canonical hyperplanes:

  • Projection of xi on plane

wx=0:

w x x   

wx=0:

|| || w x x

i i

  

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

slide-7
SLIDE 7

 Maximizing the margin:  Maximizing the margin:

max

, w

     ) ( , . . w x y i t s

i i

 Equivalent:

|| || min

2

w 1 ) ( , . . || || min     w x y i t s w

i i w

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

SVM with “hard” constraints

slide-8
SLIDE 8

 If data not separable introduce penalty  If data not separable introduce penalty mistakes

  • f

number # C 2 1 min    w w

w

+ +

wx=0

Ch C b d

1 ) ( , . .     w x y i t s

i i

+ + + ‐

wx=0

 Choose C based

  • n cross validation

+ + ‐

 How to penalize

i t k ? + ‐ ‐ ‐ mistakes?

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

‐ ‐

slide-9
SLIDE 9

 Introduce slack variables :

+ +

 Introduce slack variables :

n i i w

C w w

i

 

2 1 min

1 ,

+ + + Hi l

i i i i

w x y i t s

i

     

1 ) ( , . . 2

1

+ + + ‐ ‐

 Hinge loss:

‐ ‐ ‐

wx=0 For each datapoint: If margin>1, don’t care If margin<1 pay linear penalty

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

If margin<1, pay linear penalty

slide-10
SLIDE 10

 SVM in the “natural” form  SVM in the natural form

(w) min arg f

w

  • Where:

w

      

n i i

w x y C w w w f )} ( 1 , max{ 1 ) (

i i i

w x y C w w w f

1

)} ( 1 , max{ 2 ) (

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

slide-11
SLIDE 11

 Use quadratic solver:

n

1  Use quadratic solver:

  • Minimize quadratic function
  • S bj

t t li t i t

n i i w

w x y i t s C w w

i

 

      

 

1 ) ( 2 1 min

1 ,

  • Subject to linear constraints

 Stochastic gradient descent:

Mi i i

i i i

w x y i t s       1 ) ( , . .

  • Minimize:

      

n i i

w x y C w w w f )} ( 1 , max{ 2 1 ) (

  • Update:

        y wx L w w w f w w

t t

) , ( ) ( '   

 i 1

2

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

          w w w w f w w

t t

) (   

slide-12
SLIDE 12

 Example by Leon Bottou:  Example by Leon Bottou:

  • Reuters RCV1 document corpus
  • 781k t

i i l 23k t t l

  • m=781k training examples, 23k test examples
  • d=50k features

 Training time:

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

slide-13
SLIDE 13

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

slide-14
SLIDE 14

 What if we subsample the dataset?

  • SGD on full dataset vs.
  • Conjugate gradient on n training examples

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

slide-15
SLIDE 15

 Need to choose learning rate :  Need to choose learning rate :

) ( '

1

w L w w

t t t

  

 Leon suggests:

  • Select small subsample
  • Select small subsample
  • Try various rates 

Pi k th th t t d th l

  • Pick the one that most reduces the loss
  • Use  for next 100k iterations on the full dataset

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

slide-16
SLIDE 16

 Stopping criteria:  Stopping criteria:

How many iterations of SGD?

  • Early stopping with cross validation

Early stopping with cross validation

  • Create validation set
  • Monitor cost function on the validation set
  • Stop when loss stops decreasing
  • Early stopping a priori
  • Extract two disjoint subsamples A and B of training data
  • Extract two disjoint subsamples A and B of training data
  • Determine the number of epochs k by training on A, stop

by validating on B

  • Train for k epochs on the full dataset

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

slide-17
SLIDE 17

 Kernel function: K(x x ) = (x )  (x )  Kernel function: K(xi,xj) = (xi)  (xj)  Does the SVM kernel trick still work?  Yes (but not without a price):  Yes (but not without a price):

  • Represent w with its kernel expansion:

 ( ) w = i i  (xi)

  • Usually:

dL( )/d ( ) dL(w)/dw = ‐   (xj)

  • Then update w at epoch t by combining :

t= (1‐ )  t + 

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

slide-18
SLIDE 18

[Shalev‐Shwartz et al. ICML ‘07]

 We had before:  We had before:  Can replace C with :

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

slide-19
SLIDE 19

[Shalev‐Shwartz et al. ICML ‘07] |At| = S |At| = 1 |At| S Subgradient method |

t|

Stochastic gradient

Subgradient Projection

3/2/2010 19 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-20
SLIDE 20

[Shalev‐Shwartz et al. ICML ‘07]

 Choosing |At|=1 and a linear kernel over Rn

Choosing |At| 1 and a linear kernel over R

 Theorem [Shalev‐Shwartz et al. ‘07]:

d f f d

  • Run‐time required for Pegasos to find 

accurate solution with prob. >1‐

i d d b f f

 Run‐time depends on number of features n  Does not depend on #examples m  Depends on “difficulty” of problem ( and )  Depends on difficulty of problem ( and )

3/2/2010 20 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-21
SLIDE 21

 SVM and structured output prediction  SVM and structured output prediction  Setting:

  • Assume: Data is i.i.d. from

Assume: Data is i.i.d. from

  • Given: Training sample
  • Goal: Find function from input space X to output Y

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

Complex objects

slide-22
SLIDE 22

 Examples:  Examples:

  • Natural Language Parsing
  • Given a sequence of words x predict the parse tree y
  • Given a sequence of words x, predict the parse tree y
  • Dependencies from structural constraints, since y has to

be a tree

The dog chased the cat x S VP NP y The dog chased the cat Det N V NP Det N

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

slide-23
SLIDE 23

 Approach: view as multi‐class classification task

pp

  • Every complex output is one class

 Problems:

  • Exponentially many classes!
  • Exponentially many classes!
  • How to predict efficiently?
  • How to learn efficiently?
  • Potentially huge model!

S VP VP y1

  • Potentially huge model!
  • Manageable number of features?

x

S VP NP NP y2 S VP VP Det N V NP V N y

The dog chased the cat x

Det N V NP Det N

k

S NP VP Det N V NP Det N yk

3/2/2010 23 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-24
SLIDE 24

 Feature vector describes match between x and y

y

 Learn single weight vector and rank by

Hard‐margin optimization problem:

3/2/2010 24 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-25
SLIDE 25

[Yue et al., SIGIR ‘07]

 Ranking:  Ranking:

  • Given a query x, predict a ranking y.
  • D

d i b t lt (

  • Dependencies between results (e.g.

avoid redundant hits)

  • Loss function over rankings (e g AvgPrec)
  • Loss function over rankings (e.g. AvgPrec)

SVM

x 1. Kernel‐Machines 2. SVM‐Light L i i h K l y 3. Learning with Kernels 4. SV Meppen Fan Club 5. Service Master & Co. 6. School of Volunteer Management

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

g 7. SV Mattersburg Online …

slide-26
SLIDE 26

[Yue et al., SIGIR ‘07]

 Given:

Given:

  • a complete (weak) ranking of documents for a query

 Predict:

  • ranking for the input query and document set

ranking for the input query and document set

 The true labeling is a ranking where the relevant

documents are all ranked in the front e g documents are all ranked in the front, e.g.,

 An incorrect labeling is any other ranking, e.g.,

g y g, g ,

 There are intractable many rankings, thus an

i t t bl b f t i t ! intractable number of constraints!

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 26

slide-27
SLIDE 27

[Yue et al., SIGIR ‘07]

 Let x is a set of documents/query examples

Let x is a set of documents/query examples

 Let y denote a weak ranking (pairwise orderings)

yij  {‐1, +1}

j

 SVM objective function:

C t i t d fi d f h i t ki

i i

C w 

2

2 1

 Constraints are defined for each incorrect ranking

y’ over the set of documents x:

         ) ' , ( ) , ' ( ) , ( : ' y y x y w x y w y y

T T

  • is the match between target and prediction

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27

slide-28
SLIDE 28

[Yue et al., SIGIR ‘07]

 Loss:  Loss:

Average precision is the average of the precision scores at the rank locations of each precision scores at the rank locations of each relevant document.

 Ex: has average precision

  76 . 5 3 3 2 1 1 3 1          

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 28

slide-29
SLIDE 29

[Yue et al., SIGIR ‘07]

 Maximize:

 C w 

2

1

 Maximize:

subject to:

i i

C w  2

         ) ' , ( ) , ' ( ) , ( : ' y y x y w x y w y y

T T

j where:

 

   

l l j i ij

x x y x y

!

) ( ' ) , ' (

 ) , ( ) , ( ) , ( y y y y y y

and:

rel i rel j : :!

) ' ( AvgPrec 1 ) ' , ( y y y   

 After learning w predict by sorting on wx  After learning w, predict by sorting on wxi

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 29

slide-30
SLIDE 30

[Yue et al., SIGIR ‘07]

Original SVM Problem

E ti l t i t

Structural SVM Approach

R t dl fi d th t t

Exponential constraints

Most are dominated by a small set of “important” constraints

Repeatedly finds the next most violated constraint…

…until set of constraints is a good approximation. pp

3/2/2010 30 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-31
SLIDE 31

[Yue et al., SIGIR ‘07]

Original SVM Problem

E ti l t i t

Structural SVM Approach

R t dl fi d th t t

Exponential constraints

Most are dominated by a small set of “important” constraints

Repeatedly finds the next most violated constraint…

…until set of constraints is a good approximation. pp

3/2/2010 31 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-32
SLIDE 32

[Yue et al., SIGIR ‘07]

Original SVM Problem

E ti l t i t

Structural SVM Approach

R t dl fi d th t t

Exponential constraints

Most are dominated by a small set of “important” constraints

Repeatedly finds the next most violated constraint…

…until set of constraints is a good approximation. pp

3/2/2010 32 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-33
SLIDE 33

[Yue et al., SIGIR ‘07]

Original SVM Problem

E ti l t i t

Structural SVM Approach

R t dl fi d th t t

Exponential constraints

Most are dominated by a small set of “important” constraints

Repeatedly finds the next most violated constraint…

…until set of constraints is a good approximation. pp

3/2/2010 33 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-34
SLIDE 34

 Input:

Input:

  REPEAT

Find most violated t i t Violated by more than  ?

  • FOR
  • Compute
  • ENDFOR

constraint than  ?

  • ENDFOR
  • IF

_

  • ptimize StructSVM over
  • ENDIF

Add constraint t ki t

 UNTIL has not changed during iteration

[Jo06] [JoFinYu08]

to working set

3/2/2010 34 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-35
SLIDE 35

 Cutting plane algorithm:

g p g

  • STEP 1: Solve the SVM objective function using only the

current working set of constraints

  • STEP 2: Using the model learned in STEP 1 find the most
  • STEP 2: Using the model learned in STEP 1, find the most

violated constraint from the exponential set of constraints

  • STEP 3: If the constraint returned in STEP 2 is more violated

h h i l d i h ki b than the most violated constraint the working set by some small constant, add that constraint to the working set

  • Repeat STEP 1‐3 until no additional constraints are added.

Repeat STEP 1 3 until no additional constraints are added.

  • Return the most recent model that was trained in STEP 1.

STEP 1‐3 is guaranteed to loop for at most a polynomial number of STEP 1 3 is guaranteed to loop for at most a polynomial number of

  • iterations. [Tsochantaridis et al. 2005]

3/2/2010 35 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-36
SLIDE 36

 Structural SVM is an oracle framework

Structural SVM is an oracle framework

 Requires subroutine for finding the most violated

constraint constraint

  • Dependents on the formulation of loss function and

joint feature representation

 Exponential number of constraints!  Efficient algorithm in the case of optimizing

Efficient algorithm in the case of optimizing Mean Avg. Prec. (MAP):

  • MAP is invariant on the order of documents within a

relevance class relevance class

3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 36

slide-37
SLIDE 37

[Yue et al., SIGIR ‘07]

 

    

j T i T ij

x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (

Observation:

 MAP is invariant on the order of documents within a

l l

 

rel i rel j j i ij

y y y y

: :!

) ( ) ( ) (

relevance class

  • Swapping two relevant or non‐relevant documents does not

change MAP.

 Joint SVM score is optimized by sorting by document score,

w∙x

 Reduces to finding an interleaving

between two sorted lists of documents between two sorted lists of documents

3/2/2010 37 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-38
SLIDE 38

 

    

j T i T ij

x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (

 

rel i rel j j i ij

y y y y

: :!

) ( ) ( ) (

Start with perfect ranking

Consider swapping adjacent

Consider swapping adjacent relevant/non‐relevant documents

3/2/2010 38 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-39
SLIDE 39

 

    

j T i T ij

x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (

 

rel i rel j j i ij

y y y y

: :!

) ( ) ( ) (

Start with perfect ranking

Consider swapping adjacent

Consider swapping adjacent relevant/non‐relevant documents

Find the best feasible ranking of the non‐relevant document

3/2/2010 39 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-40
SLIDE 40

 

    

j T i T ij

x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (

 

rel i rel j j i ij

y y y y

: :!

) ( ) ( ) (

Start with perfect ranking

Consider swapping adjacent

Consider swapping adjacent relevant/non‐relevant documents

Find the best feasible ranking of the non‐relevant document

Repeat for next non‐relevant document

3/2/2010 40 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-41
SLIDE 41

 

    

j T i T ij

x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (

 

rel i rel j j i ij

y y y y

: :!

) ( ) ( ) (

Start with perfect ranking

Consider swapping adjacent

Consider swapping adjacent relevant/non‐relevant documents

Find the best feasible ranking of the non‐relevant document

Repeat for next non‐relevant document

Never want to swap past previous

non‐relevant document

3/2/2010 41 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-42
SLIDE 42

 

    

j T i T ij

x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (

 

rel i rel j j i ij

y y y y

: :!

) ( ) ( ) (

Start with perfect ranking

Consider swapping adjacent

Consider swapping adjacent relevant/non‐relevant documents

Find the best feasible ranking of the non‐relevant document

Repeat for next non‐relevant document

Never want to swap past previous non‐ relevant document

Repeat until all non‐relevant documents have been considered

3/2/2010 42 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining

slide-43
SLIDE 43

SVM Formulation

 SVMs optimize a tradeoff between model complexity and

MAP loss

 Exponential number of constraints (one for each incorrect

ki ) ranking)

 Structural SVMs finds a small subset of important

constraints

 Requires sub procedure to find most violated constraint  Requires sub‐procedure to find most violated constraint

Find Most Violated Constraint

 Loss function invariant to re‐ordering of relevant

g documents

 SVM score imposes an ordering of the relevant documents  Finding interleaving of two sorted lists  Loss function has certain monotonic properties  Efficient algorithm

3/2/2010 43 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining