CS345a: Data Mining Jure Leskovec and Anand Rajaraman j
Stanford University
CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - - PowerPoint PPT Presentation
CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3 is out Poster session is on last day of classes: Thu March 11 at 4:15 Thu March 11 at 4:15 Reports are due March 14 Final is
CS345a: Data Mining Jure Leskovec and Anand Rajaraman j
Stanford University
HW3 is out HW3 is out Poster session is on last day of classes:
Reports are due March 14 Final is March 18 at 12:15 Final is March 18 at 12:15
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2
Which is best linear separator?
Examples:
Example i:
Example i:
(1),…, x1 (d))
yi { , }
Inner product: w x
wx=
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3
Confidence:
For all datapoints:
wx=0
For all datapoints:
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4
Maximize the margin:
Maximize the margin:
wx=0
, w
i i
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5
Canonical hyperplanes: Canonical hyperplanes:
w x x
|| || w x x
i i
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6
Maximizing the margin: Maximizing the margin:
, w
i i
Equivalent:
2
i i w
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7
SVM with “hard” constraints
If data not separable introduce penalty If data not separable introduce penalty mistakes
number # C 2 1 min w w
w
wx=0
1 ) ( , . . w x y i t s
i i
wx=0
Choose C based
How to penalize
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8
Introduce slack variables :
Introduce slack variables :
n i i w
i
1 ,
i i i i
i
1
Hinge loss:
wx=0 For each datapoint: If margin>1, don’t care If margin<1 pay linear penalty
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9
If margin<1, pay linear penalty
SVM in the “natural” form SVM in the natural form
w
w
n i i
i i i
1
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10
Use quadratic solver:
n
1 Use quadratic solver:
n i i w
w x y i t s C w w
i
1 ) ( 2 1 min
1 ,
Stochastic gradient descent:
i i i
w x y i t s 1 ) ( , . .
n i i
t t
i 1
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11
t t
Example by Leon Bottou: Example by Leon Bottou:
Training time:
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13
What if we subsample the dataset?
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14
Need to choose learning rate : Need to choose learning rate :
1
t t t
Leon suggests:
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15
Stopping criteria: Stopping criteria:
by validating on B
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16
Kernel function: K(x x ) = (x ) (x ) Kernel function: K(xi,xj) = (xi) (xj) Does the SVM kernel trick still work? Yes (but not without a price): Yes (but not without a price):
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17
[Shalev‐Shwartz et al. ICML ‘07]
We had before: We had before: Can replace C with :
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18
[Shalev‐Shwartz et al. ICML ‘07] |At| = S |At| = 1 |At| S Subgradient method |
t|
Stochastic gradient
Subgradient Projection
3/2/2010 19 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Shalev‐Shwartz et al. ICML ‘07]
Choosing |At|=1 and a linear kernel over Rn
Theorem [Shalev‐Shwartz et al. ‘07]:
d f f d
accurate solution with prob. >1‐
Run‐time depends on number of features n Does not depend on #examples m Depends on “difficulty” of problem ( and ) Depends on difficulty of problem ( and )
3/2/2010 20 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
SVM and structured output prediction SVM and structured output prediction Setting:
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21
Complex objects
Examples: Examples:
be a tree
The dog chased the cat x S VP NP y The dog chased the cat Det N V NP Det N
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22
Approach: view as multi‐class classification task
Problems:
S VP VP y1
x
S VP NP NP y2 S VP VP Det N V NP V N y
The dog chased the cat x
Det N V NP Det N
k
S NP VP Det N V NP Det N yk
3/2/2010 23 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Feature vector describes match between x and y
Learn single weight vector and rank by
Hard‐margin optimization problem:
3/2/2010 24 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Yue et al., SIGIR ‘07]
Ranking: Ranking:
SVM
x 1. Kernel‐Machines 2. SVM‐Light L i i h K l y 3. Learning with Kernels 4. SV Meppen Fan Club 5. Service Master & Co. 6. School of Volunteer Management
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25
g 7. SV Mattersburg Online …
[Yue et al., SIGIR ‘07]
Given:
Predict:
ranking for the input query and document set
The true labeling is a ranking where the relevant
An incorrect labeling is any other ranking, e.g.,
There are intractable many rankings, thus an
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 26
[Yue et al., SIGIR ‘07]
Let x is a set of documents/query examples
Let y denote a weak ranking (pairwise orderings)
j
SVM objective function:
i i
C w
2
2 1
Constraints are defined for each incorrect ranking
T T
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 27
[Yue et al., SIGIR ‘07]
Loss: Loss:
Ex: has average precision
76 . 5 3 3 2 1 1 3 1
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 28
[Yue et al., SIGIR ‘07]
Maximize:
C w
2
1
Maximize:
i i
C w 2
) ' , ( ) , ' ( ) , ( : ' y y x y w x y w y y
T T
l l j i ij
x x y x y
!
) ( ' ) , ' (
) , ( ) , ( ) , ( y y y y y y
rel i rel j : :!
) ' ( AvgPrec 1 ) ' , ( y y y
After learning w predict by sorting on wx After learning w, predict by sorting on wxi
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 29
[Yue et al., SIGIR ‘07]
Original SVM Problem
E ti l t i t
Structural SVM Approach
R t dl fi d th t t
Exponential constraints
Most are dominated by a small set of “important” constraints
Repeatedly finds the next most violated constraint…
…until set of constraints is a good approximation. pp
3/2/2010 30 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Yue et al., SIGIR ‘07]
Original SVM Problem
E ti l t i t
Structural SVM Approach
R t dl fi d th t t
Exponential constraints
Most are dominated by a small set of “important” constraints
Repeatedly finds the next most violated constraint…
…until set of constraints is a good approximation. pp
3/2/2010 31 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Yue et al., SIGIR ‘07]
Original SVM Problem
E ti l t i t
Structural SVM Approach
R t dl fi d th t t
Exponential constraints
Most are dominated by a small set of “important” constraints
Repeatedly finds the next most violated constraint…
…until set of constraints is a good approximation. pp
3/2/2010 32 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
[Yue et al., SIGIR ‘07]
Original SVM Problem
E ti l t i t
Structural SVM Approach
R t dl fi d th t t
Exponential constraints
Most are dominated by a small set of “important” constraints
Repeatedly finds the next most violated constraint…
…until set of constraints is a good approximation. pp
3/2/2010 33 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Input:
REPEAT
Find most violated t i t Violated by more than ?
constraint than ?
_
Add constraint t ki t
UNTIL has not changed during iteration
[Jo06] [JoFinYu08]
to working set
3/2/2010 34 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Cutting plane algorithm:
current working set of constraints
violated constraint from the exponential set of constraints
h h i l d i h ki b than the most violated constraint the working set by some small constant, add that constraint to the working set
Repeat STEP 1 3 until no additional constraints are added.
STEP 1‐3 is guaranteed to loop for at most a polynomial number of STEP 1 3 is guaranteed to loop for at most a polynomial number of
3/2/2010 35 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
Structural SVM is an oracle framework
Requires subroutine for finding the most violated
joint feature representation
Exponential number of constraints! Efficient algorithm in the case of optimizing
relevance class relevance class
3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 36
[Yue et al., SIGIR ‘07]
j T i T ij
x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (
Observation:
MAP is invariant on the order of documents within a
l l
rel i rel j j i ij
y y y y
: :!
) ( ) ( ) (
relevance class
change MAP.
Joint SVM score is optimized by sorting by document score,
w∙x
Reduces to finding an interleaving
between two sorted lists of documents between two sorted lists of documents
3/2/2010 37 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
j T i T ij
x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (
rel i rel j j i ij
y y y y
: :!
) ( ) ( ) (
Start with perfect ranking
Consider swapping adjacent
►
Consider swapping adjacent relevant/non‐relevant documents
3/2/2010 38 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
j T i T ij
x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (
rel i rel j j i ij
y y y y
: :!
) ( ) ( ) (
Start with perfect ranking
Consider swapping adjacent
Consider swapping adjacent relevant/non‐relevant documents
Find the best feasible ranking of the non‐relevant document
►
3/2/2010 39 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
j T i T ij
x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (
rel i rel j j i ij
y y y y
: :!
) ( ) ( ) (
Start with perfect ranking
Consider swapping adjacent
Consider swapping adjacent relevant/non‐relevant documents
Find the best feasible ranking of the non‐relevant document
►
Repeat for next non‐relevant document
3/2/2010 40 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
j T i T ij
x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (
rel i rel j j i ij
y y y y
: :!
) ( ) ( ) (
Start with perfect ranking
Consider swapping adjacent
Consider swapping adjacent relevant/non‐relevant documents
Find the best feasible ranking of the non‐relevant document
Repeat for next non‐relevant document
Never want to swap past previous
►
non‐relevant document
3/2/2010 41 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
j T i T ij
x w x w y y y w y H ) ( ' ) ' , ( ) ; ' (
rel i rel j j i ij
y y y y
: :!
) ( ) ( ) (
Start with perfect ranking
Consider swapping adjacent
Consider swapping adjacent relevant/non‐relevant documents
Find the best feasible ranking of the non‐relevant document
Repeat for next non‐relevant document
Never want to swap past previous non‐ relevant document
Repeat until all non‐relevant documents have been considered
►
3/2/2010 42 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining
SVM Formulation
SVMs optimize a tradeoff between model complexity and
MAP loss
Exponential number of constraints (one for each incorrect
ki ) ranking)
Structural SVMs finds a small subset of important
constraints
Requires sub procedure to find most violated constraint Requires sub‐procedure to find most violated constraint
Find Most Violated Constraint
Loss function invariant to re‐ordering of relevant
g documents
SVM score imposes an ordering of the relevant documents Finding interleaving of two sorted lists Loss function has certain monotonic properties Efficient algorithm
3/2/2010 43 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining