cs345a data mining jure leskovec and anand rajaraman j
play

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford - PowerPoint PPT Presentation

CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University HW3 is out HW3 is out Poster session is on last day of classes: Thu March 11 at 4:15 Thu March 11 at 4:15 Reports are due March 14 Final is


  1. CS345a: Data Mining Jure Leskovec and Anand Rajaraman j Stanford University

  2.  HW3 is out  HW3 is out  Poster session is on last day of classes:  Thu March 11 at 4:15  Thu March 11 at 4:15  Reports are due March 14  Final is March 18 at 12:15  Final is March 18 at 12:15  Open book, open notes  No laptop N l t 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 2

  3.  Which is best linear separator? Whi h i b t li t ? + Data: +  Examples: +  (x 1 , y 1 ),… (x n , y n ) ‐ + + + +  Example i:  Example i: ‐  x i =(x 1 (1) ,…, x 1 (d) ) ‐ +  y i  { ‐ 1, +1} y i { , } ‐ ‐ ‐  Inner product: ‐  w  x=  w x 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 3

  4.  Confidence: f d w  x=0 =(w  x i )y i + + +  For all datapoints:  For all datapoints:  i = + + ‐ + + ‐ + ‐ + ‐ ‐ ‐ ‐ 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 4

  5. +  Maximize the margin:  Maximize the margin: + + w  x=0  Good according to + + intuition, theory & practice intuition theory & practice ‐ + + +  ‐ ‐ max +  , w      ‐ . . , ( ) s t i y x w + i i ‐ ‐ ‐ 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 5

  6.  Canonical hyperplanes:  Canonical hyperplanes:  Projection of x i on plane w w  x=0: w  x=0:       x x x x i i || w || 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 6

  7.  Maximizing the margin:  Maximizing the margin:  max  , w      . . , ( ) s t i y x w i i  Equivalent: 2 min min || || || || w w w     . . , ( ) 1 s t i y x w i i SVM with “hard” constraints 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 7

  8.  If data not separable introduce penalty  If data not separable introduce penalty 1  w   min C # number of mistakes w + w + 2 w  x=0 w  x=0     . . , ( ) 1 s t i y x w + i i + ‐ + +  Choose C based Ch C b d ‐ + on cross validation ‐ + ‐  How to penalize ‐ ‐ mistakes? i t k ? ‐ 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 8

  9. + +  Introduce slack variables  :  Introduce slack variables  : n 1 +     min w w C + + i   2 2 , 0 w +  i i 1 1 i i       ‐ ‐ + . . , ( ) 1 s t i y x w i i i + ‐  Hinge loss: Hi l ‐ ‐ w  x=0 For each datapoint: If margin>1, don’t care If margin<1 pay linear penalty If margin<1, pay linear penalty 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 9

  10.  SVM in the natural form  SVM in the “natural” form arg min (w) f w w  Where: n 1           ( ( ) ) max{ max{ 0 0 , , 1 1 ( ( )} )} f f w w w w w w C C y y x x w w i i i i 2  1 i 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 10

  11.  Use quadratic solver:  Use quadratic solver: n n 1 1     min w w C i  Minimize quadratic function   2 , 0 w  i 1 i              Subject to linear constraints  S bj . . , ( ( ) ) 1 1 s s t t i i y y x x w w t t li t i t i i i  Stochastic gradient descent:  Minimize: Mi i i n 1         ( ) max{ 0 , 1 ( )} f w w w C y x w i i 2 2  i 1  Update:    ( , ) L wx y                   t t ' ( ( ) ) w w w w f f w w w w w w  t t   w 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 11

  12.  Example by Leon Bottou:  Example by Leon Bottou:  Reuters RCV1 document corpus   m=781k training examples, 23k test examples 781k t i i l 23k t t l  d=50k features  Training time: 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 12

  13. 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 13

  14.  What if we subsample the dataset?  SGD on full dataset vs.  Conjugate gradient on n training examples 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 14

  15.  Need to choose learning rate  :  Need to choose learning rate  :    ' ( ) w w L w  1 t t t  Leon suggests:  Select small subsample  Select small subsample  Try various rates   Pick the one that most reduces the loss Pi k th th t t d th l  Use  for next 100k iterations on the full dataset 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 15

  16.  Stopping criteria:  Stopping criteria: How many iterations of SGD?  Early stopping with cross validation Early stopping with cross validation  Create validation set  Monitor cost function on the validation set  Stop when loss stops decreasing  Early stopping a priori  Extract two disjoint subsamples A and B of training data  Extract two disjoint subsamples A and B of training data  Determine the number of epochs k by training on A, stop by validating on B  Train for k epochs on the full dataset 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 16

  17.  Kernel function: K(x x ) =  (x )   (x )  Kernel function: K(x i ,x j ) =  (x i )   (x j )  Does the SVM kernel trick still work?  Yes (but not without a price):  Yes (but not without a price):  Represent w with its kernel expansion: w =  i  i   (x i )   ( )  Usually: d L(w)/ d w = ‐    (x j )  ( ) d L( )/ d  Then update w at epoch t by combining  :  t = (1 ‐  )  t +  3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 17

  18. [Shalev ‐ Shwartz et al. ICML ‘07]  We had before:  We had before:  Can replace C with  : 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 18

  19. [Shalev ‐ Shwartz et al. ICML ‘07] |A t | = 1 | t | |A t | = S |A t | S Stochastic gradient Subgradient method Subgradient Projection 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 19

  20. [Shalev ‐ Shwartz et al. ICML ‘07]  Choosing |A t |=1 and a linear kernel over R n Choosing |A t | 1 and a linear kernel over R  Theorem [Shalev ‐ Shwartz et al. ‘07]:  Run ‐ time required for Pegasos to find  d f f d accurate solution with prob. >1 ‐   Run ‐ time depends on number of features n i d d b f f  Does not depend on #examples m  Depends on “difficulty” of problem (  and  )  Depends on difficulty of problem (  and  ) 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 20

  21.  SVM and structured output prediction  SVM and structured output prediction  Setting:  Assume: Data is i.i.d. from Assume: Data is i.i.d. from  Given: Training sample  Goal: Find function from input space X to output Y Complex objects 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 21

  22.  Examples:  Examples:  Natural Language Parsing  Given a sequence of words x predict the parse tree y  Given a sequence of words x, predict the parse tree y  Dependencies from structural constraints, since y has to be a tree y S NP VP x The dog chased the cat The dog chased the cat NP Det N V Det N 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 22

  23.  Approach: view as multi ‐ class classification task pp  Every complex output is one class  Problems:  Exponentially many classes!  Exponentially many classes!  How to predict efficiently?  How to learn efficiently?  Potentially huge model!  Potentially huge model! y 1 y S S VP VP VP VP NP  Manageable number of features? V N V Det N y 2 S NP VP NP NP x x The dog chased the cat Det N V Det N … y k k S VP NP NP Det N V Det N 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 23

  24.  Feature vector describes match between x and y y  Learn single weight vector and rank by Hard ‐ margin optimization problem: … … 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 24

  25. [Yue et al., SIGIR ‘07]  Ranking:  Ranking:  Given a query x, predict a ranking y.  D  Dependencies between results (e.g. d i b t lt ( avoid redundant hits)  Loss function over rankings (e g AvgPrec)  Loss function over rankings (e.g. AvgPrec) y x 1. Kernel ‐ Machines SVM 2. SVM ‐ Light 3. Learning with Kernels L i i h K l 4. SV Meppen Fan Club 5. Service Master & Co. 6. School of Volunteer Management g 7. SV Mattersburg Online … 3/2/2010 Jure Leskovec & Anand Rajaraman, Stanford CS345a: Data Mining 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend