learning kernel based halfspaces with the zero one loss
play

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai - PowerPoint PPT Presentation

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz 1 , Ohad Shamir 1 and Karthik Sridharan 2 1 The Hebrew University 2 TTI Chicago COLT, June 2010 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces


  1. Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz 1 , Ohad Shamir 1 and Karthik Sridharan 2 1 The Hebrew University 2 TTI Chicago COLT, June 2010 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  2. Halfspaces Hypothesis Class { x �→ φ 0 − 1 ( � w , x � ) } φ 0 − 1 ( � w , x � ) 1 1 � w , x � 0 -1 1 Sample Complexity: O ( d /ǫ 2 ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  3. Kernel-Based Halfspaces Hypothesis Class { x �→ φ 0 − 1 ( � w , ϕ ( x ) � ) } φ 0 − 1 ( � w , ϕ ( x ) � ) 1 0 1 � w , ϕ ( x ) � -1 1 Sample Complexity: ∞ Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  4. Fuzzy Kernel-Based Halfspaces Hypothesis Class { x �→ φ sig ( � w , ϕ ( x ) � ) } φ sig ( � w , ϕ ( x ) � ) 1 0 1 � w , ϕ ( x ) � -1 1 Sample Complexity: O ( L 2 /ǫ 2 ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  5. Fuzzy Kernel-Based Halfspaces Hypothesis Class { x �→ φ sig ( � w , ϕ ( x ) � ) } φ sig ( � w , ϕ ( x ) � ) 1 0 1 � w , ϕ ( x ) � -1 1 Sample Complexity: O ( L 2 /ǫ 2 ) Time Complexity: ?? Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  6. Formal Results Time complexity of learning Fuzzy Halfspaces Positive Result : can be done in poly(1 /ǫ ) for any fixed L ( worst case ) Do convex optimization, just use a different kernel... Negative Result : can’t be done in poly( L , 1 /ǫ ) time Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  7. Related Work: Surrogates to 0 − 1 loss Popular fix: replace 0 − 1 loss with convex loss (e.g., hinge loss) No finite-sample approximation guarantees! Asymptotic guarantees exist (Zhang 2004; Bartlett, Jordan, McAuliffe 2006) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  8. Related Work: Surrogates to 0 − 1 loss Popular fix: replace 0 − 1 loss with convex loss (e.g., hinge loss) No finite-sample approximation guarantees! Asymptotic guarantees exist (Zhang 2004; Bartlett, Jordan, McAuliffe 2006) Ben-David & Simon 2000: By a covering technique, can learn fuzzy halfspaces in exp( O ( L 2 /ǫ 2 )) time Worst case = best case Exponentially worse than our bound (however, requires exponentially less examples) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  9. Related Work: Directly for 0 − 1 loss Agnostically learning halfspaces in poly( d 1 /ǫ 4 ) time (Kalai, Klivans, Mansour, Servedio 2005; Blais, O’Donell, Wimmer 2008) But only under distributional assumptions. Dimension-dependent (problematic for kernels) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  10. Technique Idea Original class: H = { x �→ φ ( � w , x � ) : � w � = 1 } Loss function: E ˆ y ∼ φ ( � w , x � ) 1 ˆ y = y Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  11. Technique Idea Original class: H = { x �→ φ ( � w , x � ) : � w � = 1 } Loss function: E ˆ y ∼ φ ( � w , x � ) 1 ˆ y = y = | φ ( � w , x � ) − y | Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  12. Technique Idea Original class: H = { x �→ φ ( � w , x � ) : � w � = 1 } Loss function: E ˆ y ∼ φ ( � w , x � ) 1 ˆ y = y = | φ ( � w , x � ) − y | Problem: Loss is non-convex w.r.t. w The main idea: Work with a larger hypothesis class for which the loss becomes convex x �→ � v , ψ ( x ) � x �→ φ ( � w , x � ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  13. Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  14. Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 ∞ � � (2 j / 2 β j w k 1 · · · w k j )(2 − j / 2 x k 1 · · · x k j ) = j =0 k 1 ,..., k j Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  15. Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 ∞ � � (2 j / 2 β j w k 1 · · · w k j )(2 − j / 2 x k 1 · · · x k j ) = j =0 k 1 ,..., k j = � v w , Ψ( x ) � Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  16. Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 ∞ � � (2 j / 2 β j w k 1 · · · w k j )(2 − j / 2 x k 1 · · · x k j ) = j =0 k 1 ,..., k j = � v w , Ψ( x ) � Ψ is the feature mapping of the RKHS corresponding to the infinite-dimensional polynomial kernel 1 k ( x , x ′ ) = 1 − 1 2 � x , x ′ � Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  17. Technique Idea Therefore, given sample ( x 1 , y 1 ) , . . . , ( x m , y m ), m 1 � min | φ ( � w , x i � ) − y i | m w : � w � =1 i =1 equivalent to m 1 � min | � v w , Ψ( x i ) � − y i | m v w : � w � =1 i =1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  18. Technique Idea Therefore, given sample ( x 1 , y 1 ) , . . . , ( x m , y m ), m 1 � min | φ ( � w , x i � ) − y i | m w : � w � =1 i =1 equivalent to m 1 � min | � v w , Ψ( x i ) � − y i | m v w : � w � =1 i =1 Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  19. Technique Idea Theorem Let H B consist of all predictors of the form x �→ φ ( � w , x � ) , where φ ( a ) = � ∞ j =0 β j a j � ∞ j =0 2 j β 2 j ≤ B With O ( B /ǫ 2 ) examples, returned predictor ˆ v satisfies w.h.p. err D (ˆ v ) ≤ min err D ( v ) + ǫ v ∈ H B Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  20. Technique Idea Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  21. Technique Idea Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Same algorithm competitive against all φ with coefficient bound B - including optimal one for data distribution 1 1 -1 1 -1 1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  22. Technique Idea Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Same algorithm competitive against all φ with coefficient bound B - including optimal one for data distribution 1 1 -1 1 -1 1 In practice, parameter B chosen by cross validation. Algorithm can work much faster depending on distribution Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  23. Example - Error Function φ erf ( � w , Ψ( x ) � ) 1 φ erf ( � w , x � ) = 1 + erf( √ π L � w , x � ) 2 � w , Ψ( x ) � -1 1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  24. Example - Error Function φ erf ( � w , Ψ( x ) � ) 1 φ erf ( � w , x � ) = 1 + erf( √ π L � w , x � ) 2 � w , Ψ( x ) � -1 1 φ erf can be written as an infinite-degree polynomial x �→ � v , ψ ( x ) � x �→ φ erf ( � w , x � ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  25. Example - Error Function φ erf ( � w , Ψ( x ) � ) 1 φ erf ( � w , x � ) = 1 + erf( √ π L � w , x � ) 2 � w , Ψ( x ) � -1 1 φ erf can be written as an infinite-degree polynomial x �→ � v , ψ ( x ) � x �→ φ erf ( � w , x � ) Unfortunately, bad dependence on L . Can we get a better bound? Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

  26. Sigmoid Function φ sig ( � w , Ψ( x ) � ) 1 1 φ sig ( � w , x � ) = 1 + exp( − 4 L � w , x � ) � w , Ψ( x ) � -1 1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend