Learning Kernel-Based Halfspaces with the Zero-One Loss Shai - PowerPoint PPT Presentation

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz 1 , Ohad Shamir 1 and Karthik Sridharan 2 1 The Hebrew University 2 TTI Chicago COLT, June 2010 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Halfspaces Hypothesis Class { x �→ φ 0 − 1 ( � w , x � ) } φ 0 − 1 ( � w , x � ) 1 1 � w , x � 0 -1 1 Sample Complexity: O ( d /ǫ 2 ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Kernel-Based Halfspaces Hypothesis Class { x �→ φ 0 − 1 ( � w , ϕ ( x ) � ) } φ 0 − 1 ( � w , ϕ ( x ) � ) 1 0 1 � w , ϕ ( x ) � -1 1 Sample Complexity: ∞ Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Fuzzy Kernel-Based Halfspaces Hypothesis Class { x �→ φ sig ( � w , ϕ ( x ) � ) } φ sig ( � w , ϕ ( x ) � ) 1 0 1 � w , ϕ ( x ) � -1 1 Sample Complexity: O ( L 2 /ǫ 2 ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Fuzzy Kernel-Based Halfspaces Hypothesis Class { x �→ φ sig ( � w , ϕ ( x ) � ) } φ sig ( � w , ϕ ( x ) � ) 1 0 1 � w , ϕ ( x ) � -1 1 Sample Complexity: O ( L 2 /ǫ 2 ) Time Complexity: ?? Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Formal Results Time complexity of learning Fuzzy Halfspaces Positive Result : can be done in poly(1 /ǫ ) for any fixed L ( worst case ) Do convex optimization, just use a different kernel... Negative Result : can’t be done in poly( L , 1 /ǫ ) time Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Related Work: Surrogates to 0 − 1 loss Popular fix: replace 0 − 1 loss with convex loss (e.g., hinge loss) No finite-sample approximation guarantees! Asymptotic guarantees exist (Zhang 2004; Bartlett, Jordan, McAuliffe 2006) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Related Work: Surrogates to 0 − 1 loss Popular fix: replace 0 − 1 loss with convex loss (e.g., hinge loss) No finite-sample approximation guarantees! Asymptotic guarantees exist (Zhang 2004; Bartlett, Jordan, McAuliffe 2006) Ben-David & Simon 2000: By a covering technique, can learn fuzzy halfspaces in exp( O ( L 2 /ǫ 2 )) time Worst case = best case Exponentially worse than our bound (however, requires exponentially less examples) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Related Work: Directly for 0 − 1 loss Agnostically learning halfspaces in poly( d 1 /ǫ 4 ) time (Kalai, Klivans, Mansour, Servedio 2005; Blais, O’Donell, Wimmer 2008) But only under distributional assumptions. Dimension-dependent (problematic for kernels) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Original class: H = { x �→ φ ( � w , x � ) : � w � = 1 } Loss function: E ˆ y ∼ φ ( � w , x � ) 1 ˆ y = y Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Original class: H = { x �→ φ ( � w , x � ) : � w � = 1 } Loss function: E ˆ y ∼ φ ( � w , x � ) 1 ˆ y = y = | φ ( � w , x � ) − y | Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Original class: H = { x �→ φ ( � w , x � ) : � w � = 1 } Loss function: E ˆ y ∼ φ ( � w , x � ) 1 ˆ y = y = | φ ( � w , x � ) − y | Problem: Loss is non-convex w.r.t. w The main idea: Work with a larger hypothesis class for which the loss becomes convex x �→ � v , ψ ( x ) � x �→ φ ( � w , x � ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 ∞ � � (2 j / 2 β j w k 1 · · · w k j )(2 − j / 2 x k 1 · · · x k j ) = j =0 k 1 ,..., k j Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 ∞ � � (2 j / 2 β j w k 1 · · · w k j )(2 − j / 2 x k 1 · · · x k j ) = j =0 k 1 ,..., k j = � v w , Ψ( x ) � Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Assume � x � ≤ 1, and suppose that φ ( a ) is a polynomial � ∞ j =0 β j a j Then ∞ � β j ( � w , x � ) j φ ( � w , x � ) = j =0 ∞ � � (2 j / 2 β j w k 1 · · · w k j )(2 − j / 2 x k 1 · · · x k j ) = j =0 k 1 ,..., k j = � v w , Ψ( x ) � Ψ is the feature mapping of the RKHS corresponding to the infinite-dimensional polynomial kernel 1 k ( x , x ′ ) = 1 − 1 2 � x , x ′ � Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Therefore, given sample ( x 1 , y 1 ) , . . . , ( x m , y m ), m 1 � min | φ ( � w , x i � ) − y i | m w : � w � =1 i =1 equivalent to m 1 � min | � v w , Ψ( x i ) � − y i | m v w : � w � =1 i =1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Therefore, given sample ( x 1 , y 1 ) , . . . , ( x m , y m ), m 1 � min | φ ( � w , x i � ) − y i | m w : � w � =1 i =1 equivalent to m 1 � min | � v w , Ψ( x i ) � − y i | m v w : � w � =1 i =1 Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Theorem Let H B consist of all predictors of the form x �→ φ ( � w , x � ) , where φ ( a ) = � ∞ j =0 β j a j � ∞ j =0 2 j β 2 j ≤ B With O ( B /ǫ 2 ) examples, returned predictor ˆ v satisfies w.h.p. err D (ˆ v ) ≤ min err D ( v ) + ǫ v ∈ H B Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Same algorithm competitive against all φ with coefficient bound B - including optimal one for data distribution 1 1 -1 1 -1 1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Technique Idea Algorithm m 1 � arg min | � v , Ψ( x i ) � − y i | , m v : � v �≤ B i =1 using the infinite-dimensional polynomial kernel Same algorithm competitive against all φ with coefficient bound B - including optimal one for data distribution 1 1 -1 1 -1 1 In practice, parameter B chosen by cross validation. Algorithm can work much faster depending on distribution Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Example - Error Function φ erf ( � w , Ψ( x ) � ) 1 φ erf ( � w , x � ) = 1 + erf( √ π L � w , x � ) 2 � w , Ψ( x ) � -1 1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Example - Error Function φ erf ( � w , Ψ( x ) � ) 1 φ erf ( � w , x � ) = 1 + erf( √ π L � w , x � ) 2 � w , Ψ( x ) � -1 1 φ erf can be written as an infinite-degree polynomial x �→ � v , ψ ( x ) � x �→ φ erf ( � w , x � ) Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Example - Error Function φ erf ( � w , Ψ( x ) � ) 1 φ erf ( � w , x � ) = 1 + erf( √ π L � w , x � ) 2 � w , Ψ( x ) � -1 1 φ erf can be written as an infinite-degree polynomial x �→ � v , ψ ( x ) � x �→ φ erf ( � w , x � ) Unfortunately, bad dependence on L . Can we get a better bound? Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Sigmoid Function φ sig ( � w , Ψ( x ) � ) 1 1 φ sig ( � w , x � ) = 1 + exp( − 4 L � w , x � ) � w , Ψ( x ) � -1 1 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai - PowerPoint PPT Presentation

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz 1 , Ohad Shamir 1 and Karthik Sridharan 2 1 The Hebrew University 2 TTI Chicago COLT, June 2010 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces

Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a Margin Ilias Diakonikolas

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Zero-knowledge Arguments Proving circuit satisfaibility in zero-knowledge Zero-knowledge In

DALLAS ZERO WASTE Recycling 101 ZERO WASTE PLAN What is Zero Waste? The planet has limited

Intro ADA Operations Contact Info Todd Grugel ph: 651-366-3531 email: todd.grugel@state.mn.us

New Porcelain Tile Standard for Rehabilitation of Platform Slabs at Minnesota Avenue &

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

Enhancing Sustainability of Water and Sanitation Facilities Through Mobile Phone Technology

Citizens Advisory Group Meeting No. 4 December 15, 2015 Agenda 1. Project Overview

Financial Need and Aid Volatility among Students with Zero Expected Family Contribution Robert

Youtube Revisited: On the Importance of Correct Measurement Methodology Ossi Karkulahti, Jussi

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai - PowerPoint PPT Presentation

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz 1 , Ohad Shamir 1 and Karthik Sridharan 2 1 The Hebrew University 2 TTI Chicago COLT, June 2010 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces

Nearly Tight Bounds for Robust Proper Learning of Halfspaces with a Margin Ilias Diakonikolas

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Zero Waste at The Nat Zero Waste Zero Waste Zero Waste is a philosophy that encourages the

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Getting to Zero San Francisco Consortium Zero new HIV infections Zero HIV deaths Zero stigma

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

CHRONIC CHRONIC VISUAL LOSS VISUAL LOSS Wasu Supakornthanasarn, MD. Visual loss Sensory

Early Hearing Early Hearing Early Hearing loss D Early Hearing-loss D loss D loss D

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Zero-knowledge Arguments Proving circuit satisfaibility in zero-knowledge Zero-knowledge In

DALLAS ZERO WASTE Recycling 101 ZERO WASTE PLAN What is Zero Waste? The planet has limited

Intro ADA Operations Contact Info Todd Grugel ph: 651-366-3531 email: todd.grugel@state.mn.us

New Porcelain Tile Standard for Rehabilitation of Platform Slabs at Minnesota Avenue &amp;

Adaptations of the Thompson Sampling Algorithm for Multi-Armed Bandits Ciara Pike-Burke

Enhancing Sustainability of Water and Sanitation Facilities Through Mobile Phone Technology

Citizens Advisory Group Meeting No. 4 December 15, 2015 Agenda 1. Project Overview

Financial Need and Aid Volatility among Students with Zero Expected Family Contribution Robert

Youtube Revisited: On the Importance of Correct Measurement Methodology Ossi Karkulahti, Jussi

New Porcelain Tile Standard for Rehabilitation of Platform Slabs at Minnesota Avenue &