Learning Kernel-Based Halfspaces with the Zero-One Loss Shai - - PowerPoint PPT Presentation

learning kernel based halfspaces with the zero one loss
SMART_READER_LITE
LIVE PREVIEW

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai - - PowerPoint PPT Presentation

Learning Kernel-Based Halfspaces with the Zero-One Loss Shai Shalev-Shwartz 1 , Ohad Shamir 1 and Karthik Sridharan 2 1 The Hebrew University 2 TTI Chicago COLT, June 2010 Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces


slide-1
SLIDE 1

Learning Kernel-Based Halfspaces with the Zero-One Loss

Shai Shalev-Shwartz1, Ohad Shamir1 and Karthik Sridharan2

1The Hebrew University 2TTI Chicago

COLT, June 2010

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-2
SLIDE 2

Halfspaces

Hypothesis Class {x → φ0−1(w, x)}

1

w, x φ0−1(w, x)

  • 1

1 1

Sample Complexity: O(d/ǫ2)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-3
SLIDE 3

Kernel-Based Halfspaces

Hypothesis Class {x → φ0−1(w, ϕ(x))}

1

w, ϕ(x) φ0−1(w, ϕ(x))

  • 1

1 1

Sample Complexity: ∞

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-4
SLIDE 4

Fuzzy Kernel-Based Halfspaces

Hypothesis Class {x → φsig(w, ϕ(x))}

1

w, ϕ(x) φsig(w, ϕ(x))

  • 1

1 1

Sample Complexity: O(L2/ǫ2)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-5
SLIDE 5

Fuzzy Kernel-Based Halfspaces

Hypothesis Class {x → φsig(w, ϕ(x))}

1

w, ϕ(x) φsig(w, ϕ(x))

  • 1

1 1

Sample Complexity: O(L2/ǫ2) Time Complexity: ??

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-6
SLIDE 6

Formal Results

Time complexity of learning Fuzzy Halfspaces Positive Result: can be done in poly(1/ǫ) for any fixed L (worst case)

Do convex optimization, just use a different kernel...

Negative Result: can’t be done in poly(L, 1/ǫ) time

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-7
SLIDE 7

Related Work: Surrogates to 0 − 1 loss

Popular fix: replace 0 − 1 loss with convex loss (e.g., hinge loss)

No finite-sample approximation guarantees! Asymptotic guarantees exist (Zhang 2004; Bartlett, Jordan, McAuliffe 2006)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-8
SLIDE 8

Related Work: Surrogates to 0 − 1 loss

Popular fix: replace 0 − 1 loss with convex loss (e.g., hinge loss)

No finite-sample approximation guarantees! Asymptotic guarantees exist (Zhang 2004; Bartlett, Jordan, McAuliffe 2006)

Ben-David & Simon 2000: By a covering technique, can learn fuzzy halfspaces in exp(O(L2/ǫ2)) time

Worst case = best case Exponentially worse than our bound (however, requires exponentially less examples)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-9
SLIDE 9

Related Work: Directly for 0 − 1 loss

Agnostically learning halfspaces in poly(d1/ǫ4) time (Kalai, Klivans, Mansour, Servedio 2005; Blais, O’Donell, Wimmer 2008)

But only under distributional assumptions. Dimension-dependent (problematic for kernels)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-10
SLIDE 10

Technique Idea

Original class: H = {x → φ(w, x) : w = 1} Loss function: Eˆ

y∼φ(w,x)1ˆ y=y

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-11
SLIDE 11

Technique Idea

Original class: H = {x → φ(w, x) : w = 1} Loss function: Eˆ

y∼φ(w,x)1ˆ y=y = |φ(w, x) − y|

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-12
SLIDE 12

Technique Idea

Original class: H = {x → φ(w, x) : w = 1} Loss function: Eˆ

y∼φ(w,x)1ˆ y=y = |φ(w, x) − y|

Problem: Loss is non-convex w.r.t. w The main idea: Work with a larger hypothesis class for which the loss becomes convex x → φ(w, x) x → v, ψ(x)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-13
SLIDE 13

Technique Idea

Assume x ≤ 1, and suppose that φ(a) is a polynomial ∞

j=0 βjaj

Then φ(w, x) =

  • j=0

βj(w, x)j

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-14
SLIDE 14

Technique Idea

Assume x ≤ 1, and suppose that φ(a) is a polynomial ∞

j=0 βjaj

Then φ(w, x) =

  • j=0

βj(w, x)j =

  • j=0
  • k1,...,kj

(2j/2βjwk1 · · · wkj)(2−j/2xk1 · · · xkj)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-15
SLIDE 15

Technique Idea

Assume x ≤ 1, and suppose that φ(a) is a polynomial ∞

j=0 βjaj

Then φ(w, x) =

  • j=0

βj(w, x)j =

  • j=0
  • k1,...,kj

(2j/2βjwk1 · · · wkj)(2−j/2xk1 · · · xkj) = vw, Ψ(x)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-16
SLIDE 16

Technique Idea

Assume x ≤ 1, and suppose that φ(a) is a polynomial ∞

j=0 βjaj

Then φ(w, x) =

  • j=0

βj(w, x)j =

  • j=0
  • k1,...,kj

(2j/2βjwk1 · · · wkj)(2−j/2xk1 · · · xkj) = vw, Ψ(x) Ψ is the feature mapping of the RKHS corresponding to the infinite-dimensional polynomial kernel k(x, x′) = 1 1 − 1

2 x, x′

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-17
SLIDE 17

Technique Idea

Therefore, given sample (x1, y1), . . . , (xm, ym), min

w:w=1

1 m

m

  • i=1

|φ(w, xi) − yi| equivalent to min

vw:w=1

1 m

m

  • i=1

| vw, Ψ(xi) − yi|

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-18
SLIDE 18

Technique Idea

Therefore, given sample (x1, y1), . . . , (xm, ym), min

w:w=1

1 m

m

  • i=1

|φ(w, xi) − yi| equivalent to min

vw:w=1

1 m

m

  • i=1

| vw, Ψ(xi) − yi| Algorithm arg min

v:v≤B

1 m

m

  • i=1

| v, Ψ(xi) − yi|, using the infinite-dimensional polynomial kernel

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-19
SLIDE 19

Technique Idea

Theorem Let HB consist of all predictors of the form x → φ(w, x), where φ(a) = ∞

j=0 βjaj

j=0 2jβ2 j ≤ B

With O(B/ǫ2) examples, returned predictor ˆ v satisfies w.h.p. errD(ˆ v) ≤ min

v∈HB

errD(v) + ǫ

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-20
SLIDE 20

Technique Idea

Algorithm arg min

v:v≤B

1 m

m

  • i=1

| v, Ψ(xi) − yi|, using the infinite-dimensional polynomial kernel

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-21
SLIDE 21

Technique Idea

Algorithm arg min

v:v≤B

1 m

m

  • i=1

| v, Ψ(xi) − yi|, using the infinite-dimensional polynomial kernel Same algorithm competitive against all φ with coefficient bound B - including optimal one for data distribution

  • 1

1 1

  • 1

1 1

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-22
SLIDE 22

Technique Idea

Algorithm arg min

v:v≤B

1 m

m

  • i=1

| v, Ψ(xi) − yi|, using the infinite-dimensional polynomial kernel Same algorithm competitive against all φ with coefficient bound B - including optimal one for data distribution

  • 1

1 1

  • 1

1 1

In practice, parameter B chosen by cross validation. Algorithm can work much faster depending on distribution

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-23
SLIDE 23

Example - Error Function

φerf(w, x) = 1 + erf(√πL w, x) 2

w, Ψ(x) φerf(w, Ψ(x))

  • 1

1 1

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-24
SLIDE 24

Example - Error Function

φerf(w, x) = 1 + erf(√πL w, x) 2

w, Ψ(x) φerf(w, Ψ(x))

  • 1

1 1

φerf can be written as an infinite-degree polynomial x → φerf(w, x) x → v, ψ(x)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-25
SLIDE 25

Example - Error Function

φerf(w, x) = 1 + erf(√πL w, x) 2

w, Ψ(x) φerf(w, Ψ(x))

  • 1

1 1

φerf can be written as an infinite-degree polynomial x → φerf(w, x) x → v, ψ(x) Unfortunately, bad dependence on L. Can we get a better bound?

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-26
SLIDE 26

Sigmoid Function

φsig(w, x) = 1 1 + exp(−4L w, x)

w, Ψ(x) φsig(w, Ψ(x))

  • 1

1 1

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-27
SLIDE 27

Sigmoid Function

φsig(w, x) = 1 1 + exp(−4L w, x)

w, Ψ(x) φsig(w, Ψ(x))

  • 1

1 1

φsig is not a polynomial However, can be ǫ-approximated by a polynomial with coefficient bound B ≤ O

  • exp
  • 7L log

L

ǫ

  • We use a truncated sum of Chebyshev polynomials

Closed-form coefficient bound via tools from complex analysis

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-28
SLIDE 28

Sigmoid Function

Worst-Case Guarantee Can learn fuzzy halfspace class {x → φsig(w, x) : w = 1} in time/sample complexity O(exp(7L log(L/ǫ))) Picking φsig is just for the analysis - algorithm is oblivious to φ used

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-29
SLIDE 29

Hardness Result

Better bound? Maybe with some other L-Lipschitz φ? Proper learning is hard, but here we search for any predictor

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-30
SLIDE 30

Hardness Result

Better bound? Maybe with some other L-Lipschitz φ? Proper learning is hard, but here we search for any predictor Theorem Can’t learn Fuzzy Halfspaces with L-Lipschitz φ in poly(L, 1/ǫ) time.

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-31
SLIDE 31

Proof Idea

Proof by reduction: Cryptographic assumption: No poly-time solution to ˜ O(n1.5)-unique-shortest-vector problem

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-32
SLIDE 32

Proof Idea

Proof by reduction: Cryptographic assumption: No poly-time solution to ˜ O(n1.5)-unique-shortest-vector problem ⇒ can’t PAC-learn intersection of nρ halfspaces over {−1, +1}n in poly-time (Klivans and Sherstov, 2006)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-33
SLIDE 33

Proof Idea

Proof by reduction: Cryptographic assumption: No poly-time solution to ˜ O(n1.5)-unique-shortest-vector problem ⇒ can’t PAC-learn intersection of nρ halfspaces over {−1, +1}n in poly-time (Klivans and Sherstov, 2006) ⇒ can’t agnostic-PAC-learn single halfspaces over {−1, +1}n in poly-time (otherwise, can use boosting to learn intersections)

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-34
SLIDE 34

Proof Idea

Proof by reduction: Cryptographic assumption: No poly-time solution to ˜ O(n1.5)-unique-shortest-vector problem ⇒ can’t PAC-learn intersection of nρ halfspaces over {−1, +1}n in poly-time (Klivans and Sherstov, 2006) ⇒ can’t agnostic-PAC-learn single halfspaces over {−1, +1}n in poly-time (otherwise, can use boosting to learn intersections) ⇒ can’t agnostic-PAC-learn fuzzy halfspaces over Rn in poly-time, when L is polynomially small

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss

slide-35
SLIDE 35

Summary

New technique for learning predictors x → φ(w, x), φ possibly non-convex, with the 0 − 1 loss

  • 1

1 1

Single algorithm, simultaneously competitive against all φ, including optimal one for the data distribution

  • 1

1 1

In fact, equivalent to standard SVM, but composing our kernel

Shalev-Shwartz, Shamir and Sridharan Learning Kernel-Based Halfspaces with the Zero-One Loss