Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School - - PowerPoint PPT Presentation

learnability beyond uniform convergence
SMART_READER_LITE
LIVE PREVIEW

Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School - - PowerPoint PPT Presentation

Learnability Beyond Uniform Convergence Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Algorithmic Learning Theory, Lyon 2012 Joint work with: N. Srebro, O. Shamir, K. Sridharan


slide-1
SLIDE 1

Learnability Beyond Uniform Convergence

Shai Shalev-Shwartz

School of CS and Engineering, The Hebrew University of Jerusalem

”Algorithmic Learning Theory”, Lyon 2012

Joint work with:

  • N. Srebro, O. Shamir, K. Sridharan (COLT’09,JMLR’11)
  • A. Daniely, S. Sabato, S. Ben-David (COLT’11)
  • A. Daniely, S. Sabato (NIPS’12)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 1 / 34

slide-2
SLIDE 2

The Fundamental Theorem of Learning Theory

For Binary Classification

Uniform Convergence Learnable with ERM Learnable Finite VC

trivial trivial NFL (W’96) VC’71 VC = Vapnik and Chervonenkis, W = Wolpert Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 2 / 34

slide-3
SLIDE 3

The Fundamental Theorem of Learning Theory

For Regression

Uniform Convergence Learnable with ERM Learnable Finite fat- shattering

trivial trivial KS’94,BLW’96,ABCH’97 BLW’96,ABCH’97 BLW = Bartlett, Long, Williamson. ABCH = Alon, Ben-David, Cesa-Bianchi, Hausler. KS = Kearns and Schapire Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 3 / 34

slide-4
SLIDE 4

For general learning problems?

Uniform Convergence Learnable with ERM Learnable

trivial trivial

?

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 4 / 34

slide-5
SLIDE 5

For general learning problems?

Uniform Convergence Learnable with ERM Learnable

trivial trivial

X

Not true

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 4 / 34

slide-6
SLIDE 6

For general learning problems?

Uniform Convergence Learnable with ERM Learnable

trivial trivial

X

Not true

Not true in “Convex learning problems” ! Not true even in “multiclass categorization” !

What is learnable ? How to learn ?

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 4 / 34

slide-7
SLIDE 7

Outline

1

Definitions

2

Learnability without uniform convergence

3

Characterizing Learnability using Stability

4

Characterizing Multiclass Learnability

5

Analyzing specific, practically relevant, classes

6

Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 5 / 34

slide-8
SLIDE 8

The General Learning Setting (Vapnik)

Hypothesis class H Examples domain Z with unknown distribution D Loss function ℓ : H × Z → R Given: Training set S ∼ Dm Goal: Solve: min

h∈H L(h)

where L(h) = E

z∼D[ℓ(h, z)]

in the Probably (w.p. ≥ 1 − δ) Approximately Correct (up to ǫ) sense

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 6 / 34

slide-9
SLIDE 9

The General Learning Setting (Vapnik)

Hypothesis class H Examples domain Z with unknown distribution D Loss function ℓ : H × Z → R Given: Training set S ∼ Dm Goal: Solve: min

h∈H L(h)

where L(h) = E

z∼D[ℓ(h, z)]

in the Probably (w.p. ≥ 1 − δ) Approximately Correct (up to ǫ) sense Training loss: LS(h) = 1 m

m

  • i=1

ℓ(h, zi)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 6 / 34

slide-10
SLIDE 10

Examples

Binary classification:

Z = X × {0, 1} h ∈ H is a predictor h : X → {0, 1} ℓ(h, (x, y)) = 1[h(x) = y]

Multiclass categorization:

Z = X × Y h ∈ H is a predictor h : X → Y ℓ(h, (x, y)) = 1[h(x) = y]

k-means clustering:

Z = Rd H ⊂ (Rd)k specifies k cluster centers ℓ((µ1, . . . , µk), z) = minj µj − z

Density Estimation:

h is a parameter of a density ph(z) ℓ(h, z) = − log ph(z)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 7 / 34

slide-11
SLIDE 11

Learnability, ERM, Uniform convergence

Uniform Convergence: For m ≥ mUC(ǫ, δ) P

S∼Dm [∀h ∈ H, |LS(h) − L(h)| ≤ ǫ] ≥ 1 − δ

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

slide-12
SLIDE 12

Learnability, ERM, Uniform convergence

Uniform Convergence: For m ≥ mUC(ǫ, δ) P

S∼Dm [∀h ∈ H, |LS(h) − L(h)| ≤ ǫ] ≥ 1 − δ

Learnable: ∃A s.t. for m ≥ mPAC(ǫ, δ), P

S∼Dm

  • L(A(S)) ≤ min

h∈H L(h) + ǫ

  • ≥ 1 − δ

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

slide-13
SLIDE 13

Learnability, ERM, Uniform convergence

Uniform Convergence: For m ≥ mUC(ǫ, δ) P

S∼Dm [∀h ∈ H, |LS(h) − L(h)| ≤ ǫ] ≥ 1 − δ

Learnable: ∃A s.t. for m ≥ mPAC(ǫ, δ), P

S∼Dm

  • L(A(S)) ≤ min

h∈H L(h) + ǫ

  • ≥ 1 − δ

ERM: An algorithm that returns A(S) ∈ argminh∈H LS(h)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

slide-14
SLIDE 14

Learnability, ERM, Uniform convergence

Uniform Convergence: For m ≥ mUC(ǫ, δ) P

S∼Dm [∀h ∈ H, |LS(h) − L(h)| ≤ ǫ] ≥ 1 − δ

Learnable: ∃A s.t. for m ≥ mPAC(ǫ, δ), P

S∼Dm

  • L(A(S)) ≤ min

h∈H L(h) + ǫ

  • ≥ 1 − δ

ERM: An algorithm that returns A(S) ∈ argminh∈H LS(h) Learnable by arbitrary ERM (with rate mERM(ǫ, δ)) Like “Learnable” but A should be an ERM.

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 8 / 34

slide-15
SLIDE 15

For Binary Classification

Uniform Convergence Learnable with ERM Learnable Finite VC

trivial trivial NFL (W’96) VC’71

mUC(ǫ, δ) ≈ mERM(ǫ, δ) ≈ mPAC(ǫ, δ) ≈

VC(H) log(1/δ) ǫ2

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 9 / 34

slide-16
SLIDE 16

Outline

1

Definitions

2

Learnability without uniform convergence

3

Characterizing Learnability using Stability

4

Characterizing Multiclass Learnability

5

Analyzing specific, practically relevant, classes

6

Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 10 / 34

slide-17
SLIDE 17

Counter Example — Stochastic Convex Optimization

Consider the family of problems: H is a convex set with maxh∈H h ≤ 1 For all z, ℓ(h, z) is convex and Lipschitz w.r.t. h

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 11 / 34

slide-18
SLIDE 18

Counter Example — Stochastic Convex Optimization

Consider the family of problems: H is a convex set with maxh∈H h ≤ 1 For all z, ℓ(h, z) is convex and Lipschitz w.r.t. h Claim: Problem is learnable by the rule: argmin

h∈H λm 2 h2 + 1 m m

  • i=1

ℓ(h, zi) No uniform convergence Not learnable by ERM

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 11 / 34

slide-19
SLIDE 19

Counter Example — Stochastic Convex Optimization

Proof (of “not learnable by arbitrary ERM”) 1-Mean + missing features

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 12 / 34

slide-20
SLIDE 20

Counter Example — Stochastic Convex Optimization

Proof (of “not learnable by arbitrary ERM”) 1-Mean + missing features z = (α, x), α ∈ {0, 1}d, x ∈ Rd, x ≤ 1 ℓ(h, (α, x)) =

  • i αi(hi − xi)2

Take P[αi = 1] = 1/2, P[x = µ] = 1 Let h(i) be s.t. h(i)

j

=

  • 1 − µj

if j = i µj

  • .w.

If d is large enough, exists i such that h(i) is an ERM But L(h(i)) ≥ 1/ √ 2

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 12 / 34

slide-21
SLIDE 21

Counter Example — Stochastic Convex Optimization

Proof (of “not even learnable by a unique ERM”) Perturb the loss a little bit: ℓ(h, (α, x)) =

  • i

αi(hi − xi)2 + ǫ

  • i

2−i(hi − 1)2 Now loss is strictly convex — unique ERM But the unique ERM does not generalize (as before)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 13 / 34

slide-22
SLIDE 22

For general learning problems?

Uniform Convergence Learnable with ERM Learnable

trivial trivial

X

Not true

Not true in “Convex learning problems” ! ✓ Not true even in “multiclass categorization” !

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 14 / 34

slide-23
SLIDE 23

Counter Example — Multiclass

X – a set, Y = {0, 1, 2, . . . , 2|X| − 1} Let n : 2X → Y be defined by binary encoding H = {hT : T ⊂ X} where hT (x) =

  • x /

∈ T n(T) x ∈ T

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 15 / 34

slide-24
SLIDE 24

Counter Example — Multiclass

X – a set, Y = {0, 1, 2, . . . , 2|X| − 1} Let n : 2X → Y be defined by binary encoding H = {hT : T ⊂ X} where hT (x) =

  • x /

∈ T n(T) x ∈ T Claim: No uniform convergence: mUC ≥ |X|/ǫ

Target function is h∅ For any training set S, take T = X \ S LS(hT ) = 0 but L(hT ) = P[T]

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 15 / 34

slide-25
SLIDE 25

Counter Example — Multiclass

X – a set, Y = {0, 1, 2, . . . , 2|X| − 1} Let n : 2X → Y be defined by binary encoding H = {hT : T ⊂ X} where hT (x) =

  • x /

∈ T n(T) x ∈ T Claim: H is Learnable: mPAC ≤ 1

ǫ

Let T be the target A(S) = hT if (x, n(T)) ∈ S A(S) = h∅ if S = {(x1, 0), . . . , (xm, 0)} In the 1st case, L(A(S)) = 0. In the 2nd case, L(A(S)) = P[T] With high probability, if P[T] > ǫ then we’ll be in the 1st case

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 15 / 34

slide-26
SLIDE 26

Counter Example — Multiclass

Corollary

mUC mPAC ≈ |X|.

If |X| → ∞ then the problem is learnable but there is no uniform convergence!

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 16 / 34

slide-27
SLIDE 27

Outline

1

Definitions

2

Learnability without uniform convergence

3

Characterizing Learnability using Stability

4

Characterizing Multiclass Learnability

5

Analyzing specific, practically relevant, classes

6

Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 17 / 34

slide-28
SLIDE 28

Characterizing Learnability using Stability

Theorem

A sufficient and necessary condition for learnability is the existence of Asymptotic ERM (AERM) which is stable. Uniform Convergence ERM is stable ∃ stable AERM Learnable

RMP’05,MNPR’06, trivial Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 18 / 34

slide-29
SLIDE 29

More formally

Definition (Stability)

We say that A is ǫstable(m)-replace-one stable if for all D, E

S,z′,i |ℓ(A(S(i)); z′) − ℓ(A(S); z′)| ≤ ǫstable(m).

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 19 / 34

slide-30
SLIDE 30

More formally

Definition (Stability)

We say that A is ǫstable(m)-replace-one stable if for all D, E

S,z′,i |ℓ(A(S(i)); z′) − ℓ(A(S); z′)| ≤ ǫstable(m).

Definition (AERM)

We say that A is an AERM (Asymptotic Empirical Risk Minimizer) with rate ǫerm(m) if for all D: E

S∼Dm[LS(A(S)) − min h∈H LS(h)] ≤ ǫerm(m)

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 19 / 34

slide-31
SLIDE 31

Proof sketch: (Stable AERM is sufficient and necessary for Learnability)

Sufficient: For AERM: stability ⇒ generalization AERM+generalization ⇒ consistency Necessary: ∃ consistent A ⇒ ∃ consistent and generalizing A′ (using subsampling) Consistent+generalizing ⇒ AERM AERM+generalizing ⇒ stable

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 20 / 34

slide-32
SLIDE 32

Intermediate Summary

Learnability ⇐ ⇒ ∃ stable AERM But, how do we find one? And, is there a combinatorial notion of learnability (like VC dimension) ?

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 21 / 34

slide-33
SLIDE 33

Outline

1

Definitions

2

Learnability without uniform convergence

3

Characterizing Learnability using Stability

4

Characterizing Multiclass Learnability

5

Analyzing specific, practically relevant, classes

6

Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 22 / 34

slide-34
SLIDE 34

Why multiclass learning

Practical relevance A simple twist of binary classification

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 23 / 34

slide-35
SLIDE 35

The Natarajan Dimension

Natarajan dimension: Maximal size of N-shattered set where: C is N-shattered by H if ∃f1, f2 ∈ H s.t. ∀x ∈ C, f1(x) = f2(x), and for every T ⊆ C exists h ∈ H with h(x) =

  • f1(x)

if x ∈ T f2(x) if x ∈ C \ T

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 24 / 34

slide-36
SLIDE 36

The Natarajan Dimension

Natarajan dimension: Maximal size of N-shattered set where: C is N-shattered by H if ∃f1, f2 ∈ H s.t. ∀x ∈ C, f1(x) = f2(x), and for every T ⊆ C exists h ∈ H with h(x) =

  • f1(x)

if x ∈ T f2(x) if x ∈ C \ T When |Y| = 2, Natarajan dimension equals to VC dimension

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 24 / 34

slide-37
SLIDE 37

Does Natarajan dimension characterize multiclass learnability ?

Theorem (Natarajan’89, Ben-David et al 95)

If H is a class of functions with Natarajan dimension d then d + ln(1/δ) ǫ ≤ mPAC(ǫ, δ) ≤ d ln(|Y|) ln(1/ǫ) + ln(1/δ) ǫ .

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 25 / 34

slide-38
SLIDE 38

Does Natarajan dimension characterize multiclass learnability ?

Theorem (Natarajan’89, Ben-David et al 95)

If H is a class of functions with Natarajan dimension d then d + ln(1/δ) ǫ ≤ mPAC(ǫ, δ) ≤ d ln(|Y|) ln(1/ǫ) + ln(1/δ) ǫ . Remark: A large gap when Y is large Uniform convergence rate does depend on Y

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 25 / 34

slide-39
SLIDE 39

How to design good ERM algorithm?

Consider again our counter example: Y = {0, . . . , 2|X| − 1} and H = {hT : T ⊂ X} with hT (x) =

  • x /

∈ T n(T) x ∈ T

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 26 / 34

slide-40
SLIDE 40

How to design good ERM algorithm?

Consider again our counter example: Y = {0, . . . , 2|X| − 1} and H = {hT : T ⊂ X} with hT (x) =

  • x /

∈ T n(T) x ∈ T Bad ERM:

If S = (x1, 0), . . . , (xm, 0) return hT with T = X \ {x1, . . . , xm}

Good ERM

If S = (x1, 0), . . . , (xm, 0) return h∅

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 26 / 34

slide-41
SLIDE 41

How to design a good ERM algorithm?

Definition

A has an essential range r if ∀h ∈ H, ∃Y′(h) with |Y′(h)| ≤ r s.t. for all S labeled by h we have A(S) ∈ Y′(h) A good ERM is an ERM that has a small essential range A Principle for Designing Good ERMs

Theorem

If a learner has an “essential” range r then mA(ǫ, δ) ≤ d ln(r/ǫ) + ln(1/δ) ǫ

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 27 / 34

slide-42
SLIDE 42

Characterizing Multiclass Learnability

Conjecture

For any H of Natarajan dimension d, d + ln(1/δ) ǫ ≤ mPAC(ǫ, δ) ≤ d ln(d/ǫ) + ln(1/δ) ǫ .

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 28 / 34

slide-43
SLIDE 43

Characterizing Multiclass Learnability

Conjecture

For any H of Natarajan dimension d, d + ln(1/δ) ǫ ≤ mPAC(ǫ, δ) ≤ d ln(d/ǫ) + ln(1/δ) ǫ . Cannot rely on uniform convergence / arbitrary ERM Maybe there’s always an ERM with a small essential range ? Holds for symmetric classes

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 28 / 34

slide-44
SLIDE 44

Outline

1

Definitions

2

Learnability without uniform convergence

3

Characterizing Learnability using Stability

4

Characterizing Multiclass Learnability

5

Analyzing specific, practically relevant, classes

6

Open Questions

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 29 / 34

slide-45
SLIDE 45

Sample Complexity of Specific classes

Enables a rigorous comparison of known multiclass algorithms

Previous analyses (e.g. ASS’01,BL’07): how the binary error translates to multiclass error

Multiclass predictors:

One-vs-All (OvA) Multiclass SVM (MSVM): arg maxi(Wx)i Tree Classifiers (TC), with ˜ O(|Y|) nodes Error Correcting Output Codes (ECOC), with code-length ˜ O(|Y|)

Use linear predictors in Rd as the binary classifiers

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 30 / 34

slide-46
SLIDE 46

Sample Complexity of Specific classes

Enables a rigorous comparison of known multiclass algorithms

Previous analyses (e.g. ASS’01,BL’07): how the binary error translates to multiclass error

Multiclass predictors:

One-vs-All (OvA) Multiclass SVM (MSVM): arg maxi(Wx)i Tree Classifiers (TC), with ˜ O(|Y|) nodes Error Correcting Output Codes (ECOC), with code-length ˜ O(|Y|)

Use linear predictors in Rd as the binary classifiers

Theorem

The sample complexity of all the above classes is ˜ Θ(d |Y|).

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 30 / 34

slide-47
SLIDE 47

Comparing Approximation Error

Definition

We say that H essentially contains H′ if for any distribution, the approximation error of H is at most the approximation error of H′. H strictly contains H′ if, in addition, there is a distribution for which the approximation error of H is strictly smaller than that of H′.

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 31 / 34

slide-48
SLIDE 48

Comparing Approximation Error

MSVM ✓ ✓ ✓ OvA ✓ ✓ ✗ TC/ECOC ✓ ✗ ✗ * Assuming tree structure and ECOC code are chosen randomly

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 32 / 34

slide-49
SLIDE 49

Comparing Approximation Error

TC OvA MSVM random ECOC Est. d|Y| d|Y| d|Y| d|Y| Approx. ≥ MSVM ≥ MSVM best incomparable error ≈ 1/2 if d ≪ |Y| ≈ 1/2 if d ≪ |Y|

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 33 / 34

slide-50
SLIDE 50

Open Questions

Equivalence between uniform convergence and learnability breaks even in multiclass problems What characterizes multiclass learnability ? What is the corresponding learning rule ? What characterizes learnability in the general learning setting ? What is the corresponding learning rule ?

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 34 / 34

slide-51
SLIDE 51

Open Questions

Equivalence between uniform convergence and learnability breaks even in multiclass problems What characterizes multiclass learnability ? What is the corresponding learning rule ? What characterizes learnability in the general learning setting ? What is the corresponding learning rule ?

THANKS

Shai Shalev-Shwartz (Hebrew U) Learnability Beyond Uniform Convergence Oct’12 34 / 34