[PPT] - The Sample-Computational Tradeoff Shai Shalev-Shwartz School of PowerPoint Presentation

SLIDE 1

The Sample-Computational Tradeoff

Shai Shalev-Shwartz

School of Computer Science and Engineering The Hebrew University of Jerusalem

Optimization and Statistical Learning Workshop, des Houches, January 2013

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 1 / 32

SLIDE 2

Collaborators:

Nati Srebro Ohad Shamir and Eran Tromer (AISTATS’2012) Satyen Kale and Elad Hazan (COLT’2012) Aharon Birnbaum (NIPS’2012) Amit Daniely and Nati Linial (on arxiv)

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 2 / 32

SLIDE 3

What else can we do with more data?

Big data reduce error Traditional

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 3 / 32

SLIDE 4

What else can we do with more data?

Big data speedup runtime training runtime prediction runtime compensate for missing information reduce error Traditional

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 3 / 32

SLIDE 5

Agnostic PAC Learning

Hypothesis class H ⊂ YX Loss function: ℓ : H × (X × Y) → R D - unknown distribution over X × Y True risk: LD(h) = E(x,y)∼D[ℓ(h, (x, y))] Training set: S = (x1, y1), . . . , (xm, ym) i.i.d. ∼ Dm Goal: use S to find hS s.t. with high probability, LD(hS) ≤ min

h∈H LD(h) + ǫ

ERM rule: ERM(S) ∈ argmin

h∈H

LS(h) := 1 m

m

i=1

ℓ(h, (xi, yi))

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 4 / 32

SLIDE 6

Error Decomposition

h⋆ = argmin

h∈H

LD(h) ; ERM(S) = argmin

h∈H

LS(h) LD(hS) = LD(h⋆)

approximation

+ LD(ERM(S)) − LD(h⋆)

estimation

Bias-Complexity tradeoff: Larger H decreases approximation error but increases estimation error

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 5 / 32

SLIDE 7

3-term Error Decomposition (Bottou & Bousquet’ 08)

h⋆ = argmin

h∈H

LD(h) ; ERM(S) = argmin

h∈H

LS(h) LD(hS) = LD(h⋆)

approximation

+ LD(ERM(S)) − LD(h⋆)

estimation

+ LD(hS) − LD(ERM(S))

ptimization

Bias-Complexity tradeoff: Larger H decreases approximation error but increases estimation error What about optimization error ?

Two resources: samples and runtime Sample-Computational complexity (Decatur, Goldreich, Ron ’98)

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 5 / 32

SLIDE 8

Joint Time-Sample Complexity

Goal: LD(hS) ≤ min

h∈H LD(h) + ǫ

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 6 / 32

SLIDE 9

Joint Time-Sample Complexity

Goal: LD(hS) ≤ min

h∈H LD(h) + ǫ

Sample complexity: How many examples are needed ? Time complexity: How much time is needed ?

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 6 / 32

SLIDE 10

Joint Time-Sample Complexity

Goal: LD(hS) ≤ min

h∈H LD(h) + ǫ

Sample complexity: How many examples are needed ? Time complexity: How much time is needed ? TH,ǫ(m) = how much time is needed when |S| = m ? Time-sample complexity TH,ǫ m

sample complexity data-laden Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 6 / 32

SLIDE 11

Outline

The Sample-Computational tradeoff: Agnostic learning of preferences Learning margin-based halfspaces Formally establishing the tradeoff More data in partial information settings Other things we can do with more data Missing information Testing time

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 7 / 32

SLIDE 12

Agnostic learning Preferences

The Learning Problem: X = [d] × [d], Y = {0, 1} Given (i, j) ∈ X predict if i is preferable over j H is all permutations over [d] Loss function = zero-one loss

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32

SLIDE 13

Agnostic learning Preferences

The Learning Problem: X = [d] × [d], Y = {0, 1} Given (i, j) ∈ X predict if i is preferable over j H is all permutations over [d] Loss function = zero-one loss Method I: ERMH Sample complexity is

d ǫ2

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32

SLIDE 14

Agnostic learning Preferences

The Learning Problem: X = [d] × [d], Y = {0, 1} Given (i, j) ∈ X predict if i is preferable over j H is all permutations over [d] Loss function = zero-one loss Method I: ERMH Sample complexity is

d ǫ2

Varun Kanade and Thomas Steinke (2011): If RP=NP, it is not possible to efficiently find an ǫ-accurate permutation

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32

SLIDE 15

Agnostic learning Preferences

The Learning Problem: X = [d] × [d], Y = {0, 1} Given (i, j) ∈ X predict if i is preferable over j H is all permutations over [d] Loss function = zero-one loss Method I: ERMH Sample complexity is

d ǫ2

Varun Kanade and Thomas Steinke (2011): If RP=NP, it is not possible to efficiently find an ǫ-accurate permutation Claim: If m ≥ d2/ǫ2 it is possible to find a predictor with error ≤ ǫ in polynomial time

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 8 / 32

SLIDE 16

Agnostic learning Preferences

Let H(n) be the set of all functions from X to Y ERMH(n) can be computed efficiently Sample complexity: V C(H(n))/ǫ2 = d2/ǫ2 Improper learning H H(n)

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 9 / 32

SLIDE 17

Sample-Computational Tradeoff

?

Time Samples ERMH ERMH(n) Samples Time ERMH d d! ERMH(n) d2 d2

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 10 / 32

SLIDE 18

Is this the best we can do?

Analysis is based on upper bounds Is it possible to (improperly) learn efficiently with d log(d) examples ? Posed as an open problem by:

Jake Abernathy (COLT’10) Kleinberg, Niculescu-Mizil, Sharma (Machine Learning 2010)

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 11 / 32

SLIDE 19

Is this the best we can do?

Analysis is based on upper bounds Is it possible to (improperly) learn efficiently with d log(d) examples ? Posed as an open problem by:

Jake Abernathy (COLT’10) Kleinberg, Niculescu-Mizil, Sharma (Machine Learning 2010)

Hazan, Kale, S. (COLT’12):

Can learn efficiently with d log3(d)

ǫ2

examples

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 11 / 32

SLIDE 20

Sample-Computational Tradeoff

Time Samples ERMH HKS ERMH(n) Samples Time ERMH d d! HKS d log3(d) d4 log3(d) ERMH(n) d2 d2

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 12 / 32

SLIDE 21

HKS: Proof idea

Each permutation π can be written as a matrix, s.t., W(i, j) =

1

if π(i) < π(j)

.w.

Definition: A matrix is (β, τ) decomposable if its symmetrization can be written as P − N where P, N are PSD, have trace bounded by τ, and diagonal entries bounded by β Theorem: There’s an efficient online algorithm with regret of

τβ log(d)T for predicting the elements of (β, τ)-decomposable

matrices Lemma: Permutation matrices are (log(d), d log(d)) decomposable.

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 13 / 32

SLIDE 22

Outline

The Sample-Computational tradeoff: Agnostic learning of preferences Learning margin-based halfspaces Formally establishing the tradeoff Other things we can do with more data Missing information Testing time

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 14 / 32

SLIDE 23

Learning Margin-Based Halfspaces

Prior assumption: minw:w=1 P[yw, x ≤ γ] is small. γ

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 15 / 32

SLIDE 24

Learning Margin-Based Halfspaces

Goal: Find hS : X → {±1} such that P[hS(x) = y] ≤ (1 + α) min

w:w=1 P[yw, x ≤ γ] + ǫ

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 16 / 32

SLIDE 25

Learning Margin-Based Halfspaces

Goal: Find hS : X → {±1} such that P[hS(x) = y] ≤ (1 + α) min

w:w=1 P[yw, x ≤ γ] + ǫ

Known results: α Samples Time Ben-David and Simon

1 γ2 ǫ2

exp(1/γ2) SVM (Hinge-loss)

1 γ 1 γ2 ǫ2

poly(1/γ) Trading approximation factor for runtime What if α ∈ (0, 1/γ) ?

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 16 / 32

SLIDE 26

Learning Margin-Based Halfspaces

Theorem (Birnbaum and S., NIPS’12)

Can achieve α-approximation using time and sample complexity of poly(1/γ) · exp

4

(γ α)2

Corollary

Can achieve α =

1 γ√ log(1/γ) in polynomial time

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 17 / 32

SLIDE 27

Proof Idea

SVM relies on the hinge-loss as a convex surrogate: ℓ(w, (x, y)) =

1 − y w,x

γ

+
1

1 γ

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 18 / 32

SLIDE 28

Proof Idea

SVM relies on the hinge-loss as a convex surrogate: ℓ(w, (x, y)) =

1 − y w,x

γ

+

Compose the hinge-loss over a polynomial [1 − yp(w, x)]+

1

1 γ

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 18 / 32

SLIDE 29

Proof Idea

SVM relies on the hinge-loss as a convex surrogate: ℓ(w, (x, y)) =

1 − y w,x

γ

+

Compose the hinge-loss over a polynomial [1 − yp(w, x)]+

1

1 γ

But now the loss function is non convex ...

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 18 / 32

SLIDE 30

Proof Idea (Cont.)

Let p(x) =

j βjxj be the polynomial

Original class: H = {x → p(w, x) : w = 1} Define kernel: k(x, x′) =

j |βj|(x, x′)j

New class: H(n) = {x → v, Ψ(x) : v ≤ B} where Ψ is the mapping corresponds to the kernel ERMH(n) can be computed efficiently (due to convexity) Sample complexity: B2/ǫ2 H H(n)

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 19 / 32

SLIDE 31

Can we do better ?

Theorem (Daniely, Lineal, S. 2012)

For every kernel, SVM cannot obtain α <

1 γ poly(log(γ)) with poly(1/γ)

samples. A similar lower bound holds for any feature-based mapping (not

necessarily kernel-based). Open problem: lower bounds for other techniques / any technique ?

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 20 / 32

SLIDE 32

Proof ideas

A one dimensional problem: D = (1 − λ)D1 + λD2 xo Every low degree polynomial with hinge-loss smaller than 1 must have p(γ) ≈ p(−γ). Pull back the distribution to high dimension Use a characterization of Hilbert spaces corresponding to symmetric kernels, from which we can write f using Legendre polynomials and reduce to the 1-dim case By averaging the kernel over the group of linear isometries of Rd, we relax the assumption that the kernel is symmetric

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 21 / 32

SLIDE 33

Outline

The Sample-Computational tradeoff: Agnostic learning of preferences Learning margin-based halfspaces Formally establishing the tradeoff Other things we can do with more data Missing information Testing time

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 22 / 32

SLIDE 34

Formal Derivation of Gaps

Theorem (Shamir, S., Tromer 2012): Assume one-way permutations exist, there exists an agnostic learning problem such that: TH,ǫ(m) 2n + 1

ǫ2

> poly(n)

n3 ǫ6

m

n ǫ2

log(n)

1 ǫ2

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 23 / 32

SLIDE 35

Proof: One Way Permutations

P : {0, 1}n → {0, 1}n is one-way permutation if it’s one-to-one and It is easy to compute w = P(s) It is hard to compute s = P −1(w) Goldreich-Levin Theorem: If P is one way, then for any algorithm A, ∃w s.t. P

r[A(r, P(w)) = r, w] < 1

2 + 1 poly(n)

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 24 / 32

SLIDE 36

Proof: One Way Permutations

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 25 / 32

SLIDE 37

What else can we do with more data?

More Data speedup runtime training runtime prediction runtime compensate for missing information reduce error Traditional

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 26 / 32

SLIDE 38

Online Bandit Multiclass Prediction

A hypothesis class H For t = 1, 2, . . . , T

Receive xt ∈ Rd Predict ˆ yt ∈ {1, . . . , k} Pay 1[ˆ yt = h∗(xt)]

Goal: Minimize number of mistakes

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 27 / 32

SLIDE 39

Online Bandit Multiclass Prediction

Consider H to be linear predictors with large margin In the full information setting (i.e. learner observes h∗(xt)), Perceptron achieves error rate of O(1/T) In the bandit case:

Error rate of O(1/T) is achievable in exponential time Error rate of O(1/ √ T) is achievable in linear time Main idea: Exploration— Guess the label randomly with probability Θ(1/ √ T).

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 28 / 32

SLIDE 40

What else can we do with more data?

More Data speedup runtime training runtime prediction runtime compensate for missing information

reduce error

Traditional

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 29 / 32

SLIDE 41

More data can speedup prediction time

Semi-Supervised Learning: Many unlabeled examples, few labeled examples Most previous work: how unlabeled data can improve accuracy ? Our goal: how unlabeled data can help constructing faster classifiers Modeling: Proper-Semi-Supervised-Learning — we must output a classifier from a predefined class H (of fast predictors)

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 30 / 32

SLIDE 42

More data can speedup prediction time

Semi-Supervised Learning: Many unlabeled examples, few labeled examples Most previous work: how unlabeled data can improve accuracy ? Our goal: how unlabeled data can help constructing faster classifiers Modeling: Proper-Semi-Supervised-Learning — we must output a classifier from a predefined class H (of fast predictors) A simple two phase procedure: Use labeled examples to learn an arbitrary classifier (which is as accurate as possible) Apply the learned classifier to label the unlabeled examples Feed the now-labeled examples to a proper supervised learning for H Analysis is based on the simple inequality: P[h(x) = f(x)] ≤ P[h(x) = g(x)] + P[g(x) = f(x)]

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 30 / 32

SLIDE 43

Demonstration

10

3

10

4

0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18

# labeled examples error Linear Linear with SSL Chamfer

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 31 / 32

SLIDE 44

Summary

The Bias-Variance tradeoff is well understood We study the Sample-Computational tradeoff More data can reduce runtime (both training and testing) More data can compensate for missing information Open Questions Other techniques to control the tradeoff Stronger lower bounds for real-world problems

Shai Shalev-Shwartz (Hebrew U) Sample-Computational Tradeoff OSL2013 32 / 32