Optimistic Rates Nati Srebro Based on work with Karthik Sridharan - - PowerPoint PPT Presentation

β–Ά
optimistic rates
SMART_READER_LITE
LIVE PREVIEW

Optimistic Rates Nati Srebro Based on work with Karthik Sridharan - - PowerPoint PPT Presentation

Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik Sridharan Outline What?


slide-1
SLIDE 1

β€œOptimistic” Rates

Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari

Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik Sridharan

slide-2
SLIDE 2

Outline

  • What?
  • When?

(How?)

  • Why?
slide-3
SLIDE 3

Estimating the Bias of a Coin

slide-4
SLIDE 4

Optimistic VC bound

(aka π‘€βˆ—-bound, multiplicative bound)

  • For a hypothesis class with VC-dim D, w.p. 1-πœ€ over n samples:
slide-5
SLIDE 5

Optimistic VC bound

(aka π‘€βˆ—-bound, multiplicative bound)

  • For a hypothesis class with VC-dim D, w.p. 1-πœ€ over n samples:
  • Sample complexity to get 𝑀 β„Ž ≀ π‘€βˆ— + πœ—:
  • Extends to bounded real-valued loss, D=VC subgraph dim
slide-6
SLIDE 6

From Parametric to Scale Sensitive Classes

  • Instead of VC-dim or VC-subgraph-dim (β‰ˆ #params), rely on

metric scale to control complexity, e.g.:

  • Learning depends on:
  • Metric complexity measures: fat shattering dimension, covering

numbers, Rademacher Complexity

  • Scale sensitivity of loss 𝜚 (bound on derivatives or β€œmargin”)
  • For β„‹with Rademacher Complexity β„›π‘œ, and πœšβ€² ≀ 𝐻:

𝑆 =

slide-7
SLIDE 7

Non-Parametric Optimistic Rate for Smooth Loss

  • Theorem: for any β„‹ with (worst case) Rademacher Complexity

β„›π‘œ(β„‹), and any smooth loss with πœšβ€²β€² ≀ 𝐼, 𝜚 ≀ 𝑐, w.p. 1 βˆ’ πœ€ over n samples:

[S Sridharan Tewari 2010]

  • Sample complexity
slide-8
SLIDE 8

Proof Ideas

  • Smooth functions are self bounding: for any H-smooth non-

negative f: 𝑔′ 𝑒 ≀ 4𝐼𝑔 𝑒

  • 2nd order version of Lipschitz composition Lemma, restricted to

predictors with low loss:

Rademacher οƒ  fat shattering οƒ  π‘€βˆž covering οƒ  (compose with loss and use smoothness) οƒ  𝑀2 covering οƒ  Rademacher

  • Local Rademacher analysis
slide-9
SLIDE 9

Non-Parametric Optimistic Rate for Smooth Loss

  • Theorem: for any β„‹ with (worst case) Rademacher Complexity

β„›π‘œ(β„‹), and any smooth loss with πœšβ€²β€² ≀ 𝐼, 𝜚 ≀ 𝑐, w.p. 1 βˆ’ πœ€ over n samples:

[S Sridharan Tewari 2010]

  • Sample complexity
slide-10
SLIDE 10

Parametric vs Non-Parametric

Parametric dim(β„‹) ≀ 𝐄, π’Š ≀ 𝟐 Scale-Sensitive ℛ𝒐 β„‹ ≀ 𝑺 𝒐 Lipschitz: πœšβ€² ≀ 𝐻 (e.g. hinge, β„“1) 𝐻 𝐸 π‘œ + π‘€βˆ—π»πΈ π‘œ 𝐻2𝑆 π‘œ Smooth: πœšβ€²β€² ≀ 𝐼 (e.g. logistic, Huber, smoothed hinge) 𝐼 𝐸 π‘œ + π‘€βˆ—πΌπΈ π‘œ 𝐼 𝑆 π‘œ + π‘€βˆ—πΌπ‘† π‘œ Smooth & strongly convex: πœ‡ ≀ πœšβ€²β€² ≀ 𝐼 (e.g. square loss) 𝐼 πœ‡ β‹… 𝐼 𝐸 π‘œ 𝐼 𝑆 π‘œ + π‘€βˆ—πΌπ‘† π‘œ

Min-max tight up to poly-log factors

slide-11
SLIDE 11

Optimistic SVM-Type Bounds

𝜚01 𝜚hinge ≀

  • Optimize
  • Generalize
slide-12
SLIDE 12

Optimistic SVM-Type Bounds

𝜚01 ≀ 𝜚smooth≀ 𝜚hinge

  • Optimize
  • Generalize
slide-13
SLIDE 13

Optimistic Learning Guarantees

οƒΌParametric classes οƒΌScale-sensitive classes with smooth loss οƒΌSVM-type bounds οƒΌMargin Bounds οƒΌOnline Learning/Optimization with smooth loss οƒΌStability-based guarantees with smooth loss Γ— Non-param (scale sensitive) classes with non-smooth loss Γ— Online Learning/Optimization with non-smooth loss

slide-14
SLIDE 14

Why Optimistic Guarantees?

  • Optimistic regime typically relevant regime:
  • Approximation error π‘€βˆ— β‰ˆ Estimation error πœ—
  • If πœ— β‰ͺ π‘€βˆ—, better to spend energy on lowering approx. error

(use more complex class)

  • Important in understanding statistical learning
slide-15
SLIDE 15

Training Kernel SVMs

# Kernel evaluations to get excess error πœ—: (𝑆 = π‘₯βˆ— 2)

  • Using SGD:
  • Using the Stochastic Batch Perceptron [Cotter et al 2012]:

(is this the best possible?)

slide-16
SLIDE 16

Training Linear SVMs

Runtime (# feature evaluations): (𝑆 = π‘₯βˆ— 2)

  • Using SGD:
  • Using SIMBA [Hazan et al 2011]:

(is this the best possible?)

slide-17
SLIDE 17

Mini-Batch SGD

  • Stochastic optimization of smooth 𝑀 π‘₯ using n training-points,

doing T=n/b iterations of SGD with mini-batches of size b

  • Pessimistic Analysis (ignoring π‘€βˆ—):

 Can use minibatch of size 𝑐 ∝ π‘œ , with π‘ˆ ∝ π‘œ iterations and get same error (up to constant factor) as sequential SGD

[Dekel et al 2010][Agarwal Duchi 2011]

  • But taking into account π‘€βˆ—:

In Optimistic Regime: Can’t use b>1, no parallelization speedups!

  • Use acceleration to get speedup in optimistic regime  [Cotter et al 2011]
slide-18
SLIDE 18

Multiple Complexity Controls

[Liang Srebro 2010]

𝑀 π‘₯ = 𝔽 π‘₯, π‘Œ βˆ’ 𝑍 2 , 𝑍 = π‘₯, π‘Œ + π’ͺ(0, 𝜏2) π‘₯ ∈ ℝ𝐸 π‘₯ 2 ≀ 𝑆

π‘€βˆ—/𝐸

𝑆/π‘œ π‘€βˆ—π‘†/π‘œ π‘€βˆ—πΈ/π‘œ

𝔽[𝑍2] π‘€βˆ— 𝑆/𝔽[𝑍2] 𝑆/π‘€βˆ— π‘€βˆ—πΈ2/𝑆

slide-19
SLIDE 19

Be Optimistic

  • For scale-sensitive non-parametric classes, with smooth

loss: [Srebro Sridharan Tewari 2010]

  • Diff vs parametric: Not possible with non-smooth loss!
  • Optimistic regime typically relevant regime:
  • Approximation error π‘€βˆ— β‰ˆ Estimation error πœ—
  • If πœ— β‰ͺ π‘€βˆ—, better to spend energy on lowering approx. error (use

more complex class)

  • Important in understanding statistical learning