βOptimisticβ Rates
Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari
Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik Sridharan
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan - - PowerPoint PPT Presentation
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik Sridharan Outline What?
Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari
Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik Sridharan
metric scale to control complexity, e.g.:
numbers, Rademacher Complexity
π =
βπ(β), and any smooth loss with πβ²β² β€ πΌ, π β€ π, w.p. 1 β π over n samples:
[S Sridharan Tewari 2010]
negative f: πβ² π’ β€ 4πΌπ π’
predictors with low loss:
Rademacher ο fat shattering ο πβ covering ο (compose with loss and use smoothness) ο π2 covering ο Rademacher
βπ(β), and any smooth loss with πβ²β² β€ πΌ, π β€ π, w.p. 1 β π over n samples:
[S Sridharan Tewari 2010]
Parametric dim(β) β€ π, π β€ π Scale-Sensitive βπ β β€ πΊ π Lipschitz: πβ² β€ π» (e.g. hinge, β1) π» πΈ π + πβπ»πΈ π π»2π π Smooth: πβ²β² β€ πΌ (e.g. logistic, Huber, smoothed hinge) πΌ πΈ π + πβπΌπΈ π πΌ π π + πβπΌπ π Smooth & strongly convex: π β€ πβ²β² β€ πΌ (e.g. square loss) πΌ π β πΌ πΈ π πΌ π π + πβπΌπ π
Min-max tight up to poly-log factors
π01 πhinge β€
π01 β€ πsmoothβ€ πhinge
οΌParametric classes οΌScale-sensitive classes with smooth loss οΌSVM-type bounds οΌMargin Bounds οΌOnline Learning/Optimization with smooth loss οΌStability-based guarantees with smooth loss Γ Non-param (scale sensitive) classes with non-smooth loss Γ Online Learning/Optimization with non-smooth loss
(use more complex class)
# Kernel evaluations to get excess error π: (π = π₯β 2)
(is this the best possible?)
Runtime (# feature evaluations): (π = π₯β 2)
(is this the best possible?)
doing T=n/b iterations of SGD with mini-batches of size b
ο¨ Can use minibatch of size π β π , with π β π iterations and get same error (up to constant factor) as sequential SGD
[Dekel et al 2010][Agarwal Duchi 2011]
ο¨In Optimistic Regime: Canβt use b>1, no parallelization speedups!
[Liang Srebro 2010]
π π₯ = π½ π₯, π β π 2 , π = π₯, π + πͺ(0, π2) π₯ β βπΈ π₯ 2 β€ π
πβ/πΈ
π/π πβπ/π πβπΈ/π
π½[π2] πβ π/π½[π2] π/πβ πβπΈ2/π
loss: [Srebro Sridharan Tewari 2010]
more complex class)