and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz - - PowerPoint PPT Presentation

and strong convexity
SMART_READER_LITE
LIVE PREVIEW

and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz - - PowerPoint PPT Presentation

Learnability, Stability and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz Karthik Sridharan Ambuj Tewari Weizmann HUJI Cornell Michigan Toyota Technological Institute Chicago (2008-2011) Outline Theme: Role of


slide-1
SLIDE 1

Nati Srebro

Shai Shalev-Shwartz

HUJI

Karthik Sridharan

Cornell

Learnability, Stability and Strong Convexity

Ohad Shamir

Weizmann

Toyota Technological Institute—Chicago (2008-2011) Ambuj Tewari

Michigan

slide-2
SLIDE 2

Outline

  • Theme: Role of Stability in Learning
  • Story: Necessary and sufficient condition for learnability
  • Characterizing (statistical) learnability

– Stability as the master property

  • Convex Problems

– Strong convexity as the master property

  • Stability in online learning

– From Stability to Online Mirror Descent

slide-3
SLIDE 3

The General Learning Setting

min

𝑥∈𝒳 𝐺 𝑥 = 𝐹𝑨~𝒠 𝑔 𝑥, 𝑨

given an iid sample 𝑨1, 𝑨2, … , 𝑨𝑛 ∼ 𝒠

  • Known objective function 𝑔: 𝒳 × Ω → ℝ,

unknown distribution 𝒠 over 𝑎 ∈ Ω

  • Problem specified by 𝒳, Ω, 𝑔 is learnable if there exists a learning

rule 𝑥(𝑨1, … , 𝑨𝑛) s.t. for every 𝜗 > 0 and large enough sample size 𝑛(𝜗), for any distribution D: 𝔽𝑨1,…,𝑨𝑛∼𝒠 𝐺 𝑥 ≤ inf

𝑥∈𝒳 𝐺 𝑥 + 𝜗

Vapnik95

aka Stochastic Optimization

𝐺(𝑥∗)

slide-4
SLIDE 4

General Learning: Examples

  • Supervised learning:

z = (x,y) w specifies a perdictor hw: X ! Y f( w ; (x,y) ) = loss(hw(x),y) e.g. linear prediction: 𝑔 𝑥 ; 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡 𝑥, 𝑦 , 𝑧

  • Unsupervised learning, e.g. k-means clustering:

 = x 2 Rd w = ([1],…,[k]) 2 Rd£k specifies k cluster centers f( ([1],…,[k]) ; x ) = minj |[j]-x|2

  • Density estimation:

w specifies probability density pw(x) f( w ; x ) = -log pw(x)

  • Optimization in uncertain environment, e.g.:

z = traffic delays on each road segment w = route chosen (indicator over road segments in route) f( w ; z ) = hw,zi = total delay along route

Minimize F(w)=Ez[f(w;z)] based on sample z1,z2,…,zn

slide-5
SLIDE 5

sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { hw | w 2 W } has finite fat-shattering dimension

F( ^ w)

n!1

¡ ! F(w?)

Learnable using ERM:

^ w = arg min ^ F(w)

𝑥 = arg min

𝑥

𝐺(𝑥) 𝐺 𝑥 = 1 𝑛

𝑗

𝑔 𝑥, 𝑨𝑗

slide-6
SLIDE 6

F( ~ w)

n!1

¡ ! F(w?)

Learnable (using some rule):

sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { hw | w 2 W } has finite fat-shattering dimension

Supervised Classification

F( ^ w)

n!1

¡ ! F(w?)

Learnable using ERM:

^ w = arg min ^ F(w)

f(w;(x,y)) = loss(hw(x),y):

[Alon Ben-David Cesa-Bianchi Haussler 93]

slide-7
SLIDE 7

Beyond Supervised Learning

  • Supervised learning:

𝑔 𝑥, 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡(ℎ𝑥 𝑦 , 𝑧) – Combinatorial necessary and sufficient condition of learning – Uniform convergence necessary and sufficient for learning – ERM universal (if learnable, can do it with ERM)

  • General learning / stochastic optimization:

𝑔(𝑥, 𝑨) ????

slide-8
SLIDE 8

Online Learning (Optimization)

  • Known function 𝑔 ⋅,⋅
  • Unknown sequence 𝑨1, 𝑨2, …
  • Online learning rule: 𝑥𝑗(𝑨1, … , 𝑨𝑗−1)
  • Goal: 𝑗 𝑔(𝑥𝑗, 𝑨𝑗)

Differences vs stochastic setting:

  • Any sequence—not necessarily iid
  • No distinction between “train” and “test”

w1 w2 w3

….

f(¢;z1) f(¢;z2) f(¢;z3)

Learner: Adversary:

slide-9
SLIDE 9

Online and Stochastic Regret

  • Online Regret: for any sequence,

1 𝑛

𝑗=1 𝑛

𝑔(𝑥𝑗(𝑨1, … , 𝑨𝑗−1), 𝑨𝑗) ≤ inf

𝑥∈𝒳

1 𝑛

𝑗=1 𝑛

𝑔 𝑥, 𝑨𝑗 + 𝑆𝑓𝑕(𝑛)

  • Statistical Regret: for any distribution 𝒠,

𝔽𝑨1,…,𝑨𝑛∼𝒠 𝐺

𝒠

𝑥 𝑨1, … , 𝑨𝑛 ≤ inf

𝑥∈𝒳 𝐺 𝒠 𝑥 + 𝜗(𝑛)

  • Online-To-Batch:

𝑥(𝑨1, … , 𝑨𝑛) = 𝑥𝑗 with prob 1/𝑛 𝔽 𝐺 𝑥 ≤ 𝐺 𝑥∗ + 𝑆𝑓𝑕 𝑛

𝐺 𝑥∗ 𝐺 𝑥

slide-10
SLIDE 10

F( ~ w)

n!1

¡ ! F(w?)

Learnable (using some rule):

sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { hw | w 2 W } has finite fat-shattering dimension

Supervised Classification

F( ^ w)

n!1

¡ ! F(w?)

Learnable using ERM:

^ w = arg min ^ F(w)

f(w;(x,y)) = loss(hw(x),y):

[Alon Ben-David Cesa-Bianchi Haussler 93]

Online Learnable

slide-11
SLIDE 11

Convex Lipschitz Problems

  • 𝒳 convex bounded subset of Hilbert space (or ℝ𝑒)

∀𝑥∈𝒳 𝑥 2 ≤ 𝑪

  • For each 𝑨, 𝑔(𝑥, 𝑨) convex Lipschitz w.r.t 𝑥

𝑔 𝑥, 𝑨 − 𝑔 𝑥′, 𝑨 ≤ 𝑴 ⋅ 𝑥 − 𝑥′ 2

  • E.g., 𝑔 𝑥, 𝑦, 𝑧

= 𝑚𝑝𝑡𝑡( 𝑥, 𝑦 ; 𝑧), 𝑚𝑝𝑡𝑡′ ≤ 1 𝑦 2 ≤ 𝑀

  • Online Gradient Descent: 𝑆𝑓𝑕 𝑛 ≤

𝐶2𝑀2 𝑛

  • Stochastic Setting:

– For generalized linear (including supervised): matches ERM rate – For general Convex Lipschitz Problems?

  • Learnable via online-to-batch (SGD)
  • Using ERM?
slide-12
SLIDE 12

Center of Mass with Missing Data

𝑔 𝑥, 𝐽, 𝑦𝐽 = 𝑗∈𝐽 𝑥 𝑗 − 𝑦 𝑗

2

𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1

Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗, 𝑦 = 0 If d>>2m (think of d=1) then with high probability there exists a coordinate j that is never seen in the sample, i.e. 𝑘 ∉ 𝐽 for all i=1..m

^ F(ej) = 0 F(ej) = 1=2

sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯ ¸ 1=2

No uniform convergence! ej is an empirical minimizer with F(ej) = ½, far from F(w*)=F(0)=0

𝑥 ∈ ℝ𝑒, 𝑥 ≤ 1

slide-13
SLIDE 13

F( ~ w)

n!1

¡ ! F(w?)

Learnable (using some rule): Learnable with ERM:

F( ^ w)

n!1

¡ ! F(w?) sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { z!f(w;z) | w 2 W } has finite fat-shattering dimension

Supervised learning Supervised learning Supervised learning general setting general setting general setting

Online Learnable

slide-14
SLIDE 14

Stochastic Convex Optimization

  • Empirical minimization might not be consistent
  • Learnable using specific procedural rule

(online-to-batch conversion of online gradient descent)

  • ??????????
slide-15
SLIDE 15

Strongly Convex Objectives

𝑔(𝑥, 𝑨) is 𝜇-strongly convex in 𝑥 iff:

𝑔 𝑥 + 𝑥′ 2 , 𝑨 ≤ 𝑔 𝑥, 𝑨 + 𝑔 𝑥′, 𝑨 2 − 𝜇 8 𝑥 − 𝑥′

2 2

Equivalent to 𝛼

𝑥 2𝑔 𝑥, 𝑨 ≽ 𝜇

If 𝑔(𝑥, 𝑨) is 𝜇-convex and 𝑀-Lipschitz w.r.t. 𝑥

  • Online Gradient Descent [Hazan Kalai Kale Agarwal 2006]

𝑆𝑓𝑕 ≤ 𝑃 𝑀2log(𝑛) 𝜇𝑛

  • Empirical Risk Minimization:

𝔽 𝐺 𝑥 ≤ 𝐺 𝑥∗ + 𝑃 𝑀2 𝜇𝑛 Stochastic Setting? ERM?

slide-16
SLIDE 16

Strong Convexity and Stability

  • Definition: rule

𝑥(𝑨1, … 𝑨𝑛) is 𝛾(𝑛)-stable if: 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛)

  • Symmetric

𝑥 is 𝛾−stable ⇒ 𝔽 𝐺 𝑥𝑛−1 ≤ 𝔽 𝐺 𝑥𝑛 + 𝛾

  • 𝑔 is 𝜇-strongly convex and 𝑀-Lipschitz ⇒

𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾 = 4𝑀2 𝜇𝑛

  • Conclusion:

𝔽 𝐺 𝑥 ≤ 𝛾 𝑛

For ERM: 𝔽 𝐺 𝑥 ≤ 𝔽 𝐺 𝑥∗ = 𝐺 𝑥∗

slide-17
SLIDE 17

Empirical Minimization Consistent, but is there Uniform Convergence?

For j that is never seen in the sample: sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯ ¸ 1=2

No uniform convergence: 𝑔 𝑥, 𝐽, 𝑦𝐽 = 𝑗∈𝐽 𝑥 𝑗 − 𝑦 𝑗

2 + 𝜇 𝑥 2 2

𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1

Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗, 𝑦 = 0

𝑥 ∈ ℝ𝑒, 𝑥 ≤ 1

𝐺 𝑢𝑓

𝑘 = 𝜇𝑢2

𝐺 𝑢𝑓

𝑘 = 1

2 𝑢 + 𝜇𝑢2

slide-18
SLIDE 18

F( ~ w)

n!1

¡ ! F(w?)

Solvable (using some algorithm): Empirical minimizer is consistent: F( ^

w)

n!1

¡ ! F(w?) sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { z!f(w;z) | w 2 W } has finite fat-shattering dimension

Supervised learning Supervised learning Supervised learning

not even local Online Learnable

slide-19
SLIDE 19

Back to Weak Convexity

𝑔(𝑥, 𝑨) 𝑀-Lipschitz (and convex), 𝑥 2 ≤ 𝐶

  • Use Regularized ERM:

𝑥𝜇 = arg min

𝑥∈𝒳

𝐺 𝑥 + 𝜇 2 𝑥 2

2

  • Setting 𝜇 =

𝑀2 𝐶2𝑛:

𝔽 𝐺 𝑥𝜇 ≤ 𝐺 𝑥∗ + 𝑃 𝑀2𝐶2 𝑛

  • Key: strongly convex regularizer ensures stability
slide-20
SLIDE 20

The Role of Regularization

  • Structure Risk Minimization view:

– Adding regularization term effectively constrains domain to lower complexity domain 𝒳

𝑠 = 𝑥 | 𝑥

≤ 𝑠 – Learning guarantees (e.g. for SVMs, LASSO) are actually for empirical minimization inside 𝒳

𝑠, and are

based on uniform convergence in 𝒳

𝑠.

  • In our case:

– No uniform convergence in 𝒳

𝑠, for any r>0

– No uniform convergence even of regularized loss – Cannot solve stochastic optimization problem by restricting to 𝒳

𝑠, for any 𝑠.

– What regularization buys is stability

slide-21
SLIDE 21

Stability Characterizes Learnability

Learnable Learnable with ERM Uniform Convergence Finite fat-shat dimension ∃ Stable AERM Theorem: Learnable with (symmetric) ERM 𝑥 iff ∀𝒠 𝔽 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛) For some 𝛾 𝑛 → 0 Theorem: Learnable iff ∃ symetric 𝑥 s.t. ∀𝒠:

  • 𝑥 is an “almost ERM”:

𝔽 𝐺 𝑥 − 𝐺 𝑥 ≤ 𝜗(𝑛)

  • 𝑥 is stable:

𝔽 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛)

For some 𝜗 𝑛 → 0, 𝛾 𝑛 → 0 Stable ERM

supervised learning

slide-22
SLIDE 22

Strong Convexity and Stability

  • For any norm ‖𝑥‖:

– Ψ 𝑥 ≥ 0 is strongly convex w.r.t. 𝑥 , i.e. Ψ 𝑥 + 𝑥′ 2 ≥ Ψ 𝑥 + Ψ 𝑥′ 2 + 1 4 𝑥 2 – 𝑔 𝑥, 𝑨 is 𝑀-Lipschitz w.r.t. ‖𝑥‖: 𝑔 𝑥, 𝑨 − 𝑔 𝑥′, 𝑨 ≤ 𝑀 ⋅ ‖𝑥 − 𝑥′‖  𝑥𝜇 = arg min

𝑥

𝐺(𝑥) + 𝜇

2 Ψ(𝑥) is 𝑀2𝜇 𝑛 -stable

  • With 𝜇 =

𝑀2 Ψ 𝑥∗ 𝑛 : 𝐺 𝑥𝜇 ≤ 𝐺 𝑥∗ + 𝑀2Ψ 𝑥∗ 𝑛

slide-23
SLIDE 23

Convex Lipschitz Problems

  • 𝒳 convex bounded subset of normed space (ℝ𝑒 or Banach space)
  • For each 𝑨, 𝑔(𝑥, 𝑨) convex Lipschitz w.r.t 𝑥

𝑔 𝑥, 𝑨 − 𝑔 𝑥′, 𝑨 ≤ 𝑴 ⋅ 𝑥 − 𝑥′

  • E.g., 𝑔 𝑥, 𝑦, 𝑧

= 𝑚𝑝𝑡𝑡( 𝑥, 𝑦 ; 𝑧), 𝑚𝑝𝑡𝑡′ ≤ 1 𝑦 ∗ ≤ 𝑀

  • To learn: need Ψ(𝑥) strongly convex w.r.t. ‖ ⋅ ‖

𝐺 𝑥𝜇 ≤ 𝐺 𝑥∗ +

𝑀2𝐶2 𝑛

𝐶2 = sup

𝑥∈𝒳

Ψ(𝑥)

  • Is this universal?

Can all Lipschitz problems (for all ‖ ⋅ ‖ and 𝒳) be learned this way?

slide-24
SLIDE 24

Stability in Online Learning

  • Reminder: rule

𝑥(𝑨1, … 𝑨𝑛) is 𝛾(𝑛)-stable if 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛)

  • Follow The Leader (FTL):

𝑥𝑛 𝑨1, … , 𝑨𝑛−1 = arg min

𝑥 𝑗=1 𝑛−1 𝑔(𝑥, 𝑨𝑗)

  • Be The Leader (BTL): 𝑥𝑛 𝑨1, … , 𝑨𝑛−1 = arg min

𝑥 𝑗=1 𝑛 𝑔(𝑥, 𝑨𝑗)

  • If the ERM is 𝛾(𝑛)-stable: 𝑆𝑓𝑕𝐺𝑈𝑀 𝑛 ≤ 𝑆𝑓𝑕𝐶𝑈𝑀 𝑛 +

1 𝑛 𝑗 𝛾 𝑗 ≤ 1 𝑛 𝑗 𝛾 𝑗

  • Follow The Regularized Leader (FTRL):

𝑥𝑛 𝑨1, … , 𝑨𝑛−1 = arg min

𝑥 𝑗=1 𝑛−1 𝑔(𝑥, 𝑨𝑗) + 𝜇Ψ(𝑥)

  • If 𝑔 is 𝑀-Lipschitz and Ψ strongly conv. w.r.t. ‖ ⋅ ‖: 𝑆𝑓𝑕𝐺𝑈𝑆𝑀 𝑛 ≤

𝑀2 sup Ψ(𝑥) 𝑛

≤ 0

slide-25
SLIDE 25

Strong Convexity is Necessary and Sufficient

  • Theorem: If a Convex Lipschitz problem (for some ‖ ⋅ ‖ and

some convex 𝒳) can be online learned with regret

𝑀2𝐶2 𝑛 ,

then there exists Ψ(𝑥) strongly convex w.r.t. ⋅ s.t. sup

𝑥∈𝒳

Ψ 𝑥 ≤ 𝑑𝐶2

  • More generally: For any problem, Follow The Regularized

Leader with some Ψ achieves the optimal online regret (up to a constant factor), and this can be established via stability

slide-26
SLIDE 26

From FTRL to Mirror Descent

  • Linearized problem:

𝑔

𝑗 𝑥 ≝ 𝑔 𝑥𝑗, 𝑨𝑗 + 〈𝛼𝑔 𝑥𝑗, 𝑨𝑗 , 𝑥 − 𝑥𝑗〉

  • Main observation: for convex 𝑔, 𝑆𝑓𝑕𝑠𝑓𝑢 𝑝𝑜 𝑔 ≤ (𝑆𝑓𝑕𝑠𝑓𝑢 𝑝𝑜

𝑔)

  • Follow the Linearized Regularized Leader (aka Mirror Descent):

𝑥𝑛 = arg min

𝑥 𝑗=1 𝑛−1〈𝛼𝑔 𝑥𝑗, 𝑨𝑗 , 𝑥〉 + 𝜇Ψ(𝑥)

= 𝛼Ψ−1 𝛼Ψ 𝑥𝑛−1 − 1

𝜇 𝛼𝑔 𝑥𝑛−1, 𝑨𝑛−1

𝑆𝑓𝑕𝑁𝐸 𝑛 ≤ 𝑀2 sup Ψ(𝑥) 𝑛

  • Conclusion: Any Convex Lipschitz problem (for any 𝒳 and ‖ ⋅ ‖)

that is online learnable, is (optimally) learnable with this approach

slide-27
SLIDE 27

Strong Convexity as the Master Property

Ψ 𝑥 strongly convex w.r.t ‖𝑥‖

arg min 𝐺 𝑥 + 𝜇Ψ(𝑥) stable Uniform convergence of 𝑦 → 𝑥, 𝑦 𝑥 ≤ 𝐶} stat learnability

  • f 𝑔

𝛼𝑔 ∗ ≤ 𝑀} stat learnability

  • f generalized linear

(including supervised) FTLRL / Online Mirror Descent

  • nline learnability
  • f 𝑔

𝛼𝑔 ∗ ≤ 𝑀} RERM ERM