[PPT] - and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz PowerPoint Presentation

SLIDE 1

Nati Srebro

Shai Shalev-Shwartz

HUJI

Karthik Sridharan

Cornell

Learnability, Stability and Strong Convexity

Ohad Shamir

Weizmann

Toyota Technological Institute—Chicago (2008-2011) Ambuj Tewari

Michigan

SLIDE 2

Outline

Theme: Role of Stability in Learning
Story: Necessary and sufficient condition for learnability
Characterizing (statistical) learnability

– Stability as the master property

Convex Problems

– Strong convexity as the master property

Stability in online learning

– From Stability to Online Mirror Descent

SLIDE 3

The General Learning Setting

min

𝑥∈𝒳 𝐺 𝑥 = 𝐹𝑨~𝒠 𝑔 𝑥, 𝑨

given an iid sample 𝑨1, 𝑨2, … , 𝑨𝑛 ∼ 𝒠

Known objective function 𝑔: 𝒳 × Ω → ℝ,

unknown distribution 𝒠 over 𝑎 ∈ Ω

Problem specified by 𝒳, Ω, 𝑔 is learnable if there exists a learning

rule 𝑥(𝑨1, … , 𝑨𝑛) s.t. for every 𝜗 > 0 and large enough sample size 𝑛(𝜗), for any distribution D: 𝔽𝑨1,…,𝑨𝑛∼𝒠 𝐺 𝑥 ≤ inf

𝑥∈𝒳 𝐺 𝑥 + 𝜗

Vapnik95

aka Stochastic Optimization

𝐺(𝑥∗)

SLIDE 4

General Learning: Examples

Supervised learning:

z = (x,y) w specifies a perdictor hw: X ! Y f( w ; (x,y) ) = loss(hw(x),y) e.g. linear prediction: 𝑔 𝑥 ; 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡 𝑥, 𝑦 , 𝑧

Unsupervised learning, e.g. k-means clustering:

 = x 2 Rd w = ([1],…,[k]) 2 Rd£k specifies k cluster centers f( ([1],…,[k]) ; x ) = minj |[j]-x|2

Density estimation:

w specifies probability density pw(x) f( w ; x ) = -log pw(x)

Optimization in uncertain environment, e.g.:

z = traffic delays on each road segment w = route chosen (indicator over road segments in route) f( w ; z ) = hw,zi = total delay along route

Minimize F(w)=Ez[f(w;z)] based on sample z1,z2,…,zn

SLIDE 5

sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { hw | w 2 W } has finite fat-shattering dimension

F( ^ w)

n!1

¡ ! F(w?)

Learnable using ERM:

^ w = arg min ^ F(w)

𝑥 = arg min

𝑥

𝐺(𝑥) 𝐺 𝑥 = 1 𝑛

𝑗

𝑔 𝑥, 𝑨𝑗

SLIDE 6

F( ~ w)

n!1

¡ ! F(w?)

Learnable (using some rule):

sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { hw | w 2 W } has finite fat-shattering dimension

Supervised Classification

F( ^ w)

n!1

¡ ! F(w?)

Learnable using ERM:

^ w = arg min ^ F(w)

f(w;(x,y)) = loss(hw(x),y):

[Alon Ben-David Cesa-Bianchi Haussler 93]

SLIDE 7

Beyond Supervised Learning

Supervised learning:

𝑔 𝑥, 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡(ℎ𝑥 𝑦 , 𝑧) – Combinatorial necessary and sufficient condition of learning – Uniform convergence necessary and sufficient for learning – ERM universal (if learnable, can do it with ERM)

General learning / stochastic optimization:

𝑔(𝑥, 𝑨) ????

SLIDE 8

Online Learning (Optimization)

Known function 𝑔 ⋅,⋅
Unknown sequence 𝑨1, 𝑨2, …
Online learning rule: 𝑥𝑗(𝑨1, … , 𝑨𝑗−1)
Goal: 𝑗 𝑔(𝑥𝑗, 𝑨𝑗)

Differences vs stochastic setting:

Any sequence—not necessarily iid
No distinction between “train” and “test”

w1 w2 w3

….

f(¢;z1) f(¢;z2) f(¢;z3)

Learner: Adversary:

SLIDE 9

Online and Stochastic Regret

Online Regret: for any sequence,

1 𝑛

𝑗=1 𝑛

𝑔(𝑥𝑗(𝑨1, … , 𝑨𝑗−1), 𝑨𝑗) ≤ inf

𝑥∈𝒳

1 𝑛

𝑗=1 𝑛

𝑔 𝑥, 𝑨𝑗 + 𝑆𝑓𝑕(𝑛)

Statistical Regret: for any distribution 𝒠,

𝔽𝑨1,…,𝑨𝑛∼𝒠 𝐺

𝒠

𝑥 𝑨1, … , 𝑨𝑛 ≤ inf

𝑥∈𝒳 𝐺 𝒠 𝑥 + 𝜗(𝑛)

Online-To-Batch:

𝑥(𝑨1, … , 𝑨𝑛) = 𝑥𝑗 with prob 1/𝑛 𝔽 𝐺 𝑥 ≤ 𝐺 𝑥∗ + 𝑆𝑓𝑕 𝑛

𝐺 𝑥∗ 𝐺 𝑥

SLIDE 10

F( ~ w)

n!1

¡ ! F(w?)

Learnable (using some rule):

sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { hw | w 2 W } has finite fat-shattering dimension

Supervised Classification

F( ^ w)

n!1

¡ ! F(w?)

Learnable using ERM:

^ w = arg min ^ F(w)

f(w;(x,y)) = loss(hw(x),y):

[Alon Ben-David Cesa-Bianchi Haussler 93]

Online Learnable

SLIDE 11

Convex Lipschitz Problems

𝒳 convex bounded subset of Hilbert space (or ℝ𝑒)

∀𝑥∈𝒳 𝑥 2 ≤ 𝑪

For each 𝑨, 𝑔(𝑥, 𝑨) convex Lipschitz w.r.t 𝑥

𝑔 𝑥, 𝑨 − 𝑔 𝑥′, 𝑨 ≤ 𝑴 ⋅ 𝑥 − 𝑥′ 2

E.g., 𝑔 𝑥, 𝑦, 𝑧

= 𝑚𝑝𝑡𝑡( 𝑥, 𝑦 ; 𝑧), 𝑚𝑝𝑡𝑡′ ≤ 1 𝑦 2 ≤ 𝑀

Online Gradient Descent: 𝑆𝑓𝑕 𝑛 ≤

𝐶2𝑀2 𝑛

Stochastic Setting:

– For generalized linear (including supervised): matches ERM rate – For general Convex Lipschitz Problems?

Learnable via online-to-batch (SGD)
Using ERM?

SLIDE 12

Center of Mass with Missing Data

𝑔 𝑥, 𝐽, 𝑦𝐽 = 𝑗∈𝐽 𝑥 𝑗 − 𝑦 𝑗

2

𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1

Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗, 𝑦 = 0 If d>>2m (think of d=1) then with high probability there exists a coordinate j that is never seen in the sample, i.e. 𝑘 ∉ 𝐽 for all i=1..m

^ F(ej) = 0 F(ej) = 1=2

sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯ ¸ 1=2

No uniform convergence! ej is an empirical minimizer with F(ej) = ½, far from F(w*)=F(0)=0

𝑥 ∈ ℝ𝑒, 𝑥 ≤ 1

SLIDE 13

F( ~ w)

n!1

¡ ! F(w?)

Learnable (using some rule): Learnable with ERM:

F( ^ w)

n!1

¡ ! F(w?) sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { z!f(w;z) | w 2 W } has finite fat-shattering dimension

Supervised learning Supervised learning Supervised learning general setting general setting general setting

Online Learnable

SLIDE 14

Stochastic Convex Optimization

Empirical minimization might not be consistent
Learnable using specific procedural rule

(online-to-batch conversion of online gradient descent)

??????????

SLIDE 15

Strongly Convex Objectives

𝑔(𝑥, 𝑨) is 𝜇-strongly convex in 𝑥 iff:

𝑔 𝑥 + 𝑥′ 2 , 𝑨 ≤ 𝑔 𝑥, 𝑨 + 𝑔 𝑥′, 𝑨 2 − 𝜇 8 𝑥 − 𝑥′

2 2

Equivalent to 𝛼

𝑥 2𝑔 𝑥, 𝑨 ≽ 𝜇

If 𝑔(𝑥, 𝑨) is 𝜇-convex and 𝑀-Lipschitz w.r.t. 𝑥

Online Gradient Descent [Hazan Kalai Kale Agarwal 2006]

𝑆𝑓𝑕 ≤ 𝑃 𝑀2log(𝑛) 𝜇𝑛

Empirical Risk Minimization:

𝔽 𝐺 𝑥 ≤ 𝐺 𝑥∗ + 𝑃 𝑀2 𝜇𝑛 Stochastic Setting? ERM?

SLIDE 16

Strong Convexity and Stability

Definition: rule

𝑥(𝑨1, … 𝑨𝑛) is 𝛾(𝑛)-stable if: 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛)

Symmetric

𝑥 is 𝛾−stable ⇒ 𝔽 𝐺 𝑥𝑛−1 ≤ 𝔽 𝐺 𝑥𝑛 + 𝛾

𝑔 is 𝜇-strongly convex and 𝑀-Lipschitz ⇒

𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾 = 4𝑀2 𝜇𝑛

Conclusion:

𝔽 𝐺 𝑥 ≤ 𝛾 𝑛

For ERM: 𝔽 𝐺 𝑥 ≤ 𝔽 𝐺 𝑥∗ = 𝐺 𝑥∗

SLIDE 17

Empirical Minimization Consistent, but is there Uniform Convergence?

For j that is never seen in the sample: sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯ ¸ 1=2

No uniform convergence: 𝑔 𝑥, 𝐽, 𝑦𝐽 = 𝑗∈𝐽 𝑥 𝑗 − 𝑦 𝑗

2 + 𝜇 𝑥 2 2

𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1

Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗, 𝑦 = 0

𝑥 ∈ ℝ𝑒, 𝑥 ≤ 1

𝐺 𝑢𝑓

𝑘 = 𝜇𝑢2

𝐺 𝑢𝑓

𝑘 = 1

2 𝑢 + 𝜇𝑢2

SLIDE 18

F( ~ w)

n!1

¡ ! F(w?)

Solvable (using some algorithm): Empirical minimizer is consistent: F( ^

w)

n!1

¡ ! F(w?) sup

w2W

¯ ¯ ¯F(w) ¡ ^ F(w) ¯ ¯ ¯

n!1

¡ ! 0

Uniform convergence: { z!f(w;z) | w 2 W } has finite fat-shattering dimension

Supervised learning Supervised learning Supervised learning

not even local Online Learnable

SLIDE 19

Back to Weak Convexity

𝑔(𝑥, 𝑨) 𝑀-Lipschitz (and convex), 𝑥 2 ≤ 𝐶

Use Regularized ERM:

𝑥𝜇 = arg min

𝑥∈𝒳

𝐺 𝑥 + 𝜇 2 𝑥 2

2

Setting 𝜇 =

𝑀2 𝐶2𝑛:

𝔽 𝐺 𝑥𝜇 ≤ 𝐺 𝑥∗ + 𝑃 𝑀2𝐶2 𝑛

Key: strongly convex regularizer ensures stability

SLIDE 20

The Role of Regularization

Structure Risk Minimization view:

– Adding regularization term effectively constrains domain to lower complexity domain 𝒳

𝑠 = 𝑥 | 𝑥

≤ 𝑠 – Learning guarantees (e.g. for SVMs, LASSO) are actually for empirical minimization inside 𝒳

𝑠, and are

based on uniform convergence in 𝒳

𝑠.

In our case:

– No uniform convergence in 𝒳

𝑠, for any r>0

– No uniform convergence even of regularized loss – Cannot solve stochastic optimization problem by restricting to 𝒳

𝑠, for any 𝑠.

– What regularization buys is stability

SLIDE 21

Stability Characterizes Learnability

Learnable Learnable with ERM Uniform Convergence Finite fat-shat dimension ∃ Stable AERM Theorem: Learnable with (symmetric) ERM 𝑥 iff ∀𝒠 𝔽 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛) For some 𝛾 𝑛 → 0 Theorem: Learnable iff ∃ symetric 𝑥 s.t. ∀𝒠:

𝑥 is an “almost ERM”:

𝔽 𝐺 𝑥 − 𝐺 𝑥 ≤ 𝜗(𝑛)

𝑥 is stable:

𝔽 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛)

For some 𝜗 𝑛 → 0, 𝛾 𝑛 → 0 Stable ERM

supervised learning

SLIDE 22

Strong Convexity and Stability

For any norm ‖𝑥‖:

– Ψ 𝑥 ≥ 0 is strongly convex w.r.t. 𝑥 , i.e. Ψ 𝑥 + 𝑥′ 2 ≥ Ψ 𝑥 + Ψ 𝑥′ 2 + 1 4 𝑥 2 – 𝑔 𝑥, 𝑨 is 𝑀-Lipschitz w.r.t. ‖𝑥‖: 𝑔 𝑥, 𝑨 − 𝑔 𝑥′, 𝑨 ≤ 𝑀 ⋅ ‖𝑥 − 𝑥′‖  𝑥𝜇 = arg min

𝑥

𝐺(𝑥) + 𝜇

2 Ψ(𝑥) is 𝑀2𝜇 𝑛 -stable

With 𝜇 =

𝑀2 Ψ 𝑥∗ 𝑛 : 𝐺 𝑥𝜇 ≤ 𝐺 𝑥∗ + 𝑀2Ψ 𝑥∗ 𝑛

SLIDE 23

Convex Lipschitz Problems

𝒳 convex bounded subset of normed space (ℝ𝑒 or Banach space)
For each 𝑨, 𝑔(𝑥, 𝑨) convex Lipschitz w.r.t 𝑥

𝑔 𝑥, 𝑨 − 𝑔 𝑥′, 𝑨 ≤ 𝑴 ⋅ 𝑥 − 𝑥′

E.g., 𝑔 𝑥, 𝑦, 𝑧

= 𝑚𝑝𝑡𝑡( 𝑥, 𝑦 ; 𝑧), 𝑚𝑝𝑡𝑡′ ≤ 1 𝑦 ∗ ≤ 𝑀

To learn: need Ψ(𝑥) strongly convex w.r.t. ‖ ⋅ ‖

𝐺 𝑥𝜇 ≤ 𝐺 𝑥∗ +

𝑀2𝐶2 𝑛

𝐶2 = sup

𝑥∈𝒳

Ψ(𝑥)

Is this universal?

Can all Lipschitz problems (for all ‖ ⋅ ‖ and 𝒳) be learned this way?

SLIDE 24

Stability in Online Learning

Reminder: rule

𝑥(𝑨1, … 𝑨𝑛) is 𝛾(𝑛)-stable if 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛)

Follow The Leader (FTL):

𝑥𝑛 𝑨1, … , 𝑨𝑛−1 = arg min

𝑥 𝑗=1 𝑛−1 𝑔(𝑥, 𝑨𝑗)

Be The Leader (BTL): 𝑥𝑛 𝑨1, … , 𝑨𝑛−1 = arg min

𝑥 𝑗=1 𝑛 𝑔(𝑥, 𝑨𝑗)

If the ERM is 𝛾(𝑛)-stable: 𝑆𝑓𝑕𝐺𝑈𝑀 𝑛 ≤ 𝑆𝑓𝑕𝐶𝑈𝑀 𝑛 +

1 𝑛 𝑗 𝛾 𝑗 ≤ 1 𝑛 𝑗 𝛾 𝑗

Follow The Regularized Leader (FTRL):

𝑥𝑛 𝑨1, … , 𝑨𝑛−1 = arg min

𝑥 𝑗=1 𝑛−1 𝑔(𝑥, 𝑨𝑗) + 𝜇Ψ(𝑥)

If 𝑔 is 𝑀-Lipschitz and Ψ strongly conv. w.r.t. ‖ ⋅ ‖: 𝑆𝑓𝑕𝐺𝑈𝑆𝑀 𝑛 ≤

𝑀2 sup Ψ(𝑥) 𝑛

≤ 0

SLIDE 25

Strong Convexity is Necessary and Sufficient

Theorem: If a Convex Lipschitz problem (for some ‖ ⋅ ‖ and

some convex 𝒳) can be online learned with regret

𝑀2𝐶2 𝑛 ,

then there exists Ψ(𝑥) strongly convex w.r.t. ⋅ s.t. sup

𝑥∈𝒳

Ψ 𝑥 ≤ 𝑑𝐶2

More generally: For any problem, Follow The Regularized

Leader with some Ψ achieves the optimal online regret (up to a constant factor), and this can be established via stability

SLIDE 26

From FTRL to Mirror Descent

Linearized problem:

𝑔

𝑗 𝑥 ≝ 𝑔 𝑥𝑗, 𝑨𝑗 + 〈𝛼𝑔 𝑥𝑗, 𝑨𝑗 , 𝑥 − 𝑥𝑗〉

Main observation: for convex 𝑔, 𝑆𝑓𝑕𝑠𝑓𝑢 𝑝𝑜 𝑔 ≤ (𝑆𝑓𝑕𝑠𝑓𝑢 𝑝𝑜

𝑔)

Follow the Linearized Regularized Leader (aka Mirror Descent):

𝑥𝑛 = arg min

𝑥 𝑗=1 𝑛−1〈𝛼𝑔 𝑥𝑗, 𝑨𝑗 , 𝑥〉 + 𝜇Ψ(𝑥)

= 𝛼Ψ−1 𝛼Ψ 𝑥𝑛−1 − 1

𝜇 𝛼𝑔 𝑥𝑛−1, 𝑨𝑛−1

𝑆𝑓𝑕𝑁𝐸 𝑛 ≤ 𝑀2 sup Ψ(𝑥) 𝑛

Conclusion: Any Convex Lipschitz problem (for any 𝒳 and ‖ ⋅ ‖)

that is online learnable, is (optimally) learnable with this approach

SLIDE 27

Strong Convexity as the Master Property

Ψ 𝑥 strongly convex w.r.t ‖𝑥‖

arg min 𝐺 𝑥 + 𝜇Ψ(𝑥) stable Uniform convergence of 𝑦 → 𝑥, 𝑦 𝑥 ≤ 𝐶} stat learnability

f 𝑔

𝛼𝑔 ∗ ≤ 𝑀} stat learnability

f generalized linear

(including supervised) FTLRL / Online Mirror Descent

nline learnability
f 𝑔

𝛼𝑔 ∗ ≤ 𝑀} RERM ERM