Nati Srebro
Shai Shalev-Shwartz
HUJI
Karthik Sridharan
Cornell
Learnability, Stability and Strong Convexity
Ohad Shamir
Weizmann
Toyota Technological Institute—Chicago (2008-2011) Ambuj Tewari
Michigan
and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz - - PowerPoint PPT Presentation
Learnability, Stability and Strong Convexity Nati Srebro Ohad Shamir Shai Shalev-Shwartz Karthik Sridharan Ambuj Tewari Weizmann HUJI Cornell Michigan Toyota Technological Institute Chicago (2008-2011) Outline Theme: Role of
Shai Shalev-Shwartz
HUJI
Karthik Sridharan
Cornell
Ohad Shamir
Weizmann
Toyota Technological Institute—Chicago (2008-2011) Ambuj Tewari
Michigan
– Stability as the master property
– Strong convexity as the master property
– From Stability to Online Mirror Descent
min
𝑥∈𝒳 𝐺 𝑥 = 𝐹𝑨~ 𝑔 𝑥, 𝑨
given an iid sample 𝑨1, 𝑨2, … , 𝑨𝑛 ∼
unknown distribution over 𝑎 ∈ Ω
rule 𝑥(𝑨1, … , 𝑨𝑛) s.t. for every 𝜗 > 0 and large enough sample size 𝑛(𝜗), for any distribution D: 𝔽𝑨1,…,𝑨𝑛∼ 𝐺 𝑥 ≤ inf
𝑥∈𝒳 𝐺 𝑥 + 𝜗
Vapnik95
𝐺(𝑥∗)
z = (x,y) w specifies a perdictor hw: X ! Y f( w ; (x,y) ) = loss(hw(x),y) e.g. linear prediction: 𝑔 𝑥 ; 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡 𝑥, 𝑦 , 𝑧
= x 2 Rd w = ([1],…,[k]) 2 Rd£k specifies k cluster centers f( ([1],…,[k]) ; x ) = minj |[j]-x|2
w specifies probability density pw(x) f( w ; x ) = -log pw(x)
z = traffic delays on each road segment w = route chosen (indicator over road segments in route) f( w ; z ) = hw,zi = total delay along route
Minimize F(w)=Ez[f(w;z)] based on sample z1,z2,…,zn
w2W
n!1
Uniform convergence: { hw | w 2 W } has finite fat-shattering dimension
n!1
Learnable using ERM:
𝑥 = arg min
𝑥
𝐺(𝑥) 𝐺 𝑥 = 1 𝑛
𝑗
𝑔 𝑥, 𝑨𝑗
n!1
Learnable (using some rule):
w2W
n!1
Uniform convergence: { hw | w 2 W } has finite fat-shattering dimension
n!1
Learnable using ERM:
[Alon Ben-David Cesa-Bianchi Haussler 93]
𝑔 𝑥, 𝑦, 𝑧 = 𝑚𝑝𝑡𝑡(ℎ𝑥 𝑦 , 𝑧) – Combinatorial necessary and sufficient condition of learning – Uniform convergence necessary and sufficient for learning – ERM universal (if learnable, can do it with ERM)
𝑔(𝑥, 𝑨) ????
….
Learner: Adversary:
1 𝑛
𝑗=1 𝑛
𝑔(𝑥𝑗(𝑨1, … , 𝑨𝑗−1), 𝑨𝑗) ≤ inf
𝑥∈𝒳
1 𝑛
𝑗=1 𝑛
𝑔 𝑥, 𝑨𝑗 + 𝑆𝑓(𝑛)
𝑥∈𝒳 𝐺 𝑥 + 𝜗(𝑛)
𝐺 𝑥∗ 𝐺 𝑥
n!1
Learnable (using some rule):
w2W
n!1
Uniform convergence: { hw | w 2 W } has finite fat-shattering dimension
n!1
Learnable using ERM:
[Alon Ben-David Cesa-Bianchi Haussler 93]
Online Learnable
∀𝑥∈𝒳 𝑥 2 ≤ 𝑪
𝑔 𝑥, 𝑨 − 𝑔 𝑥′, 𝑨 ≤ 𝑴 ⋅ 𝑥 − 𝑥′ 2
= 𝑚𝑝𝑡𝑡( 𝑥, 𝑦 ; 𝑧), 𝑚𝑝𝑡𝑡′ ≤ 1 𝑦 2 ≤ 𝑀
𝐶2𝑀2 𝑛
– For generalized linear (including supervised): matches ERM rate – For general Convex Lipschitz Problems?
2
𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1
Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗, 𝑦 = 0 If d>>2m (think of d=1) then with high probability there exists a coordinate j that is never seen in the sample, i.e. 𝑘 ∉ 𝐽 for all i=1..m
w2W
𝑥 ∈ ℝ𝑒, 𝑥 ≤ 1
n!1
Learnable (using some rule): Learnable with ERM:
n!1
w2W
n!1
Uniform convergence: { z!f(w;z) | w 2 W } has finite fat-shattering dimension
Supervised learning Supervised learning Supervised learning general setting general setting general setting
Online Learnable
𝑔 𝑥 + 𝑥′ 2 , 𝑨 ≤ 𝑔 𝑥, 𝑨 + 𝑔 𝑥′, 𝑨 2 − 𝜇 8 𝑥 − 𝑥′
2 2
𝑥 2𝑔 𝑥, 𝑨 ≽ 𝜇
For ERM: 𝔽 𝐺 𝑥 ≤ 𝔽 𝐺 𝑥∗ = 𝐺 𝑥∗
For j that is never seen in the sample: sup
w2W
2 + 𝜇 𝑥 2 2
𝐽 ⊆ 𝑒 , 𝑦 𝑗 , 𝑗 ∈ 𝐽, 𝑦 ≤ 1
Consider 𝑄 𝑗 ∈ 𝐽 = 1/2 independently for all 𝑗, 𝑦 = 0
𝑥 ∈ ℝ𝑒, 𝑥 ≤ 1
𝐺 𝑢𝑓
𝑘 = 𝜇𝑢2
𝐺 𝑢𝑓
𝑘 = 1
2 𝑢 + 𝜇𝑢2
n!1
Solvable (using some algorithm): Empirical minimizer is consistent: F( ^
n!1
w2W
n!1
Uniform convergence: { z!f(w;z) | w 2 W } has finite fat-shattering dimension
Supervised learning Supervised learning Supervised learning
not even local Online Learnable
𝑔(𝑥, 𝑨) 𝑀-Lipschitz (and convex), 𝑥 2 ≤ 𝐶
𝑥𝜇 = arg min
𝑥∈𝒳
𝐺 𝑥 + 𝜇 2 𝑥 2
2
𝑀2 𝐶2𝑛:
𝔽 𝐺 𝑥𝜇 ≤ 𝐺 𝑥∗ + 𝑃 𝑀2𝐶2 𝑛
𝑠 = 𝑥 | 𝑥
𝑠, and are
𝑠.
𝑠, for any r>0
𝑠, for any 𝑠.
Learnable Learnable with ERM Uniform Convergence Finite fat-shat dimension ∃ Stable AERM Theorem: Learnable with (symmetric) ERM 𝑥 iff ∀ 𝔽 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛) For some 𝛾 𝑛 → 0 Theorem: Learnable iff ∃ symetric 𝑥 s.t. ∀:
𝔽 𝐺 𝑥 − 𝐺 𝑥 ≤ 𝜗(𝑛)
𝔽 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛)
For some 𝜗 𝑛 → 0, 𝛾 𝑛 → 0 Stable ERM
supervised learning
– Ψ 𝑥 ≥ 0 is strongly convex w.r.t. 𝑥 , i.e. Ψ 𝑥 + 𝑥′ 2 ≥ Ψ 𝑥 + Ψ 𝑥′ 2 + 1 4 𝑥 2 – 𝑔 𝑥, 𝑨 is 𝑀-Lipschitz w.r.t. ‖𝑥‖: 𝑔 𝑥, 𝑨 − 𝑔 𝑥′, 𝑨 ≤ 𝑀 ⋅ ‖𝑥 − 𝑥′‖ 𝑥𝜇 = arg min
𝑥
𝐺(𝑥) + 𝜇
2 Ψ(𝑥) is 𝑀2𝜇 𝑛 -stable
𝑀2 Ψ 𝑥∗ 𝑛 : 𝐺 𝑥𝜇 ≤ 𝐺 𝑥∗ + 𝑀2Ψ 𝑥∗ 𝑛
𝑔 𝑥, 𝑨 − 𝑔 𝑥′, 𝑨 ≤ 𝑴 ⋅ 𝑥 − 𝑥′
= 𝑚𝑝𝑡𝑡( 𝑥, 𝑦 ; 𝑧), 𝑚𝑝𝑡𝑡′ ≤ 1 𝑦 ∗ ≤ 𝑀
𝐺 𝑥𝜇 ≤ 𝐺 𝑥∗ +
𝑀2𝐶2 𝑛
𝐶2 = sup
𝑥∈𝒳
Ψ(𝑥)
Can all Lipschitz problems (for all ‖ ⋅ ‖ and 𝒳) be learned this way?
𝑥(𝑨1, … 𝑨𝑛) is 𝛾(𝑛)-stable if 𝑔 𝑥 𝑨1, … , 𝑨𝑛−1 , 𝑨𝑛 − 𝑔 𝑥 𝑨1, … , 𝑨𝑛 , 𝑨𝑛 ≤ 𝛾(𝑛)
𝑥𝑛 𝑨1, … , 𝑨𝑛−1 = arg min
𝑥 𝑗=1 𝑛−1 𝑔(𝑥, 𝑨𝑗)
𝑥 𝑗=1 𝑛 𝑔(𝑥, 𝑨𝑗)
1 𝑛 𝑗 𝛾 𝑗 ≤ 1 𝑛 𝑗 𝛾 𝑗
𝑥𝑛 𝑨1, … , 𝑨𝑛−1 = arg min
𝑥 𝑗=1 𝑛−1 𝑔(𝑥, 𝑨𝑗) + 𝜇Ψ(𝑥)
𝑀2 sup Ψ(𝑥) 𝑛
≤ 0
some convex 𝒳) can be online learned with regret
𝑀2𝐶2 𝑛 ,
then there exists Ψ(𝑥) strongly convex w.r.t. ⋅ s.t. sup
𝑥∈𝒳
Ψ 𝑥 ≤ 𝑑𝐶2
Leader with some Ψ achieves the optimal online regret (up to a constant factor), and this can be established via stability
𝑔
𝑗 𝑥 ≝ 𝑔 𝑥𝑗, 𝑨𝑗 + 〈𝛼𝑔 𝑥𝑗, 𝑨𝑗 , 𝑥 − 𝑥𝑗〉
𝑔)
𝑥𝑛 = arg min
𝑥 𝑗=1 𝑛−1〈𝛼𝑔 𝑥𝑗, 𝑨𝑗 , 𝑥〉 + 𝜇Ψ(𝑥)
= 𝛼Ψ−1 𝛼Ψ 𝑥𝑛−1 − 1
𝜇 𝛼𝑔 𝑥𝑛−1, 𝑨𝑛−1
𝑆𝑓𝑁𝐸 𝑛 ≤ 𝑀2 sup Ψ(𝑥) 𝑛
that is online learnable, is (optimally) learnable with this approach
arg min 𝐺 𝑥 + 𝜇Ψ(𝑥) stable Uniform convergence of 𝑦 → 𝑥, 𝑦 𝑥 ≤ 𝐶} stat learnability
𝛼𝑔 ∗ ≤ 𝑀} stat learnability
(including supervised) FTLRL / Online Mirror Descent
𝛼𝑔 ∗ ≤ 𝑀} RERM ERM