RegML 2020 Class 3 Early Stopping and Spectral Regularization - - PowerPoint PPT Presentation
RegML 2020 Class 3 Early Stopping and Spectral Regularization - - PowerPoint PPT Presentation
RegML 2020 Class 3 Early Stopping and Spectral Regularization Lorenzo Rosasco UNIGE-MIT-IIT Learning problem Solve d ( x, y ) L ( w x, y ) min w E ( w ) , E ( w ) = given ( x 1 , y 1 ) , . . . , ( x n , y n ) Beyond linear models:
Learning problem
Solve min
w E(w),
E(w) =
- dρ(x, y)L(w⊤x, y)
given (x1, y1), . . . , (xn, yn) Beyond linear models: non-linear features and kernels
L.Rosasco, RegML 2020 2
Regularization by penalization
Replace min
w E(w)
by min
w
- E(w) + λw2
- Eλ(w)
◮ E(w) = 1
n
n
i=1 L(w⊤xi, yi)
◮ λ > 0 regularization parameter
L.Rosasco, RegML 2020 3
Loss functions and computational methods
◮ Logistic loss log(1 + e−yw⊤x) ◮ Hinge loss |1 − yw⊤x|+ wt+1 = wt − γt∇ Eλ(wt) . . .
L.Rosasco, RegML 2020 4
Square loss
(1 − yw⊤x)2 = (y − w⊤x)2
L.Rosasco, RegML 2020 5
Square loss
(1 − yw⊤x)2 = (y − w⊤x)2
- Eλ(w) =
E(w) + λw2 with
- E(w) = 1
n ˆ Xw − ˆ y2 ◮ ˆ X n × d data matrix ◮ ˆ y n × 1 output vector.
L.Rosasco, RegML 2020 5
Ridge regression / Tikhonov regression
- Eλ(w) = 1
n ˆ Xw − ˆ y2 + λw2
- Smooth and strongly convex
∇ Eλ(w) = 2 n ˆ X⊤( ˆ Xw − ˆ y) + 2λw = 0 = ⇒ ( ˆ X⊤ ˆ X + λnI)w = ˆ X⊤ˆ y
L.Rosasco, RegML 2020 6
Linear systems
( ˆ X⊤ ˆ X + λnI)w = ˆ X⊤ˆ y ◮ nd2 to form ˆ X⊤ ˆ X ◮ roughly d3 to solve the linear system
L.Rosasco, RegML 2020 7
Representer theorem for square loss
f(x) = x⊤w = ⇒ f(x) =
n
- i=1
x⊤xici
L.Rosasco, RegML 2020 8
Representer theorem for square loss
f(x) = x⊤w = ⇒ f(x) =
n
- i=1
x⊤xici Using SVD of ˆ X... w = ( ˆ X⊤ ˆ X + λnI)−1 ˆ X⊤ˆ y = ˆ X⊤ ( ˆ X ˆ X⊤ + λnI)−1ˆ y
- c
= ⇒ w = ˆ X⊤c =
n
- i=1
xici
L.Rosasco, RegML 2020 8
Beyond linear models
f(x) = x⊤w =
n
- i=1
x⊤xici, w = ( ˆ X⊤ ˆ X + λnI)−1 ˆ X⊤ˆ y, c = ( ˆ X ˆ X⊤ + λnI)−1ˆ y ◮ non-linear function x → φ(x) = (φ1(x), . . . φn(x)), f(x) = φ(x)⊤w ◮ non-linear kernels ˆ X ˆ X⊤ = ˆ K, f(x) =
n
- i=1
K(x, xi)ci.
L.Rosasco, RegML 2020 9
Interlude: linear systems and stability
Aw = y, A = diag(a1, . . . , ad), c = a1 ad < ∞, w = A−1y, A−1 = diag(a−1
1 , . . . , a−1 d )
More generally A = UΣU ⊤, Σ = diag(σ1, . . . , σd) A−1 = UΣ−1U ⊤, Σ−1 = diag(σ−1
1 , . . . , σ−1 d )
L.Rosasco, RegML 2020 10
Tikhonov Regularization
1 n
n
- i=1
(yi − w⊤xi)2 → 1 n
n
- i=1
(yi − w⊤xi)2 + λw2 ˆ X⊤ ˆ Xw = ˆ X⊤ˆ y → ( ˆ X⊤ ˆ X + λnI)w = ˆ X⊤ˆ y Overfitting and numerical stability
L.Rosasco, RegML 2020 11
Beyond Tikhonov: TSVD
ˆ X⊤ ˆ X = V ΣV ⊤, wM = ( ˆ X⊤ ˆ X)−1
M ˆ
X⊤ˆ y ◮ ( ˆ X⊤ ˆ X)−1
M = V Σ−1 M V ⊤
◮ Σ−1
M = diag(σ−1 1 , . . . , σM −1, 0 . . . , 0)
Also known as principal component regression (PCR)
L.Rosasco, RegML 2020 12
Principal component analysis (PCA)
Dimensionality reduction ˆ X⊤ ˆ X = V ΣV ⊤ Eigenfunctions are directions, of ◮ maximum variance ◮ best reconstruction
L.Rosasco, RegML 2020 13
TSVD and PCA
TSV D ⇔ PCA + ERM Regularization by projection
L.Rosasco, RegML 2020 14
TSVD/PCR beyond linearity
Non-linear function f(x) =
p
- i=1
wiφi(x) = Φ(x)⊤w with w = ( Φ⊤ Φ)−1
M
Φ⊤ˆ y Let Φ = (Φ(x1), . . . Φ(xn))⊤ ∈ Rn×p.
- Φ⊤
Φ = V ΣV ⊤, ( Φ⊤ Φ)−1
M = V Σ−1 M V ⊤
Σ = diag(σ1, . . . , σp), Σ−1
M = diag(σ−1 1 , . . . , σ−1 M , 0, . . . )
L.Rosasco, RegML 2020 15
TSVD/PCR with kernels
f(x) =
n
- i=1
K(x, xi)ci, c = ( K)−1
M ˆ
y
- Kij = K(xi, xj),
- K = UΣU ⊤,
Σ = (σ1, . . . , σn),
- K−1
M = UΣ−1 M U ⊤,
Σ−1
M = (σ−1 1 , . . . , σ−1 M , 0, . . . ),
L.Rosasco, RegML 2020 16
Early stopping regularization
Other example of regularization: Early stopping of an iterative procedure applied to noisy data.
L.Rosasco, RegML 2020 17
Gradient descent for square loss
wt+1 = wt − γ ˆ X⊤( ˆ Xwt − ˆ y)
n
- i=1
(yi − w⊤xi)2 = ˆ Xw − ˆ y2 ◮ no penalty ◮ stepsize chosen a priori γ =
2 ˆ X⊤ ˆ X
L.Rosasco, RegML 2020 18
Early stopping at work
0.2 0.4 0.6 0.8 1
- 1.5
- 1
- 0.5
0.5 1 1.5
Fitting on the training set Iteration #1
L.Rosasco, RegML 2020 19
Early stopping at work
0.2 0.4 0.6 0.8 1
- 1.5
- 1
- 0.5
0.5 1 1.5
Fitting on the training set Iteration #2
L.Rosasco, RegML 2020 19
Early stopping at work
0.2 0.4 0.6 0.8 1
- 1.5
- 1
- 0.5
0.5 1 1.5
Fitting on the training set Iteration #7
L.Rosasco, RegML 2020 19
Early stopping at work
0.2 0.4 0.6 0.8 1
- 1.5
- 1
- 0.5
0.5 1 1.5
Fitting on the training set Iteration #5000
L.Rosasco, RegML 2020 19
Semi-convergence
min
w E(w)
vs min
w
- E(w)
L.Rosasco, RegML 2020 20
Connection to Tikhonov or TSVD
wt+1 = wt − γ ˆ X⊤( ˆ Xwt − ˆ y) = (I − γ ˆ X⊤ ˆ X)wt + γ ˆ X⊤ˆ y) by induction wt = γ
t−1
- j=0
(I − γ ˆ X⊤ ˆ X)j
- Truncated power series
ˆ X⊤ˆ y
L.Rosasco, RegML 2020 21
Neumann series
γ
t−1
- j=0
(I − γ ˆ X⊤ ˆ X)j
L.Rosasco, RegML 2020 22
Neumann series
γ
t−1
- j=0
(I − γ ˆ X⊤ ˆ X)j ◮ |a| < 1 (1 − a)−1 =
∞
- j=0
aj = ⇒ a−1 =
∞
- j=0
(1 − a)j
L.Rosasco, RegML 2020 22
Neumann series
γ
t−1
- j=0
(I − γ ˆ X⊤ ˆ X)j ◮ |a| < 1 (1 − a)−1 =
∞
- j=0
aj = ⇒ a−1 =
∞
- j=0
(1 − a)j ◮ A ∈ Rd×d, A < 1, invertible A−1 =
∞
- j=0
(I − A)j
L.Rosasco, RegML 2020 22
Stable matrix inversion
Truncated Neumann Series ( ˆ X⊤ ˆ X)−1 = γ
∞
- j=0
(I − γ ˆ X⊤ ˆ X)j ≈ γ
t−1
- j=0
(I − γ ˆ X⊤ ˆ X)j compare to ( ˆ X⊤ ˆ X)−1 ≈ ( ˆ X⊤ ˆ X + λnI)−1
L.Rosasco, RegML 2020 23
Spectral filtering
Different instances of the same principle. ◮ Tikhonov wt = ( ˆ X⊤ ˆ X + λnI)−1 ˆ X⊤ˆ y ◮ Early Stopping wt = γ
t−1
- j=0
(I − γ ˆ X⊤ ˆ X)j ˆ X⊤ˆ y ◮ TSVD wt = ( ˆ X⊤ ˆ X)−1
M ˆ
X⊤ˆ y
L.Rosasco, RegML 2020 24
Statistics and optimization
wt = γ
t−1
- j=0
(I − γ ˆ X⊤ ˆ X)j ˆ X⊤ˆ y The difference is in the computations wt+1 = wt − γ ˆ X⊤( ˆ Xwt − ˆ y) ◮ Tikhonov - O(nd2 + d3) ◮ TSVD - O(nd2 + d2M) ◮ GD - O(ndt)
L.Rosasco, RegML 2020 25
Regularization path and warm restart
min
w E(w)
vs min
w
- E(w)
L.Rosasco, RegML 2020 26
Beyond linear models
Non-linear function f(x) =
p
- i=1
wiφi(x) = Φ(x)⊤w ◮ Replace x by Φ(x) = (φ1(x), . . . , φp(x)) ◮ Replace X by
- Φ = (Φ(x1), . . . Φ(xn))⊤ ∈ Rn×p.
wt+1 = wt − γ Φ⊤( Φwt − ˆ y) Computational cost O(npt).
L.Rosasco, RegML 2020 27
Early-stopping and kernels
f(x) =
n
- i=1
K(xi, x)ci By induction ct+1 = ct − γ( Kct − ˆ y)
- Kij = K(xi, xj)
Computational Complexity O(n2t).
L.Rosasco, RegML 2020 28
What about other loss functions?
◮ PCA + ERM ◮ Gradient / Subgradient Descent. Iterations for regularization, not
- nly optimization!
L.Rosasco, RegML 2020 29
Going big...
Bottleneck of Kernel methods Memory
- K
is O(n2)
L.Rosasco, RegML 2020 30
Approaches to large scale
◮ (Random) features - find Φ : X → RM, with M ≪ n s.t. K(x, x′) ≈ Φ(x)⊤ Φ(x′)
L.Rosasco, RegML 2020 31
Approaches to large scale
◮ (Random) features - find Φ : X → RM, with M ≪ n s.t. K(x, x′) ≈ Φ(x)⊤ Φ(x′) ◮ Subsampling (Nystr¨
- m) - replace
f(x) =
n
- i=1
K(x, xi)ci by f(x) =
M
- i=1
K(x, ˜ xi)ci ˜ xi subsampled from training set, M
L.Rosasco, RegML 2020 31
Approaches to large scale
◮ (Random) features - find Φ : X → RM, with M ≪ n s.t. K(x, x′) ≈ Φ(x)⊤ Φ(x′) ◮ Subsampling (Nystr¨
- m) - replace
f(x) =
n
- i=1
K(x, xi)ci by f(x) =
M
- i=1
K(x, ˜ xi)ci ˜ xi subsampled from training set, M ◮ Greedy! ◮ Neural Nets
L.Rosasco, RegML 2020 31
This class
Regularization beyond penalization ◮ Regularization by projection ◮ Regularization by early stopping
L.Rosasco, RegML 2020 32
Next class
Multioutput learning ◮ Multitask learning ◮ Vector valued learning ◮ Multiclass learning
L.Rosasco, RegML 2020 33