Jacob Rafati
http://rafati.net jrafatiheravi@ucmerced.edu
Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced
Improving L-BFGS Initialization for Trust-Region Methods in Deep - - PowerPoint PPT Presentation
Improving L-BFGS Initialization for Trust-Region Methods in Deep Learning Jacob Rafati http://rafati.net jrafatiheravi@ucmerced.edu Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced Agenda
http://rafati.net jrafatiheravi@ucmerced.edu
Ph.D. Candidate, Electrical Engineering and Computer Science University of California, Merced
w∈Rn L(w) , 1
N
i=1
Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838
w0 αk wk+1 ← wk + αkpk k = 0, 1, 2, . . . , pk krLk < ✏
n and N L(w) r2L(w) r2L rL
w∈Rn L(w) , 1
N
i=1
Optimizing methods in statistics.
Sk ⊂ {1, 2, . . . , N}
rL(wk) ⇡ rL(wk)(Sk) , 1 |Sk| X
i∈Sk
r`i(wk)
Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838
scaling.
minimum.
Bottou et al., (2016). Optimization methods for large-scale machine learning. print arXiv:1606.04838
Sk ⊂ {1, 2, . . . , N}
i∈Sk
r2L(wk) ⇡ r2L(wk)(Sk) , 1 |Sk| X
i∈Sk
r2`i(wk)
α L(wk + αpk)
(quadratic for Newton method).
parameters.
expensive and requires massive storage.
practical.
Bottou et al., (2016) Optimization methods for large-scale machine learning, print arXiv:1606.04838
Quadratic Model of the objective function
p∈Rn Qk(p) , gT k p + 1
Bk+1 = Bk − 1 sT
k Bksk
BksksT
k Bk +
1 yT
k sk
ykyT
k ,
construct quasi-Newton matrices.
be expensive.
in large-scale problems.
Sk = ⇥ sk−m . . . sk−1 ⇤
Yk = ⇥ yk−m . . . yk−1 ⇤
Ψk = ⇥ B0Sk Yk ⇤ , Mk = −ST
k B0Sk
−Lk −LT
k
Dk −1
ST
k Yk = Lk + Dk + Uk
L-BFGS Compact Representation
where Limited Memory Storage
achieved.
Quadratic model if Bk is positive definite:
p∈Rn Qk(p) , gT k p + 1
k gk
Toint et al. (2000), Trust Region Methods, SIAM.
p∈Rn Q(p)
2 4 6
2 4 6
Newton step Global minimum Local minimum
Brust et al, (2017). “On solving L-SR1 trust-region subproblems,” Computational Optimization and Applications, vol. 66, pp. 245–266.
k
Eigen-decomposition Bk = P Λ + γkI γkI
Sherman-Morrison-Woodbury Formula p∗
k = − 1
τ ∗ ⇥ I − Ψk(τ ∗M −1
k
+ ΨT
k Ψk)−1ΨT k
⇤ gk, τ ∗ = γk + σ∗
26th European Signal Processing Conference, Rome, Italy. September 2018.
Spectral estimate of Hessian
k−1yk−1
k−1yk−1
γk = arg min
γ
kB−1
0 yk−1 sk−1k2 2,
B0 = γI
HSk = Yk
ST
k HSk = ST k Yk
Consider a quadratic function
L(w) = 1 2wT Hw + gT w
We have Therefore
Erway et al. (2018). “Trust-Region Algorithms for Training Responses: Machine Learning Methods Using Indefinite Hessian Approximations,” ArXiv e-prints.
k HSk − γkST k Sk = ST k ΨkMkΨT k Sk Ψk = ⇥ B0Sk Yk ⇤ , Mk = −ST
k B0Sk
−Lk −LT
k
Dk −1
k
Since
We have Secant Condition
BkST
k = Yk
k HSk − γkST k Sk = ST k ΨkMkΨT k Sk
k )z = λST k Skz
General Eigen-Value Problem Upper bound for initial value to avoid false curvature information
k
k HSk − γkST k Sk = ST k ΨkMkΨT k Sk
Ψk = ⇥ B0Sk Yk ⇤ , Mk = −ST
k B0Sk
−Lk −LT
k
Dk −1
Note that compact representation matrices contains γk
A∗ = Lk+Dk+LT
k − ST k Yk ˜
DY T
k Sk− γ2 k−1(ST k Sk ˜
AST
k Sk),
B∗ = ST
k Sk + ST k Sk ˜
BY T
k ST k + ST k Yk ˜
BT ST
k Sk.
General Eigen-Value Problem Upper bound for initial values to avoid false curvature information
Features Labels
Input: Model (Predictor) Output:
1 2 3 . . . 9 0.97 0.03 . . .
Input conv1 conv2 pool1
pool2 hidden4
Lecun et al. (1998) “Gradient-basedlearning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324.
t = ⇥0, 0, 1, 0, . . . , 0⇤
y = ⇥ 0, 0, 0.97, 0.03, . . . , ⇤
Cross-entropy Empirical Risk
N
i=1
`(t, y) = −t. log(y) − (1 − t). log(1 − y)
Target Output
Shuffled Data Shuffled Data
Ok−1
Ok+1 Ok+1 Ok+2 Sk Sk+1 Sk+2 Ok = Sk ∩ Sk+1
Berahas et al. (2016). “A multi-batch L-BFGS method for machine learning,” in Advances in Neural Information Processing Systems 29, pp. 1055–1063.
Ok−1
Ok+1 Sk Sk+1 Ok = Sk ∩ Sk+1 gk = rL(wk)(Sk) = 1 |Sk| X
i∈Jk
rLi(wk)
yk = rL(wk+1)(Ok) rL(wk)(Ok)
Berahas et al. (2016). “A multi-batch L-BFGS method for machine learning,” in Advances in Neural Information Processing Systems 29, pp. 1055–1063.
Full Stochastic Full Stochastic
Full Stochastic Full Stochastic
Full Stochastic Full Stochastic
This research is supported by NSF Grants CMMI Award 1333326 , IIS 1741490 and ACI 1429783.