Generalized/Global Abs-Linear Learning (GALL)
Andreas Griewank and ´ Angel Rojas
Humboldt University (Berlin) and Yachay Tech (Imbabura)
14.12.19, NeurIPS Vancouver
- A. Griewank, ´
- A. Rojas (HU/Yachay Tech)
GALL 14.12.19, NeurIPS Vancouver 1 / 25
Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and - - PowerPoint PPT Presentation
Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and Angel Rojas Humboldt University (Berlin) and Yachay Tech (Imbabura) 14.12.19, NeurIPS Vancouver A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver
Humboldt University (Berlin) and Yachay Tech (Imbabura)
GALL 14.12.19, NeurIPS Vancouver 1 / 25
1
2
3
4
5
6
GALL 14.12.19, NeurIPS Vancouver 2 / 25
1 Nonsmoothness can be ignored except for step size choice. 2 Stochastic (mini-batch) sampling hides all the problems. 3 Higher dimensions make local minimizer less likely. 4 Difficulty is getting away from saddle points not minimizers. 5 Precise location of (almost) global minimizer unimportant. 6 Network architecture and stepsize selection can be tweaked. 7 Convergence proofs only under ”unrealistic assumptions”.
GALL 14.12.19, NeurIPS Vancouver 3 / 25
Notational Zoo (Subspecies in Euclidean and Lipschitzian Habitat): Fr´ echet Derivative: ∇ϕ(x) ≡ ∂ϕ(x)/∂x : D → Rn ∪ ∅ Limiting Gradient: ∂Lf (˚ x) ≡ limx→˚
x∇ϕ(x) : D ⇒ Rn
Clarke Gradient: ∂ϕ(x) ≡ conv(∂Lϕ(x)) : D ⇒ Rn Bouligand: f ′(x; ∆x) ≡ limtց0[ϕ(x + t∆x) − ϕ(x)]/t : D × Rn → R : D → PLh(Rn) Piecewise Linearization(PL): ∆ϕ(x; ∆x) : D × Rn → R : D → PL(Rn)
Almost everywhere all concepts reduce to Fr´ echet, except PL!!
GALL 14.12.19, NeurIPS Vancouver 4 / 25
GALL 14.12.19, NeurIPS Vancouver 5 / 25
GALL 14.12.19, NeurIPS Vancouver 6 / 25
GALL 14.12.19, NeurIPS Vancouver 7 / 25
JOTA: VOL. 34, NO. 1, MAY 1981 33
Search trajectories with target c = 0 and sensitivit2y , e e {0.4, 0.5, 0.67} on the objective function f= (x~ +x~)/200+ t-cos x1 cos(x2/~/2). Initial point (40, -35). Global minimum at origin marked by +~
gradient and explore the objective function more thoroughly. Simul- taneously, the stepsize, which equals the length of the dashes, becomes smaller to ensure an accurate integration. The behavior of the individual trajectories confirms in principle the results of Theorem 5.1(i) applied to the quadratic term u, with d being equal to 2. The combination and
1
e =~= 1/d c = O=f* seems optimal, even though the corresponding trajectory converges to the global solution x* only from the initial point (40,-35), but not from (35, -30). In the latter case, as shown in Fig. 2, the trajectory is distracted
GALL 14.12.19, NeurIPS Vancouver 8 / 25
34 JOTA: VOL 34, NO. 1, MAY 1981
{-0.4, 0, 0.4} on the ob- jective function f= (x~+x~)/200+ 1-cos xt cos(x2/']2). Initial point (35,
Global minimum at origin marked by +. from x* by a sequence of suboptimal minima and eventually diverges toward
e = 0.67 in Fig. 1, usually lack the penetration to reach x* and wander around endlessly, as they cannot escape the attraction of the quadratic term
with e = 0.4 in Fig. 1, are likely to pass the global solution x* at some distance before diverging toward infinity. The same is true of trajectories with appropriate sensitivity e = ½, but having an unattainable target, as we can see from the case c =-0.4 in Fig. 2. Trajectories whose target is attainable are likely to achieve their goal, like the one with c = 0.4 in Fig. 2, which attains its target close to the suboptimal minimum £1 ~ -Ir(1, 42) T, with value
f(~l) ~ 0.15
after passing through the neighborhood of two unacceptable minima ~2 ~ -2~r (1, x/2) T ~3 = -¢r(3, x/2) T,
GALL 14.12.19, NeurIPS Vancouver 9 / 25
2x2 2
ω
ω2
t2g 2(ϕ0−c)
ω
ω2
GALL 14.12.19, NeurIPS Vancouver 10 / 25
1 Every function ϕ(x) that is abs-normal, i.e. evaluated by a sequence
2∆x2
2 The function y = ∆ϕ(˚
3 [d, Z, M, L, µ, a, b, c] can be generated automatically by Algorithmic
GALL 14.12.19, NeurIPS Vancouver 11 / 25
˚ x x F(x) F ♦˚
xF
(a) Tangent mode linearization
GALL 14.12.19, NeurIPS Vancouver 12 / 25
∆x
qk 2 ∆x2}
GALL 14.12.19, NeurIPS Vancouver 13 / 25
Figure 1: Decimal digits gained by 4 methods on single layer regression problem.
GALL 14.12.19, NeurIPS Vancouver 14 / 25
1 Form piecewise linearization ∆ϕ of objective ϕ at the current iterate
2 Select the initial tangent ˙
3 Compute and follow circular segment x(t) in Pσ. 4 Determine minimal t∗ where ϕ(x(t∗)) = c
σ.
5 If ϕ(x∗) ≤ c then lower c and goto step (2) // restart inner loop
6 Else, set x0 = x∗, ˙
GALL 14.12.19, NeurIPS Vancouver 15 / 25
Figure 2: Reached value 0.591576 whereas target level 0.519984 unreachable.
GALL 14.12.19, NeurIPS Vancouver 16 / 25
GALL 14.12.19, NeurIPS Vancouver 17 / 25
x,z,w,σ
2x2
GALL 14.12.19, NeurIPS Vancouver 18 / 25
1 where M, L ∈ Rs×s are strictly lower triangular to yield z = z(x). 2 ≡ NN if M ≡ L are block bidiagonal, other sparsity patterns possible. 3 note that max(u, w) = u + (z + |z|)/2 with z = (w − u). 4 ALFs with ν ≤ ¯
5 ALFs can be successively abs-linearized with respect to
GALL 14.12.19, NeurIPS Vancouver 19 / 25
2∆M, ∆L2 F.
GALL 14.12.19, NeurIPS Vancouver 20 / 25
Empirical Risk # of models s=3 s=4 s=5 Model size 1 2 3 4 5 0.11 0.12 0.13 0.14 0.15 0.16
GALL 14.12.19, NeurIPS Vancouver 21 / 25
Regression on Griewank in 2 dimensions, 50 training data, 8 testing data
s #w var. 1 2 3 4 5 3 21 471 303810 353703 1716277 581060 681025 4 31 631 1129639 263007 1015447 1339147 1068608 5 43 793 1153345 22793377 22895320 21241422 16513124 For s=5 there were 250 equality and 1000 inequality constraints, both linear.
GALL 14.12.19, NeurIPS Vancouver 22 / 25
1 SALMIN generates cluster points that are first oder minimal. 2 Analytically savvy ball reaches target level in convex case. 3 Savvy ball can climb away from undesirable local minimizers. 4 Successive PL allows exact integration of Savvy Ball and
5 Though costly MIBLOP may provide reference solutions. 6 Stepsize chosen automatically via kinks and angle bound. 7 Abs-Linear-Learning generalizes hinged Neural Nets.
GALL 14.12.19, NeurIPS Vancouver 23 / 25
1 Refine targeting and restarting strategy for SB. 2 Matrix based implementation for HPC with GPU. 3 Exploitation of low-rank updates in polyhedral transition. 4 Mini-batch version in stochastic gradient fashion. 5 Check global optimality of MIBLOP cluster points. 6 Piecewise linearize ”loss”-function (e.g. sparsemax). 7 Adaptively enforce sparsity in Abs-Linear-Learning.
GALL 14.12.19, NeurIPS Vancouver 24 / 25
GALL 14.12.19, NeurIPS Vancouver 25 / 25