generalized global abs linear learning gall
play

Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and - PowerPoint PPT Presentation

Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and Angel Rojas Humboldt University (Berlin) and Yachay Tech (Imbabura) 14.12.19, NeurIPS Vancouver A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver


  1. Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and ´ Angel Rojas Humboldt University (Berlin) and Yachay Tech (Imbabura) 14.12.19, NeurIPS Vancouver A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 1 / 25

  2. Outline From Heavy to Savvy Ball search trajectory 1 Results in convex, homogeneous and prox-linear case 2 Successive Piecewise Linearization 3 Mixed Binary Linear Optimization 4 Generalized Abs-Linear Learning 5 Summary, Conclusions and Outlook 6 A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 2 / 25

  3. Folklore and Common Expectations in ML 1 Nonsmoothness can be ignored except for step size choice. 2 Stochastic (mini-batch) sampling hides all the problems. 3 Higher dimensions make local minimizer less likely. 4 Difficulty is getting away from saddle points not minimizers. 5 Precise location of (almost) global minimizer unimportant. 6 Network architecture and stepsize selection can be tweaked. 7 Convergence proofs only under ”unrealistic assumptions”. A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 3 / 25

  4. Generalized Gradient Concepts Notational Zoo (Subspecies in Euclidean and Lipschitzian Habitat): echet Derivative: ∇ ϕ ( x ) ≡ ∂ϕ ( x ) /∂ x : D �→ R n ∪ ∅ Fr´ Limiting Gradient: ∂ L f (˚ x ∇ ϕ ( x ) : D ⇒ R n x ) ≡ lim x → ˚ Clarke Gradient: ∂ϕ ( x ) ≡ conv ( ∂ L ϕ ( x )) : D ⇒ R n Bouligand: f ′ ( x ; ∆ x ) ≡ lim t ց 0 [ ϕ ( x + t ∆ x ) − ϕ ( x )] / t D × R n �→ R : D �→ PL h ( R n ) : Piecewise Linearization(PL): D × R n �→ R ∆ ϕ ( x ; ∆ x ) : D �→ PL ( R n ) : Moriarty Effect due to Rademacher ( C 0 , 1 = W 1 , ∞ ) : Almost everywhere all concepts reduce to Fr´ echet, except PL!! A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 4 / 25

  5. Lurking in the background: Prof. Moriarty A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 5 / 25

  6. Filippov solutions of generalized steepest descent inclusion The convexity and outer semi-continuity of subsets ∂ϕ ( x ( t )) imply that x (0) = x 0 ∈ R n − ˙ x ( t ) ∈ ∂ϕ ( x ( t )) from has (at least) one absolutely continuous Filippov solution trajectory x ( t ). Heavy ball (Polyak, 1964) − ¨ x ( t ) ∈ ∂ϕ ( x ( t )) from x (0) = x 0 , − ˙ x (0) ∈ ∂ϕ ( x 0 ) . Picks up speed/momentum going downhill and slows down going uphill. Savvy ball (Griewank, 1981) d � − ˙ x ( t ) � e ∂ϕ ( x ( t )) � − 1 � ∈ ( ϕ ( x ( t )) − c ) e +1 = ∂ . ( ϕ ( x ( t )) − c ) e ( ϕ ( x ( t )) − c ) e dt Can be rewritten as a first order system of a differential equation and an inclusion satisfying Fillipov = ⇒ absolutely continuous ( x ( t ) , ˙ x ( t )) exists. A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 6 / 25

  7. Integrated Form � t x ( t ) ˙ x 0 ˙ ∂ϕ ( x ( τ )) d τ v ( t ) = [ ϕ ( x ( t )) − c ] e ∈ [ ϕ ( x 0 ) − c ] e − e [ ϕ ( x ( τ )) − c ] e +1 . 0 Second order Form � [ e ∂ϕ ( x ( t ))] x ( t ) ⊤ � I − ˙ x ( t ) ˙ x ( t ) ∈ − � ˙ x (0) � = 1 . ¨ with x ( t ) � 2 � ˙ [ ϕ ( x ( t )) − c ] Idea: Adjustment of current search direction ˙ x ( t ) towards a negative gradient direction − ∂ϕ ( x ( t )) . The closer the current function value ϕ ( x ( t )) is to the target level c , the more rapidly the direction is adjusted. If ϕ convex, ϕ (˚ x ) ≤ c and e ≤ 1 the trajectory reaches the level set. On degree (1 / e ) homogeneous objectives, local minimizers below c are accepted and local minimizers above the target level are passed by. A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 7 / 25

  8. JOTA: VOL. 34, NO. 1, MAY 1981 33 Fig. 1. Search trajectories with target c = 0 and sensitivit2y , e e {0.4, 0.5, 0.67} on the objective function f= (x~ +x~)/200+ t-cos x1 cos(x2/~/2). Initial point (40, -35). Global minimum at origin marked by +~ A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 8 / 25 gradient and explore the objective function more thoroughly. Simul- taneously, the stepsize, which equals the length of the dashes, becomes smaller to ensure an accurate integration. The behavior of the individual trajectories confirms in principle the results of Theorem 5.1(i) applied to the quadratic term u, with d being equal to 2. The combination 1 e =~= 1/d and c = O=f* seems optimal, even though the corresponding trajectory converges to the global solution x* only from the initial point (40,-35), but not from (35, -30). In the latter case, as shown in Fig. 2, the trajectory is distracted

  9. 34 JOTA: VOL 34, NO. 1, MAY 1981 Fig. 2. Search trajectories with sensitivity e = 0.5 and target c E {-0.4, 0, 0.4} on the ob- jective function f= (x~+x~)/200+ 1-cos xt cos(x2/']2). Initial point (35, -30). Global minimum at origin marked by +. A. Griewank, ´ A. Rojas (HU/Yachay Tech) from x* by a sequence of suboptimal minima and eventually diverges toward GALL 14.12.19, NeurIPS Vancouver 9 / 25 infinity. Trajectories with sensitivities larger than 0.5, like the one with e = 0.67 in Fig. 1, usually lack the penetration to reach x* and wander around endlessly, as they cannot escape the attraction of the quadratic term u. On the other hand, trajectories with sensitivities less than 0.5, like the one with e = 0.4 in Fig. 1, are likely to pass the global solution x* at some distance before diverging toward infinity. The same is true of trajectories with appropriate sensitivity e = ½, but having an unattainable target, as we can see from the case c =-0.4 in Fig. 2. Trajectories whose target is attainable are likely to achieve their goal, like the one with c = 0.4 in Fig. 2, which attains its target close to the suboptimal minimum £1 ~ -Ir(1, 42) T, with value f(~l) ~ 0.15 after passing through the neighborhood of two unacceptable minima ~2 ~ -2~r (1, x/2) T ~3 = -¢r(3, x/2) T,

  10. Closed form solution on prox-linear function Lemma(A.G. 1977 & A.R. 2019). For ϕ ( x ) = b + g ⊤ x + q 2 � x � 2 2 ∇ ϕ ( x ( t )) � x ( t ) ⊤ � ¨ x ( t ) = − I − ˙ x ( t ) ˙ [ ϕ ( x ( t )) − c ] yields momentum like t 2 g x ( t ) = x 0 + sin( ω t ) x 0 + 1 − cos( ω t ) ˙ ¨ x 0 ≈ x 0 + t ˙ x 0 − ω 2 2( ϕ 0 − c ) ω and � (1 − cos( ω t )) � � sin( ω t ) q − ω 2 ( ϕ 0 − c ) ( g + qx 0 ) ⊤ ˙ � ϕ ( x ( t )) = ϕ 0 + x 0 + ω 2 ω where � ( g + qx 0 ) � x ⊤ ¨ x 0 = − I − ˙ x 0 ˙ and ω = � ¨ x 0 � . 0 ( ϕ 0 − c ) A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 10 / 25

  11. Piecewise-Linearization Approach 1 Every function ϕ ( x ) that is abs-normal , i.e. evaluated by a sequence of smooth elemental functions and piecewise linear elements like abs , min , max can be approximated near a reference point ˚ x by a piecewise-linear function ∆ ϕ (˚ x ; ∆ x ) s.t. x ; ∆ x ) | ≤ q 2 � ∆ x � 2 | ϕ (˚ x + ∆ x ) − ϕ (˚ x ) − ∆ ϕ (˚ 2 The function y = ∆ ϕ (˚ x ; x − ˚ x ) can be represented in Abs-Linear form z = d + Zx + Mz + L | z | µ + a ⊤ x + b ⊤ z + c ⊤ | z | = y where M and L are strictly lower triangular matrices s.t. z = z ( x ). 3 [ d , Z , M , L , µ, a , b , c ] can be generated automatically by Algorithmic Piecewise Differentiation, which allows the computational handling of ∆ ϕ in and between the polyhedra P σ = closure { x ∈ R n ; sgn( z ( x )) = σ } σ ∈ {− 1 , +1 } s for A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 11 / 25

  12. F ♦ ˚ x F F ( x ) ˚ x x (a) Tangent mode linearization A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 12 / 25

  13. SALMIN defined by iteration q k 2 � ∆ x � 2 } { ∆ ϕ ( x k ; ∆ x ) + x k +1 = arglocmin (1) ∆ x where q k > 0 is adjusted such that eventually q k ≥ q in region of interest. Has cluster points x ∗ that are first order minimal minimal (FOM) i.e. ∆ ϕ ( x ∗ , ∆ x ) ≥ 0 for ∆ x ≈ 0 . Drawback: Requires computation and factorization of active Jacobians. Coordinate Global Descent CGD f ( w ; x ) is PL w.r.t. x but ϕ ( w ) is only multi-piecewise linear w.r.t. w , i.e. ϕ ( x + te j ) ≡ t ∈ R ϕ ( x ) + ∆ ϕ ( x + te j ) for . Along any such coordinate bi-direction we can perform a global univariate minimization efficiently. Cluster points x ∗ of such alternating coordinate searches seem not even even Clarke stationary, i.e. 0 ∈ ∂ϕ ( x ∗ ) . A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 13 / 25

  14. Figure 1: Decimal digits gained by 4 methods on single layer regression problem. A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 14 / 25

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend