Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and - PowerPoint PPT Presentation

Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and ´ Angel Rojas Humboldt University (Berlin) and Yachay Tech (Imbabura) 14.12.19, NeurIPS Vancouver A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 1 / 25

Outline From Heavy to Savvy Ball search trajectory 1 Results in convex, homogeneous and prox-linear case 2 Successive Piecewise Linearization 3 Mixed Binary Linear Optimization 4 Generalized Abs-Linear Learning 5 Summary, Conclusions and Outlook 6 A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 2 / 25

Folklore and Common Expectations in ML 1 Nonsmoothness can be ignored except for step size choice. 2 Stochastic (mini-batch) sampling hides all the problems. 3 Higher dimensions make local minimizer less likely. 4 Difficulty is getting away from saddle points not minimizers. 5 Precise location of (almost) global minimizer unimportant. 6 Network architecture and stepsize selection can be tweaked. 7 Convergence proofs only under ”unrealistic assumptions”. A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 3 / 25

Generalized Gradient Concepts Notational Zoo (Subspecies in Euclidean and Lipschitzian Habitat): echet Derivative: ∇ ϕ ( x ) ≡ ∂ϕ ( x ) /∂ x : D �→ R n ∪ ∅ Fr´ Limiting Gradient: ∂ L f (˚ x ∇ ϕ ( x ) : D ⇒ R n x ) ≡ lim x → ˚ Clarke Gradient: ∂ϕ ( x ) ≡ conv ( ∂ L ϕ ( x )) : D ⇒ R n Bouligand: f ′ ( x ; ∆ x ) ≡ lim t ց 0 [ ϕ ( x + t ∆ x ) − ϕ ( x )] / t D × R n �→ R : D �→ PL h ( R n ) : Piecewise Linearization(PL): D × R n �→ R ∆ ϕ ( x ; ∆ x ) : D �→ PL ( R n ) : Moriarty Effect due to Rademacher ( C 0 , 1 = W 1 , ∞ ) : Almost everywhere all concepts reduce to Fr´ echet, except PL!! A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 4 / 25

Lurking in the background: Prof. Moriarty A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 5 / 25

Filippov solutions of generalized steepest descent inclusion The convexity and outer semi-continuity of subsets ∂ϕ ( x ( t )) imply that x (0) = x 0 ∈ R n − ˙ x ( t ) ∈ ∂ϕ ( x ( t )) from has (at least) one absolutely continuous Filippov solution trajectory x ( t ). Heavy ball (Polyak, 1964) − ¨ x ( t ) ∈ ∂ϕ ( x ( t )) from x (0) = x 0 , − ˙ x (0) ∈ ∂ϕ ( x 0 ) . Picks up speed/momentum going downhill and slows down going uphill. Savvy ball (Griewank, 1981) d � − ˙ x ( t ) � e ∂ϕ ( x ( t )) � − 1 � ∈ ( ϕ ( x ( t )) − c ) e +1 = ∂ . ( ϕ ( x ( t )) − c ) e ( ϕ ( x ( t )) − c ) e dt Can be rewritten as a first order system of a differential equation and an inclusion satisfying Fillipov = ⇒ absolutely continuous ( x ( t ) , ˙ x ( t )) exists. A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 6 / 25

Integrated Form � t x ( t ) ˙ x 0 ˙ ∂ϕ ( x ( τ )) d τ v ( t ) = [ ϕ ( x ( t )) − c ] e ∈ [ ϕ ( x 0 ) − c ] e − e [ ϕ ( x ( τ )) − c ] e +1 . 0 Second order Form � [ e ∂ϕ ( x ( t ))] x ( t ) ⊤ � I − ˙ x ( t ) ˙ x ( t ) ∈ − � ˙ x (0) � = 1 . ¨ with x ( t ) � 2 � ˙ [ ϕ ( x ( t )) − c ] Idea: Adjustment of current search direction ˙ x ( t ) towards a negative gradient direction − ∂ϕ ( x ( t )) . The closer the current function value ϕ ( x ( t )) is to the target level c , the more rapidly the direction is adjusted. If ϕ convex, ϕ (˚ x ) ≤ c and e ≤ 1 the trajectory reaches the level set. On degree (1 / e ) homogeneous objectives, local minimizers below c are accepted and local minimizers above the target level are passed by. A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 7 / 25

JOTA: VOL. 34, NO. 1, MAY 1981 33 Fig. 1. Search trajectories with target c = 0 and sensitivit2y , e e {0.4, 0.5, 0.67} on the objective function f= (x~ +x~)/200+ t-cos x1 cos(x2/~/2). Initial point (40, -35). Global minimum at origin marked by +~ A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 8 / 25 gradient and explore the objective function more thoroughly. Simul- taneously, the stepsize, which equals the length of the dashes, becomes smaller to ensure an accurate integration. The behavior of the individual trajectories confirms in principle the results of Theorem 5.1(i) applied to the quadratic term u, with d being equal to 2. The combination 1 e =~= 1/d and c = O=f* seems optimal, even though the corresponding trajectory converges to the global solution x* only from the initial point (40,-35), but not from (35, -30). In the latter case, as shown in Fig. 2, the trajectory is distracted

34 JOTA: VOL 34, NO. 1, MAY 1981 Fig. 2. Search trajectories with sensitivity e = 0.5 and target c E {-0.4, 0, 0.4} on the objective function f= (x~+x~)/200+ 1-cos xt cos(x2/']2). Initial point (35, -30). Global minimum at origin marked by +. A. Griewank, ´ A. Rojas (HU/Yachay Tech) from x* by a sequence of suboptimal minima and eventually diverges toward GALL 14.12.19, NeurIPS Vancouver 9 / 25 infinity. Trajectories with sensitivities larger than 0.5, like the one with e = 0.67 in Fig. 1, usually lack the penetration to reach x* and wander around endlessly, as they cannot escape the attraction of the quadratic term u. On the other hand, trajectories with sensitivities less than 0.5, like the one with e = 0.4 in Fig. 1, are likely to pass the global solution x* at some distance before diverging toward infinity. The same is true of trajectories with appropriate sensitivity e = ½, but having an unattainable target, as we can see from the case c =-0.4 in Fig. 2. Trajectories whose target is attainable are likely to achieve their goal, like the one with c = 0.4 in Fig. 2, which attains its target close to the suboptimal minimum £1 ~ -Ir(1, 42) T, with value f(~l) ~ 0.15 after passing through the neighborhood of two unacceptable minima ~2 ~ -2~r (1, x/2) T ~3 = -¢r(3, x/2) T,

Closed form solution on prox-linear function Lemma(A.G. 1977 & A.R. 2019). For ϕ ( x ) = b + g ⊤ x + q 2 � x � 2 2 ∇ ϕ ( x ( t )) � x ( t ) ⊤ � ¨ x ( t ) = − I − ˙ x ( t ) ˙ [ ϕ ( x ( t )) − c ] yields momentum like t 2 g x ( t ) = x 0 + sin( ω t ) x 0 + 1 − cos( ω t ) ˙ ¨ x 0 ≈ x 0 + t ˙ x 0 − ω 2 2( ϕ 0 − c ) ω and � (1 − cos( ω t )) � � sin( ω t ) q − ω 2 ( ϕ 0 − c ) ( g + qx 0 ) ⊤ ˙ � ϕ ( x ( t )) = ϕ 0 + x 0 + ω 2 ω where � ( g + qx 0 ) � x ⊤ ¨ x 0 = − I − ˙ x 0 ˙ and ω = � ¨ x 0 � . 0 ( ϕ 0 − c ) A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 10 / 25

Piecewise-Linearization Approach 1 Every function ϕ ( x ) that is abs-normal , i.e. evaluated by a sequence of smooth elemental functions and piecewise linear elements like abs , min , max can be approximated near a reference point ˚ x by a piecewise-linear function ∆ ϕ (˚ x ; ∆ x ) s.t. x ; ∆ x ) | ≤ q 2 � ∆ x � 2 | ϕ (˚ x + ∆ x ) − ϕ (˚ x ) − ∆ ϕ (˚ 2 The function y = ∆ ϕ (˚ x ; x − ˚ x ) can be represented in Abs-Linear form z = d + Zx + Mz + L | z | µ + a ⊤ x + b ⊤ z + c ⊤ | z | = y where M and L are strictly lower triangular matrices s.t. z = z ( x ). 3 [ d , Z , M , L , µ, a , b , c ] can be generated automatically by Algorithmic Piecewise Differentiation, which allows the computational handling of ∆ ϕ in and between the polyhedra P σ = closure { x ∈ R n ; sgn( z ( x )) = σ } σ ∈ {− 1 , +1 } s for A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 11 / 25

F ♦ ˚ x F F ( x ) ˚ x x (a) Tangent mode linearization A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 12 / 25

SALMIN defined by iteration q k 2 � ∆ x � 2 } { ∆ ϕ ( x k ; ∆ x ) + x k +1 = arglocmin (1) ∆ x where q k > 0 is adjusted such that eventually q k ≥ q in region of interest. Has cluster points x ∗ that are first order minimal minimal (FOM) i.e. ∆ ϕ ( x ∗ , ∆ x ) ≥ 0 for ∆ x ≈ 0 . Drawback: Requires computation and factorization of active Jacobians. Coordinate Global Descent CGD f ( w ; x ) is PL w.r.t. x but ϕ ( w ) is only multi-piecewise linear w.r.t. w , i.e. ϕ ( x + te j ) ≡ t ∈ R ϕ ( x ) + ∆ ϕ ( x + te j ) for . Along any such coordinate bi-direction we can perform a global univariate minimization efficiently. Cluster points x ∗ of such alternating coordinate searches seem not even even Clarke stationary, i.e. 0 ∈ ∂ϕ ( x ∗ ) . A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 13 / 25

Figure 1: Decimal digits gained by 4 methods on single layer regression problem. A. Griewank, ´ A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver 14 / 25

Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and - PowerPoint PPT Presentation

Generalized/Global Abs-Linear Learning (GALL) Andreas Griewank and Angel Rojas Humboldt University (Berlin) and Yachay Tech (Imbabura) 14.12.19, NeurIPS Vancouver A. Griewank, A. Rojas (HU/Yachay Tech) GALL 14.12.19, NeurIPS Vancouver

Objectives 1. List risk factors for gall stones and gall bladder polyps 2. Compare imaging

Changes from 1 st LR: GALL License Renewal Guidance Documents: SRP-LR and GALL Report (Revision

Overview of logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models

Rotor 67 average inlet/outlet data abs. total pressure : 101063. abs. total temperature:

Limitations of linear models Richard Erickson Instructor DataCamp Generalized Linear Models in

AgBioChem, Inc. The leader in crown gall control www.agbiochem.com Crown Gall Disease Walnut

Social Media for Financial Social Media for Financial P P Professionals Professionals f f i

Protective Services: Changing the Negotiating Landscape Presenters Peter Gall, QC Gall

AIM: MED Presentation by: Stuart Gall CEO Stuart Gall CEO DISCLAIMER This document, which has

Introduction to the R Statistical Computing Environment Linear and Generalized Linear Models in R

Webinar on the Global ABS Community UNDP-GEF Global ABS Project Strengthening human resources,

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Generalized linear models Christopher F Baum EC 823: Applied Econometrics Boston College, Spring

Introduction to General and Generalized Linear Models Generalized Linear Models - part II Henrik

Introduction to General and Generalized Linear Models Generalized Linear Models - part I Henrik

Multiple logistic regression Richard Erickson Instructor DataCamp Generalized Linear Models in

Challenges of Migrating ABS Surveys to Blaise Web On a Large Scale and Short Timeframe Helen

GoldenTree Asset Management MIT Golub Center for Finance and Policy 3 rd Annual Conference Joseph

Study of the absorption length effect on the light signal response Anne Chappuis - Isabelle De

Disclosures Celmatix Advisor or Reviewer (spouse) Mindchild Advisor or Reviewer (spouse) Bobs

Projections Michael Finucan, General Manager International Markets Rain fall Cattle Herd

Functional programming with tree syntax Ulysse G erard and Dale Miller LFMTP, July 7,

SECTION 1: Introductions Code Reasoning Forward Reasoning CODE REASONING +

high-quality ABS. How QuantLib might help? Michael von den Driesch / Matthias Groncki IKB