 
              Optimization why does it work How many minima Do they control worm complexity
Plain Background SGD NN on and cross entropy traj Minima Behout Number square loss Besout Degeneracy SGD and Langevin SGD fuids global minima stability equil points Hessian Variations
Loss functions I Uw FH ios W Yi e Sif my I Cw 7 is In euqe bifxi.int logistic label multi logistic Cross entropy is version
Gradient Descent optimization y 3 innit L g I Wii wist g 2 wit te Ii Instead of L li random for selected at minibatdresm use each iteration S G D
Minima Can we say how many which kind independent of GD Key fact are usually D NN N with Ms overgarametrizidm is t Wii M N t eqts is
Bezouttheoreumstead of mzin Ii ffci.int Yi consider fdi WII Yi i N l i N t it is easy to find because zero error over parametrization because
theorem Besout If then scat were pe multivariate f polynomial is w in the win in the E and Then f Gil 0 i N Yi i a system of N is polynomial equations in M variables 300 h N Mr 60 h CHA h in Besouttheorem A set of N polynomial eqts variables of degree It has M in solutions if D isolated Na E IT solutions the then N M a
us I am degenerate Remarks to systems of linear This similar is equations size of N N too day For the isolated is very high solutions protons universe because arfd.f.mn hemoredegeu acte of The feint N M and and degeneracy is ou what next we use
Because M s N the globalminimacorrespuding error for all To 2ero xi f O N K w i aredegenerate Yi n What about all minima stationary pints of the gradient The are Vw L which 0 means Fj Ei Cf 0 il yi M equations in M unknowns These are the equations If f glynomial are Besout polynomial equations Theorem applies the solutions are
in general not degenerate This are degenerate solutions global local not minima are degenerate S GD S L G D nvm the next step For 1 need to establish between similarity S G D Langevin and equations N Z Zi Ucf Zi Lf Gil L yi VI L L Aw Wt GD we Unis Zin j dynamical gradient system y VI V f z SGD Awt with 2 i chosen at random
d Bt I Loe GDL e d Wis TBI te SDE w is the derivative of the d Bt where Brownian motion that white with is zero mean noise Gaussian statistics suirulations to similar GDL SGD and in is also if I write S G D as in e ft Tw VIL EV L e where 3 a pseudo wise VI L V is is def cried E ft S t ft O in terms of minibarches where CLT Gann litre applies gicy ft some putter
us speak Let which about GDL S DE is a Wo d Bt VI L t for stationary yob Its solution disturb is LT l I p 2 This that if means 4 L p w m L 1 1 p concentration of shows huportant p most of with large d probabilities in large volume probability urinine is man U of slides See
The conclusion the prob that solution is high probability prefers with of GDL with Together minima degenerate Besout conclusions this implies GDL prefers global that minima Because of GD local SGD vs Lr ones also for This SGD valid is
The last point in thus class which also is a harbinger of next class about the is structure of the solutions of G D with square loss in the overyarametriaed seise The dynamical system is WII f JIM EL xi z y www.io Ei if Ei f then i o is may be too aero W 0 n these solutions Unique Are stable Hessian of look L Let us at N 22 L f Ef Eti 44 31 fei Yi 2 2w or
I Z to if E 212 o WE d Wnit then stability But if H p d is degenerate direction is often 01 3 g as repeated four Behour analysis valleys
Recommend
More recommend