[PDF] - Optimization why does it work How many minima Do they control worm PDF Document

SLIDE 1

Optimization

why does it work

minima

How many

Do they

control

worm

complexity

SLIDE 2

Plain Background

n

SGD

and

NN

cross entropy

traj

Minima

Number

Behout

square loss

Degeneracy

Besout

SGD

and

Langevin

SGD fuids global

minima

stability

equil points

Hessian

Variations

SLIDE 3

Loss functions

ios

Uw

I

FH

W

Yi

7 Cw

I

e Sif

my

logistic

is In euqe

bifxi.int

Crossentropy is

multi

label

version

logistic

SLIDE 4

Gradient Descent optimization

innit

y 3

L

g

Wii

wist

g

I

2 wit

te

Instead of

L

Ii

li

use

minibatdresm

selected at

random for

each

iteration

S G D

SLIDE 5

Minima

Can

we say

how

many

which

kind

independentof

GD

Key fact

D NN

are usually

vergarametrizidm

with

Ms

N

M

is t Wii

N

is

t eqts

SLIDE 6

Bezouttheoreumstead

f

mzin Ii

ffci.int Yi

consider

fdi WII Yi

i

l

N

i

t

N

because

it is easy to find

zero

error

because

ver parametrization

SLIDE 7

Besout

theorem

If

scat

were

pe

then

f

w

is

multivariate

polynomial

in the win

and

in the E

Then

f Gil

Yi

i

N

is

a system of

N

polynomial equations

in M

variables

N

60 h

in

CHA h

Mr

300 h

Besouttheorem

A set of N polynomialeqts

in

M

variables of degree It

has

E IT

isolated solutions if D

Na

M

a

N

then the

solutions

SLIDE 8

us

I

am degenerate

Remarks

This

is

similar

to systems of linear

equations

For the

size of N N tooday

isolated solutions

is very high

protons

universe

arfd.f.mn hemoredegeu acte

because

f

The feint

u

M and

N

and

degeneracy

is

what

we use

SLIDE 9

Because M

s N

the

globalminimacorrespuding

To

2ero

error for all

xi

f

K w

Yi

O

i

n

N

aredegenerate

What about

all

minima

The stationarypints of the gradient

are

Vw L

which

means

Fj Ei Cf

il

yi

These

are

M equations

in M unknowns

If f glynomial

the equations

are

polynomialequations

Besout

Theorem applies

the solutions

are

SLIDE 10

in general not degenerate

This

global

solutions

are degenerate

local

minima

are

not

degenerate

S GD

S G

D

L

nvm

For

the next step

1 need to establish

similarity

between

S G D

and

Langevin

equations

N

L

Z

Lf Gil

yi

Zi Ucf Zi

GD

Aw

Wt

we

j

VI L

Unis

Zin

L dynamical gradient system

SGD Awt

y VI V f z

with 2 i chosen at random

SLIDE 11

GDL

Loe

I

e

d Bt

d Wis

TBI

te

SDE

w

where

d Bt is the derivative of the

Brownian

motion

that

is

zero

mean

white

noise

with

Gaussian statistics

SGD

is

similar

to GDL

in

suirulations

and

also if

I write

S G D

as

in Tw

L

EV

e

VIL

e ft

where 3

VI

L

V

is

a pseudo wise

S t

E ft

O

ft

is defcried

in

terms of minibarches

where

CLT

applies gicy ft

some

Gann litre

putter

SLIDE 12

Let

us speak

about

GDL

which

is

a

S DE

Wo

VI

L

t

d Bt

Its solution

for stationaryyob

disturb

is

LT

p

I

l

2

This

means

that if

L

4

p

w

m

L

1

p

1

huportant

p

shows

concentration of

probabilities

with large

d

most of

probability

man

is

in large volume

urinine

f

U

See

slides

SLIDE 13

The conclusion

is

that

the prob

solution

f

GDL

prefers with high probability

degenerate

minima

Together

with

Besout conclusions

this

implies

that

GDL prefers global

minima

vs

local

nes

Because of GD

Lr

SGD

This

is

valid

also for

SGD

SLIDE 14

The last point in thus class

which

is

also

a harbingerof next class

is

about

the

structure of the solutions of G D

with

square loss in the overyarametriaed

seise

The dynamical system

is

WII

EL

z

y

f

xi

JIM

www.io

Ei

if

Ei

f

i

then

is

may be

aero

too

W

n

Are

these solutions stable

Unique

Let us

look

at

Hessian of L

N

22 L

2w

2

fEf Eti

Yi

fei

31

44

r

SLIDE 15

if

E

to

Z

I

212

WE

dWnit

if

H

is

p d then stability

But 3

g is often

01

degenerate direction

valleys

as repeated four Behour analysis