Optimization
why does it work
minima
How many
Do they
control
worm
complexity
Optimization why does it work How many minima Do they control worm - - PDF document
Optimization why does it work How many minima Do they control worm complexity Plain Background SGD NN on and cross entropy traj Minima Behout Number square loss Besout Degeneracy SGD and Langevin SGD fuids global minima
minima
How many
Do they
control
worm
complexity
SGD
and
NN
cross entropy
Minima
Number
Behout
square loss
Degeneracy
Besout
SGD
and
SGD fuids global
minima
Hessian
Variations
W
7 Cw
is In euqe
Crossentropy is
multi
label
version
logistic
g
te
use
minibatdresm
selected at
random for
each
iteration
S G D
we say
are usually
with
Ms
M
is t Wii
is
t eqts
i
l
N
i
t
N
it is easy to find
zero
error
were
w
is
in the win
in the E
i
i
N
is
a system of
N
polynomial equations
in M
variables
N
60 h
in
CHA h
Mr
300 h
in
M
has
Na
M
a
N
then the
us
I
am degenerate
This
is
similar
to systems of linear
For the
size of N N tooday
isolated solutions
is very high
protons
universe
arfd.f.mn hemoredegeu acte
because
The feint
M and
N
and
degeneracy
is
what
we use
next
Because M
s N
the
To
2ero
error for all
xi
K w
Yi
O
i
n
N
all
minima
The stationarypints of the gradient
are
Vw L
which
means
il
These
are
M equations
in M unknowns
the equations
are
Besout
Theorem applies
the solutions
are
in general not degenerate
solutions
are degenerate
local
minima
are
not
degenerate
S GD
S G
D
L
nvm
For
the next step
1 need to establish
between
S G D
and
Langevin
N
L
Z
yi
GD
Aw
Wt
we
j
VI L
Unis
L dynamical gradient system
SGD Awt
y VI V f z
with 2 i chosen at random
GDL
Loe
I
e
d Bt
d Wis
TBI
te
SDE
w
where
d Bt is the derivative of the
Brownian
motion
that
is
zero
mean
white
noise
with
Gaussian statistics
SGD
is
similar
to GDL
in
suirulations
and
also if
I write
S G D
as
in Tw
L
EV
e
VIL
e ft
where 3
L
V
is
a pseudo wise
S t
E ft
O
is defcried
in
terms of minibarches
where
CLT
applies gicy ft
some
Gann litre
Let
us speak
about
GDL
which
is
a
S DE
Wo
VI
L
t
d Bt
Its solution
disturb
is
p
I
l
2
means
that if
L
p
m
L
1
p
1
p
shows
concentration of
with large
d
most of
man
is
in large volume
urinine
U
See
slides
The conclusion
is
that
the prob
solution
GDL
prefers with high probability
degenerate
minima
with
this
that
minima
vs
local
Lr
SGD
is
valid
also for
SGD
The last point in thus class
which
is
also
a harbingerof next class
is
about
the
structure of the solutions of G D
with
square loss in the overyarametriaed
seise
The dynamical system
is
EL
z
xi
www.io
Ei
Ei
i
then
is
may be
aero
too
W
n
Are
these solutions stable
Let us
look
at
Hessian of L
N
22 L
2w
2
Yi
fei
E
Z
WE
dWnit
H
is
p d then stability
g is often
01
degenerate direction
valleys
as repeated four Behour analysis