XX Shan Deep Somehistory 80 S N N 2018 Approximation Theory is - - PDF document

xx
SMART_READER_LITE
LIVE PREVIEW

XX Shan Deep Somehistory 80 S N N 2018 Approximation Theory is - - PDF document

Why deep nets Is deep better than shallow When Why in i w XX Shan Deep Somehistory 80 S N N 2018 Approximation Theory is depth better why Shallow networks wit O O 02 k X kernel machine Rain example 9 K x a xi y ta Ef Gil


slide-1
SLIDE 1

Why deep nets

Is deep better

than shallow

When

Why

slide-2
SLIDE 2

in

w

i

XX

Shan

Deep

slide-3
SLIDE 3

Somehistory

80

S N N

2018

slide-4
SLIDE 4

Approximation Theory

why

is depth better

slide-5
SLIDE 5

Shallow

networks

wit

O

O

02

k

X

Rain example

kernel machine

9

y

a

K

x

xi

III n

ta Ef Gil yi

WE

slide-6
SLIDE 6

Another example

I

O

O

O

O

w

ex

ex

8CZt lz4

max

zJ

fzy

2iwiiok.w

x I

usually

64W

x

b

but

I

eliminate b

b

anury

  • ne of

the

components

is

Xd

L so

wit

bi

slide-7
SLIDE 7

Deep

Networks

2

  • f
  • H

l

O

O

O

O

l

h

O

O

O

O

very

E in

y

VK's Wii

W

xp

E summation convention

slide-8
SLIDE 8

Networkstoappnoxuriate representfunctions

Are deep

nets

better

than

shallow

  • nes

The

answer

in the SO

was

no

We will

see

proof of above

a

new

answer

deep

nets

can

be much

better for certain f

slide-9
SLIDE 9

key

ideas

in approximation

theory

Functions and approxuriators

mm

get

Example

Vefss.fi fCx ist

fxfafeC

iRd

gEVu

aspgu6Cwiif

slide-10
SLIDE 10

Density

K

Md

Vcompact

KEITH

HE

O

Igers

t

qq.pe

cxi gcxllcE

I

Ieuan

set of

networks

slide-11
SLIDE 11

Degreeafapproximation

H f e CCR

d

gnefn I f g la

e

dinge

is N

slide-12
SLIDE 12

Shallownets

density idegreeappeximation

Considertargetfundious

few

ftp iiifPnfhei

fo structure assumed

Theorem

gakEcicw.ie

i

Vfewdm

7gc.VN

st

1gal false

with

N

O

E

slide-13
SLIDE 13

Curse ofdimensionality

In

Bellman's

term

a optimization

cannot be done by

Rs

a function approximation

requires

yd evaluations

for f Lipschitz

  • rder

E

D integration

slide-14
SLIDE 14

Blessings of

a Smoothness

Barrow's Green

2

compositionelity

slide-15
SLIDE 15

Examples

Pf has

tnk

dtk

d

K

Kd

monomials A function of 10 Variables

corresponds to

a

10 D

table

If

each

dimension

is discretized

vi Just

10

partitions

I

have

table

with

10

entries

If D

100 pixels

slide-16
SLIDE 16

e

s

If d

  • p els

then

10

entries

If f e Wdm

N

O

Ym

For

E

e

l O

d

too

N Off

slide-17
SLIDE 17

Summary proof

iCi6

Ai

X

bi

r

Pol

E

Me

PY CW xD

E

Pge fol updegee

kind

variables

Z

Kd

p

k

qYd

I.EETEEqfmda

Sobolev

F

Wdm Rz Lp

a Ck

E

F

W

Es

30

p

E

g

C

2 Vd

  • 's

I

w'd

I

slide-18
SLIDE 18

Logic of

t t

Networks

approximate

univariate

poly

Univariate

get

in CW X

represent

multivariate

gel

Multivariate

pet approximate

Sobolev functions

Thus

theorem

slide-19
SLIDE 19

Any

univariate

p

x

can berepresented

as

linear

combination of smooth ReLU

proof

fun

6

ath

x

b

da

Xt b

h

  • 2

h

Ida

6

xtbYE.o

X

6

b

Theorem If

6

is

not

a polynomial

to E C

the closureof Nr

pair

  • f

r 8

contains

the

linear space of

IT

slide-20
SLIDE 20

p

a of

a

Second derivative

which

needs

3

terms

gives

x

Thus

Nr

is dense in

C CH

because of Weierstrass

theorem

slide-21
SLIDE 21

FROM

ID To

DD

i

win variate

PC

I

pi

i

Rd

pohffffeneous

f Hn

d

variables of

deputy

2

dim Hh

Ye

dth

Cd al

k

Kd

thus

Hnd pol

can

be represented

by

a

network

with

2 fed

units

slide-22
SLIDE 22

We

want

to show that

if Pan

pol

  • u

IR

d

then

P Lx

pikwi.es

for

some ghoice of

r

Wi

and

Pi

No general proof

but

consider

following

assume

network

with

units

r s

t

pi

fu

Can I synthesise tf

Pal of

slide-23
SLIDE 23

Can

yuthere

Cl of

d variables

I

can

get

xp

E

how

do

I get

x xz

Well

EE

II

aka Reimann

And

hour pot degree him 2Nd

Pnd

a

4

u

n

din

HE

Ith

r

din Pu

tan

slide-24
SLIDE 24

Define

E fB

X

Lp

inf Ip flip

tf

c B

Pex

theorem

F

Wam Nz

Lp

ect

Yet

II

statist

proof

classically

Be

m Pie

Lp ectin

Juice

r

Ii'd

p K

Ed

ECW

No 4 NECW

Ralph

C

IF

slide-25
SLIDE 25

Remarks

c Even

without

a shallow

net

can represent arts

Well

polynomials

in PL

with

2

units

2

fed

slide-26
SLIDE 26

Depths

e

For general functions

shallow

and deep

nets

suffer

curse ofdrinensionality

But

for Local Hierarchical

Compositional functions

deep

nets

unlike

shallow

  • nes

do

not

have

curse

slide-27
SLIDE 27

LHC

functions

Swnplesteeample

gftp.flkrkisxn

x

he

4

4h

fg

f Lx Xa

f2

3

h

few

unhfiew

Another eeaus.pk

f

x.kz

AX.Xz

iBx.eCxzJ

shallow not require

2

units

slide-28
SLIDE 28

a deep

net

n

3

t

3

10

Intuition Thf

shallow wet

f

4

units

deep wet

f

2

for

each

node

6

N

A

fer

total

3

12 units

Another

ee.am

y

VsinCX.exz oCxzeXuY

h6

h

h2C4xz

has

34

slide-29
SLIDE 29

Theorem

Deep nets

with

same

graph

approximate functions

in

d

variables

in WE

with 4 units per node

r

O

m

fer

total units

0 Kd D

Th

Proof

µ

µ

µ

Each

h

can

be approximated with

O

units

We

assume

slide-30
SLIDE 30

each

h

is Lipschitz continuous

that

is

1ham

het E I

a

LE

By hypothesis

11h

Pace

1h

pike

1h

pl

e

Then

I

h

h

hz

PG pal

1h

h

hu h

P

Pz

h

R Pz

PNR

a Ih

h I

t

I h

Pl

e GE

t

E

n O

l

Minkowski

p

Lipson h

hypothers

If di

f Kal ten

Ix

El

More general

theorems for

DA

G

s

slide-31
SLIDE 31

EH

This theorem may eeplain why deep

nets

are

successful

and why all

the really good

  • nes

are

C

N Ns

h h

Ann

Ann

Em

Locality

is hey

not

weight sharing of weight

sharing helps butuoteepuentially