Monte Carlo Methods and Neural Networks Alexander Keller, partially - - PowerPoint PPT Presentation

monte carlo methods and neural networks
SMART_READER_LITE
LIVE PREVIEW

Monte Carlo Methods and Neural Networks Alexander Keller, partially - - PowerPoint PPT Presentation

Monte Carlo Methods and Neural Networks Alexander Keller, partially joint work with Noah Gamboa Artificial Neural Networks in a Nutshell Supervised learning of high dimensional function approximation input layer a 0 , L 1 fully connected


slide-1
SLIDE 1

Monte Carlo Methods and Neural Networks

Alexander Keller, partially joint work with Noah Gamboa

slide-2
SLIDE 2

Artificial Neural Networks in a Nutshell

Supervised learning of high dimensional function approximation

input layer a0, L−1 fully connected hidden layers, and output layer aL

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

2

slide-3
SLIDE 3

Artificial Neural Networks in a Nutshell

Supervised learning of high dimensional function approximation

input layer a0, L−1 fully connected hidden layers, and output layer aL

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1

– nl rectified linear units (ReLU) al,i = max{0,∑wl,j,i ·al−1,j} in layer l 2

slide-4
SLIDE 4

Artificial Neural Networks in a Nutshell

Supervised learning of high dimensional function approximation

input layer a0, L−1 fully connected hidden layers, and output layer aL

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1

– nl rectified linear units (ReLU) al,i = max{0,∑wl,j,i ·al−1,j} in layer l – backpropagating the error δl−1,i = ∑al,j >0 δl,j ·wl,j,i 2

slide-5
SLIDE 5

Artificial Neural Networks in a Nutshell

Supervised learning of high dimensional function approximation

input layer a0, L−1 fully connected hidden layers, and output layer aL

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1

– nl rectified linear units (ReLU) al,i = max{0,∑wl,j,i ·al−1,j} in layer l – backpropagating the error δl−1,i = ∑al,j >0 δl,j ·wl,j,i, update weights w′

l,j,i = wl,j,i −λ ·δl,j ·al−1,i if al,j > 0

2

slide-6
SLIDE 6

Artificial Neural Networks in a Nutshell

Convolutional neural networks: Similarity measures

convolutional layer: feature map defined by convolution kernel – identical weights across all neural units of one feature map max pooling layer: maximum of tile of neurons in feature map for subsampling ◮ Gradient based learning applied to document recognition ◮ Quasi-Monte Carlo feature maps for shift-invariant kernels 3

slide-7
SLIDE 7

Relations to Mathematical Objects

slide-8
SLIDE 8

Relations to Mathematical Objects

Maximum pooling layers

rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity 5

slide-9
SLIDE 9

Relations to Mathematical Objects

Maximum pooling layers

rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity for example, leaky ReLU is

ReLU(x)−α ·ReLU(−x)

5

slide-10
SLIDE 10

Relations to Mathematical Objects

Maximum pooling layers

rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity for example, leaky ReLU is

ReLU(x)−α ·ReLU(−x) which for α = −1 yields the absolute value |x| = ReLU(x)+ReLU(−x)

5

slide-11
SLIDE 11

Relations to Mathematical Objects

Maximum pooling layers

rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity for example, leaky ReLU is

ReLU(x)−α ·ReLU(−x) which for α = −1 yields the absolute value |x| = ReLU(x)+ReLU(−x)

hence the maximum of two values is

max{x,y} = x +y 2 +

  • x −y

2

  • = 1

2 ·(x +y +ReLU(x −y)+ReLU(y −x))) which allows one to represent maximum pooling by ReLU functions and introduces skip links

5

slide-12
SLIDE 12

Relations to Mathematical Objects

Residual layers looks like projections onto half spaces

halfspace H+ with weights ˆ

ω as normal and bias b as distance from origin O

6

slide-13
SLIDE 13

Relations to Mathematical Objects

Residual layers as differential equations

relation to a differential equation by introduction a step size h

al = al−1 +h· W (2)

l

max

  • 0,W (1)

l

·al−1

  • 7
slide-14
SLIDE 14

Relations to Mathematical Objects

Residual layers as differential equations

relation to a differential equation by introduction a step size h

al = al−1 +h· W (2)

l

max

  • 0,W (1)

l

·al−1

  • resembles like Euler method

al−al−1 h

7

slide-15
SLIDE 15

Relations to Mathematical Objects

Residual layers as differential equations

relation to a differential equation by introduction a step size h

al = al−1 +h· W (2)

l

max

  • 0,W (1)

l

·al−1

  • resembles like Euler method

al−al−1 h

= W (2)

l

max

  • 0,W (1)

l

·al−1

  • which for h → 0 becomes ˙

al

7

slide-16
SLIDE 16

Relations to Mathematical Objects

Residual layers as differential equations

relation to a differential equation by introduction a step size h

al = al−1 +h· W (2)

l

max

  • 0,W (1)

l

·al−1

  • resembles like Euler method

al−al−1 h

= W (2)

l

max

  • 0,W (1)

l

·al−1

  • which for h → 0 becomes ˙

al

– select your favorite ordinary differential equation to determine W (1)

l

and W (2)

l

◮ Neural networks motivated by partial differential Equations – use your favorite ordinary differential equation solver for both inference and training ◮ A radical new neural network design could overcome big challenges in AI 7

slide-17
SLIDE 17

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

  • 0,

nl−1−1

i=0

wl,j,ial−1,i

  • 8
slide-18
SLIDE 18

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

  • 0,

nl−1−1

i=0

wl,j,ial−1,i

al,j :=

nl−1−1

i=0

wl,j,i max{0,al−1,i}

8

slide-19
SLIDE 19

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

  • 0,

nl−1−1

i=0

wl,j,ial−1,i

al,j :=

nl−1−1

i=0

wl,j,i max{0,al−1,i} written in continuous form al(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx

relates to high-dimensional integro-approximation

8

slide-20
SLIDE 20

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

  • 0,

nl−1−1

i=0

wl,j,ial−1,i

al,j :=

nl−1−1

i=0

wl,j,i max{0,al−1,i} written in continuous form al(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx

relates to high-dimensional integro-approximation

recurrent neural network layer in continuous form alludes to integral equation

a′

l(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx

8

slide-21
SLIDE 21

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

  • 0,

nl−1−1

i=0

wl,j,ial−1,i

al,j :=

nl−1−1

i=0

wl,j,i max{0,al−1,i} written in continuous form al(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx

relates to high-dimensional integro-approximation

recurrent neural network layer in continuous form alludes to integral equation

a′

l(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx +

1

0 wh l (x,y)max{0,al(x)}dx

– weights wh establish recurrence, e.g. for processing sequences of data 8

slide-22
SLIDE 22

Monte Carlo Methods and Neural Networks

Explore algorithms linear in time and space

structural equivalence of integral equations and reinforcement learning learning integro-approximation from noisy/sampled data 9

slide-23
SLIDE 23

Monte Carlo Methods and Neural Networks

Explore algorithms linear in time and space

structural equivalence of integral equations and reinforcement learning learning integro-approximation from noisy/sampled data examples of random sampling – pseudo-random initialization – training by stochastic gradient descent – regularization by drop-out and drop-connect – random binarization – sampling by generative adversarial networks – fixed pseudo-random matrices for direct feedback alignment ◮ Learning light transport the reinforced way ◮ Machine learning and integral equations ◮ Noise2Noise: Learning image restoration without clean data 9

slide-24
SLIDE 24

Partition instead of Dropout

slide-25
SLIDE 25

Partition instead of Dropout

Guaranteeing coverage of neural units

dropout neuron if threshold 1

P > ξ

– ξ by linear feedback shift register generator (for example)

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

11

slide-26
SLIDE 26

Partition instead of Dropout

Guaranteeing coverage of neural units

dropout neuron if threshold 1

P > ξ

– ξ by linear feedback shift register generator (for example)

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

11

slide-27
SLIDE 27

Partition instead of Dropout

Guaranteeing coverage of neural units

dropout neuron if threshold 1

P > ξ

– ξ by linear feedback shift register generator (for example)

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

11

slide-28
SLIDE 28

Partition instead of Dropout

Guaranteeing coverage of neural units

assign neuron to partition p = ⌊ξ ·P⌋ out of P – less random number generator calls

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

11

slide-29
SLIDE 29

Partition instead of Dropout

Training accuracy with LeNet on CIFAR-10 20 40 60 80 100 120 140 0.2 0.4 0.6 Epoch of Training Accuracy 3 dropout partitions 1/3 dropout 2 dropout partitions 1/2 dropout

12

slide-30
SLIDE 30

Simulating Discrete Densities

slide-31
SLIDE 31

Simulating Discrete Densities

Stochastic evaluation of scalar product

select inputs proportional to weight density

1

14

slide-32
SLIDE 32

Simulating Discrete Densities

Stochastic evaluation of scalar product

select inputs proportional to weight density

1

14

slide-33
SLIDE 33

Simulating Discrete Densities

Stochastic evaluation of scalar product

select inputs proportional to weight density

1

– remember to flip sign accordingly 14

slide-34
SLIDE 34

Simulating Discrete Densities

Stochastic evaluation of scalar product

partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

P0 1 Pm P1 P2 Pm−1 w1 w2 wm

15

slide-35
SLIDE 35

Simulating Discrete Densities

Stochastic evaluation of scalar product

partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

P0 1 Pm P1 P2 Pm−1 w1 w2 wm

– using a uniform random variable ξ ∈ [0,1) to

select input i ⇔ Pi−1 ≤ ξ < Pi satisfying Prob({Pi−1 ≤ ξ < Pi}) = |wi|

15

slide-36
SLIDE 36

Simulating Discrete Densities

Stochastic evaluation of scalar product

partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

P0 1 Pm P1 P2 Pm−1 w1 w2 wm

– using a uniform random variable ξ ∈ [0,1) to

select input i ⇔ Pi−1 ≤ ξ < Pi satisfying Prob({Pi−1 ≤ ξ < Pi}) = |wi|

– use jittered equidistant samples for ξ 15

slide-37
SLIDE 37

Simulating Discrete Densities

Stochastic evaluation of scalar product

partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

P0 1 Pm P1 P2 Pm−1 w1 w2 wm

– using a uniform random variable ξ ∈ [0,1) to

select input i ⇔ Pi−1 ≤ ξ < Pi satisfying Prob({Pi−1 ≤ ξ < Pi}) = |wi|

– use jittered equidistant samples for ξ in fact derivation of quantization to ternary weights in {−1,0,+1} – integer weights result from neurons referenced more than once – relation to drop connect and drop out 15

slide-38
SLIDE 38

Simulating Discrete Densities

Results

97% of accuracy of model by sampling most important 12% of weights

50 100 150 200 250 300 350 400 450 500 550 600 0.94 0.96 0.98 1 Number of Samples per Neuron Test Accuracy two layer ReLU feedforward network on MNIST

16

slide-39
SLIDE 39

Neural Networks linear in Time and Space

slide-40
SLIDE 40

Neural Networks linear in Time and Space

Complexity

number of neural units

n =

L

l=1

nl where nl is the number of neurons in layer l

18

slide-41
SLIDE 41

Neural Networks linear in Time and Space

Complexity

number of neural units

n =

L

l=1

nl where nl is the number of neurons in layer l

number of weights

nw =

L

l=1

nl−1 ·nl

18

slide-42
SLIDE 42

Neural Networks linear in Time and Space

Complexity

number of neural units

n =

L

l=1

nl where nl is the number of neurons in layer l

number of weights

nw =

L

l=1

nl−1 ·nl

select constant number of weights per neuron – linear in n, because proportional to nw 18

slide-43
SLIDE 43

Neural Networks linear in Time and Space

Results 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy LeNet on MNIST LeNet on CIFAR-10 AlexNet on CIFAR-10 Top-5 Accuracy AlexNet on ILSVRC12 Top-1 Accuracy AlexNet on ILSVRC12

19

slide-44
SLIDE 44

Neural Networks linear in Time and Space

Sampling paths through networks

complexity bounded by number of paths times depth L of network – relation to random walks on directed graphs and Markov chains 20

slide-45
SLIDE 45

Neural Networks linear in Time and Space

Sampling paths through networks

complexity bounded by number of paths times depth L of network – relation to random walks on directed graphs and Markov chains application after training – backwards random walks using sampling proportional to the weights of a neuron – compression and quantization by importance sampling 20

slide-46
SLIDE 46

Neural Networks linear in Time and Space

Sampling paths through networks

complexity bounded by number of paths times depth L of network – relation to random walks on directed graphs and Markov chains application after training – backwards random walks using sampling proportional to the weights of a neuron – compression and quantization by importance sampling application before training – uniform (bidirectional) random walks to connect inputs and outputs – sparse from scratch 20

slide-47
SLIDE 47

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

21

slide-48
SLIDE 48

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1

21

slide-49
SLIDE 49

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity 21

slide-50
SLIDE 50

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity 21

slide-51
SLIDE 51

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity 21

slide-52
SLIDE 52

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity 21

slide-53
SLIDE 53

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity and coverage 21

slide-54
SLIDE 54

Neural Networks linear in Time and Space

Test accuracy for 4 layer feedforward network (784/300/300/10) 50 100 150 200 250 300 0.2 0.4 0.6 0.8 1 Number of Per Pixel Paths Through Network Test Accuracy MNIST Fashion MNIST

22

slide-55
SLIDE 55

Monte Carlo Methods and Neural Networks

slide-56
SLIDE 56

Monte Carlo Methods and Neural Networks

Relations to integro-approximation

dropout partitions – using a fraction of the random numbers 24

slide-57
SLIDE 57

Monte Carlo Methods and Neural Networks

Relations to integro-approximation

dropout partitions – using a fraction of the random numbers simulating discrete densities explains {−1,0,1} and integer weights – compression and quantization without retraining 24

slide-58
SLIDE 58

Monte Carlo Methods and Neural Networks

Relations to integro-approximation

dropout partitions – using a fraction of the random numbers simulating discrete densities explains {−1,0,1} and integer weights – compression and quantization without retraining neural networks with linear complexity for both inference and training – sampling paths through neural networks – no more dangling neurons – sparse from scratch ◮ also watch the GTC presentation by Noah Gamboa 24