[PPT] - Monte Carlo Methods and Neural Networks Alexander Keller, partially PowerPoint Presentation

SLIDE 1

Monte Carlo Methods and Neural Networks

Alexander Keller, partially joint work with Noah Gamboa

SLIDE 2

Artificial Neural Networks in a Nutshell

Supervised learning of high dimensional function approximation

input layer a0, L−1 fully connected hidden layers, and output layer aL

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

2

SLIDE 3

Artificial Neural Networks in a Nutshell

Supervised learning of high dimensional function approximation

input layer a0, L−1 fully connected hidden layers, and output layer aL

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1

– nl rectified linear units (ReLU) al,i = max{0,∑wl,j,i ·al−1,j} in layer l 2

SLIDE 4

Artificial Neural Networks in a Nutshell

Supervised learning of high dimensional function approximation

input layer a0, L−1 fully connected hidden layers, and output layer aL

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1

– nl rectified linear units (ReLU) al,i = max{0,∑wl,j,i ·al−1,j} in layer l – backpropagating the error δl−1,i = ∑al,j >0 δl,j ·wl,j,i 2

SLIDE 5

Artificial Neural Networks in a Nutshell

Supervised learning of high dimensional function approximation

input layer a0, L−1 fully connected hidden layers, and output layer aL

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1

– nl rectified linear units (ReLU) al,i = max{0,∑wl,j,i ·al−1,j} in layer l – backpropagating the error δl−1,i = ∑al,j >0 δl,j ·wl,j,i, update weights w′

l,j,i = wl,j,i −λ ·δl,j ·al−1,i if al,j > 0

2

SLIDE 6

Artificial Neural Networks in a Nutshell

Convolutional neural networks: Similarity measures

convolutional layer: feature map defined by convolution kernel – identical weights across all neural units of one feature map max pooling layer: maximum of tile of neurons in feature map for subsampling ◮ Gradient based learning applied to document recognition ◮ Quasi-Monte Carlo feature maps for shift-invariant kernels 3

SLIDE 7

Relations to Mathematical Objects

SLIDE 8

Relations to Mathematical Objects

Maximum pooling layers

rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity 5

SLIDE 9

Relations to Mathematical Objects

Maximum pooling layers

rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity for example, leaky ReLU is

ReLU(x)−α ·ReLU(−x)

5

SLIDE 10

Relations to Mathematical Objects

Maximum pooling layers

rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity for example, leaky ReLU is

ReLU(x)−α ·ReLU(−x) which for α = −1 yields the absolute value |x| = ReLU(x)+ReLU(−x)

5

SLIDE 11

Relations to Mathematical Objects

Maximum pooling layers

rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity for example, leaky ReLU is

ReLU(x)−α ·ReLU(−x) which for α = −1 yields the absolute value |x| = ReLU(x)+ReLU(−x)

hence the maximum of two values is

max{x,y} = x +y 2 +

x −y

2

= 1

2 ·(x +y +ReLU(x −y)+ReLU(y −x))) which allows one to represent maximum pooling by ReLU functions and introduces skip links

5

SLIDE 12

Relations to Mathematical Objects

Residual layers looks like projections onto half spaces

halfspace H+ with weights ˆ

ω as normal and bias b as distance from origin O

6

SLIDE 13

Relations to Mathematical Objects

Residual layers as differential equations

relation to a differential equation by introduction a step size h

al = al−1 +h· W (2)

l

max

0,W (1)

l

·al−1

7

SLIDE 14

Relations to Mathematical Objects

Residual layers as differential equations

relation to a differential equation by introduction a step size h

al = al−1 +h· W (2)

l

max

0,W (1)

l

·al−1

resembles like Euler method

⇔

al−al−1 h

7

SLIDE 15

Relations to Mathematical Objects

Residual layers as differential equations

relation to a differential equation by introduction a step size h

al = al−1 +h· W (2)

l

max

0,W (1)

l

·al−1

resembles like Euler method

⇔

al−al−1 h

= W (2)

l

max

0,W (1)

l

·al−1

which for h → 0 becomes ˙

al

7

SLIDE 16

Relations to Mathematical Objects

Residual layers as differential equations

relation to a differential equation by introduction a step size h

al = al−1 +h· W (2)

l

max

0,W (1)

l

·al−1

resembles like Euler method

⇔

al−al−1 h

= W (2)

l

max

0,W (1)

l

·al−1

which for h → 0 becomes ˙

al

– select your favorite ordinary differential equation to determine W (1)

l

and W (2)

l

◮ Neural networks motivated by partial differential Equations – use your favorite ordinary differential equation solver for both inference and training ◮ A radical new neural network design could overcome big challenges in AI 7

SLIDE 17

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

0,

nl−1−1

∑

i=0

wl,j,ial−1,i

8

SLIDE 18

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

0,

nl−1−1

∑

i=0

wl,j,ial−1,i

→

al,j :=

nl−1−1

∑

i=0

wl,j,i max{0,al−1,i}

8

SLIDE 19

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

0,

nl−1−1

∑

i=0

wl,j,ial−1,i

→

al,j :=

nl−1−1

∑

i=0

wl,j,i max{0,al−1,i} written in continuous form al(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx

relates to high-dimensional integro-approximation

8

SLIDE 20

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

0,

nl−1−1

∑

i=0

wl,j,ial−1,i

→

al,j :=

nl−1−1

∑

i=0

wl,j,i max{0,al−1,i} written in continuous form al(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx

relates to high-dimensional integro-approximation

recurrent neural network layer in continuous form alludes to integral equation

a′

l(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx

8

SLIDE 21

Relations to Mathematical Objects

Learning integral operator kernels

neural unit with ReLU

al,j := max

0,

nl−1−1

∑

i=0

wl,j,ial−1,i

→

al,j :=

nl−1−1

∑

i=0

wl,j,i max{0,al−1,i} written in continuous form al(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx

relates to high-dimensional integro-approximation

recurrent neural network layer in continuous form alludes to integral equation

a′

l(y) :=

1

0 wl(x,y)max{0,al−1(x)}dx +

1

0 wh l (x,y)max{0,al(x)}dx

– weights wh establish recurrence, e.g. for processing sequences of data 8

SLIDE 22

Monte Carlo Methods and Neural Networks

Explore algorithms linear in time and space

structural equivalence of integral equations and reinforcement learning learning integro-approximation from noisy/sampled data 9

SLIDE 23

Monte Carlo Methods and Neural Networks

Explore algorithms linear in time and space

structural equivalence of integral equations and reinforcement learning learning integro-approximation from noisy/sampled data examples of random sampling – pseudo-random initialization – training by stochastic gradient descent – regularization by drop-out and drop-connect – random binarization – sampling by generative adversarial networks – fixed pseudo-random matrices for direct feedback alignment ◮ Learning light transport the reinforced way ◮ Machine learning and integral equations ◮ Noise2Noise: Learning image restoration without clean data 9

SLIDE 24

Partition instead of Dropout

SLIDE 25

Partition instead of Dropout

Guaranteeing coverage of neural units

dropout neuron if threshold 1

P > ξ

– ξ by linear feedback shift register generator (for example)

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

11

SLIDE 26

Partition instead of Dropout

Guaranteeing coverage of neural units

dropout neuron if threshold 1

P > ξ

– ξ by linear feedback shift register generator (for example)

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

11

SLIDE 27

Partition instead of Dropout

Guaranteeing coverage of neural units

dropout neuron if threshold 1

P > ξ

– ξ by linear feedback shift register generator (for example)

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

11

SLIDE 28

Partition instead of Dropout

Guaranteeing coverage of neural units

assign neuron to partition p = ⌊ξ ·P⌋ out of P – less random number generator calls

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

11

SLIDE 29

Partition instead of Dropout

Training accuracy with LeNet on CIFAR-10 20 40 60 80 100 120 140 0.2 0.4 0.6 Epoch of Training Accuracy 3 dropout partitions 1/3 dropout 2 dropout partitions 1/2 dropout

12

SLIDE 30

Simulating Discrete Densities

SLIDE 31

Simulating Discrete Densities

Stochastic evaluation of scalar product

select inputs proportional to weight density

1

14

SLIDE 32

Simulating Discrete Densities

Stochastic evaluation of scalar product

select inputs proportional to weight density

1

14

SLIDE 33

Simulating Discrete Densities

Stochastic evaluation of scalar product

select inputs proportional to weight density

1

– remember to flip sign accordingly 14

SLIDE 34

Simulating Discrete Densities

Stochastic evaluation of scalar product

partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

P0 1 Pm P1 P2 Pm−1 w1 w2 wm

15

SLIDE 35

Simulating Discrete Densities

Stochastic evaluation of scalar product

partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

P0 1 Pm P1 P2 Pm−1 w1 w2 wm

– using a uniform random variable ξ ∈ [0,1) to

select input i ⇔ Pi−1 ≤ ξ < Pi satisfying Prob({Pi−1 ≤ ξ < Pi}) = |wi|

15

SLIDE 36

Simulating Discrete Densities

Stochastic evaluation of scalar product

partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

P0 1 Pm P1 P2 Pm−1 w1 w2 wm

– using a uniform random variable ξ ∈ [0,1) to

select input i ⇔ Pi−1 ≤ ξ < Pi satisfying Prob({Pi−1 ≤ ξ < Pi}) = |wi|

– use jittered equidistant samples for ξ 15

SLIDE 37

Simulating Discrete Densities

Stochastic evaluation of scalar product

partition of unit interval by sums Pk := ∑k

j=1 |wj| of normalized absolute weights

P0 1 Pm P1 P2 Pm−1 w1 w2 wm

– using a uniform random variable ξ ∈ [0,1) to

select input i ⇔ Pi−1 ≤ ξ < Pi satisfying Prob({Pi−1 ≤ ξ < Pi}) = |wi|

– use jittered equidistant samples for ξ in fact derivation of quantization to ternary weights in {−1,0,+1} – integer weights result from neurons referenced more than once – relation to drop connect and drop out 15

SLIDE 38

Simulating Discrete Densities

Results

97% of accuracy of model by sampling most important 12% of weights

50 100 150 200 250 300 350 400 450 500 550 600 0.94 0.96 0.98 1 Number of Samples per Neuron Test Accuracy two layer ReLU feedforward network on MNIST

16

SLIDE 39

Neural Networks linear in Time and Space

SLIDE 40

Neural Networks linear in Time and Space

Complexity

number of neural units

n =

L

∑

l=1

nl where nl is the number of neurons in layer l

18

SLIDE 41

Neural Networks linear in Time and Space

Complexity

number of neural units

n =

L

∑

l=1

nl where nl is the number of neurons in layer l

number of weights

nw =

L

∑

l=1

nl−1 ·nl

18

SLIDE 42

Neural Networks linear in Time and Space

Complexity

number of neural units

n =

L

∑

l=1

nl where nl is the number of neurons in layer l

number of weights

nw =

L

∑

l=1

nl−1 ·nl

select constant number of weights per neuron – linear in n, because proportional to nw 18

SLIDE 43

Neural Networks linear in Time and Space

Results 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy LeNet on MNIST LeNet on CIFAR-10 AlexNet on CIFAR-10 Top-5 Accuracy AlexNet on ILSVRC12 Top-1 Accuracy AlexNet on ILSVRC12

19

SLIDE 44

Neural Networks linear in Time and Space

Sampling paths through networks

complexity bounded by number of paths times depth L of network – relation to random walks on directed graphs and Markov chains 20

SLIDE 45

Neural Networks linear in Time and Space

Sampling paths through networks

complexity bounded by number of paths times depth L of network – relation to random walks on directed graphs and Markov chains application after training – backwards random walks using sampling proportional to the weights of a neuron – compression and quantization by importance sampling 20

SLIDE 46

Neural Networks linear in Time and Space

Sampling paths through networks

complexity bounded by number of paths times depth L of network – relation to random walks on directed graphs and Markov chains application after training – backwards random walks using sampling proportional to the weights of a neuron – compression and quantization by importance sampling application before training – uniform (bidirectional) random walks to connect inputs and outputs – sparse from scratch 20

SLIDE 47

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

21

SLIDE 48

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1

21

SLIDE 49

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity 21

SLIDE 50

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity 21

SLIDE 51

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity 21

SLIDE 52

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity 21

SLIDE 53

Neural Networks linear in Time and Space

Sampling paths through networks

sparse from scratch

. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1

– guaranteed connectivity and coverage 21

SLIDE 54

Neural Networks linear in Time and Space

Test accuracy for 4 layer feedforward network (784/300/300/10) 50 100 150 200 250 300 0.2 0.4 0.6 0.8 1 Number of Per Pixel Paths Through Network Test Accuracy MNIST Fashion MNIST

22

SLIDE 55

Monte Carlo Methods and Neural Networks

SLIDE 56

Monte Carlo Methods and Neural Networks

Relations to integro-approximation

dropout partitions – using a fraction of the random numbers 24

SLIDE 57

Monte Carlo Methods and Neural Networks

Relations to integro-approximation

dropout partitions – using a fraction of the random numbers simulating discrete densities explains {−1,0,1} and integer weights – compression and quantization without retraining 24

SLIDE 58

Monte Carlo Methods and Neural Networks

Relations to integro-approximation

dropout partitions – using a fraction of the random numbers simulating discrete densities explains {−1,0,1} and integer weights – compression and quantization without retraining neural networks with linear complexity for both inference and training – sampling paths through neural networks – no more dangling neurons – sparse from scratch ◮ also watch the GTC presentation by Noah Gamboa 24