Monte Carlo Methods and Neural Networks Alexander Keller, partially - - PowerPoint PPT Presentation
Monte Carlo Methods and Neural Networks Alexander Keller, partially - - PowerPoint PPT Presentation
Monte Carlo Methods and Neural Networks Alexander Keller, partially joint work with Noah Gamboa Artificial Neural Networks in a Nutshell Supervised learning of high dimensional function approximation input layer a 0 , L 1 fully connected
Artificial Neural Networks in a Nutshell
Supervised learning of high dimensional function approximation
input layer a0, L−1 fully connected hidden layers, and output layer aL
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
2
Artificial Neural Networks in a Nutshell
Supervised learning of high dimensional function approximation
input layer a0, L−1 fully connected hidden layers, and output layer aL
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1
– nl rectified linear units (ReLU) al,i = max{0,∑wl,j,i ·al−1,j} in layer l 2
Artificial Neural Networks in a Nutshell
Supervised learning of high dimensional function approximation
input layer a0, L−1 fully connected hidden layers, and output layer aL
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1
– nl rectified linear units (ReLU) al,i = max{0,∑wl,j,i ·al−1,j} in layer l – backpropagating the error δl−1,i = ∑al,j >0 δl,j ·wl,j,i 2
Artificial Neural Networks in a Nutshell
Supervised learning of high dimensional function approximation
input layer a0, L−1 fully connected hidden layers, and output layer aL
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1
– nl rectified linear units (ReLU) al,i = max{0,∑wl,j,i ·al−1,j} in layer l – backpropagating the error δl−1,i = ∑al,j >0 δl,j ·wl,j,i, update weights w′
l,j,i = wl,j,i −λ ·δl,j ·al−1,i if al,j > 0
2
Artificial Neural Networks in a Nutshell
Convolutional neural networks: Similarity measures
convolutional layer: feature map defined by convolution kernel – identical weights across all neural units of one feature map max pooling layer: maximum of tile of neurons in feature map for subsampling ◮ Gradient based learning applied to document recognition ◮ Quasi-Monte Carlo feature maps for shift-invariant kernels 3
Relations to Mathematical Objects
Relations to Mathematical Objects
Maximum pooling layers
rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity 5
Relations to Mathematical Objects
Maximum pooling layers
rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity for example, leaky ReLU is
ReLU(x)−α ·ReLU(−x)
5
Relations to Mathematical Objects
Maximum pooling layers
rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity for example, leaky ReLU is
ReLU(x)−α ·ReLU(−x) which for α = −1 yields the absolute value |x| = ReLU(x)+ReLU(−x)
5
Relations to Mathematical Objects
Maximum pooling layers
rectified linear unit ReLU(x) := max{0,x} as a basic non-linearity for example, leaky ReLU is
ReLU(x)−α ·ReLU(−x) which for α = −1 yields the absolute value |x| = ReLU(x)+ReLU(−x)
hence the maximum of two values is
max{x,y} = x +y 2 +
- x −y
2
- = 1
2 ·(x +y +ReLU(x −y)+ReLU(y −x))) which allows one to represent maximum pooling by ReLU functions and introduces skip links
5
Relations to Mathematical Objects
Residual layers looks like projections onto half spaces
halfspace H+ with weights ˆ
ω as normal and bias b as distance from origin O
6
Relations to Mathematical Objects
Residual layers as differential equations
relation to a differential equation by introduction a step size h
al = al−1 +h· W (2)
l
max
- 0,W (1)
l
·al−1
- 7
Relations to Mathematical Objects
Residual layers as differential equations
relation to a differential equation by introduction a step size h
al = al−1 +h· W (2)
l
max
- 0,W (1)
l
·al−1
- resembles like Euler method
⇔
al−al−1 h
7
Relations to Mathematical Objects
Residual layers as differential equations
relation to a differential equation by introduction a step size h
al = al−1 +h· W (2)
l
max
- 0,W (1)
l
·al−1
- resembles like Euler method
⇔
al−al−1 h
= W (2)
l
max
- 0,W (1)
l
·al−1
- which for h → 0 becomes ˙
al
7
Relations to Mathematical Objects
Residual layers as differential equations
relation to a differential equation by introduction a step size h
al = al−1 +h· W (2)
l
max
- 0,W (1)
l
·al−1
- resembles like Euler method
⇔
al−al−1 h
= W (2)
l
max
- 0,W (1)
l
·al−1
- which for h → 0 becomes ˙
al
– select your favorite ordinary differential equation to determine W (1)
l
and W (2)
l
◮ Neural networks motivated by partial differential Equations – use your favorite ordinary differential equation solver for both inference and training ◮ A radical new neural network design could overcome big challenges in AI 7
Relations to Mathematical Objects
Learning integral operator kernels
neural unit with ReLU
al,j := max
- 0,
nl−1−1
∑
i=0
wl,j,ial−1,i
- 8
Relations to Mathematical Objects
Learning integral operator kernels
neural unit with ReLU
al,j := max
- 0,
nl−1−1
∑
i=0
wl,j,ial−1,i
- →
al,j :=
nl−1−1
∑
i=0
wl,j,i max{0,al−1,i}
8
Relations to Mathematical Objects
Learning integral operator kernels
neural unit with ReLU
al,j := max
- 0,
nl−1−1
∑
i=0
wl,j,ial−1,i
- →
al,j :=
nl−1−1
∑
i=0
wl,j,i max{0,al−1,i} written in continuous form al(y) :=
1
0 wl(x,y)max{0,al−1(x)}dx
relates to high-dimensional integro-approximation
8
Relations to Mathematical Objects
Learning integral operator kernels
neural unit with ReLU
al,j := max
- 0,
nl−1−1
∑
i=0
wl,j,ial−1,i
- →
al,j :=
nl−1−1
∑
i=0
wl,j,i max{0,al−1,i} written in continuous form al(y) :=
1
0 wl(x,y)max{0,al−1(x)}dx
relates to high-dimensional integro-approximation
recurrent neural network layer in continuous form alludes to integral equation
a′
l(y) :=
1
0 wl(x,y)max{0,al−1(x)}dx
8
Relations to Mathematical Objects
Learning integral operator kernels
neural unit with ReLU
al,j := max
- 0,
nl−1−1
∑
i=0
wl,j,ial−1,i
- →
al,j :=
nl−1−1
∑
i=0
wl,j,i max{0,al−1,i} written in continuous form al(y) :=
1
0 wl(x,y)max{0,al−1(x)}dx
relates to high-dimensional integro-approximation
recurrent neural network layer in continuous form alludes to integral equation
a′
l(y) :=
1
0 wl(x,y)max{0,al−1(x)}dx +
1
0 wh l (x,y)max{0,al(x)}dx
– weights wh establish recurrence, e.g. for processing sequences of data 8
Monte Carlo Methods and Neural Networks
Explore algorithms linear in time and space
structural equivalence of integral equations and reinforcement learning learning integro-approximation from noisy/sampled data 9
Monte Carlo Methods and Neural Networks
Explore algorithms linear in time and space
structural equivalence of integral equations and reinforcement learning learning integro-approximation from noisy/sampled data examples of random sampling – pseudo-random initialization – training by stochastic gradient descent – regularization by drop-out and drop-connect – random binarization – sampling by generative adversarial networks – fixed pseudo-random matrices for direct feedback alignment ◮ Learning light transport the reinforced way ◮ Machine learning and integral equations ◮ Noise2Noise: Learning image restoration without clean data 9
Partition instead of Dropout
Partition instead of Dropout
Guaranteeing coverage of neural units
dropout neuron if threshold 1
P > ξ
– ξ by linear feedback shift register generator (for example)
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
11
Partition instead of Dropout
Guaranteeing coverage of neural units
dropout neuron if threshold 1
P > ξ
– ξ by linear feedback shift register generator (for example)
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
11
Partition instead of Dropout
Guaranteeing coverage of neural units
dropout neuron if threshold 1
P > ξ
– ξ by linear feedback shift register generator (for example)
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
11
Partition instead of Dropout
Guaranteeing coverage of neural units
assign neuron to partition p = ⌊ξ ·P⌋ out of P – less random number generator calls
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
11
Partition instead of Dropout
Training accuracy with LeNet on CIFAR-10 20 40 60 80 100 120 140 0.2 0.4 0.6 Epoch of Training Accuracy 3 dropout partitions 1/3 dropout 2 dropout partitions 1/2 dropout
12
Simulating Discrete Densities
Simulating Discrete Densities
Stochastic evaluation of scalar product
select inputs proportional to weight density
1
14
Simulating Discrete Densities
Stochastic evaluation of scalar product
select inputs proportional to weight density
1
14
Simulating Discrete Densities
Stochastic evaluation of scalar product
select inputs proportional to weight density
1
– remember to flip sign accordingly 14
Simulating Discrete Densities
Stochastic evaluation of scalar product
partition of unit interval by sums Pk := ∑k
j=1 |wj| of normalized absolute weights
P0 1 Pm P1 P2 Pm−1 w1 w2 wm
15
Simulating Discrete Densities
Stochastic evaluation of scalar product
partition of unit interval by sums Pk := ∑k
j=1 |wj| of normalized absolute weights
P0 1 Pm P1 P2 Pm−1 w1 w2 wm
– using a uniform random variable ξ ∈ [0,1) to
select input i ⇔ Pi−1 ≤ ξ < Pi satisfying Prob({Pi−1 ≤ ξ < Pi}) = |wi|
15
Simulating Discrete Densities
Stochastic evaluation of scalar product
partition of unit interval by sums Pk := ∑k
j=1 |wj| of normalized absolute weights
P0 1 Pm P1 P2 Pm−1 w1 w2 wm
– using a uniform random variable ξ ∈ [0,1) to
select input i ⇔ Pi−1 ≤ ξ < Pi satisfying Prob({Pi−1 ≤ ξ < Pi}) = |wi|
– use jittered equidistant samples for ξ 15
Simulating Discrete Densities
Stochastic evaluation of scalar product
partition of unit interval by sums Pk := ∑k
j=1 |wj| of normalized absolute weights
P0 1 Pm P1 P2 Pm−1 w1 w2 wm
– using a uniform random variable ξ ∈ [0,1) to
select input i ⇔ Pi−1 ≤ ξ < Pi satisfying Prob({Pi−1 ≤ ξ < Pi}) = |wi|
– use jittered equidistant samples for ξ in fact derivation of quantization to ternary weights in {−1,0,+1} – integer weights result from neurons referenced more than once – relation to drop connect and drop out 15
Simulating Discrete Densities
Results
97% of accuracy of model by sampling most important 12% of weights
50 100 150 200 250 300 350 400 450 500 550 600 0.94 0.96 0.98 1 Number of Samples per Neuron Test Accuracy two layer ReLU feedforward network on MNIST
16
Neural Networks linear in Time and Space
Neural Networks linear in Time and Space
Complexity
number of neural units
n =
L
∑
l=1
nl where nl is the number of neurons in layer l
18
Neural Networks linear in Time and Space
Complexity
number of neural units
n =
L
∑
l=1
nl where nl is the number of neurons in layer l
number of weights
nw =
L
∑
l=1
nl−1 ·nl
18
Neural Networks linear in Time and Space
Complexity
number of neural units
n =
L
∑
l=1
nl where nl is the number of neurons in layer l
number of weights
nw =
L
∑
l=1
nl−1 ·nl
select constant number of weights per neuron – linear in n, because proportional to nw 18
Neural Networks linear in Time and Space
Results 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.4 0.6 0.8 1 Percent of FC layers sampled Test Accuracy LeNet on MNIST LeNet on CIFAR-10 AlexNet on CIFAR-10 Top-5 Accuracy AlexNet on ILSVRC12 Top-1 Accuracy AlexNet on ILSVRC12
19
Neural Networks linear in Time and Space
Sampling paths through networks
complexity bounded by number of paths times depth L of network – relation to random walks on directed graphs and Markov chains 20
Neural Networks linear in Time and Space
Sampling paths through networks
complexity bounded by number of paths times depth L of network – relation to random walks on directed graphs and Markov chains application after training – backwards random walks using sampling proportional to the weights of a neuron – compression and quantization by importance sampling 20
Neural Networks linear in Time and Space
Sampling paths through networks
complexity bounded by number of paths times depth L of network – relation to random walks on directed graphs and Markov chains application after training – backwards random walks using sampling proportional to the weights of a neuron – compression and quantization by importance sampling application before training – uniform (bidirectional) random walks to connect inputs and outputs – sparse from scratch 20
Neural Networks linear in Time and Space
Sampling paths through networks
sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
21
Neural Networks linear in Time and Space
Sampling paths through networks
sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 a2,1 . . . aL,0 aL,1 aL,2 aL,nL−1
21
Neural Networks linear in Time and Space
Sampling paths through networks
sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
– guaranteed connectivity 21
Neural Networks linear in Time and Space
Sampling paths through networks
sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
– guaranteed connectivity 21
Neural Networks linear in Time and Space
Sampling paths through networks
sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
– guaranteed connectivity 21
Neural Networks linear in Time and Space
Sampling paths through networks
sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
– guaranteed connectivity 21
Neural Networks linear in Time and Space
Sampling paths through networks
sparse from scratch
. . . a0,0 a0,1 a0,2 a0,n0−1 . . . a1,0 a1,1 a1,2 a1,n1−1 . . . a2,0 a2,1 a2,2 a2,n2−1 . . . aL,0 aL,1 aL,2 aL,nL−1
– guaranteed connectivity and coverage 21
Neural Networks linear in Time and Space
Test accuracy for 4 layer feedforward network (784/300/300/10) 50 100 150 200 250 300 0.2 0.4 0.6 0.8 1 Number of Per Pixel Paths Through Network Test Accuracy MNIST Fashion MNIST
22
Monte Carlo Methods and Neural Networks
Monte Carlo Methods and Neural Networks
Relations to integro-approximation
dropout partitions – using a fraction of the random numbers 24
Monte Carlo Methods and Neural Networks
Relations to integro-approximation
dropout partitions – using a fraction of the random numbers simulating discrete densities explains {−1,0,1} and integer weights – compression and quantization without retraining 24
Monte Carlo Methods and Neural Networks
Relations to integro-approximation
dropout partitions – using a fraction of the random numbers simulating discrete densities explains {−1,0,1} and integer weights – compression and quantization without retraining neural networks with linear complexity for both inference and training – sampling paths through neural networks – no more dangling neurons – sparse from scratch ◮ also watch the GTC presentation by Noah Gamboa 24