Algorithms in Nature
Neural Networks (NN)
Algorithms in Nature Neural Networks (NN) Mimicking the brain In - - PowerPoint PPT Presentation
Algorithms in Nature Neural Networks (NN) Mimicking the brain In the early days of AI there was a lot of interest in developing models that can mimic human thinking. While no one knew exactly how the brain works (and, even though there
Neural Networks (NN)
developing models that can mimic human thinking.
even though there was a lot of progress since, there is still a lot we do not know), some of the basic computational units were known
neurons
computations by integrating signals from other neurons
computation may be transmitted to one or more neurons
neurons, so copying their behaviour and functionality could provide solutions to problems related to interpretation and optimization.
Dendrites Soma (cell body) Axon
synapses axon dendrites Synapses are the edges in this network, responsible for transmitting information between the neurons
trigger the release of neurotransmitter substances at the synapse.
dendrite of the post-synaptic neuron.
produce spikes in the post-synaptic neuron.
synaptic connection.
Input: Real valued variables Output: One or more real values
stock price
assume that y and x are related with the following equation: y = wx+ε
X Y
parameters of a general linear regression problem. Define:
Φ = φ0(x1) φ1(x1) φm(x1) φ0(x 2) φ1(x 2) φm(x 2) φ0(x n) φ1(x n) φm(x n)
Then deriving w we get:
w = (ΦTΦ)−1ΦTy ε φ + = ) ( y
T
x w
we need to invert a k by k matrix (k-1 is the number of features)
w = (ΦTΦ)−1ΦTy
form y = f(wTX)
y=f(∑wixi)
w0 w1 w2 wk x1 x2 xk 1
Input layer Output layer
solution, for perceptrons we will use a different strategy
y=∑wix
i
w0 w1 w2 wk x1 x2 xk 1
z=(f(w)-y)2 w
Slope = ∂z/ ∂w
∆z ∆w
a smaller z
a smaller z
where λ is small constant which is intended to prevent us from passing the optimal w
w z w w ∂ ∂ − ← λ
more updates
where xj,i is the i’th value of the j’th input vector
) ( 2
2
∑ ∑
− − = − ∂ ∂
k k k i k k k i
x w y x x w y w
= =
− = ∂ ∂
n j j T j i j n j j T j i
y x y w
1 , 1 2
) ( 2 ) ( x w
w
= =
− = ∂ ∂
n j j T j i j n j j T j i
y x y w
1 , 1 2
) ( 2 ) ( x w
w
(
j T j j
y x w
δ
=
+ ←
n j j i j i i
x w w
1 ,
2 δ λ
1.Chose λ 2.Start with a guess for w 3.Compute δj for all j 4.For all i set 5.If no improvement for
∑
=
+ ←
n j j i j i i
x w w
1 ,
2 δ λ
= n j j T j
y
1 2
) ( x w
the parameters for a classification problem
1
Optimal regression model
wTx ≥ 0 ⇒ classify as 1 wTx < 0 ⇒ classify as -1
models we replace the linear function with the sigmoid function:
binary classification problems) g(h) = 1 1+ e−h
p(y | x;θ)
Always between 0 and 1
p(y = 0 | x;θ) = g(w Tx) = 1 1+ ew T x p(y =1| x;θ) =1− g(w Tx) = ew T x 1+ ew T x
models we replace the linear function with the sigmoid function:
binary classification problems) g(h) = 1 1+ e−h
p(y | x;θ)
Always between 0 and 1
p(y = 0 | x;θ) = g(w Tx) = 1 1+ ew T x p(y =1| x;θ) =1− g(w Tx) = ew T x 1+ ew T x We can use the sigmoid function as part of the perception when using it for classification
p(y = 0 | x;θ) = g(w Tx) = 1 1+ ew T x
p(y =1| x;θ) =1− g(w Tx) = ew T x 1+ ew T x
x
e x g
−
+ = 1 1 ) (
−
j j T j
x w g y
2
)) ( ( min
Taking the derivative w.r.t. wi we get:
i j j T j T j j T j j j T j i
x x w g x w g x w g y x w g y w
, 2
)) ( 1 )( ( )) ( ( 2 )) ( ( − − − = − ∂ ∂
)) ( 1 )( ( ) ( ' x g x g x g − − =
sigmoid function in a NN:
x
e x g
−
+ = 1 1 ) (
x
e x g
−
+ = 1 1 ) (
−
j j T j
x w g y
2
)) ( ( min
Taking the derivative w.r.t. wi we get:
i j j j j j def i j j T j T j j T j j j T j i
x g g x x w g x w g x w g y x w g y w
, , 2
) 1 ( 2 )) ( 1 )( ( )) ( ( 2 )) ( ( − = − − = − ∂ ∂
δ
)) ( 1 )( ( ) ( ' x g x g x g − − =
) (
j T j
x w g g =
sigmoid function in a NN:
1.Chose λ 2.Start with a guess for w 3.Compute δj for all j 4.For all i set 5.If no improvement for
∑
= n j j T j
g y
1 2
)) ( x (w
j j j n j j i i
x g g w w
, 1
) 1 ( 2 − − ←
∑
=
δ λ
layers, increasing the set of functions that can be represented using a NN
v1=g(wTx)
w0,1 x1 x2 1
v2=g(wTx) v1=g(wTv)
w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2
Input layer Output layer Hidden layer
inputs.
hidden layer weights
v1=g(wTx)
w0,1 x1 x2 1
v2=g(wTx) v1=g(wTv)
w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2
compute the update rule for the output weights w1 and w2: where
i j j n j j j i i
v g g w w
, 1
) 1 ( 2 − + ←
=
δ λ
v1=g(wTx)
w0,1 x1 x2 1
v2=g(wTx) y=g(wTv)
w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2
) (
j T j j
g y v w − = δ
w1 and w2: where
i j j n j j j i i
v g g w w
, 1
) 1 ( 2 − + ←
=
δ λ
v1=g(wTx)
w0,1 x1 x2 1
v2=g(wTx) v1=g(wTv)
w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2
) (
j T j j
g y v w − = δ But what is the error associated with each of the hidden layer states?
descent to update them
v1=g(wTx)
w0,1 x1 x2 1
v2=g(wTx) v1=g(wTv)
w1,1 w2,1 w0,2 w1,2 w2,2 w1 w2
weight
descent to update them
k j i j n j i j i j i k i k
x g g w w
, , 1 , , , ,
) 1 ( 2 − ∆ + ←
=
λ
k j i j n j i j i j i k i k
x g g w w
, , 1 , , , ,
) 1 ( 2 − ∆ + ←
=
λ
The correct error term for each hidden state can be determined by taking the partial derivative for each
the global error function*:
2
T i T j j
*See RN book for details (pages 746-747)
1.Chose λ 2.Start with a guess for w, wi 3.Compute values vi,j for all hidden layer states i and inputs j 4.Compute δj for all j: 5.Compute ∆j,I 6.For all i set
step 3 ) (
j T j j
g y v w − = δ
i j j n j j j i i
v g g w w
, 1
) 1 ( 2 − + ←
=
δ λ
k j i j n j i j i j i k i k
x g g w w
, , 1 , , , ,
) 1 ( 2 − ∆ + ←
=
λ
∑ ∑
= =
∆ +
s i i j n j j 1 2 , 1 2
δ
Figure 1: Feedforward ANN designed and tested for prediction of tactical air combat maneuvers.
used a layer of hand- coded features and tried to recognize objects by learning how to weight these features. – There was a neat learning algorithm for adjusting the weights. – But perceptrons are fundamentally limited in what they can learn to do.
non-adaptive hand-coded features
e.g. class labels input units e.g. pixels
Sketch of a typical perceptron from the 1960’s
Bomb Toy
input vector
hidden layers
Back-propagate error signal to get derivatives for learning
Compare outputs with correct answer to get error signal
method for adjusting the weights, but use it for modeling the structure of the sensory input. – Iteratively learn the different layers. – Adjust the weights to maximize the probability that a generative model would have produced the sensory input. – Learn p(image) not p(label | image) for the lower layers.
Input Hidden Reconstruction
Input Hidden Reconstruction Hidden
The final 50 x 256 weights
Each neuron grabs a different feature.
Reconstruction from activated binary features
Data
Reconstruction from activated binary features
Data
How well can we reconstruct the digit images from the binary feature activations?
New test images from the digit class that the model was trained on Images from an unfamiliar digit class (the network tries to see every image as a 2)
(the main reason RBM’s are interesting)
from the pixels.
they were pixels and learn features of features in a second hidden layer.
features we improve a variational lower bound on the log probability of the training data. – The proof is slightly complicated. – But it is based on a neat equivalence between an RBM and a deep directed model (described later)
Samples generated by letting the associative memory run with one label clamped. There are 1000 iterations of alternating Gibbs sampling between samples.
x
e x g
−
+ = 1 1 ) (
weight between units i and j Energy with configuration v on the visible units and h on the hidden units binary state of visible unit i binary state of hidden unit j
j i ij
configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations.
configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it.
− −
g u g u E h v E
, ) , ( ) , (
− −
g u g u E h h v E
, ) , ( ) , (
partition function
> <
j ih
v
∞
> <
j ih
v
i j i j i j i j t = 0 t = 1 t = 2 t = infinity
∞
j i j i ij
Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.
a fantasy
> <
j ih
v
1
> <
j ih
v
i j i j t = 0 t = 1
1
j i j i ij
Start with a training vector on the visible units. Update all the hidden units in parallel Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again.
This is not following the gradient of the log likelihood. But it works well. It is approximately following the gradient of another
reconstruction data