CISC 4631 Data Mining Lecture 11: Neural Networks Biological - - PowerPoint PPT Presentation
CISC 4631 Data Mining Lecture 11: Neural Networks Biological - - PowerPoint PPT Presentation
CISC 4631 Data Mining Lecture 11: Neural Networks Biological Motivation Can we simulate the human learning process? Two schools modeling biological learning process obtain highly effective algorithms, independent of whether
Biological Motivation
- Can we simulate the human learning process? Two
schools
- modeling biological learning process
- obtain highly effective algorithms, independent of
whether these algorithms mirror biological processes (this course)
- Biological learning system (brain)
– complex network of neurons – ANN are loosely motivated by biological neural
- systems. However, many features of ANNs are
inconsistent with biological systems
2
Neural Speed Constraints
- Neurons have a “switching time” on the order of a few milliseconds,
compared to nanoseconds for current computing hardware.
- However, neural systems can perform complex cognitive tasks
(vision, speech understanding) in tenths of a second.
- Only time for performing 100 serial steps in this time frame,
compared to orders of magnitude more for current computers.
- Must be exploiting “massive parallelism.”
- Human brain has about 1011 neurons with an average of 104
connections each.
3
Artificial Neural Networks (ANN)
- ANN
– network of simple units – real-valued inputs & outputs
- Many neuron-like threshold switching units
- Many weighted interconnections among units
- Highly parallel, distributed process
- Emphasis on tuning weights automatically
4
Neural Network Learning
- Learning approach based on modeling adaptation in biological
neural systems.
- Perceptron: Initial algorithm for learning simple neural networks
(single layer) developed in the 1950’s.
- Backpropagation: More complex algorithm for learning multi-layer
neural networks developed in the 1980’s.
5
Real Neurons
6
How Does our Brain Work?
- A neuron is connected to other neurons via its input and
- utput links
- Each incoming neuron has an activation value and each
connection has a weight associated with it
- The neuron sums the incoming weighted values and this
value is input to an activation function
- The output of the activation function is the output from
the neuron
7
Neural Communication
- Electrical potential across
cell membrane exhibits spikes called action potentials.
- Spike originates in cell
body, travels down axon, and causes synaptic terminals to release neurotransmitters.
- Chemical diffuses across
synapse to dendrites of
- ther neurons.
- Neurotransmitters can be
excititory or inhibitory.
- If net input of
neurotransmitters to a neuron from other neurons is excititory and exceeds some threshold, it fires an action potential.
8
Real Neural Learning
- To model the brain we need to model a neuron
- Each neuron performs a simple computation
– It receives signals from its input links and it uses these values to compute the activation level (or output) for the neuron. – This value is passed to other neurons via its output links.
9
Prototypical ANN
- Units interconnected in layers
– directed, acyclic graph (DAG)
- Network structure is fixed
– learning = weight adjustment – backpropagation algorithm
10
Appropriate Problems
- Instances: vectors of attributes
– discrete or real values
- Target function
– discrete, real, vector – ANNs can handle classification & regression
- Noisy data
- Long training times acceptable
- Fast evaluation
- No need to be readable
– It is almost impossible to interpret neural networks except for the simplest target functions
11
Perceptrons
- The perceptron is a type of
artificial neural network which can be seen as the simplest kind
- f feedforward neural network: a
linear classifier
- Introduced in the late 50s
- Perceptron convergence
theorem (Rosenblatt 1962): – Perceptron will learn to classify any linearly separable set of inputs. XOR function (no linear separation)
Perceptron is a network: – single-layer – feed-forward: data only travels in one direction
12
ALVINN drives 70 mph on highways
13
See Alvinn video Alvinn Video
Artificial Neuron Model
- Model network as a graph with cells as nodes and synaptic
connections as weighted edges from node i to node j, wji
- Model net input to cell as
- Cell output is:
1 3 2 5 4 6 w12 w13 w14 w15 w16
i i ji j
- w
net
(Tj is threshold for unit j)
j i j j j
T net T net
-
if 1 if
netj
- j
Tj 1
14
Perceptron: Artificial Neuron Model
Vector notation: Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, wji The input value received of a neuron is calculated by summing the weighted input values from its input links
n i i ix
w
threshold function threshold
15
Different Threshold Functions
, 1 , 1 ) ( x w x w x
-
, 1 , 1 ) ( x w x w x
-
- therwise
t x w x
- ,
, 1 ) (
, 1 , 1 ) ( x w x w x
-
We should learn the weight w1,…, wn
16
Examples
(step activation function)
In1 In2 Out 1 1 1 1 1 In1 In2 Out 1 1 1 1 1 1 1 In Out 1 1
n i i ix
w
w0 – t
17
Neural Computation
- McCollough and Pitts (1943) showed how such model
neurons could compute logical functions and be used to construct finite-state machines
- Can be used to simulate logic gates:
– AND: Let all wji be Tj/n, where n is the number of inputs. – OR: Let all wji be Tj – NOT: Let threshold be 0, single input with a negative weight.
- Can build arbitrary logic circuits, sequential machines, and
computers with such gates
18
Perceptron Training
- Assume supervised training examples giving the desired
- utput for a unit given a set of known input activations.
- Goal: learn the weight vector (synaptic weights) that
causes the perceptron to produce the correct +/- 1 values
- Perceptron uses iterative update algorithm to learn a
correct set of weights
– Perceptron training rule – Delta rule
- Both algorithms are guaranteed to converge to somewhat
different acceptable hypotheses, under somewhat different conditions
19
Perceptron Training Rule
- Update weights by:
where η is the learning rate
- a small value (e.g., 0.1)
- sometimes is made to decay as the number of
weight-tuning operations increases
t – target output for the current training example
- – linear unit output for the current training example
i i i i i
w
- t
w w w w ) (
20
Perceptron Training Rule
- Equivalent to rules:
– If output is correct do nothing. – If output is high, lower weights on active inputs – If output is low, increase weights on active inputs
- Can prove it will converge
– if training data is linearly separable – and η is sufficiently small
21
Perceptron Learning Algorithm
- Iteratively update weights until convergence.
- Each execution of the outer loop is typically called an
epoch. Initialize weights to random values Until outputs of all training examples are correct For each training pair, E, do: Compute current output oj for E given its inputs Compare current output to target value, tj , for E Update synaptic weights and threshold using learning rule
22
Delta Rule
- Works reasonably with data that is not linearly separable
- Minimizes error
- Gradient descent method
– basis of Backpropagation method – basis for methods working in multidimensional continuous spaces – Discussion of this rule is beyond the scope of this course
23
Perceptron as a Linear Separator
- Since perceptron uses linear threshold function it searches
for a linear separator that discriminates the classes
1 3 13 2 12
T
- w
- w
- 3
- 2
13 1 2 13 12 3
w T
- w
w
-
??
Or hyperplane in n-dimensional space
24
Concept Perceptron Cannot Learn
- Cannot learn exclusive-or (XOR), or parity function in
general
- 3
- 2
?? +
1 1
– + –
25
General Structure of an ANN
26
Activation function g(Si )
Si Oi
I1 I2 I3 wi1 wi2 wi3 Oi Neuron i Input Output threshold, t
Input Layer Hidden Layer Output Layer x1 x2 x3 x4 x5 y
Perceptrons have no hidden layers Multilayer perceptrons may have many
Learning Power of an ANN
- Perceptron is guaranteed to converge if data is linearly
separable – It will learn a hyperplane that separates the classes – The XOR function on page 250 TSK is not linearly separable
- A mulitlayer ANN has no such guarantee of convergence
but can learn functions that are not linearly separable
- An ANN with a hidder layer can learn the XOR function
by constructing two hyperplanes (see page 253)
27
Multilayer Network Example
The decision surface is highly nonlinear
28
Sigmoid Threshold Unit
- Sigmoid is a unit whose output is a nonlinear function of its inputs,
but whose output is also a differentiable function of its inputs
- We can derive gradient descent rules to train
– Sigmoid unit – Multilayer networks of sigmoid units backpropagation
29
Multi-Layer Networks
- Multi-layer networks can represent arbitrary functions,
but an effective learning algorithm for such networks was thought to be difficult
- A typical multi-layer network consists of an input, hidden
and output layer, each fully connected to the next, with activation feeding forward.
- The weights determine the function computed. Given an
arbitrary number of hidden units, any boolean function can be computed with a single hidden layer
- utput
hidden input activation
30
Logistic function
Inputs Coefficients Output Independent variables Prediction Age 34 1 Gender Stage 4
.5 .8 .4
0.6
S
“Probability
- f beingAlive”
n i i ix
w
e 1 1
31
Neural Network Model
Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2 Gender Stage 4
.6 .5 .8 .2 .1 .3 .7 .2
Weights Hidden Layer
“Probability
- f beingAlive”
0.6
S
S
.4 .2
S 32
Getting an answer from a NN
Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2 Gender Stage 4
.6 .5 .8 .1 .7
Weights Hidden Layer
“Probability
- f beingAlive”
0.6
S
33
Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2 Gender Stage 4
.5 .8 .2 .3 .2
Weights Hidden Layer
“Probability
- f beingAlive”
0.6
S
Getting an answer from a NN
34
Getting an answer from a NN
Inputs Weights Output Independent variables Dependent variable Prediction Age 34 1 Gender Stage 4
.6 .5 .8 .2 .1 .3 .7 .2
Weights Hidden Layer
“Probability
- f beingAlive”
0.6
S
35
Comments on Training Algorithm
- Not guaranteed to converge to zero training error, may
converge to local optima or oscillate indefinitely.
- However, in practice, does converge to low error for many
large networks on real data.
- Many epochs (thousands) may be required, hours or days
- f training for large networks.
- To avoid local-minima problems, run several trials starting
with different random weights (random restarts).
– Take results of trial with lowest training set error. – Build a committee of results from multiple trials (possibly weighting votes by training set accuracy).
36
Hidden Unit Representations
- Trained hidden units can be seen as newly constructed
features that make the target concept linearly separable in the transformed space: key features of ANNs
- On many real domains, hidden units can be interpreted as
representing meaningful features such as vowel detectors or edge detectors, etc..
- However, the hidden layer can also become a distributed
representation of the input in which each individual unit is not easily interpretable as a meaningful feature.
37
Expressive Power of ANNs
- Boolean functions: Any boolean function can be
represented by a two-layer network with sufficient hidden units.
- Continuous functions: Any bounded continuous
function can be approximated with arbitrarily small error by a two-layer network.
- Arbitrary function: Any function can be approximated to
arbitrary accuracy by a three-layer network.
38
Sample Learned XOR Network
3.11 7.38 6.96 5.24 3.6 3.58 5.57 5.74 2.03
A X Y B
Hidden Unit A represents: (X Y) Hidden Unit B represents: (X Y) Output O represents: A B = (X Y) (X Y) = X Y
O
IN 1 IN 2 OUT 1 1 1 1 1 1
39
Expressive Power of ANNs
- Universal Function Approximator:
– Given enough hidden units, can approximate any continuous function f
- Why not use millions of hidden units?
– Efficiency (neural network training is slow) – Overfitting
40
Combating Overfitting in Neural Nets
- Many techniques
- Two popular ones:
– Early Stopping
- Use “a lot” of hidden units
- Just don’t over-train
– Cross-validation
- Choose the “right” number of hidden units
41
Determining the Best Number of Hidden Units
- Too few hidden units prevents the network from
adequately fitting the data
- Too many hidden units can result in over-fitting
- Use internal cross-validation to empirically determine an
- ptimal number of hidden units
error
- n training data
- n test data
# hidden units
42
Over-Training Prevention
- Running too many epochs can result in over-fitting.
- Keep a hold-out validation set and test accuracy on it after every
- epoch. Stop training when additional epochs actually increase
validation error.
- To avoid losing training data for validation:
– Use internal 10-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy. – Train final network on complete training set for this many epochs.
error
- n training data
- n test data
# training epochs
43
Summary of Neural Networks
When are Neural Networks useful?
– Instances represented by attribute-value pairs
- Particularly when attributes are real valued
– The target function is
- Discrete-valued
- Real-valued
- Vector-valued
– Training examples may contain errors – Fast evaluation times are necessary
When not?
– Fast training times are necessary – Understandability of the function is required
44
Successful Applications
- Text to Speech (NetTalk)
- Fraud detection
- Financial Applications
– HNC (eventually bought by Fair Isaac)
- Chemical Plant Control
– Pavillion Technologies
- Automated Vehicles
- Game Playing
– Neurogammon
- Handwriting recognition
45
Issues in Neural Nets
- Learning the proper network architecture:
– Grow network until able to fit data
- Cascade Correlation
- Upstart
– Shrink large network until unable to fit data
- Optimal Brain Damage
- Recurrent networks that use feedback and can learn
finite state machines with “backpropagation through time.”
46