CISC 4631 Data Mining Lecture 11: Neural Networks Biological - - PowerPoint PPT Presentation

cisc 4631 data mining lecture 11 neural networks
SMART_READER_LITE
LIVE PREVIEW

CISC 4631 Data Mining Lecture 11: Neural Networks Biological - - PowerPoint PPT Presentation

CISC 4631 Data Mining Lecture 11: Neural Networks Biological Motivation Can we simulate the human learning process? Two schools modeling biological learning process obtain highly effective algorithms, independent of whether


slide-1
SLIDE 1

CISC 4631 Data Mining

Lecture 11: Neural Networks

slide-2
SLIDE 2

Biological Motivation

  • Can we simulate the human learning process?  Two

schools

  • modeling biological learning process
  • obtain highly effective algorithms, independent of

whether these algorithms mirror biological processes (this course)

  • Biological learning system (brain)

– complex network of neurons – ANN are loosely motivated by biological neural

  • systems. However, many features of ANNs are

inconsistent with biological systems

2

slide-3
SLIDE 3

Neural Speed Constraints

  • Neurons have a “switching time” on the order of a few milliseconds,

compared to nanoseconds for current computing hardware.

  • However, neural systems can perform complex cognitive tasks

(vision, speech understanding) in tenths of a second.

  • Only time for performing 100 serial steps in this time frame,

compared to orders of magnitude more for current computers.

  • Must be exploiting “massive parallelism.”
  • Human brain has about 1011 neurons with an average of 104

connections each.

3

slide-4
SLIDE 4

Artificial Neural Networks (ANN)

  • ANN

– network of simple units – real-valued inputs & outputs

  • Many neuron-like threshold switching units
  • Many weighted interconnections among units
  • Highly parallel, distributed process
  • Emphasis on tuning weights automatically

4

slide-5
SLIDE 5

Neural Network Learning

  • Learning approach based on modeling adaptation in biological

neural systems.

  • Perceptron: Initial algorithm for learning simple neural networks

(single layer) developed in the 1950’s.

  • Backpropagation: More complex algorithm for learning multi-layer

neural networks developed in the 1980’s.

5

slide-6
SLIDE 6

Real Neurons

6

slide-7
SLIDE 7

How Does our Brain Work?

  • A neuron is connected to other neurons via its input and
  • utput links
  • Each incoming neuron has an activation value and each

connection has a weight associated with it

  • The neuron sums the incoming weighted values and this

value is input to an activation function

  • The output of the activation function is the output from

the neuron

7

slide-8
SLIDE 8

Neural Communication

  • Electrical potential across

cell membrane exhibits spikes called action potentials.

  • Spike originates in cell

body, travels down axon, and causes synaptic terminals to release neurotransmitters.

  • Chemical diffuses across

synapse to dendrites of

  • ther neurons.
  • Neurotransmitters can be

excititory or inhibitory.

  • If net input of

neurotransmitters to a neuron from other neurons is excititory and exceeds some threshold, it fires an action potential.

8

slide-9
SLIDE 9

Real Neural Learning

  • To model the brain we need to model a neuron
  • Each neuron performs a simple computation

– It receives signals from its input links and it uses these values to compute the activation level (or output) for the neuron. – This value is passed to other neurons via its output links.

9

slide-10
SLIDE 10

Prototypical ANN

  • Units interconnected in layers

– directed, acyclic graph (DAG)

  • Network structure is fixed

– learning = weight adjustment – backpropagation algorithm

10

slide-11
SLIDE 11

Appropriate Problems

  • Instances: vectors of attributes

– discrete or real values

  • Target function

– discrete, real, vector – ANNs can handle classification & regression

  • Noisy data
  • Long training times acceptable
  • Fast evaluation
  • No need to be readable

– It is almost impossible to interpret neural networks except for the simplest target functions

11

slide-12
SLIDE 12

Perceptrons

  • The perceptron is a type of

artificial neural network which can be seen as the simplest kind

  • f feedforward neural network: a

linear classifier

  • Introduced in the late 50s
  • Perceptron convergence

theorem (Rosenblatt 1962): – Perceptron will learn to classify any linearly separable set of inputs. XOR function (no linear separation)

Perceptron is a network: – single-layer – feed-forward: data only travels in one direction

12

slide-13
SLIDE 13

ALVINN drives 70 mph on highways

13

See Alvinn video Alvinn Video

slide-14
SLIDE 14

Artificial Neuron Model

  • Model network as a graph with cells as nodes and synaptic

connections as weighted edges from node i to node j, wji

  • Model net input to cell as
  • Cell output is:

1 3 2 5 4 6 w12 w13 w14 w15 w16

i i ji j

  • w

net

(Tj is threshold for unit j)

j i j j j

T net T net

  if 1 if

netj

  • j

Tj 1

14

slide-15
SLIDE 15

Perceptron: Artificial Neuron Model

Vector notation: Model network as a graph with cells as nodes and synaptic connections as weighted edges from node i to node j, wji The input value received of a neuron is calculated by summing the weighted input values from its input links 

 n i i ix

w

threshold function threshold

15

slide-16
SLIDE 16

Different Threshold Functions

         , 1 , 1 ) ( x w x w x

             , 1 , 1 ) ( x w x w x

   

     

  • therwise

t x w x

  • ,

, 1 ) (   

         , 1 , 1 ) ( x w x w x

   

We should learn the weight w1,…, wn

16

slide-17
SLIDE 17

Examples

(step activation function)

In1 In2 Out 1 1 1 1 1 In1 In2 Out 1 1 1 1 1 1 1 In Out 1 1

 n i i ix

w

w0 – t

17

slide-18
SLIDE 18

Neural Computation

  • McCollough and Pitts (1943) showed how such model

neurons could compute logical functions and be used to construct finite-state machines

  • Can be used to simulate logic gates:

– AND: Let all wji be Tj/n, where n is the number of inputs. – OR: Let all wji be Tj – NOT: Let threshold be 0, single input with a negative weight.

  • Can build arbitrary logic circuits, sequential machines, and

computers with such gates

18

slide-19
SLIDE 19

Perceptron Training

  • Assume supervised training examples giving the desired
  • utput for a unit given a set of known input activations.
  • Goal: learn the weight vector (synaptic weights) that

causes the perceptron to produce the correct +/- 1 values

  • Perceptron uses iterative update algorithm to learn a

correct set of weights

– Perceptron training rule – Delta rule

  • Both algorithms are guaranteed to converge to somewhat

different acceptable hypotheses, under somewhat different conditions

19

slide-20
SLIDE 20

Perceptron Training Rule

  • Update weights by:

where η is the learning rate

  • a small value (e.g., 0.1)
  • sometimes is made to decay as the number of

weight-tuning operations increases

t – target output for the current training example

  • – linear unit output for the current training example

i i i i i

w

  • t

w w w w ) (       

20

slide-21
SLIDE 21

Perceptron Training Rule

  • Equivalent to rules:

– If output is correct do nothing. – If output is high, lower weights on active inputs – If output is low, increase weights on active inputs

  • Can prove it will converge

– if training data is linearly separable – and η is sufficiently small

21

slide-22
SLIDE 22

Perceptron Learning Algorithm

  • Iteratively update weights until convergence.
  • Each execution of the outer loop is typically called an

epoch. Initialize weights to random values Until outputs of all training examples are correct For each training pair, E, do: Compute current output oj for E given its inputs Compare current output to target value, tj , for E Update synaptic weights and threshold using learning rule

22

slide-23
SLIDE 23

Delta Rule

  • Works reasonably with data that is not linearly separable
  • Minimizes error
  • Gradient descent method

– basis of Backpropagation method – basis for methods working in multidimensional continuous spaces – Discussion of this rule is beyond the scope of this course

23

slide-24
SLIDE 24

Perceptron as a Linear Separator

  • Since perceptron uses linear threshold function it searches

for a linear separator that discriminates the classes

1 3 13 2 12

T

  • w
  • w

 

  • 3
  • 2

13 1 2 13 12 3

w T

  • w

w

 

??

Or hyperplane in n-dimensional space

24

slide-25
SLIDE 25

Concept Perceptron Cannot Learn

  • Cannot learn exclusive-or (XOR), or parity function in

general

  • 3
  • 2

?? +

1 1

– + –

25

slide-26
SLIDE 26

General Structure of an ANN

26

Activation function g(Si )

Si Oi

I1 I2 I3 wi1 wi2 wi3 Oi Neuron i Input Output threshold, t

Input Layer Hidden Layer Output Layer x1 x2 x3 x4 x5 y

Perceptrons have no hidden layers Multilayer perceptrons may have many

slide-27
SLIDE 27

Learning Power of an ANN

  • Perceptron is guaranteed to converge if data is linearly

separable – It will learn a hyperplane that separates the classes – The XOR function on page 250 TSK is not linearly separable

  • A mulitlayer ANN has no such guarantee of convergence

but can learn functions that are not linearly separable

  • An ANN with a hidder layer can learn the XOR function

by constructing two hyperplanes (see page 253)

27

slide-28
SLIDE 28

Multilayer Network Example

The decision surface is highly nonlinear

28

slide-29
SLIDE 29

Sigmoid Threshold Unit

  • Sigmoid is a unit whose output is a nonlinear function of its inputs,

but whose output is also a differentiable function of its inputs

  • We can derive gradient descent rules to train

– Sigmoid unit – Multilayer networks of sigmoid units  backpropagation

29

slide-30
SLIDE 30

Multi-Layer Networks

  • Multi-layer networks can represent arbitrary functions,

but an effective learning algorithm for such networks was thought to be difficult

  • A typical multi-layer network consists of an input, hidden

and output layer, each fully connected to the next, with activation feeding forward.

  • The weights determine the function computed. Given an

arbitrary number of hidden units, any boolean function can be computed with a single hidden layer

  • utput

hidden input activation

30

slide-31
SLIDE 31

Logistic function

Inputs Coefficients Output Independent variables Prediction Age 34 1 Gender Stage 4

.5 .8 .4

0.6

S

“Probability

  • f beingAlive”

 

 

n i i ix

w

e 1 1 

31

slide-32
SLIDE 32

Neural Network Model

Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2 Gender Stage 4

.6 .5 .8 .2 .1 .3 .7 .2

Weights Hidden Layer

“Probability

  • f beingAlive”

0.6

S

S

.4 .2

S 32

slide-33
SLIDE 33

Getting an answer from a NN

Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2 Gender Stage 4

.6 .5 .8 .1 .7

Weights Hidden Layer

“Probability

  • f beingAlive”

0.6

S

33

slide-34
SLIDE 34

Inputs Weights Output Independent variables Dependent variable Prediction Age 34 2 Gender Stage 4

.5 .8 .2 .3 .2

Weights Hidden Layer

“Probability

  • f beingAlive”

0.6

S

Getting an answer from a NN

34

slide-35
SLIDE 35

Getting an answer from a NN

Inputs Weights Output Independent variables Dependent variable Prediction Age 34 1 Gender Stage 4

.6 .5 .8 .2 .1 .3 .7 .2

Weights Hidden Layer

“Probability

  • f beingAlive”

0.6

S

35

slide-36
SLIDE 36

Comments on Training Algorithm

  • Not guaranteed to converge to zero training error, may

converge to local optima or oscillate indefinitely.

  • However, in practice, does converge to low error for many

large networks on real data.

  • Many epochs (thousands) may be required, hours or days
  • f training for large networks.
  • To avoid local-minima problems, run several trials starting

with different random weights (random restarts).

– Take results of trial with lowest training set error. – Build a committee of results from multiple trials (possibly weighting votes by training set accuracy).

36

slide-37
SLIDE 37

Hidden Unit Representations

  • Trained hidden units can be seen as newly constructed

features that make the target concept linearly separable in the transformed space: key features of ANNs

  • On many real domains, hidden units can be interpreted as

representing meaningful features such as vowel detectors or edge detectors, etc..

  • However, the hidden layer can also become a distributed

representation of the input in which each individual unit is not easily interpretable as a meaningful feature.

37

slide-38
SLIDE 38

Expressive Power of ANNs

  • Boolean functions: Any boolean function can be

represented by a two-layer network with sufficient hidden units.

  • Continuous functions: Any bounded continuous

function can be approximated with arbitrarily small error by a two-layer network.

  • Arbitrary function: Any function can be approximated to

arbitrary accuracy by a three-layer network.

38

slide-39
SLIDE 39

Sample Learned XOR Network

3.11 7.38 6.96 5.24 3.6 3.58 5.57 5.74 2.03

A X Y B

Hidden Unit A represents: (X  Y) Hidden Unit B represents: (X  Y) Output O represents: A  B = (X  Y)  (X  Y) = X  Y

O

IN 1 IN 2 OUT 1 1 1 1 1 1

39

slide-40
SLIDE 40

Expressive Power of ANNs

  • Universal Function Approximator:

– Given enough hidden units, can approximate any continuous function f

  • Why not use millions of hidden units?

– Efficiency (neural network training is slow) – Overfitting

40

slide-41
SLIDE 41

Combating Overfitting in Neural Nets

  • Many techniques
  • Two popular ones:

– Early Stopping

  • Use “a lot” of hidden units
  • Just don’t over-train

– Cross-validation

  • Choose the “right” number of hidden units

41

slide-42
SLIDE 42

Determining the Best Number of Hidden Units

  • Too few hidden units prevents the network from

adequately fitting the data

  • Too many hidden units can result in over-fitting
  • Use internal cross-validation to empirically determine an
  • ptimal number of hidden units

error

  • n training data
  • n test data

# hidden units

42

slide-43
SLIDE 43

Over-Training Prevention

  • Running too many epochs can result in over-fitting.
  • Keep a hold-out validation set and test accuracy on it after every
  • epoch. Stop training when additional epochs actually increase

validation error.

  • To avoid losing training data for validation:

– Use internal 10-fold CV on the training set to compute the average number of epochs that maximizes generalization accuracy. – Train final network on complete training set for this many epochs.

error

  • n training data
  • n test data

# training epochs

43

slide-44
SLIDE 44

Summary of Neural Networks

When are Neural Networks useful?

– Instances represented by attribute-value pairs

  • Particularly when attributes are real valued

– The target function is

  • Discrete-valued
  • Real-valued
  • Vector-valued

– Training examples may contain errors – Fast evaluation times are necessary

When not?

– Fast training times are necessary – Understandability of the function is required

44

slide-45
SLIDE 45

Successful Applications

  • Text to Speech (NetTalk)
  • Fraud detection
  • Financial Applications

– HNC (eventually bought by Fair Isaac)

  • Chemical Plant Control

– Pavillion Technologies

  • Automated Vehicles
  • Game Playing

– Neurogammon

  • Handwriting recognition

45

slide-46
SLIDE 46

Issues in Neural Nets

  • Learning the proper network architecture:

– Grow network until able to fit data

  • Cascade Correlation
  • Upstart

– Shrink large network until unable to fit data

  • Optimal Brain Damage
  • Recurrent networks that use feedback and can learn

finite state machines with “backpropagation through time.”

46