ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja - - PowerPoint PPT Presentation

artificial intelligence artificial neural networks
SMART_READER_LITE
LIVE PREVIEW

ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja - - PowerPoint PPT Presentation

Utrecht University INFOB2KI 2019-2020 The Netherlands ARTIFICIAL INTELLIGENCE Artificial Neural Networks Lecturer: Silja Renooij These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html 2


slide-1
SLIDE 1

ARTIFICIAL INTELLIGENCE

Lecturer: Silja Renooij

Artificial Neural Networks

Utrecht University The Netherlands

These slides are part of the INFOB2KI Course Notes available from www.cs.uu.nl/docs/vakken/b2ki/schema.html

INFOB2KI 2019-2020

slide-2
SLIDE 2

2

slide-3
SLIDE 3

Outline

  • Biological neural networks
  • Artificial NN basics and training:

– perceptrons – multi‐layer networks

  • Combination with other ML techniques

– NN and Reinforcement Learning

  • e.g. AlphaGo

– NN and Evolutionary Computing

3

slide-4
SLIDE 4

(Artificial) Neural Networks

  • Supervised learning technique: error‐driven

classification

  • Output is weighted function of inputs
  • Training updates the weights
  • Used in games for e.g.

– Select weapon – Select item to pick up – Steer a car on a circuit – Recognize characters – Recognize face – …

4

slide-5
SLIDE 5

Biological Neural Nets

  • Pigeons as art experts

(Watanabe et al. 1995)

– Experiment:

  • Pigeon in Skinner box
  • Present paintings of two different artists

(e.g. Chagall / Van Gogh)

  • Reward for pecking when presented a particular artist

(e.g. Van Gogh)

5

slide-6
SLIDE 6

6

slide-7
SLIDE 7

Pigeons were able to discriminate between Van Gogh and Chagall:

  • with 95% accuracy, when presented with

pictures they had been trained on

  • still 85% successful for previously unseen

paintings of the artists

Results from experiment

7

slide-8
SLIDE 8

Praise to neural nets

  • Pigeons have acquired knowledge about art

– Pigeons do not simply memorise the pictures – They can extract and recognise patterns (the ‘style’) – They generalise from the already seen to make predictions

  • Pigeons have learned.
  • Can one implement this using an artificial neural

network?

8

slide-9
SLIDE 9

Inspiration from biology

  • If a pigeon can do it,

how hard can it be?

  • ANN’s are biologically inspired.
  • ANN’s are not duplicates of brains

(and don’t try to be)!

9

slide-10
SLIDE 10

Natural neurons:

  • receive signals through synapses (~ inputs)
  • If signals strong enough (~ above some threshold),

– the neuron is activated – and emits a signal though the axon. (~ output)

(Natural) Neurons

Artificial neuron (Node) Natural neuron

10

slide-11
SLIDE 11

McCulloch & Pitts model (1943)

w1 w2 wn x1 x2 xn y

  • utput

hard delimiter Linear Combiner

“A logical calculus of the ideas immanent in nervous activity”

  • n binary inputs xi and 1 binary output y
  • n weights wi ϵ {‐1,1}
  • Linear combiner: z = ∑

𝑥𝑦

  • Hard delimiter: unit step function at threshold θ, i.e.

𝑧 1 if 𝑨 𝜄, 𝑧 0 if 𝑨 𝜄

aka:

  • linear threshold gate
  • threshold logic unit

11

slide-12
SLIDE 12

Rosenblatt’s Perceptron (1958)

x x

z

y = g(z)

  • enhanced version of McCulloch‐Pitts artificial neuron
  • n+1 real‐valued inputs: x1… xn and 1 bias b; binary output y
  • weights wi with real‐valued values
  • Linear combiner: z = ∑

𝑥𝑦 𝑐

  • g(z): (hard delimiter) unit step function at threshold 0, i.e.

𝑧 1 if 𝑨 0, 𝑧 0 if 𝑨 0

12

slide-13
SLIDE 13

w=2 w=4 4

  • 3

8

  • 12
  • 4

weighted input: activation g(z):

z =

  • Classification: feedforward

13

The algorithm for computing outputs from inputs in perceptron neurons is the feedforward algorithm.

slide-14
SLIDE 14

Bias & threshold implementation

Bias can be incorporated in three different ways, with same effect on output:

14

1

b

b

w0= 1

θ- b

Alternatively: threshold θ can be incorporated in three different ways, with same effect on output…

slide-15
SLIDE 15

Single layer perceptron

  • alternative hard‐limiting activation functions g(z) possible;

e.g. sign function: 𝑧 1 if 𝑨 0, 𝑧 1 if 𝑨 0

  • can have multiple independent outputs yi
  • the adjustable weights can be trained using training data
  • the Perceptron learning rule adjusts the weights w1…wn such

that the inputs x1…xn give rise to the (desired) output(s)

  • Rosenblatt’s perceptron is building

block of single‐layer perceptron

  • which is the simplest feedforward

neural network

15

y1 1 2 4 x1 x2 w23 3 y2 w13 w24 w14

Input nodes: Single layer of neurons:

slide-16
SLIDE 16

Perceptron learning: idea

Idea: minimize error in the output

  • per output: 𝑓 𝑒 𝑧

(d=desired output)

  • If 𝑓 1 then z ∑

𝑥𝑦

  • should be increased such that it

exceeds the threshold

  • If 𝑓 1 then z ∑

𝑥𝑦

  • should be decreased such

that it falls below the threshold  change 𝑥 ← 𝑥 +/‐ term proportional to gradient

  • 𝑦
  • Proportional change: learning rate 𝛽 > 0

16

NB in the book the learning rate is called Gain, with notation η

slide-17
SLIDE 17

Perceptron learning

Initialize weights and threshold (or bias) to random numbers; Choose a learning rate 0 𝛽 1 For each training input t=<x1,…,xn>:

Weights for any t changed? All Weights unchanged?

  • r other stopping rule…

Ready

desired output

1 ‘epoch’

calculate the output y(t) and error e(t)=d(t) - y(t) Adjust all n weights using perceptron learning rule: 𝑥 ← 𝑥 ∆𝑥 where ∆𝑥 𝛽 𝑦 e(t)

17

slide-18
SLIDE 18

Example: AND- learning (1)

x1 x2 d 1 1 1 1 1

x2 x1

1 1

desired output of logical AND, given 2 binary inputs

18

slide-19
SLIDE 19

w=0.3 w=-0.1 0.2

Example AND (2)

x1 x2 d(t)

t1 t2 1 t3 1 t4 1 1 1

e(t1) = d(t) – 0 = 0 – 0

Init: choose weights wi and threshold θ randomly in [‐0.5,0.5]; set ; use step function: return 0 if < θ; 1 if ≥ θ

x1 x2

  Done with t1, for now…

Alternative: use bias b= – θ with unit stepfunction

19

slide-20
SLIDE 20

w=0.3 w=-0.1 0.2 1

  • 0.1
  • 0.1

Example AND (3)

e(t2) = 0-0

x1 x2

x1 x2 d(t)

t1 t2 1 t3 1 t4 1 1 1

  Done with t2, for now…

20

slide-21
SLIDE 21

w=0.3 w=-0.1 0.2 1 0.3 0.3 1

Example AND (4)

e(t3) = 0 - 1

w=0.2

x1 x2

x1 x2 d(t)

t1 t2 1 t3 1 t4 1 1 1

 

w1  0.2;

done with t3, for now…

  • (t)
  • 21
slide-22
SLIDE 22

w=0.2 w=-0.1 0.2 1 1 0.2

  • 0.1

0.1

Example AND (5)

e(t4) = 1-0

w=0 w=0.3

x1 x2

x1 x2 d(t)

t1 t2 1 t3 1 t4 1 1 1

 

w1  0.3 and w2  0;

done with t4 and first epoch…

  • (t)
  • .1

22

slide-23
SLIDE 23

Example (6) : 4 epoch’s later…

  • algorithm has converged, i.e. the weights do

not change any more.

  • algorithm has correctly learned the AND

function

w=0.1 w=0.1 0.2

x1 x2

23

slide-24
SLIDE 24

AND example (7): results

x1 x2 d y

1 1 1 1 1 1

x2 x1

1 1

Learned function/decision boundary: 0.1 𝑦 0.1 𝑦 0.2 Or: 𝑦 2 𝑦 linear classifier

24

slide-25
SLIDE 25

Perceptron learning: properties

All linear functions  space without local optima Complete: yes, if

– 𝛽 sufficiently small or initial weights sufficiently large – examples come from a linearly separable function!

then perceptron learning converges to a solution. Optimal: no

(weights serve to correctly separate ‘seen’ inputs; no guarantees for ‘unseen’ inputs close to the decision boundaries)

25

slide-26
SLIDE 26

Limitation of perceptron: example

x1 x2 d

1 1 1 1 1 1

x2 x1

1 1 XOR

  • Cannot separate two output types with a

single linear function XOR is not linearly separable.

26

slide-27
SLIDE 27

Solving XOR using 2 single layer perceptrons

x2 x1 1 1

1 2 3

ϴ=1

1

  • 1

x2 x1 1 1

1 2 4 x1 x2

  • 1

1 x1 x2 1 2 4 x1 x2

  • 1

1 3 1

  • 1

5 1 1 y

x2 x1 1 1

ϴ=1 ϴ=1 ϴ=1 ϴ=1

27

y y

slide-28
SLIDE 28

Types of decision regions

28

slide-29
SLIDE 29

Multi-layer networks

  • This type of network is also called a feed forward network
  • hidden layer captures nonlinearities
  • more than 1 hidden layer is possible, but often reducible to 1

hidden layer

  • introduced in 50s, but not studied until 80s

x1 x2 x3 y1 y2 y3

input nodes hidden layer of neurons

  • utput

neuron layer

29

slide-30
SLIDE 30

Multi-Layer Networks

In MLNs

  • outputs not based on simple weighted sum of inputs
  • weights are shared  dependent outputs
  • errors must be distributed over hidden neurons
  • continuous activation functions are used

Input signals Error signals

x1 x2 x3 y1 y2 y3

30

slide-31
SLIDE 31

Continuous activation functions

As continuous activation function, we can use

  • a (piecewise) linear function
  • (ReLU)
  • a sigmoid (smoothed version of step function)
  • e.g. logistic sigmoid
  • 31

z g(z)

slide-32
SLIDE 32

Continuous artificial neurons

w1 w2 wn x1 x2 xn y

  • utput

sigmoid function Linear Combiner

weighted input: activation (logistic sigmoid):

z =

32

slide-33
SLIDE 33

weighted input: activation:

Example

w=2 w=4 3

  • 2

6

  • 8
  • 2

0.119

z =

33

slide-34
SLIDE 34

Error minimization in MLNs: idea

Idea: minimize error in output through gradient descent

  • Total error is sum of squared error, per output:

𝐹 ∑

𝑒 𝑧

  • (d=desired output)
  • change 𝑥 ← 𝑥 term proportional to gradient
  • 34
slide-35
SLIDE 35
  • where
  • depends on location of node j
  • if node in output layer:
  • 𝑒 𝑧

Computing the gradient I

35

i j

𝑦 𝑧 ≡ 𝑦 𝑥 𝑧

𝑥

𝑕 𝑨

  • 𝐹

𝑨

slide-36
SLIDE 36
  • for node in (hidden) layer
  • =
  • The change required for error minimization is thus propagated backwards

Computing the gradient II

36

j k

𝑦 𝑧 ≡ 𝑦 𝑥

  • 𝑧

𝑥

  • 𝑕 𝑨

𝐹 𝑨 (hidden) layer L‐1 layer L 𝑧 ≡ 𝑦

slide-37
SLIDE 37

Backpropagation for MLNs

Initialize weights and threshold (or bias) to random numbers; Choose a learning rate 0 𝛽 1 For each training input t=<x1,…,xn>:

Weights for any t changed? All Weights unchanged?

  • r other stopping rule…

Ready

calculate the output y(t) and error e(t)=d(t) - y(t)

Recursively adjust each weight on link node i to node j: 𝑥 ← 𝑥 ∆𝑥 𝑥 𝛽 𝑧 𝜀

  • 𝜀

𝑕 𝑨 𝑓𝑢

if j is output node

  • 𝜀

𝑕 𝑨 ∑

𝜀 𝑥

if j is hidden node

37

slide-38
SLIDE 38

0.003

Training for XOR

x1 x2 d

1 1 1 1 1 1

e(t) = 0-0.003

0.002 0.002 1 2 4 3 5

W35= 5

y

W45= 5 W13= 10 W24= 10 W14= -5 W23= -5

x1 x2

To simplify computation, if absolute value of e(t) < 0.1, we consider outcome correct. With the sigmoid as approximation of the step‐function, we consider this outcome correct  no weight updates required for first case, for now…..

Activation function for nodes 3-5: (i.e. ) Set 𝛽 0.9 𝑕 𝑨 1 1 𝑓 ⁄ 𝜄 6

38

slide-39
SLIDE 39

1

Training for XOR

x1 x2 d

1 1 1 1 1 1

e(t) = 1-0.252=0.748

0.000 0.982 0.252 δ5 = y5 * (1-y5) * e ~ 0.141 Δw35 = α * y3 * δ5 ~ 0.000 Δw45 = α * y4 * δ5 ~ 0.125 δ3 = y3 * (1-y3) * w35* δ5 ~ 0.000 δ4 = y4 * (1-y4) * w45* δ5 ~ 0.012 Δw13 = α * y1 * δ3 = α * x1 * δ3 = 0 = Δw14 Δw23 = α * x2 * δ3 ~ 0.000 Δw24 = α * x2 * δ4 ~ 0.011 1 2 4 3 5

W35= 5

y

W45= 5 W13= 10 W24= 10 W14= -5 W23= -5

x1 x2

Activation function for nodes 3-5: (i.e. ) Set 𝛽 0.9 𝑕 𝑨 1 1 𝑓 ⁄ 𝜄 6

39

slide-40
SLIDE 40

Training for XOR

x1 x2 d

1 1 1 1 1 1

Adjust the weights that require changing: Δw45 ~ 0.125: update w45 to 5.125 Δw24 ~ 0.011: update w24 to 10.011

e(t) = 1-0.276=0.724

0.000 0.276 1 2 4 3 5

W35= 5

y

W45= 5.125 W13= 10 W24= 10.011 W14= -5 W23= -5

x1 x2

1 0.982 Activation function for nodes 3-5: (i.e. ) Set 𝛽 0.9 𝑕 𝑨 1 1 𝑓 ⁄ 𝜄 6

40

slide-41
SLIDE 41

Activation function for nodes 3-5: (i.e. ) Set 𝛽 0.9 𝑕 𝑨 1 1 𝑓 ⁄ 𝜄 6 x1 x2 d y

0.003 1 1 0.999 1 1 0.999 1 1 0.003

e(t) = 1-0.999=0.001

0.000 0.999 1 2 4 3 5

W35= 13

y

W45= 13 W13= 12 W24= 13 W14= -13 W23= -11

x1 x2

1 0.999

After many training examples

e(t) < 0.1 for all cases: we can consider these outcomes correct

41

slide-42
SLIDE 42

Properties of MLNs

  • Boolean functions:

– Every boolean function f:{0,1}k{0,1} can be represented using a single hidden layer

  • Continuous functions:

– Every bounded piece‐wise continuous function can be approximated with arbitrarily small error with one hidden layer – Any continuous function can be approximated to arbitrary accuracy with two hidden layers

  • Learning:

– Not efficient (but intractable, regardless of method) – Local minima & no guarantee of convergence

42

slide-43
SLIDE 43

Example: Voice Recognition

  • Task: Learn to discriminate between two

different voices saying “Hello”

  • Data

– Sources

  • Steve Simpson
  • David Raubenheimer

– Format

  • Frequency distribution (60 bins)
  • Analogy: cochlea

43

slide-44
SLIDE 44
  • Network architecture

– Feed forward network

  • 60 input (one for each frequency bin)
  • 6 hidden
  • 2 output (0‐1 for “Steve”, 1‐0 for “David”)

Example: Voice Recognition

44

slide-45
SLIDE 45
  • Presenting the data: feed forward

Steve David

Example: Voice Recognition

45

slide-46
SLIDE 46
  • Presenting the data: feed forward (untrained network)

Steve David

Example: Voice Recognition

0.43 0.26 0.73 0.55

46

slide-47
SLIDE 47
  • Calculate error

Steve David

0 – 0.43 = – 0.43 1 – 0.26 = 0.74 1 – 0.73 = 0.27 0 – 0.55 = – 0.55

Example: Voice Recognition

47

slide-48
SLIDE 48
  • Backprop total error and adjust weights

Steve David

0 – 0.43 = 0.43 1 – 0.26 = 0.74 1 – 0.73 = 0.27 0 – 0.55 = 0.55 1.17 0.82

Example: Voice Recognition

48

slide-49
SLIDE 49
  • Repeat process (sweep) for all training pairs

– Present data – Calculate error – Backpropagate error – Adjust weights

  • Repeat process multiple timess

Example: Voice Recognition

Total error #sweeps

49

slide-50
SLIDE 50
  • Presenting the data (trained network)

Steve David

0.01 0.99 0.99 0.01

Example: Voice Recognition

50

slide-51
SLIDE 51
  • Results – Voice Recognition

– Performance of trained network

  • Discrimination accuracy between known “Hello”s

– 100%

  • Discrimination accuracy between new “Hello”’s

– 100%

Example: Voice Recognition

51

slide-52
SLIDE 52
  • Results – Voice Recognition (ctnd.)

– Network has learnt to generalise from original data – Networks with different weight settings can have same functionality – Trained networks ‘concentrate’ on lower frequencies – Network is robust against non‐functioning nodes

Example: Voice Recognition

52

slide-53
SLIDE 53
  • Classification, pattern recognition, diagnosis:
  • Character Recognition, both printed and handwritten
  • Face Recognition, speech recognition
  • Object classification by means of salient features
  • Analysis of signal to determine their nature and source
  • Regression and forecasting:
  • In particular non‐linear functions and time series
  • Examples:

– Sonar mine/rock recognition (Gorman & Sejnowksi, 1988) – Navigation of a car (Pomerleau, 1989) – Stock‐market prediction – Pronunciation (NETtalk:

Sejnowksi & Rosenberg, 1987)

Applications of feed-forward nets

53

slide-54
SLIDE 54

More Neural Networks

Acyclic: feedforward Cyclic: recurrent

54

slide-55
SLIDE 55

Deep learning

55 Source: NIPS 2015 tutorial by Y LeCun

Usually: multi‐layer NNs with more than 1 hidden layer Often: convolutional NN:

  • Convolution operation reduces number of parameters needed
  • No vanishing gradients
  • Network learns filters that were hand‐engineered before
slide-56
SLIDE 56

NN as function approximator

  • A NN can be used as a black box that

represents (an approximation of) a function

  • This can be used in combination with other

learning methods

  • E.g. use a NN to represent the Q‐function in

Q‐learning

56

slide-57
SLIDE 57

NN + Q-learning

57

slide-58
SLIDE 58

58

Alpha Go (Deepmind/Google)

https://www.youtube.com/watch?v=mzpW10DPHeQ

slide-59
SLIDE 59

Learning NNs using Evolution

59

https://www.youtube.com/watch?v=S9Y_I9vY8Qw https://www.youtube.com/watch?v=TS8QlL‐3NXk

slide-60
SLIDE 60

(Natural) Neurons revisited

  • Human’s have 1010 neurons, and 1015 dendrites.

Don’t even think about creating an ANN of this size…

  • Most ANN’s do not have feedback loops in the

network structure (exception: recurrent NN).

  • The ANN activation function is (probably) much

simpler than what happens in the biological neuron.

60