[PPT] - IAML: Artificial Neural Networks Chris Williams and Victor Lavrenko PowerPoint Presentation

SLIDE 1

IAML: Artificial Neural Networks

Chris Williams and Victor Lavrenko School of Informatics Semester 1

1 / 26

SLIDE 2

Outline

◮ Why multilayer artificial neural networks (ANNs)? ◮ Representation Power of ANNs ◮ Training ANNs: backpropagation ◮ Learning Hidden Layer Representations ◮ Examples ◮ Recurrent Neural Networks ◮ W & F sec 6.3, multilayer perceptrons, backpropagation

(details on pp 230-232 not required), radial basis function networks

2 / 26

SLIDE 3

Why we need multilayer networks

◮ Networks without hidden units are very limited in the

input-output mappings they can represent

◮ More layers of linear units do not help, it is still linear ◮ Fixed non-linearities φ(x) are problematic; what are good

basis functions to choose ? f(x) = g  

j

wjφj(x)  

◮ We get more power from multiple layers of adaptive

non-linear hidden units

3 / 26

SLIDE 4

Artificial Neural Networks (ANNs)

◮ The field of neural networks grew up out of simple models

f neurons

◮ Research was done into what networks of these neurons

could achieve

◮ Neural networks proved to be a reasonable modelling tool ◮ Which is funny really as they never were very good models

f neurons... or of neural networks

◮ But when understood in terms of learning from data, they

proved to be powerful

4 / 26

SLIDE 5

An example network with 2 hidden layers

. . . . . . . . .

hidden layer 1 hidden layer 2

utput layer

input layer (x)

5 / 26

SLIDE 6

◮ There can be an arbitrary number of hidden layers ◮ Each unit in the first hidden layer computes a non-linear

function of the input x

◮ Each unit in a higher hidden layer computes a non-linear

function of the outputs of the layer below

◮ Common choices for the hidden-layer non-linearities are

the logistic function g(z) = 1/(1 + e−z) or the Gaussian function

◮ Logistic nonlinearity → multilayer perceptron (MLP) ◮ Gaussian nonlinearity → radial basis function (RBF),

normally only 1 hidden layer

6 / 26

SLIDE 7

◮ Output units compute a linear combination of the outputs

f the final hidden layer and pass it through a transfer

function g()

◮ g is the identity function for a regression task (cf linear

regression)

◮ g is the logistic function for a two-class classification task

(cf logistic regression)

7 / 26

SLIDE 8

Representation Power of ANNs

◮ Boolean functions:

◮ Every boolean function can be represented by network with

single hidden layer

◮ but might require exponential (in number of inputs) hidden

units

◮ Continuous functions:

◮ Every bounded continuous function can be approximated

with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989]

◮ Any function can be approximated to arbitrary accuracy by

a network with two hidden layers [Cybenko 1988].

◮ Neural Networks are universal approximators. 8 / 26

SLIDE 9

ANN predicting 1 of 10 vowel sounds based on formats F1 and F2

Figure from Mitchell (1997) 9 / 26

SLIDE 10

Limitations of Representation Power Results

◮ The fact that a function is representable does not tell us

how many hidden units would be required for its approximation

◮ Nor does it tell us if it is learnable (a search problem) ◮ Nor does it say anything about how much training data

would be needed to learn the function

◮ In fact universal approximation has only a limited benefit:

need bias

10 / 26

SLIDE 11

Training ANNs

◮ As in linear and logistic regression, we create an error

function that measures the agreement of the target y(x) and the prediction f(x)

◮ Linear regression, squared error: E = n i=1(yi − f(xi))2 ◮ Logistic regression (0/1 labels):

E = n

i=1 yi log f(xi) + (1 − yi) log(1 − f(xi)) ◮ These are both related to the log likelihood of the data

under the relevant model

◮ For linear and logistic regression the optimization problem

for w had a unique optimum; this is no longer the case for ANNs (e.g. hidden layer neurons can be permuted)

11 / 26

SLIDE 12

Backpropagation

◮ As discussed for logistic regression, we need the gradient

f E wrt all the parameters w, i.e. g(w) = ∂E

∂w ◮ This is in fact an exercise in using the chain rule to

compute derivatives; for ANNs this is given the name backpropagation

◮ We make use of the layered structure of the net to

compute the derivatives, heading backwards from the

utput layer to the inputs

◮ Once you have g(w), you can use your favourite

ptimization routines to minimize E; see discussion of

gradient descent and other methods in Logistic Regression slides

◮ It can make sense to use a regularization penalty (e.g.

λ|w|2) to help control overfitting

12 / 26

SLIDE 13

Batch vs online

◮ Batch learning: use all patterns in training set, and update

weights after calculating ∂E ∂θ =

i

∂Ei ∂θ

◮ On-line learning: adapt weights after each pattern

presentation, using ∂Ei

∂θ ◮ Batch more powerful optimization methods ◮ Batch easier to analyze ◮ On-line more feasible for huge or continually growing

datasets

◮ On-line may have ability to jump over local optima

13 / 26

SLIDE 14

Convergence of Backpropagation

◮ Dealing with local minima. Train multiple nets from different

starting places, and then choose best (or combine in some way)

◮ Initialize weights near zero; therefore, initial networks are

near-linear

◮ Increasingly non-linear functions possible as training

progresses

14 / 26

SLIDE 15

Training ANNs: Summary

◮ Optimize over vector of all weights/biases in a network ◮ All methods considered find local optima ◮ Gradient descent is simple but slow ◮ In practice, second-order methods (conjugate gradients)

are used for batch learning

◮ Overfitting can be a problem

15 / 26

SLIDE 16

Fitting this into the general structure for learning algorithms:

◮ Define the task: classification or regression, discriminative ◮ Decide on the model structure: ANN ◮ Decide on the score function: log likelihood ◮ Decide on optimization/search method to optimize the

score function: numerical optimization routine

16 / 26

SLIDE 17

Hypothesis space and Inductive Bias for ANNs

◮ Hypothesis space: if there are |w| weights and biases

H =

w|w ∈ R|w|

◮ Inductive Bias: hard to characterize, depends on search

procedure, regularization and how weight space spans the space of representable functions

◮ Approximate statement: smooth interpolation between

data points

17 / 26

SLIDE 18

Learning Hidden Layer Representations

◮ Backprop can develop intermediate representations of its

inputs in the hidden layers

◮ These new features will capture properties of the input

instances that are most relevant to learning the target function

◮ This ability to automatically discover useful hidden-layer

representations is a key feature of ANN learning

18 / 26

SLIDE 19

Example 1: Neural Net Language Models

Y Bengio et al, JMLR 3, 1137-1155 (2003)

◮ Predict word wt given preceeding words wt−1, wt−2 etc ◮ Simple way is to estimate the trigram model

p(wt = c|wt−1 = b, wt−2 = a) = count(abc)

c′ count(abc′)

◮ Can’t use bigger context due to sparse data problems ◮ But this method uses no sharing across related words; we

want a feature-based representation, so that e.g. cat and dog may share some features

19 / 26

SLIDE 20

Figure credit: Bengio et al, 2003 20 / 26

SLIDE 21

◮ Learned distributed encoding of each context word ◮ These are transformed by a hidden layer, followed by ◮ Softmax distribution over all possible words ◮ Predictive performance measured by perplexity (the

geometric average of 1/p(wt|context)

◮ Neural network is about 24% better on Brown corpus, 8%

better on AP corpus than the best n-gram results

21 / 26

SLIDE 22

Example 2: Le Net

e.g. LeCun and Bengio, 1995

◮ Task is to recognize handwritten digits ◮ “Le Net” is a multilayer backprop net which has many

hidden layers

◮ Alternation of convolutional features, followed by

subsampling

◮ Final output is a softmax over the 10 classes

Figure credit: LeCun et al, 1995 22 / 26

SLIDE 23

◮ The convolutional approach allows the net to identify

certain features, even if they have been shifted in the image

◮ Subsampling affords a small amount of translational

invariance at each stage

◮ Convolutional nets give the best performance on the

MNIST dataset (best is now 0.39% error)

23 / 26

SLIDE 24

Recurrent Neural Networks

Connectivity does not have to be feedforward, there can be directed cycles. This can give rise to richer behaviour:

◮ The network can oscillate—good for motor control? ◮ It can converge to a point attractor: good for classification? ◮ It can behave chaotically: but this is usually a bad idea for

information processing

◮ It can use activities as hidden state, to remember things for

a long time

24 / 26

SLIDE 25

V V V V V V V V

1 2 1 2 1 2 1 2 ’ ’ ’’ ’’

w w w w w w w w w w w w

21 12 11 22 11 11 22 22 21 12 21 12

◮ Recurrent networks can also be trained using

backpropagation

25 / 26