Multi-Layer Networks M. Soleymani Deep Learning Sharif University - - PowerPoint PPT Presentation

multi layer networks
SMART_READER_LITE
LIVE PREVIEW

Multi-Layer Networks M. Soleymani Deep Learning Sharif University - - PowerPoint PPT Presentation

Multi-Layer Networks M. Soleymani Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from: Bhiksha Raj, 11-785, CMU 2019 and Fei Fei Li lectures, cs231n, Stanford 2017 and some from Hinton, NN for


slide-1
SLIDE 1

Multi-Layer Networks

  • M. Soleymani

Deep Learning Sharif University of Technology Spring 2019 Most slides have been adapted from: Bhiksha Raj, 11-785, CMU 2019 and Fei Fei Li lectures, cs231n, Stanford 2017 and some from Hinton, “NN for Machine Learning”, coursera, 2015.

1

slide-2
SLIDE 2

Reasons to study neural computation

  • Neuroscience: To understand how the brain actually works.

– Its very big and very complicated and made of stuff that dies when you poke it around. So we need to use computer simulations.

  • AI: To solve practical problems by using novel learning algorithms

inspired by the brain

– Learning algorithms can be very useful even if they are not how the brain actually works.

2

slide-3
SLIDE 3

3

slide-4
SLIDE 4

A typical cortical neuron

  • Gross physical structure:

– There is one axon that branches – There is a dendritic tree that collects input from other neurons.

  • Axons typically contact dendritic trees at synapses

– A spike of activity in the axon causes charge to be injected into the post-synaptic neuron.

  • Spike generation:

– There is an axon hillock that generates outgoing spikes whenever enough charge has flowed in at synapses to depolarize the cell membrane.

4

slide-5
SLIDE 5

Binary threshold neurons

  • McCulloch-Pitts (1943): influenced Von Neumann.

– First compute a weighted sum of the inputs. – send out a spike of activity if the weighted sum exceeds a threshold. – McCulloch and Pitts thought that each spike is like the truth value of a proposition and each neuron combines truth values to compute the truth value of another proposition!

𝑗𝑜𝑞𝑣𝑢& 𝑗𝑜𝑞𝑣𝑢' 𝑗𝑜𝑞𝑣𝑢( 𝑔 * 𝑥,𝑦,

  • ,

𝑔: Activation function … 𝑔 Σ 𝑥& 𝑥' 𝑥(

5

slide-6
SLIDE 6

A better figure

6

  • A threshold unit

– “Fires” if the weighted sum of inputs and the “bias” T is positive

+

.....

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

z = * w4x4 − 𝜄

  • 4

z = 81 if z ≥ 0 0 else

𝜄

slide-7
SLIDE 7

McCulloch-Pitts neuron: binary threshold

7

𝑦& 𝑦' 𝑦( 𝑧 … 𝑧 = 81, 𝑨 ≥ 𝜄 0, 𝑨 < 𝜄 𝑧 𝜄: activation threshold 𝑥& 𝑥' 𝑥( 𝑦& 𝑦' 𝑦( 𝑧 … 𝑥& 𝑥' 𝑥( 𝑐 1 𝑧

bias: 𝑐 = −𝜄

Equivalent to binary McCulloch-Pitts neuron

slide-8
SLIDE 8

Neural nets and the brain

8

  • Neural nets are composed of networks of computational models of

neurons called perceptrons

+

.... .

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

−𝜄

slide-9
SLIDE 9

The perceptron

9

  • A threshold unit

– “Fires” if the weighted sum of inputs exceeds a threshold – Electrical engineers will call this a threshold gate

  • A basic unit of Booleancircuits

y = 1 if * 𝑥,𝑦, ≥ 𝜄

,

0 else

+

.... .

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

−𝜄

slide-10
SLIDE 10

The “soft” perceptron (logistic)

10

+

.....

𝒚𝟐 𝒚𝟑 𝒚𝟒 𝒚𝑶

𝒙𝟐 𝒙𝟑 𝒙𝟒

𝒙𝑶

z = * w4x4 − θ

  • 4

y = 1 1 + exp(−z)

  • A “squashing” function instead of a threshold at the output

– The sigmoid “activation” replaces the threshold

  • Activation: The function that acts on the weighted combination of

inputs (and threshold)

−𝜄

slide-11
SLIDE 11

Sigmoid neurons

  • These give a real-valued output that is a smooth and bounded

function of their total input.

  • Typically they use the logistic function

– They have nice derivatives.

11

slide-12
SLIDE 12

Other “activations”

12

+

....

𝒚𝟐 𝒚𝟑 𝒚𝟒 𝒚𝑶

𝒙𝟐 𝒙𝟑 𝒙𝟒 𝒙𝑶

𝒄

  • Does not always have to be a squashing function

– We will hear more about activations later

  • We will continue to assume a “threshold” activation in this lecture

tanh

tanh 𝑨 log (1 + 𝑓[)

sigmoid

1 1 1 + exp (−𝑨)

slide-13
SLIDE 13

Adjusting weights

  • Types of single layer networks:

–ADALINE (Widrow and Hoff, 1960) –Perceptron (Rosenblatt, 1962)

14

slide-14
SLIDE 14

A little bit History : Widrow

15

  • First known attempt at an analytical solution to training a single layer

network

  • Now famous as theLMS algorithm

– Used everywhere – Also known as the “deltarule”

slide-15
SLIDE 15

History: A D A LI N E

16

  • Adaptive linear element

(Hopf and Widrow,1960)

  • Weighted sum on inputs

and bias passed through a thresholdingfunction

  • ADALINE differs in the learningrule

Using 1-extended vector notation to account for bias

slide-16
SLIDE 16

History: Learning in ADALINE

17

  • During learning, minimize the squared

error assuming to be realoutput

  • The desired output is stillbinary!

𝐹𝑠𝑠 𝒚 = 1 2 𝑧 − 𝑨 '

slide-17
SLIDE 17

History: Learning in ADALINE

18

  • If we just have asingle training input,

the gradient descent update rule is

𝒙_`& = 𝒙_ − 𝜃𝛼𝐹𝑠𝑠 𝒚(c) 𝐹𝑠𝑠 𝒚(c) = 1 2 𝑧(c) − 𝑨(c) ' = 1 2 𝑧(c) − 𝒙d𝒚(c) ' 𝐹𝑠𝑠 𝒚 = 1 2 𝑧 − 𝑨 ' 𝒙_`& = 𝒙_ + 𝜃(𝑧(c) − 𝑨(c))𝒚(c)

𝜀

slide-18
SLIDE 18

The ADALINE learning rule

19

  • Online learning rule
  • After each input , that has target

(binary) output 𝑧, compute and update:

  • This is the famous deltarule

– Also called the LMS update rule

𝜀 = 𝑧 − 𝑨 𝒙_`& = 𝒙_ + 𝜃𝜀𝒚(c)

slide-19
SLIDE 19

Perceptron

21

𝑦& 𝑦' 1

x1 x2

z = * w4x4 + b

  • 4

+

....

𝑦& 𝑦' 𝑦0 𝑦( 𝑥&

𝑥' 𝑥0

𝑥(

𝑐

  • Lean this function

– A step function across a hyperplane

slide-20
SLIDE 20

Learning the perceptron

22

  • Given a number of input output pairs, learn the weights and

bias

– Learn 𝑋 = [𝑥&, … , 𝑥(] and b, given several 𝑌, 𝑧 pairs

𝑦& 𝑦'

+

....

𝑦& 𝑦' 𝑦0 𝑦( 𝑥& 𝑥' 𝑥0

𝑥(

𝑐

𝑧 = l1 𝑗𝑔 * w4x4 + b

  • 4

≥ 0 0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓

slide-21
SLIDE 21

Restating the perceptron

23

x1 x2 x3 xd

Wd+1

xd+1=1

  • Restating the perceptron equation by adding another dimension

to 𝑌

𝑧 = 81 𝑗𝑔 ∑ 𝑥,𝑦, ≥ 0

(`& ,q&

0 𝑝𝑢ℎ𝑓𝑠𝑥𝑗𝑡𝑓 Where 𝑦(`& = 1

  • Note that the boundary ∑

𝑥,𝑦, ≥ 0

(`& ,q&

is now a hyperplane through origin

𝑋

(

slide-22
SLIDE 22

The Perceptron Problem

24

that perfectly separates thetwo

  • Find the hyperplane

groups of points

34

* 𝑥,𝑦, = 0

(`& ,q&

slide-23
SLIDE 23

The Perceptron Problem

  • Find the hyperplane ∑

𝑥,𝑦, = 0

(`& ,q&

that perfecly separates the two groups of points

– Note: 𝒙 = 𝑥&, 𝑥', … , 𝑥(`& is a vector that is orthogonal to the hyperplane

  • In fact the equation for the hyperplane itself means “the set of all Xs that are
  • rthogonal to W” (∑

𝑥,𝑦, = 𝒙d𝒚 = 0

(`& ,q&

)

25

slide-24
SLIDE 24

Perceptron Learning Algorithm

  • Given 𝑂 training instances 𝒚(&), 𝑧(&) , 𝒚('), 𝑧(') , … , 𝒚(u), 𝑧(u)

– 𝑧(c) = +1 or −1

26

  • Initialize 𝒙
  • Cycle through the training instance
  • While more classification errors

– For 𝑗 = 1 … 𝑂_vw,c

𝑃 𝒚(,) = 𝑡𝑗𝑕𝑜(𝒙d𝒚(,))

  • If 𝑃 𝒚(,) ≠ 𝑧(,)

𝒙 = 𝒙 + 𝑧(,)𝒚(,)

slide-25
SLIDE 25

Perceptron Algorithm:Summary

27

  • Cycle through the traininginstances
  • Only update 𝒙 on misclassifiedinstances
  • If instance misclassified:

– If instance is positive class 𝒙 = 𝒙 + 𝒚

– If instance is negative class

𝒙 = 𝒙 − 𝒚

slide-26
SLIDE 26

The perceptron convergence procedure

  • Perceptron trains binary output neurons as classifiers
  • Pick training cases (until convergence):

– If the output unit is correct, leave its weights alone. – If the output unit incorrectly outputs a zero, add the input vector to it. – If the output unit incorrectly outputs a 1, subtract the input vector from it.

  • This is guaranteed to find a set of weights that gets the right answer

for all the training cases if any such set exists.

28

slide-27
SLIDE 27

A Simple Method: The Perceptron Algorithm

29

  • Initialize: Randomly initialize the hyperplane

– i.e. randomly initialize the normalvector 𝑥

  • Classification rule 𝑡𝑗𝑕𝑜(𝒙d𝒚)

– Vectors on the same side of the hyperplane as 𝑋 will be assigned +1 class, and those on the other side will be assigned -1

  • The random initial plane will make mistakes

+1 (blue)

  • 1 (red)

𝒙

slide-28
SLIDE 28

Perceptron Algorithm

30

+1 (bleu)

  • 1 (Red)

Initialization

𝒙

slide-29
SLIDE 29

Perceptron Algorithm

31

+1 (bleu)

  • 1 (Red)

Misclassified positive instance

𝑿

slide-30
SLIDE 30

Perceptron Algorithm

32

Updated weight vector: Misclassified positive instance, add it to W

𝒑𝒎𝒆 𝒙 𝒐𝒇𝒙 𝒙

slide-31
SLIDE 31

Perceptron Algorithm

33

𝒙

+1 (bleu)

  • 1 (Red)
slide-32
SLIDE 32

Perceptron Algorithm

34

Updated weight vector: Misclassified negative instance, subtract it from w

𝒑𝒎𝒆 𝒙 𝒐𝒇𝒙 𝒙

slide-33
SLIDE 33

Perceptron Algorithm

35

𝒙

Perfect classification, no more updates

slide-34
SLIDE 34

Convergence of Perceptron Algorithm

36

  • Guaranteed to converge if classes are linearly

separable

– After no more than misclassifications

  • Specifically when 𝒙is initialized to 0

– is length of longest trainingpoint – is the best case closest distance of a training point from the classifier

  • Same as the margin in an SVM

– Intuitively takes many increments of size to undo an error resulting from a step of size

slide-35
SLIDE 35

Perceptron Algorithm

37

𝛿 is the best-case margin R is the length of the longest vector

𝜹 𝜹 𝑺

slide-36
SLIDE 36

Adjusting weights

38

  • Weight update for a training pair (𝒚 c , 𝑧(c)):

– Perceptron: If 𝑡𝑗𝑕𝑜(𝒙d𝒚(c)) ≠ 𝑧(c) then ∆𝒙 = 𝒚(c)𝑧(c) else ∆𝒙 = 𝟏 – ADALINE: ∆𝒙 = 𝜃(𝑧(c) − 𝒙d𝒚(c))𝒚(c)

  • Widrow-Hoff, LMS, or delta rule

𝒙_`& = 𝒙_ − 𝜃𝛼𝐹c 𝒙_

𝐹c 𝒙 = 𝑧(c) − 𝒙d𝒚(c) '

slide-37
SLIDE 37

How to learn the weights: multi class example

40

slide-38
SLIDE 38

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

41

slide-39
SLIDE 39

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

42

slide-40
SLIDE 40

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

43

slide-41
SLIDE 41

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

44

slide-42
SLIDE 42

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

45

slide-43
SLIDE 43

How to learn the weights: multi class example

  • If correct: no change
  • If wrong:

– lower score of the wrong answer (by removing the input from the weight vector of the wrong answer) – raise score of the target (by adding the input to the weight vector of the target class)

46

slide-44
SLIDE 44

Single layer networks as template matching

  • Weights for each class as a template (or sometimes also called a

prototype) for that class.

– The winner is the most similar template.

  • The ways in which hand-written digits vary are much too complicated

to be captured by simple template matches of whole shapes.

  • To capture all the allowable variations of a digit we need to learn the

features that it is composed of.

47

slide-45
SLIDE 45

The history of perceptrons

  • They were popularised by Frank Rosenblatt in the early 1960’s.

– They appeared to have a very powerful learning algorithm. – Lots of grand claims were made for what they could learn to do.

  • In 1969, Minsky and Papert published a book called “Perceptrons”

that analyzed what they could do and showed their limitations.

– Many people thought these limitations applied to all neural network models.

48

slide-46
SLIDE 46

What binary threshold neurons cannot do

  • A binary threshold output unit cannot even tell if two single bit

features are the same!

  • A geometric view of what binary threshold neurons cannot do
  • The positive and negative cases cannot be separated by a plane

49

slide-47
SLIDE 47

What binary threshold neurons cannot do

  • Positive cases (same): (1,1)->1; (0,0)->1
  • Negative cases (different): (1,0)->0; (0,1)->0
  • The four input-output pairs give four inequalities that are impossible

to satisfy:

– w1 +w2 ≥θ – 0 ≥θ – w1 <θ – w2 <θ

50

slide-48
SLIDE 48

Discriminating simple patterns under translation with wrap-around

  • Suppose we just use pixels as the

features.

  • binary

decision unit cannot discriminate patterns with the same number of on pixels

– if the patterns can translate with wrap- around!

51

slide-49
SLIDE 49

Sketch of a proof

  • For pattern A, use training cases in all possible translations.

– Each pixel will be activated by 4 different translations of pattern A. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights.

  • For pattern B, use training cases in all possible translations.

– Each pixel will be activated by 4 different translations of pattern B. – So the total input received by the decision unit over all these patterns will be four times the sum of all the weights.

  • But to discriminate correctly, every single case of pattern A must provide

more input to the decision unit than every single case of pattern B.

  • This is impossible if the sums over cases are the same.

52

slide-50
SLIDE 50

Networks with hidden units

  • Networks without hidden units are very limited in the input-output

mappings they can learn to model.

– More layers of linear units do not help. Its still linear. – Fixed output non-linearities are not enough.

  • We need multiple layers of adaptive, non-linear hidden units. But

how can we train such nets?

53

slide-51
SLIDE 51

The multi-layer perceptron

54

  • A network of perceptrons

– Generally “layered”

slide-52
SLIDE 52

Feed-forward neural networks

  • Also called Multi-Layer Perceptron (MLP)

55

slide-53
SLIDE 53

MLP with single hidden layer

56

  • Two-layer MLP (Number of layers of adaptive weights is counted)

𝑝† 𝒚 = 𝜔 * 𝑥

ˆ† [']𝑨 ˆ ‰ ˆqŠ

⇒ 𝑝† 𝒚 = 𝜔 * 𝑥

ˆ† [']𝜚 * 𝑥,ˆ [&]𝑦, ( ,qŠ ‰ ˆqŠ

… … … Input Output 𝑦Š = 1 𝑦( 𝑝& 𝑝• 𝑥

ˆ† [']

𝑥,ˆ

[&]

𝜚 𝜔 𝑨Š = 1 𝑨& 𝑨‰ 𝑨

ˆ

𝜔 𝜚 𝜚 𝑦& 𝑗 = 0, … , 𝑒 𝑘 = 1 … 𝑁 𝑘 = 1 … 𝑁 𝑙 = 1, … , 𝐿

slide-54
SLIDE 54

Beyond linear models

57

𝒈 = 𝑿𝒚 𝒈 = 𝑿'𝜚 𝑿𝟐𝒚

slide-55
SLIDE 55

Beyond linear models

58

𝒈 = 𝑿𝒚 𝒈 = 𝑿'𝜚 𝑿𝟐𝒚 𝒈 = 𝑿0𝜚 𝑿'𝜚 𝑿𝟐𝒚

slide-56
SLIDE 56

Defining “depth”

60

  • What is a “deep” network
slide-57
SLIDE 57

Deep Structures

61

  • In any directed network of computational elements with input source

nodes and output sink nodes, “depth” is the length of the longest path from a source to a sink

  • “Deep” [ Depth > 2
  • Left: Depth =2.

Right: Depth =3

slide-58
SLIDE 58

The multi-layer perceptron

63

N.Net

  • Inputs are real or Boolean stimuli
  • Outputs are real or Boolean values

– Can have multiple outputs for a single input

  • What can this network compute?

– What kinds of input/output relationships can it model?

slide-59
SLIDE 59

MLPs approximate functions

64

x

ℎ2 ℎn

X Y Z

  • 1

1 A 2 1 1 2

  • 1

1 2 1 1 1 1 2 1 1

  • 1

1 1 1 1

  • MLP s can compose Boolean functions
  • MLPs can compose real-valued functions
  • What are the limitations?
slide-60
SLIDE 60

Multi-layer Perceptrons as universal Boolean functions

65

slide-61
SLIDE 61

The perceptron as a Boolean gate

67

X Y

1 1 2

X

  • 1
  • A perceptron can model any simple binary Boolean gate

X

1 1 1

Y

slide-62
SLIDE 62

Perceptron as aBoolean gate

68

1 1 L 1

  • 1
  • 1
  • 1

Will fire only if X1 .. XL are all 1 and XL+1 .. XN are all 0

  • The universal AND gate

– AND any number of inputs

  • Any subset of who may be negated
slide-63
SLIDE 63

Perceptron as aBoolean gate

69

1 1 L-N+1 1

  • 1
  • 1
  • 1

Will fire only if any of X1 .. XL are 1 or any of XL+1 .. XN are 0

  • The universal OR gate

– OR any number of inputs

  • Any subset of who may be negated
slide-64
SLIDE 64

Perceptron as aBoolean gate

70

1 1 K 1 1 1 1

Will fire only if at least K inputs are 1

  • Generalized majority gate

– Fire if at least K inputs are of the desired polarity

slide-65
SLIDE 65

Perceptron as aBoolean gate

71

Will fire only if the total number of of X1 .. XL that are 1 or XL+1 .. XN that are 0 is at least K

  • Generalized majority gate

– Fire if at least K inputs are of the desired polarity

1 1 L

  • N+K

1

  • 1
  • 1
  • 1
slide-66
SLIDE 66

The perceptron is not enough

72

X

? ? ?

Y

  • Cannot compute an XOR
slide-67
SLIDE 67

Multi-layer perceptron XOR

73

1 1 1 1

  • 1

X

1

  • 1

2

Y

Hidden Layer

  • An XOR takes three perceptrons
  • 1
slide-68
SLIDE 68

Multi-layer perceptron XOR

74

  • With 2 neurons

– 5 weights and two thresholds

X

  • 2

1 1 1.5 0.5 1 1

Y

slide-69
SLIDE 69

Multi-layer perceptron

75

2 1 X Y Z 1 2

  • 1

1 2 1 1 1 1

  • 1

1 A 1 1 1

  • 1

1 1 1 1

  • MLPs can compute more complex Boolean functions
  • MLPs can compute any Boolean function

– Since they can emulate individual gates

  • MLPs are universal Boolean functions
slide-70
SLIDE 70

MLP as Boolean Functions

76

2 1 X Y Z 1 2

  • 1

1 2 1 1 1 1

  • 1

1 A 1 1 1

  • 1

1 1 1 1

  • MLPs are universal Boolean functions

– Any function over any number of inputs and any number of outputs

  • But how many “layers” will they need?
slide-71
SLIDE 71

How many layers for aBoolean MLP?

77

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • A Boolean function is just a truth table
slide-72
SLIDE 72

How many layers for aBoolean MLP?

78

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

slide-73
SLIDE 73

How many layers for aBoolean MLP?

79

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

slide-74
SLIDE 74

How many layers for aBoolean MLP?

80

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

slide-75
SLIDE 75

How many layers for aBoolean MLP?

81

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

slide-76
SLIDE 76

How many layers for aBoolean MLP?

82

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

slide-77
SLIDE 77

How many layers for aBoolean MLP?

83

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

slide-78
SLIDE 78

How many layers for aBoolean MLP?

84

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

slide-79
SLIDE 79

How many layers for aBoolean MLP?

85

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

Truth table shows all input combinations for which output is 1

  • Expressed in disjunctive normal form

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

slide-80
SLIDE 80

How many layers for aBoolean MLP?

86

X

1

X

2

X

3

X

4

X

5

Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table

  • Any truth table can be expressed in this manner!
  • A one-hidden-layer MLP is a Universal Boolean Function
  • But what is the largest number of perceptrons required in the

single hidden layer for an N-input-variable function?

y = 𝑌 ”&𝑌 ”'X0X–𝑌 ”— +𝑌 ”&X'𝑌 ”0X–X— +𝑌 ”&X'X0𝑌 ”–𝑌 ”—+ X&𝑌 ”'𝑌 ”0𝑌 ”–X— + X&𝑌 ”'X0X–X— + X&X'𝑌 ”0𝑌 ”–X—

X' X0 X– X— X&

slide-81
SLIDE 81

Worst case

87

  • Which truth tables cannot be reduced further simply?
  • Largest width needed for a single-layer Boolean network on N inputs

– Worst case: 2u˜&

  • Example: Parity function

1 1 1 1 1 1 1 1 𝑌, 𝑍 𝑋, 𝑎 00 01 10 11 00 01 10 11 𝑌 ⊕ 𝑍 ⊕ 𝑎 ⊕ 𝑋

slide-82
SLIDE 82

Boolean functions

88

  • Input: N Boolean variable
  • How many neurons in a one hidden layer MLP is required?
  • More compact representation of a Boolean function

– “Karnaugh Map”

  • representing the truth table as a grid
  • Grouping adjacent boxes to reduce the complexity of the Disjunctive Normal Form (DNF)

formula

1 1 1 1 1 1 1 1 𝑌, 𝑍 𝑋, 𝑎 00 01 10 11 00 01 10 11

slide-83
SLIDE 83

How many neurons in the hidden layer?

89

  • 𝑌

”𝑍 ”𝑋 œ 𝑎̅ ∨ 𝑌 ”𝑍𝑋 œ 𝑎̅ ∨ 𝑌𝑍 ”𝑋 œ 𝑎̅ ∨ 𝑌𝑍𝑋 œ 𝑎̅ ∨ 𝑌 ”𝑍 ”𝑋𝑎 ∨ 𝑌𝑍 ”𝑋𝑎̅ ∨ 𝑌𝑍𝑋𝑎̅ ∨ 𝑌𝑍 ”𝑋𝑎

  • 𝑋

œ 𝑎̅ ∨ 𝑍 ”𝑋𝑎 ∨ 𝑌𝑋𝑎̅

1 1 1 1 1 1 1 1 𝑌, 𝑍 𝑋, 𝑎 00 01 10 11 00 01 10 11

slide-84
SLIDE 84

Width of a deepMLP

92 00 01 11 10 Y Z WX

00 01 11 10 00 01 11 10 Y Z WX 10 00 11 01 Y Z UV

00 01 11 10

slide-85
SLIDE 85

Using deep network: Parity function on N inputs

93

  • Simple MLP with one hidden layer:

2u˜& Hidden units 𝑂 + 2 2u˜& + 1 Weights and biases

slide-86
SLIDE 86

Using deep network: Parity function on N inputs

94

  • Simple MLP with one hidden layer:
  • 𝑔 = 𝑌& ⊕ 𝑌' ⊕ ⋯ ⊕ 𝑌u

𝑌& 𝑌' 𝑌0 𝑌–

3(𝑂 − 1) Nodes 9(𝑂 − 1) Weights and biases 2u˜& Hidden units 𝑂 + 2 2u˜& + 1 Weights and biases

The actual number of parameters in a network is the number that really matters in software or hardware implementations

slide-87
SLIDE 87

A better architecture

95

  • Only requires 2log𝑂 layers
  • 𝑔 =

𝑌& ⊕ 𝑌' ⊕ 𝑌0 ⊕ 𝑌– ⊕ 𝑌– ⊕ 𝑌— ⊕ 𝑌¢ ⊕ 𝑌£

𝑌& 𝑌' 𝑌0 𝑌– 𝑌— 𝑌¢ 𝑌£ 𝑌¤

slide-88
SLIDE 88

The challenge of depth

96

𝑎

&

… …

𝑦& 𝑦u

𝑎

  • Using only K hidden layers will require 𝑃 2¥u neurons in the kth

layer, where 𝐷 = 2˜†/'

– Because the output can be shown to be the XOR of all the outputs of k-1th hidden layer – i.e. reducing the number of layers below the minimum will result in an exponentially sized network to express the function fully – A network with fewer than the minimum required number of neurons cannot model the function

slide-89
SLIDE 89

Caveat 1: Not all Booleanfunctions..

99

  • Not all Boolean circuits have such clear depth-vs-size tradeoff
  • Shannon’s theorem: For 𝑂 > 2 , there is Boolean function of 𝑂

variables that requires at least 2u/𝑂 gates

– More correctly, for large N, almost all N-input Boolean function need more than 2u/𝑂 gates

  • Regardless of depth
  • Note: if all Boolean functions over 𝑂 inputs could be computed using

a circuit of size that is polynomial in 𝑂, P=NP !

slide-90
SLIDE 90

Caveat 2

100

  • Used a simple “Boolean circuit” analogy for explanation
  • We actually have threshold circuit (TC) not, just a Boolean circuit (AC)

– Specifically composed of threshold gates

  • More versatile than Boolean gates (can compute majority function)
  • E.g. “at least K inputs are 1” is a single TC gate, but an exponential size AC
  • For fixed depth, 𝐶𝑝𝑝𝑚𝑓𝑏𝑜 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 ⊂ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 (𝑡𝑢𝑠𝑗𝑑𝑢 𝑡𝑣𝑐𝑡𝑓𝑢)

– A depth-2 TC parity circuit can be composed with 𝑃(𝑜') weights

  • But a network of depth log

(𝑜) requires only 𝑃(𝑜) weights

  • Other formal analyses typically view neural networks as arithmetic

circuits

– Circuits which compute polynomials over any field

  • So lets consider functions over the field of reals
slide-91
SLIDE 91

Summary: Wide vs. deep network

101

  • MLP with a single hidden layer is a universal Boolean function
  • However, a single-layer network might need an exponential number of

hidden units w.r.t. the number of inputs

  • Deeper networks may require far fewer neurons than shallower

networks to express the same function

– Could be exponentially smaller

  • Optimal width and depth depend on the number of variables and the

complexity of the Boolean function

– Complexity: minimal number of terms in DNF formula to represent it

slide-92
SLIDE 92

MLPs as universal classifiers

102

slide-93
SLIDE 93

The MLPas a classifier

103

784 dimensions (MNIST) 784 dimensions

2

  • MLP as a function over real inputs
  • MLP as a function that finds a complex “decision boundary” over a

space of reals

slide-94
SLIDE 94

A Perceptron onReals

104

  • A perceptron operates on

real-valued vectors

– This is a linear classifier

* 𝑥,𝑦, ≥ 𝑈

,

x1 x2 x3 xN

x1 x2

1

𝑥&𝑦& + 𝑥'𝑦' = 𝑈

x1 x2

slide-95
SLIDE 95

Boolean functions with areal perceptron

105

  • Boolean perceptrons are also linear classifiers

– Purple regions are 1

Y X 0,0 0,1 1,0 1,1 Y X 0,0 0,1 1,0 1,1 X Y 0,0 0,1 1,0 1,1

slide-96
SLIDE 96

Composing complicated “decision” boundaries

106

  • Build a network of units with a single output that fires if the input is

in the coloured area

x1 x2

Can now be composed into “networks” to compute arbitrary classification “boundaries”

slide-97
SLIDE 97

Booleans over the reals

107

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-98
SLIDE 98

Booleans over the reals

108

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-99
SLIDE 99

Booleans over the reals

109

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-100
SLIDE 100

Booleans over the reals

110

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-101
SLIDE 101

Booleans over the reals

111

  • The network must fire if the input is in the coloured area

x1 x2

x1 x2

slide-102
SLIDE 102

Booleans over the reals

112

  • The network must fire if the input is in the coloured area

AND

y1 y2 y3 y4 y5

* 𝑧, ≥ 5

u ,q&

x1 x2

slide-103
SLIDE 103

More complex decision boundaries

113

  • Network to fire if the input is in the yellow area

– “OR” two polygons – A third layer is required

x2 x1

AND AND OR

x1 x2

slide-104
SLIDE 104

Complex decision boundaries

114

AND OR x1 x2

  • Can compose arbitrarily complex decision boundaries

– With only one hidden layer! – How?

slide-105
SLIDE 105

115

MLP with Different Number of Layers

Structure Type of Decision Regions Interpretation Example of region Single Layer (no hidden layer) Half space Region found by a hyper-plane Two Layer (one hidden layer) Polyhedral (open or closed) region Intersection of half spaces Three Layer (two hidden layers) Arbitrary regions Union of polyhedrals MLP with unit step activation function Decision region found by an output unit.

slide-106
SLIDE 106

Exercise: compose this withone hidden layer

116

  • How would you compose the decision boundary to the left with
  • nly one hidden layer?

x1 x2 x2 x1

slide-107
SLIDE 107

Composing a squaredecision boundary

117

4 2 2 2 2

  • The polygon net

x1 x2

1

y y2 y3 y4

* y4 ≥ 4 ?

– 4q&

slide-108
SLIDE 108

Composing a squaredecision boundary

118

  • The polygon net

* y4 ≥ 5 ?

— 4q& x1 x2 y1 y2 y3 y4 y5

5 4 4 4 4 4 2 2 2 2 2 3 3 3 3 3

slide-109
SLIDE 109

Composing a pentagon

119

  • The polygon net

* y4 ≥ 6 ?

¢ 4q&

6 5 5 5 5 5 5 3 3 3 3 3 3 4 4 4 4 4

x1 x2 y5 y1 y2 y3 y4 y6

slide-110
SLIDE 110

16 sides

120

  • What are the sums in the different regions?
slide-111
SLIDE 111

64 sides

121

  • What are the sums in the different regions?
slide-112
SLIDE 112

1000 sides

122

  • What are the sums in the different regions?
slide-113
SLIDE 113

Polygon net

123

  • Increasing the number of sides reduces the area outside the

polygon that have N/2 < Sum < N

slide-114
SLIDE 114

In the limit

124

  • For small radius, it’s a near perfect cylinder

– N in the cylinder, N/2 outside * 𝑧,

,

= 𝑂 1 − 1 𝜌 arccos min (1, 𝑠𝑏𝑒𝑗𝑣𝑡 |𝑦 − 𝑑𝑓𝑜𝑢|)

slide-115
SLIDE 115

Composing a circle

125

N N/2

  • The circle net

– Very large number of neurons – Sum is N inside the circle, N/2 outside almost everywhere – Circle can be at any location

* y4 ≥ N ?

· 4q&

slide-116
SLIDE 116

Composing a circle

126

  • The circle net

– Very large number of neurons – Sum is N/2 inside the circle, 0 outside almost everywhere – Circle can be at any location

*y4 − 𝑂 2 ≥ 0 ?

· 4q&

−𝑂/2

1 N/2

slide-117
SLIDE 117

Adding circles

127

  • The “sum” of two circles sub nets is exactly N/2 inside either circle, and 0

almost everywhere outside

𝑶 *y4 − 𝑂 2 ≥ 0 ?

· 4q&

slide-118
SLIDE 118

Composing an arbitraryfigure

128 *y4 − 𝑂 2 ≥ 0 ?

· 4q&

  • Just fit in an arbitrary number of circles

– More accurate approximation with greater number of smaller circles – Can achieve arbitrary precision

slide-119
SLIDE 119

MLP: Universal classifier

129 *y4 − 𝑂 2 ≥ 0 ?

· 4q&

  • MLPs can capture any classification boundary
  • A one-layer MLP can model any classification boundary
  • MLPs are universal classifiers
slide-120
SLIDE 120

Depth and theuniversal classifier

130

x2 x1 x1 x2

  • Deeper networks can require far fewer neurons
slide-121
SLIDE 121

Optimal depth in generic nets

  • We look at a different pattern:

– “worst case” decision boundaries

  • For threshold-activation networks

– Generalizes to other nets

134

slide-122
SLIDE 122

Optimal depth

135

  • A naïve one-hidden-layer neural network will required infinite

hidden neurons

*y4 − 𝑂 2 ≥ 0 ?

· 4q&

slide-123
SLIDE 123

Optimal depth

136

  • Two hidden-layer network: 56 hidden neurons
slide-124
SLIDE 124

Optimal depth

137

  • Two layer network: 56 hidden neurons

– 16 neurons in hidden layer 1

𝒁𝟐 𝒁𝟑 𝒁𝟐𝟕 𝒁𝟐 𝒁𝟑 𝒁𝟒 𝒁𝟓 𝒁𝟔 𝒁𝟕 𝒁𝟖 𝒁𝟗 𝒁𝟘 𝒁𝟐𝟏 𝒁𝟐𝟐 𝒁𝟐𝟑 𝒁𝟐𝟒 𝒁𝟐𝟓 𝒁𝟐𝟔 𝒁𝟐𝟕

slide-125
SLIDE 125

Optimal depth

138

  • Two-layer network: 56 hidden neurons

– 16 in hidden layer 1 – 40 in hidden layer 2 – 57 total neurons, including output neuron

slide-126
SLIDE 126

Optimal depth

139

  • But this isjust

𝒁𝟐 𝒁𝟑 𝒁𝟐𝟕 𝒁𝟐 𝒁𝟑 𝒁𝟒 𝒁𝟓 𝒁𝟔 𝒁𝟕 𝒁𝟖 𝒁𝟗 𝒁𝟘 𝒁𝟐𝟏 𝒁𝟐𝟐 𝒁𝟐𝟑 𝒁𝟐𝟒 𝒁𝟐𝟓 𝒁𝟐𝟔 𝒁𝟐𝟕

slide-127
SLIDE 127

Optimal depth

140

  • But this is just
  • The XOR net will require 16+15×3 = 61 neurons
  • 46 neurons if we use a two-gate XOR
slide-128
SLIDE 128

Optimal depth

141

  • A naïve one-hidden-layer neural network will required infinite

hidden neurons

*y4 − 𝑂 2 ≥ 0 ?

· 4q&

slide-129
SLIDE 129

Actual linear units

142

𝒁𝟐 𝒁𝟑 𝒁𝟕𝟓

  • 64 basic linear feature detectors
slide-130
SLIDE 130

Optimal depth

143

  • Two hidden layers: 608 hidden neurons

– 64 in layer 1 – 544 in layer 2

  • 609 total neurons (including output neuron)
slide-131
SLIDE 131

Optimal depth

144

… . … . … .

  • XOR network (12 hidden layers): 253 neurons

– 190 neurons with 2-gate XOR

  • The difference in size between the deeper optimal (XOR) net and

shallower nets increases with increasing pattern complexity and input dimension

slide-132
SLIDE 132

Depth: Summary

  • The number of neurons required in a shallow network is potentially

exponential in the dimensionality of the input

– (this is the worst case) – Alternately, exponential in the number of statistically independent features

146

slide-133
SLIDE 133

Summary

  • Multi-layer perceptrons are Universal Boolean Machines

– Even a network with a single hidden layer is a universal Boolean machine

  • Multi-layer perceptrons are Universal Classification Functions

– Even a network with a single hidden layer is a universal classifier

  • But a single-layer network may require an exponentially large

number of perceptrons than a deep one

  • Deeper networks may require far fewer neurons than shallower

networks to express the same function

– Could be exponentially smaller – Deeper networks are more expressive

147

slide-134
SLIDE 134

MLPs as universal approximators

148

slide-135
SLIDE 135

MLP as a continuous-valued regression

149

+

x

1 T2 1 T

1

T1 T

2

1

  • 1

T T

1 2 x

f(x)

  • A simple 3-unit MLP with a “summing” output unit can

generate a “square pulse” over an input

– Output is 1 only if the input lies between T1 and T2 – T1 and T2 can be arbitrarily specified

slide-136
SLIDE 136

MLP as a continuous-valued regression

150

x

ℎ□ ℎ□

+

x

1 T

2

1 T

1

T1 T

2

1

  • 1

T T

1 2 x

f(x)

  • A simple 3-unit MLP can generate a “square pulse” over an input
  • An MLP with many units can model an arbitrary function over an

input

– To arbitrary precision

  • Simply make the individual pulses narrower
  • A one-layer MLP can model an arbitrary function of a single input
slide-137
SLIDE 137

For higher dimensions

151

N/2

+

1

  • N/2
  • An MLP can compose a cylinder

– u

' in the circle, 0 outside

slide-138
SLIDE 138

MLP as a continuous-valued function

152

1

+

n 2 1 2 n

  • MLPs can actually compose arbitrary functions in any number of

dimensions!

– Even with only one layer

  • As sums of scaled and shifted cylinders

– To arbitrary precision

  • By making the cylinders thinner

– The MLP is a universal approximator!

slide-139
SLIDE 139

Caution: MLPs with additive output units are universal approximators

153

1

+

n 2 1 2 n

  • MLPs can actually compose arbitrary functions in any number of

dimensions!

  • But explanation so far only holds if the output unit only performs

summation

– i.e. does not have an additional “activation”

  • = ∑

ℎ,𝑧,

u ,q&

slide-140
SLIDE 140

“Proper” networks: Outputswith activations

154

x1 x2 x3 xN sigmoid tanh

  • Output neuron may have actual “activation”

– Threshold, sigmoid, tanh, softplus, rectifier, etc.

  • What is the property of such networks?
slide-141
SLIDE 141

155

f: 0,1 · → 0,1 Boolean f: 𝑆· → 0,1 Threshold f: 𝑆· → (0,1) Sigmoid f: 𝑆· → (−1,1) Tanh f: 𝑆· → (0, +∞) Softrectifier, Rectifier

  • Output unit with activation function
  • Threshold or Sigmoid, or any other
  • The network is actually a universal map from the entire domain of input values to the

entire range of the output activation

  • All values the activation function of the output neuron
  • The MLP is a Universal Approximator for the entire class of functions (maps) it represents!
slide-142
SLIDE 142

A discussion of optimal depth and width

156

slide-143
SLIDE 143

The issue ofdepth

  • Previous discussion showed that a single-layer MLP is a universal

function approximator

– Can approximate any function to arbitrary precision – But may require infinite neurons in the layer

  • More generally, deeper networks will require far fewer neurons for

the same approximation error

– The network is a generic map

  • The same principles that apply for Boolean networks apply here

– Can be exponentially fewer than the 1-layer network

157

slide-144
SLIDE 144

Sufficiency of architecture

158

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

  • A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

  • Not all architectures can represent any function
slide-145
SLIDE 145

Sufficiency of architecture

159

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

  • A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

  • Not all architectures can represent any function

A network with less than 16 neurons in the first layer cannot represent this pattern exactly

  • With caveats..
slide-146
SLIDE 146

Sufficiency of architecture

160

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

  • A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

  • Not all architectures can represent any function

A network with less than 16 neurons in the first layer cannot represent this pattern exactly

  • With caveats..

We will revisit this idea shortly

slide-147
SLIDE 147

Sufficiency of architecture

161

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

  • A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

  • Not all architectures can represent any function

A network with less than 16 neurons in the first layer cannot represent this pattern exactly

  • With caveats..
slide-148
SLIDE 148

Sufficiency of architecture

162

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

  • A neural network can represent any function provided it has sufficient

capacity

– I.e. sufficiently broad and deep to represent the function

  • Not all architectures can represent any function

A network with less than 16 neurons in the first layer cannot represent this pattern exactly

  • With caveats..
slide-149
SLIDE 149

Sufficiency of architecture

163

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly

  • A neural network can represent any function provided
  • it has sufficient capacity

– I.e. sufficiently broad and deep to represent the function

  • Not all architectures can represent any function

A net work wi t h l es s t han 16 ne urons i n t he f i rst l ayer cannot repres ent t hi s pattern exact l y

  • Wi t h cave at s ..

A 2-layer network with 16 neurons in the first layer cannot represent the pattern with less than 41 neurons in the second layer

slide-150
SLIDE 150

Sufficiency of architecture

164

… ..

A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly

  • With caveats..
slide-151
SLIDE 151

Sufficiency of architecture

165

  • This effect is because we use the

threshold activation

  • It gates information in the input

from later layers

  • The pattern of outputs within any

colored region is identical

  • Subsequent layers do not obtain

enough information to partition them

slide-152
SLIDE 152

Sufficiency of architecture

166

  • This effect is because we use the

threshold activation

  • It gates information in the input

from later layers

  • Continuous activation functions result in graded output at the layer
  • The gradation provides information to subsequent layers, to

capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers).

slide-153
SLIDE 153

Sufficiency of architecture

167

  • This effect is because we use the

threshold activation

  • It gates information in the input

from later layers

  • Continuous activation functions result in graded output at the layer
  • The gradation provides information to subsequent layers, to

capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers).

  • Activations with more gradation (e.g. RELU) pass more information
slide-154
SLIDE 154

Width vs. Activations vs.Depth

  • Narrow layers can still pass information to subsequent layers if the

activation function is sufficiently graded

  • But will require greater depth, to permit later layers to capture

patterns

168

slide-155
SLIDE 155

Sufficiency of architecture

169

  • The capacity of a network has various definitions

– Information or Storage capacity: how many patterns can it remember – VC dimension

  • bounded by the square of the number of weights in the network

– From our perspective: largest number of disconnected convex regions it can represent

  • A network with insufficient capacity cannot exactly

model a function that requires a greater minimal number of convex hulls than the capacity of the network

– But can approximate it with error

slide-156
SLIDE 156

Summary

  • MLPs are universal Boolean function
  • MLPs are universal classifiers
  • MLPs are universal function approximators
  • A single-layer MLP can approximate anything to arbitrary precision

– But could be exponentially or even infinitely wide in its inputs size

  • Deeper MLPs can achieve the same precision with far fewer

neurons

– Deeper networks are more expressive

171