Neural Networks: What can a network represent
Deep Learning, Spring 2018
1
Neural Networks: What can a network represent Deep Learning, Spring - - PowerPoint PPT Presentation
Neural Networks: What can a network represent Deep Learning, Spring 2018 1 Recap : Neural networks have taken over AI Tasks that are made possible by NNs, aka deep learning 2 Recap : NNets and the brain In their basic form, NNets
1
2
3
4
computational models of neurons called perceptrons
5
x1 x3 xN
6
combination of inputs (and threshold)
– We will hear more about activations later
sigmoid tanh +
. . . . . x x x x 𝑐 𝑨 𝑧 𝑥 𝑥 𝑥 𝑥
tanh (𝑨)
1 1 + exp (−𝑨)
log (1 + 𝑓)
9
10
11
12
13
– Can have multiple outputs for a single input
– What kinds of input/output relationships can it model?
14
1 2 1 1 1 2 1 2 X Y Z A 1 1 1 1 2 1 1 1
1 1
1 1 1
1 1 1 1
x
ℎ ℎ
15
16
17
18
X Y
1 1 2
X Y
1 1 1
X
19
1 1 L 1
Will fire only if X1 .. XL are all 1 and XL+1 .. XN are all 0
20
1 1 L-N+1 1
Will fire only if any of X1 .. XL are 1
21
1 1 L-N+K 1
Will fire only if the total number of
are 0 is at least K
22
X Y
? ? ?
23
1 1 1
1
X Y
1
2 Hidden Layer
24
– Since they can emulate individual gates
1 2 1 1 1 2 1 1 X Y Z A 1 1 1 1 2 1 1 1
1 1
1 1 1
1 1 1 1
25
– Any function over any number of inputs and any number
1 2 1 1 1 2 1 1 X Y Z A 1 1 1 1 2 1 1 1
1 1
1 1 1
1 1 1 1
26
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
27
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table
for which output is 1
28
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
But what is the largest number of perceptrons required in the single hidden layer for an N-input-variable function?
This is a “Karnaugh Map” It represents a truth table as a grid Filled boxes represent input combinations for which output is 1; blank boxes have
Adjacent boxes can be “grouped” to reduce the complexity of the DNF formula for the table
00 01 11 10 00 01 11 10
38
00 01 11 10 00 01 11 10
Basic DNF formula will require 7 terms
39
00 01 11 10 00 01 11 10
40
00 01 11 10 00 01 11 10
41
00 01 11 10 00 01 11 10
42
00 01 11 10 00 01 11 10
43
00 01 11 10 00 01 11 10
44
00 01 11 10 00 01 11 10
10 11 01 00 YZ
45
– In this example there are 30
X1 X2 X3 X4 X5
46
MLP for this Boolean function of 6 variables?
– How many weights will this network require? 00 01 11 10 00 01 11 10
10 11 01 00 YZ
47
00 01 11 10 00 01 11 10
10 11 01 00 YZ
48
00 01 11 10 00 01 11 10
10 11 01 00 YZ
49
00 01 11 10 00 01 11 10 YZ WX 10 11 01 00 YZ UV
00 01 11 10 00 01 11 10 YZ WX
50
1 1 1
1
X Y
1
2 Hidden Layer
51
– 27 parameters
00 01 11 10 00 01 11 10 YZ WX
W X Y Z 9 perceptrons
52
– 45 parameters
U V W X Y Z
00 01 11 10 00 01 11 10 YZ WX 10 11 01 00 YZ UV
15 perceptrons
53
– 45 weights
U V W X Y Z
00 01 11 10 00 01 11 10 YZ WX 10 11 01 00 YZ UV
More generally, the XOR of N variables will require 3(N-1) perceptrons (and 9(N-1) weights)
54
00 01 11 10 00 01 11 10
10 11 01 00 YZ
55
𝑌 𝑌
…
56
𝑎 𝑎
/
– Because the output can be shown to be the XOR of all the outputs of the K-1th hidden layer – I.e. reducing the number of layers below the minimum will result in an exponentially sized network to express the function fully – A network with fewer than the minimum required number of neurons cannot model the function
𝑌 𝑌
57
58
size
X1 X2 X3 X4 X5
a b c d e f
59
– Parity, Circuits, and the Polynomial-Time Hierarchy,
Systems Theory 1984 – Alternately stated:
fan-in elements
60
tradeoff
, there is Boolean function of variables that requires at least gates
– More correctly, for large ,almost all n-input Boolean functions need more than gates
inputs could be computed using a circuit of size that is polynomial in , P = NP!
61
– It is sufficiently wide – It is sufficiently deep – Depth can be traded off for (sometimes) exponential growth of the width of the network
the complexity of the Boolean function
– Complexity: minimal number of terms in DNF formula to represent it
62
Boolean machine
– But a single-layer network may require an exponentially large number of perceptrons
shallower networks to express the same function
– Could be exponentially smaller
63
– Specifically composed of threshold gates
– E.g. “at least K inputs are 1” is a single TC gate, but an exponential size AC – For fixed depth, 𝐶𝑝𝑝𝑚𝑓𝑏𝑜 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 ⊂ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 (strict subset)
– A depth-2 TC parity circuit can be composed with
weights
(𝑜) requires only 𝒫 𝑜 weights
– But more generally, for large , for most Boolean functions, a threshold circuit that is polynomial in at optimal depth becomes exponentially large at
circuits
– Circuits which compute polynomials over any field
64
65
66
784 dimensions (MNIST) 784 dimensions
– This is a linear classifier
67
x1 x2
w1x1+w2x2=T
x2
x1 x2 x3 xN
Y X 0,0 0,1 1,0 1,1 Y X 0,0 0,1 1,0 1,1 X Y 0,0 0,1 1,0 1,1
68
69
x1 x2 Can now be composed into “networks” to compute arbitrary classification “boundaries”
70
x1 x2
x1 x2
71
x1 x2
x1 x2
72
x1 x2
x1 x2
73
x1 x2
x1 x2
74
x1 x2
x1 x2
75
x1 x2 x1 x2 AND 5 4 4 4 4 4 3 3 3 3 3
x1 x2
y5 y2 y3 y4
– “OR” two polygons – A third layer is required
76
x2
AND AND OR
x1 x1 x2
77
78
AND OR
x1 x2
– With only one hidden layer! – How?
79
AND OR
x1 x2
80
x1 x2 x2 x1
81
4
x1 x2 y
y1 y2 y3 y4
2 2 2 2
82
5 4 4 4 4 4
x1 x2 y
y1 y5 y2 y3 y4
2 2 2 2 2 3 3 3 3 3
83
6 5 5 5 5 5 5
x1 x2 y
y1 y5 y2 y3 y4 y6
3 3 3 3 3 3 4 4 4 4 4
84
85
86
87
88 x1 x2 y
y1 y5 y2 y3 y4
– N in the cylinder, N/2 outside
89 x1 x2 y
y1 y5 y2 y3 y4
N N/2
– Very large number of neurons – Sum is N inside the circle, N/2 outside almost everywhere – Circle can be at any location
90
N N/2
y
– Very large number of neurons – Sum is N/2 inside the circle, 0 outside almost everywhere – Circle can be at any location
91
N/2
𝐳𝒋
𝑶 𝒋𝟐
− 𝑶 𝟑 > 𝟏?
1
−𝑂/2
either circle, and 0 almost everywhere outside
92 𝐳𝒋
𝟑𝑶 𝒋𝟐
− 𝑶 𝟑 > 𝟏?
– More accurate approximation with greater number of smaller circles – Can achieve arbitrary precision
93 𝐳𝒋
𝑳𝑶 𝒋𝟐
− 𝑶 𝟑 > 𝟏?
94 𝐳𝒋
𝑳𝑶 𝒋𝟐
− 𝑶 𝟑 > 𝟏?
x2 x1 x1 x2
95
arithmetic circuits
– Compute polynomials over any field
– But only considers two-input units – Generalized by Mhaskar et al. to all functions that can be expressed as a binary tree
– Depth/Size analyses of arithmetic circuits still a research problem
96
Dellaleau and Yoshua Bengio
– For networks where layers alternately perform either sums
fewer number of layers than a shallow one
97
98
99
𝐳𝒋
𝑳𝑶 𝒋𝟐
− 𝑶 𝟑 > 𝟏? 100
101
– 16 in hidden layer 1 – 40 in hidden layer 2 – 57 total neurons, including output neuron
103
105
𝐳𝒋
𝑳𝑶 𝒋𝟐
− 𝑶 𝟑 > 𝟏? 106
107
– 64 in layer 1 – 544 in layer 2
108
shallower nets increases with increasing pattern complexity
109
was quadratic in the number of lines
–
– Not exponential – Even though the pattern is an XOR – Why?
– Only two fully independent features – The pattern is exponential in the dimension of the input (two)!
mutually intersecting hyperplanes in dimensions, we will need
).
– Increasing input dimensions can increase the worst-case size of the shallower network exponentially, but not the XOR net
110
111
– Even a network with a single hidden layer is a universal Boolean machine
– Even a network with a single hidden layer is a universal classifier
shallower networks to express the same function
– Could be exponentially smaller – Deeper networks are more expressive
112
113
generate a “square pulse” over an input
– Output is 1 only if the input lies between T1 and T2 – T1 and T2 can be arbitrarily specified
114
+
x
1 T1 T2 1 T1 T2 1
T1 T2 x
f(x)
– To arbitrary precision
115
x
1 T1 T2 1 T1 T2 1
T1 T2 x
f(x) x
+ × ℎ × ℎ × ℎ ℎ ℎ ℎ
N/2
1
116
dimensions!
– Even with only one layer
– To arbitrary precision
– The MLP is a universal approximator!
117
– i.e. does not have an additional “activation”
118
x1 x2 x3 xN sigmoid tanh
119
– Threshold or Sigmoid, or any other
possible output values
– All values the activation function of the output neuron
– Threshold or Sigmoid, or any other
possible output values
– All values the activation function of the output neuron
it represents!
122
universal function approximator
– Can approximate any function to arbitrary precision – But may require infinite neurons in the layer
neurons for the same approximation error
– The network is a generic map
– Can be exponentially fewer than the 1-layer network
123
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly
124
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
Why?
125
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
We will revisit this idea shortly
126
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
Why?
127
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
Why?
128
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
A 2-layer network with 16 neurons in the first layer cannot represent the pattern with less than 41 neurons in the second layer
129
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
Why?
130
This effect is because we use the threshold activation It gates information in the input from later layers The pattern of outputs within any colored region is identical Subsequent layers do not obtain enough information to partition them
131
This effect is because we use the threshold activation It gates information in the input from later layers Continuous activation functions result in graded output at the layer The gradation provides information to subsequent layers, to capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers).
132
This effect is because we use the threshold activation It gates information in the input from later layers Continuous activation functions result in graded output at the layer The gradation provides information to subsequent layers, to capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers). Activations with more gradation (e.g. RELU) pass more information
133
134
– Information or Storage capacity: how many patterns can it remember – VC dimension
– From our perspective: largest number of disconnected convex regions it can represent
a greater minimal number of convex hulls than the capacity of the network
– But can approximate it with error
135
– Koiran and Sontag (1998): For “linear” or threshold units, VC dimension is proportional to the number of weights
square of the number of weights
– Harvey, Liaw, Mehrabian “Nearly-tight VC-dimension bounds for piecewise linear neural networks” (2017):
, s.t.
, there exisits a RELU network with
layers, weights with VC dimension
Networks” (2017):
, is the overall number of hidden neurons, is the weights per neuron
136
137
– The “center” is the parameter specifying the unit – The most common activation is the exponent
– But other similar activations may also be used
139
single unit with appropriate choice of bandwidth (or activation function)
– As opposed to units for the linear perceptron
140
143
– But could be exponentially or even infinitely wide in its inputs size
neurons
– Deeper networks are more expressive
144
– E.g. a function that takes an image as input and
– E.g. a function that takes speech input and outputs the labels of all phonemes in it – Etc…
145