Neural Networks: What can a network represent
Deep Learning, Fall 2020
1
Neural Networks: What can a network represent Deep Learning, Fall - - PowerPoint PPT Presentation
Neural Networks: What can a network represent Deep Learning, Fall 2020 1 Recap : Neural networks have taken over AI Tasks that are made possible by NNs, aka deep learning Tasks that were once assumed to be purely in the human domain
1
– Tasks that were once assumed to be purely in the human domain of expertise
2
– Functions that take an input and produce an output – What’s in these functions?
N.Net Voice signal Transcription N.Net Image Text caption N.Net Game State Next move
3
N.Net Voice signal Transcription N.Net Image Text caption N.Net Game State Next move
4
5
6
computational models of neurons called perceptrons
7
– “Fires” if the weighted sum of inputs exceeds a threshold – Electrical engineers will call this a threshold gate
x1 x3 xN
8
combination of inputs (and threshold)
– We will hear more about activations later
sigmoid tanh +
. . . . . x x x x 𝑐 𝑨 𝑧 𝑥 𝑥 𝑥 𝑥
tanh (𝑨)
1 1 + exp (−𝑨)
log (1 + 𝑓)
11
– Perceptrons “feed” other perceptrons – We give you the “formal” definition of a layer later
12
13
input source nodes and output sink nodes, “depth” is the length of the longest path from a source to a sink
– A “source” node in a directed graph is a node that has only
– A “sink” node is a node that has only incoming edges
14
– The input is the “source”, – The output nodes are “sinks”
15 Input: Black Layer 1: Red Layer 2: Green Layer 3: Yellow Layer 4: Blue
– Can have multiple outputs for a single input
– What kinds of input/output relationships can it model?
16
N.Net
1 2 1 1 1 2 1 2 X Y Z A 1 1 1 1 2 1 1 1
1 1
1 1 1
1 1 1 1
x
ℎ ℎ
17
18
19
20
Y X Y
1 1 2
X
1 1 1
X
21
Values in the circles are thresholds Values on edges are weights
1 1 L 1
Will fire only if X1 .. XL are all 1 and XL+1 .. XN are all 0
22
1 1 L-N+1 1
Will fire only if any of X1 .. XL are 1
23
24
1 1 K 1 1 1 1 Will fire only if at least K inputs are 1
1 1 L-N+K 1
Will fire only if the total number of
are 0 is at least K
25
X Y
? ? ?
26
1 1 1
1
X Y
1
2 Hidden Layer
27
28
1 1 1 1
X Y
1.5 0.5
Thanks to Gerald Friedland
– Since they can emulate individual gates
1 2 1 1 1 2 1 2 X Y Z A 1 1 1 1 2 1 1 1
1 1
1 1 1
1 1 1 1
29
– Any function over any number of inputs and any number
1 2 1 1 1 2 1 2 X Y Z A 1 1 1 1 2 1 1 1
1 1
1 1 1
1 1 1 1
30
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Truth Table Truth table shows all input combinations for which output is 1
31
Truth Table
for which output is 1
32
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
Truth Table Truth table shows all input combinations for which output is 1
X1 X2 X3 X4 X5
But what is the largest number of perceptrons required in the single hidden layer for an N-input-variable function?
X1 X2 X3 X4 X5 Y 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
This is a “Karnaugh Map” It represents a truth table as a grid Filled boxes represent input combinations for which output is 1; blank boxes have
Adjacent boxes can be “grouped” to reduce the complexity of the DNF formula for the table
00 01 11 10 00 01 11 10
42
00 01 11 10 00 01 11 10
Basic DNF formula will require 7 terms
43
00 01 11 10 00 01 11 10
44
– Find groups – Express as reduced DNF – Boolean network for this function needs only 3 hidden units
00 01 11 10 00 01 11 10
45
00 01 11 10 00 01 11 10
46
00 01 11 10 00 01 11 10
Red=0, white=1
47
00 01 11 10 00 01 11 10
48
00 01 11 10 00 01 11 10
10 11 01 00 YZ
Red=0, white=1
49
00 01 11 10 00 01 11 10
10 11 01 00 YZ
50
00 01 11 10 00 01 11 10
10 11 01 00 YZ
51
00 01 11 10 00 01 11 10 YZ WX 10 11 01 00 YZ UV
00 01 11 10 00 01 11 10 YZ WX
52
1 1 1
1
X Y
1
2 Hidden Layer
53
00 01 11 10 00 01 11 10 YZ WX
W X Y Z 9 perceptrons
54
U V W X Y Z
00 01 11 10 00 01 11 10 YZ WX 10 11 01 00 YZ UV
15 perceptrons
55
U V W X Y Z
00 01 11 10 00 01 11 10 YZ WX 10 11 01 00 YZ UV
More generally, the XOR of N variables will require 3(N-1) perceptrons!!
56
00 01 11 10 00 01 11 10
10 11 01 00 YZ
𝑌 𝑌
…
58
…
𝑌 𝑌
XOR XOR XOR XOR
59
()/
– Because the output can be shown to be the XOR of all the outputs of the K-1th hidden layer – I.e. reducing the number of layers below the minimum will result in an exponentially sized network to express the function fully – A network with fewer than the minimum required number of neurons cannot model the function
𝑎 𝑎
𝑌 𝑌
60
connections
– In this example there are 30
implementations
require an exponential number of weights..
X1 X2 X3 X4 X5
61
62
size
X1 X2 X3 X4 X5
a b c d e f
63
– Parity, Circuits, and the Polynomial-Time Hierarchy,
Systems Theory 1984 – Alternately stated:
fan-in elements
64
tradeoff
, there is a Boolean function
variables that requires at least Boolean gates
– More correctly, for large , almost all n-input Boolean functions need more than Boolean gates
inputs could be computed using a circuit of size that is polynomial in , P = NP!
65
– It is sufficiently wide – It is sufficiently deep – Depth can be traded off for (sometimes) exponential growth of the width of the network
the complexity of the Boolean function
– Complexity: minimal number of terms in DNF formula to represent it
66
Boolean machine
– But a single-layer network may require an exponentially large number of perceptrons
shallower networks to express the same function
– Could be exponentially smaller
67
– Specifically composed of threshold gates
– E.g. “at least K inputs are 1” is a single TC gate, but an exponential size AC – For fixed depth, 𝐶𝑝𝑝𝑚𝑓𝑏𝑜 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 ⊂ 𝑢ℎ𝑠𝑓𝑡ℎ𝑝𝑚𝑒 𝑑𝑗𝑠𝑑𝑣𝑗𝑢𝑡 (strict subset)
– A depth-2 TC parity circuit can be composed with
weights
(𝑜) requires only 𝒫 𝑜 weights
– But more generally, for large , for most Boolean functions, a threshold circuit that is polynomial in at optimal depth may become exponentially large at
circuits
– Circuits which compute polynomials over any field
68
69
70
784 dimensions (MNIST) 784 dimensions
– This is a linear classifier
71
x1 x2
w1x1+w2x2=T
x2
x1 x2 x3 xN
Y X 0,0 0,1 1,0 1,1 Y X 0,0 0,1 1,0 1,1 X Y 0,0 0,1 1,0 1,1
72
73
x1 x2 Can now be composed into “networks” to compute arbitrary classification “boundaries”
74
x1 x2
x1 x2
75
x1 x2
x1 x2
76
x1 x2
x1 x2
77
x1 x2
x1 x2
78
x1 x2
x1 x2
79
x1 x2 x1 x2 AND 5 4 4 4 4 4 3 3 3 3 3
x1 x2
y5 y2 y3 y4
– “OR” two polygons – A third layer is required
80
x2
AND AND OR
x1 x1 x2
81
82
AND OR
x1 x2
– With only one hidden layer! – How?
83
AND OR
x1 x2
84
x1 x2 x2 x1
85
4
x1 x2 y
y1 y2 y3 y4
2 2 2 2
86
5 4 4 4 4 4
x1 x2 y
y1 y5 y2 y3 y4
2 2 2 2 2 3 3 3 3 3
87
6 5 5 5 5 5 5
x1 x2 y
y1 y5 y2 y3 y4 y6
3 3 3 3 3 3 4 4 4 4 4
88
89
90
91
polygon that have
x1 x2 y
y1 y5 y2 y3 y4
– Value of the sum at the output unit, as a function of distance from center, as N increases
– N in the cylinder, N/2 outside
93 x1 x2 y
y1 y5 y2 y3 y4
N N/2
– Very large number of neurons – Sum is N inside the circle, N/2 outside almost everywhere – Circle can be at any location
94
N N/2
y
– Very large number of neurons – Sum is N/2 inside the circle, 0 outside almost everywhere – Circle can be at any location
95
N/2
𝐳𝒋
𝑶 𝒋𝟐
− 𝑶 𝟑 ≥ 𝟏?
1
−𝑂/2
either circle, and 0 almost everywhere outside
96 𝐳𝒋
𝟑𝑶 𝒋𝟐
− 𝑶 𝟑 ≥ 𝟏?
– More accurate approximation with greater number of smaller circles – Can achieve arbitrary precision
97 𝐳𝒋
𝑳𝑶 𝒋𝟐
− 𝑶 𝟑 ≥ 𝟏?
98 𝐳𝒋
𝑳𝑶 𝒋𝟐
− 𝑶 𝟑 ≥ 𝟏?
99
x2 x1 x1 x2
arithmetic circuits
– Compute polynomials over any field
depth
– The majority of functions are very high (possibly ∞) order polynomials
– But only considers two-input units – Generalized by Mhaskar et al. to all functions that can be expressed as a binary tree
– Depth/Size analyses of arithmetic circuits still a research problem
100
Dellaleau and Yoshua Bengio
– For networks where layers alternately perform either sums
fewer number of layers than a shallow one
101
102
103
𝐳𝒋
𝑳𝑶 𝒋𝟐
− 𝑶 𝟑 > 𝟏? 104
105
– 16 in hidden layer 1 – 40 in hidden layer 2 – 57 total neurons, including output neuron
107
109
𝐳𝒋
𝑳𝑶 𝒋𝟐
− 𝑶 𝟑 > 𝟏? 110
111
– 64 in layer 1 – 544 in layer 2
112
– 190 neurons with 2-gate XOR
nets increases with increasing pattern complexity and input dimension
113
was quadratic in the number of lines
–
– Not exponential – Even though the pattern is an XOR – Why?
– Only two fully independent features – The pattern is exponential in the dimension of the input (two)!
mutually intersecting hyperplanes in dimensions, we will need
).
– Increasing input dimensions can increase the worst-case size of the shallower network exponentially, but not the XOR net
114
115
– Even a network with a single hidden layer is a universal Boolean machine
– Even a network with a single hidden layer is a universal classifier
networks to express the same function
– Could be exponentially smaller – Deeper networks are more expressive
116
117
generate a “square pulse” over an input
– Output is 1 only if the input lies between T1 and T2 – T1 and T2 can be arbitrarily specified
118
+
x
1 T1 T2 1 T1 T2 1
T1 T2 x
f(x)
– To arbitrary precision
119
x
1 T1 T2 1 T1 T2 1
T1 T2 x
f(x) x
+ × ℎ × ℎ × ℎ ℎ ℎ ℎ
N/2
1
120
dimensions!
– Even with only one hidden layer
– To arbitrary precision
– The MLP is a universal approximator!
121
– i.e. does not have an additional “activation”
122
, ,
x1 x2 x3 xN sigmoid tanh
123
– Threshold or Sigmoid, or any other
the entire range of the output activation
– All values the activation function of the output neuron
– Threshold or Sigmoid, or any other
the entire range of the output activation
– All values the activation function of the output neuron
The MLP is a Universal Approximator for the entire class of functions (maps) it represents!
126
is a universal function approximator
– Can approximate any function to arbitrary precision – But may require infinite neurons in the layer
neurons for the same approximation error
– The network is a generic map
– Can be exponentially fewer than the 1-hidden-layer network
127
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly
128
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
Why?
129
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
We will revisit this idea shortly
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
130
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
Why?
131
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
Why?
132
it has sufficient capacity
– I.e. sufficiently broad and deep to represent the function
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
A 2-layer network with 16 neurons in the first layer cannot represent the pattern with less than 40 neurons in the second layer
133
Why?
134
A network with 16 or more neurons in the first layer is capable of representing the figure to the right perfectly A network with less than 16 neurons in the first layer cannot represent this pattern exactly With caveats..
The pattern of outputs within any colored region is identical Subsequent layers do not obtain enough information to partition them This effect is because we use the threshold activation It gates information in the input from later layers
135
Continuous activation functions result in graded output at the layer The gradation provides information to subsequent layers, to capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers). This effect is because we use the threshold activation It gates information in the input from later layers
136
Continuous activation functions result in graded output at the layer The gradation provides information to subsequent layers, to capture information “missed” by the lower layer (i.e. it “passes” information to subsequent layers). Activations with more gradation (e.g. RELU) pass more information
137
This effect is because we use the threshold activation It gates information in the input from later layers
138
– Information or Storage capacity: how many patterns can it remember – VC dimension
– From our perspective: largest number of disconnected convex regions it can represent
a greater minimal number of convex hulls than the capacity of the network
– But can approximate it with error
139
– Koiran and Sontag (1998): For “linear” or threshold units, VC dimension is proportional to the number of weights
square of the number of weights
– Batlett, Harvey, Liaw, Mehrabian “Nearly-tight VC-dimension bounds for piecewise linear neural networks” (2017):
, s.t.
, there exisits a RELU network with
layers, weights with VC dimension
Networks” (2017):
, is the overall number of hidden neurons, is the weights per neuron
140
– But could be exponentially or even infinitely wide in its inputs size
neurons
– Deeper networks are more expressive – More graded activation functions result in more expressive networks
141
142
– The “center” is the parameter specifying the unit – The most common activation is the exponent
– But other similar activations may also be used
144
single unit with appropriate choice of bandwidth (or activation function)
– As opposed to units for the linear perceptron
145
148
– But could be exponentially or even infinitely wide in its inputs size
neurons
– Deeper networks are more expressive
149
– E.g. a function that takes an image as input and
– E.g. a function that takes speech input and outputs the labels of all phonemes in it – Etc…
150