Deep Neural Networks
Scanning for patterns (aka convolutional networks)
Bhiksha Raj
1
Scanning for patterns (aka convolutional networks) Bhiksha Raj 1 - - PowerPoint PPT Presentation
Deep Neural Networks Scanning for patterns (aka convolutional networks) Bhiksha Raj 1 Story so far MLPs are universal function approximators Boolean functions, classifiers, and regressions MLPs can be trained through variations of
1
2
input layer
Or, more generally a vector input
– The linear decision boundaries in this example
– The polygons
4
AND AND OR
hierarchically
– Each layer learns to “detect” simple combinations of the patterns detected by earlier layers
– Typically linear classifiers or thresholding units – Incapable of individually holding complex patterns
5
AND AND OR
individual perceptron
– The perceptron is fully represented by its weights – For illustration, we consider a simple threshold activation
– The perceptron “fires” if the inner product between the weights and the inputs exceeds a threshold
6
x1 x2 x3 xN 𝑧 = ൞1 𝑗𝑔
𝑗
𝑥𝑗x𝑗 ≥ 𝑈 0 𝑓𝑚𝑡𝑓 𝑧 = ቊ1 𝑗𝑔 𝐲𝑈𝐱 ≥ 𝑈 0 𝑓𝑚𝑡𝑓
– Represents a convex region on the surface of the sphere!
vector
– If the input pattern matches the weight pattern closely enough
7
w 𝐲T𝐱 > T ⇒ cos θ > T 𝐲 𝐱 ⇒ θ < cos−1 T 𝐲 𝐱
x1 x2 x3 xN
𝐱 𝐲 θ
8
W X X Correlation = 0.57 Correlation = 0.82
𝑧 = ൞1 𝑗𝑔
𝑗
𝑥𝑗x𝑗 ≥ 𝑈 0 𝑓𝑚𝑡𝑓
– Detect if certain patterns have occurred in the input
9
DIGIT OR NOT?
10
DIGIT OR NOT?
– They represent Boolean functions over linear boundaries – They can represent arbitrary boundaries
– They detect patterns in the input
– Patterns of lower-complexity patterns
– The network will fire if the combination of the detected basic features matches an “acceptable” pattern for a desired class of signal
11
the top recording will not find it in the lower one
– Unless trained with both – Will require a very large network and a large amount of training data to cover every case
input layer
– Only the presence of the pattern
– Moving it by one component results in an entirely different input that the MLP wont recognize
– Only the presence of the pattern
– Moving it by one component results in an entirely different input that the MLP wont recognize
MAX
– Maximum of all the outputs (Equivalent of Boolean OR) – Or a proper softmax/logistic
noise
Perceptron
– Maximum of all the outputs (Equivalent of Boolean OR) – Or a proper softmax/logistic
– Or even an MLP
Input (the pixel data)
max
– Provide many training examples
– Gradient descent to minimize the total divergence between predicted and desired outputs
𝑙 = 𝑥𝑛𝑜 𝑚
– A weight 𝑥𝑗𝑘
𝑙 is required to be
identical to the weight 𝑥𝑛𝑜
𝑚
𝑙 and 𝑥𝑛𝑜 𝑚
– Each of these perturbations will individually influence the divergence 𝐸𝑗𝑤(𝑒, 𝑧)
𝒀 𝑧 Div 𝑒 𝐸𝑗𝑤(𝑒, 𝑧)
𝑒𝐸𝑗𝑤 𝑒𝑥𝒯 = 𝑒𝐸𝑗𝑤 𝑒𝑥𝑗𝑘
𝑙
𝑒𝑥𝑗𝑘
𝑙
𝑒𝑥𝒯 + 𝑒𝐸𝑗𝑤 𝑒𝑥𝑛𝑜
𝑚
𝑒𝑥𝑛𝑜
𝑚
𝑒𝑥𝒯 = 𝑒𝐸𝑗𝑤 𝑒𝑥𝑗𝑘
𝑙 + 𝑒𝐸𝑗𝑤
𝑒𝑥𝑛𝑜
𝑚
via backpropagation Influence diagram 𝑥𝑗𝑘
𝑙
𝑥𝑛𝑜
𝑚
𝑥𝒯 𝐸𝑗𝑤 𝒀 𝑧 Div 𝑒 𝐸𝑗𝑤(𝑒, 𝑧)
the common weight of the set
– E.g. the set of all red weights in the figure
𝑒𝐸𝑗𝑤 𝑒𝑥𝒯 =
𝑓∈𝒯
𝑒𝐸𝑗𝑤 𝑒𝑥𝑓
𝒯 = 𝑓1, 𝑓1, … , 𝑓𝑂
1, 𝐗2, … , 𝐗𝐿
(𝑙) = 𝑥𝑗,𝑘 (𝑙) − 𝜃 𝑒𝐹𝑠𝑠 𝑒𝑥𝑗,𝑘
(𝑙)
57
𝒖
1, 𝐗2, … , 𝐗𝐿)
1, 𝐗2, … , 𝐗𝐿
𝛼𝒯𝐹𝑠𝑠 = 𝑒𝐹𝑠𝑠 𝑒𝑥𝒯 𝑥𝒯 = 𝑥𝒯 − 𝜃𝛼𝒯𝐹𝑠𝑠
𝑥𝑗,𝑘
(𝑙) = 𝑥𝒯
58
1, 𝐗2, … , 𝐗𝐿
𝛼𝒯𝐹𝑠𝑠 = 𝑒𝐹𝑠𝑠 𝑒𝑥𝒯 𝑥𝒯 = 𝑥𝒯 − 𝜃𝛼𝒯𝐹𝑠𝑠
𝑥𝑗,𝑘
(𝑙) = 𝑥𝒯
59
1, 𝐗2, … , 𝐗𝐿
𝛼𝒯𝐹𝑠𝑠 = 𝑒𝐹𝑠𝑠 𝑒𝑥𝒯 𝑥𝒯 = 𝑥𝒯 − 𝜃𝛼𝒯𝐹𝑠𝑠
𝑥𝑗,𝑘
(𝑙) = 𝑥𝒯
60
𝛼𝒯𝐸𝑗𝑤 += 𝑒𝐸𝑗𝑤 𝑒𝑥𝑗,𝑘
(𝑙)
1, 𝐗2, … , 𝐗𝐿
𝛼𝒯𝐹𝑠𝑠 = 𝑒𝐹𝑠𝑠 𝑒𝑥𝒯 𝑥𝒯 = 𝑥𝒯 − 𝜃𝛼𝒯𝐹𝑠𝑠
𝑥𝑗,𝑘
(𝑙) = 𝑥𝒯
61
𝛼𝒯𝐸𝑗𝑤 += 𝑒𝐸𝑗𝑤 𝑒𝑥𝑗,𝑘
(𝑙)
Computed by Backprop
scanning
– 1-D scanning for sound – 2-D scanning for images – 3-D and higher-dimensional scans for higher dimensional data
subnets
– The large network has shared subnets
modified to combine gradients from parameters that share the same value
– The principle applies in general for networks with shared parameters
Input (the pixel data)
Input layer Hidden layer
the picture in the box as part of the classification for that region
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
𝑏𝑑𝑢𝑗𝑤𝑏𝑢𝑗𝑝𝑜
𝑗,𝑘
𝑥𝑗𝑘𝑞𝑗𝑘 + 𝑐
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
part of the classification for that region
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
position into a rectangle that’s proportional in size to the original picture
part of the classification for that region
– We could arrange the outputs of the neurons for each position correspondingly to the original picture
position into a rectangle that’s proportional in size to the original picture
– The second level neurons too are “scanning” the rectangular outputs
– (Un)like the first level, they are jointly scanning multiple “pictures”
corresponding locations from the outputs of all the first-level neurons
– The second level neurons too are “scanning” the rectangular outputs
– (Un)like the first level, they are jointly scanning multiple “pictures”
corresponding locations from the outputs of all the first-level neurons
– The second level neurons too are “scanning” the rectangular outputs
– (Un)like the first level, they are jointly scanning multiple “pictures”
corresponding locations from the outputs of all the first-level neurons
– The second level neurons too are “scanning” the rectangular outputs
– (Un)like the first level, they are jointly scanning multiple “pictures”
corresponding locations from the outputs of all the first-level neurons
– The second level neurons too are “scanning” the rectangular outputs
– (Un)like the first level, they are jointly scanning multiple “pictures”
corresponding locations from the outputs of all the first-level neurons
– The second level neurons too are “scanning” the rectangular outputs
– (Un)like the first level, they are jointly scanning multiple “pictures”
corresponding locations from the outputs of all the first-level neurons
– The second level neurons too are “scanning” the rectangular outputs
– (Un)like the first level, they are jointly scanning multiple “pictures”
corresponding locations from the outputs of all the first-level neurons
– The second level neurons too are “scanning” the rectangular outputs
– (Un)like the first level, they are jointly scanning multiple “pictures”
corresponding locations from the outputs of all the first-level neurons
– Not “detect the location of a flower”
– Not “detect the location of a flower”
to detect a picture
– Finds the max output from all the positions – Or..
– Could use multiple MLPs.. – Or a single larger MLPs with multiple outputs
– Could use multiple MLPs.. – Or a single larger MLPs with multiple output
– The linear decision boundaries in this example
– The polygons
101
AND AND OR
102
DIGIT OR NOT?
features
– Subsequent layers only perform classification over these block-level features
pixels
– Subsequent layers only look at a single pixel in their input maps
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer
– The first layer evaluates smaller blocks of pixels – The next layer evaluates blocks of outputs from the first layer – This effectively evaluates the larger block of the original image
Colors indicate neurons with shared parameters Layer 1 Each arrow represents an entire set
The pattern of weights going out of any cell is identical to that from any
Colors indicate neurons with shared parameters Layer 1 Layer 2 No sharing at this level within a block
All weights shown are unique
Colors indicate shared parameters
Colors indicate shared parameters
We are effectively evaluating the yellow block with the share parameter net to the right Every block is evaluated using the same net in the overall computation
– Where individual subnets are themselves shared-parameter nets
– All other blocks are scanned by the same net
– (𝑂𝑗−1 + 1)𝑂𝑗weights in subsequent ith layer
– Ignoring the bias term N1 units N2 units 𝐿 × 𝐿 block
– 𝑂
1(𝑀2 + 1) weights
𝐿 𝑀 2
𝑂1 + 1)𝑂2 weights
– Total parameters: 𝒫 𝑀2𝑂1 +
𝐿 𝑀 2
𝑂
1𝑂2 + 𝑂2𝑂3 …
Colors indicate neurons with shared parameters N1 groups Layer 2 No sharing at this level within a block 𝐿 × 𝐿 block 𝑀 × 𝑀 cell
2 pixels
– 𝑂
1(𝑀1 2 + 1) weights
neurons
– (𝑀2
2𝑂 1 + 1)𝑂2 weights
– (
𝐿 𝑀1𝑀2 2
𝑂2 + 1)𝑂3 weights
– Total parameters: 𝒫 𝑀1
2𝑂1 + 𝑀2 2𝑂 1𝑂2 + 𝐿 𝑀1𝑀2 2
𝑂2𝑂3 + ⋯
16, 𝑂1 = 4, 𝑂2 = 2, 𝑂3 = 1
Conventional MLP, not distributed Distributed (3 layers)
2𝑂1 + 𝑀2 2𝑂1𝑂2 +
Conventional MLP, not distributed Distributed (3 layers)
2𝑂1 +
σ𝑗<𝑜𝑑𝑝𝑜𝑤−1 𝑀𝑗
2𝑂𝑗𝑂𝑗+1 + 𝐿 ς𝑗 ℎ𝑝𝑞𝑗 2
𝑂𝑜𝑑𝑝𝑜𝑤−1𝑂𝑜𝑑𝑝𝑜𝑤 +
These terms dominate..
– More generalizable
– Large (sometimes order of magnitude) reduction in parameters
network as “scanning” the picture with an MLP
– The only difference is the manner in which parameters are shared in the MLP
– Sufficient to detect, say, petals
– To put the petals together into a flower – This corresponds to looking at a larger region of the original input image
– Sufficient to detect, say, petals
– To put the petals together into a flower – This corresponds to looking at a larger region of the original input image
– Sufficient to detect, say, petals
– To put the petals together into a flower – This corresponds to looking at a larger region of the original input image
– The squares show the sizes of the receptive fields for the first, second and third-layer neurons
weights
and must be calculated
– What patterns in the input do the neurons actually respond to? – Will not actually be simple, identifiable patterns like “petal” and “inflorescence”
– If a pattern shifts by one pixel, is it still a petal? – A small jitter is acceptable
around it
– Max filtering or Max pooling
Max Max Max Max
– If a pattern shifts by one pixel, is it still a petal? – A small jitter is acceptable
around it
– Max filtering or Max pooling
Max
Max
Max layer
Max layer
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
Max
– This will result in a shrinking of the map – The operation is usually called “pooling”
Max
Max layer
max pool with 2x2 filters and stride 2
Max pool
max pooling (and reduction) before the final MLP
– The individual perceptrons at any “scanning” or “convolutive” layer are called “filters”
– As mentioned, the individual max operations are also called max pooling or max filters
Input image First layer filters First layer maxpooling Second layer filters Second layer maxpooling
The spectrographic time-frequency components are the input layer
– Simple to complex
performed by scanning for the target pattern
parameter network
scanning with individual neurons
– First level neurons scan the input – Higher-level neurons scan the “maps” formed by lower-level neurons – A final “decision” layer (which may be a max, a perceptron, or an MLP) makes the final decision