Deep Learning with Neural Networks
The Structure and Optimization of Deep Neural Networks Allan Zelener Machine Learning Reading Group January 7th 2016 The Graduate Center, CUNY
Deep Learning with Neural Networks The Structure and Optimization - - PowerPoint PPT Presentation
Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan Zelener Machine Learning Reading Group January 7 th 2016 The Graduate Center, CUNY Objectives Explain some of the trends of deep learning and
The Structure and Optimization of Deep Neural Networks Allan Zelener Machine Learning Reading Group January 7th 2016 The Graduate Center, CUNY
machine learning research.
structure and training.
learning and related fields.
group discussions.
𝑂
𝑤𝑗𝜚 𝑥𝑗
𝑈𝑦 + 𝑐𝑗 , |𝐺 𝑦 − 𝑔 𝑦 | < 𝜗
𝑈𝑦 + 𝑐𝑗 should be 1 if 𝑔 𝑦 ≈ 𝑤𝑗 and 0 otherwise.
networks using fewer parameters.
composition of features, feature sharing, and distributed representation.
there may be some similarities.
become standard in the field.
interesting functions 𝑔: 𝑌 → 𝑍 that describe mappings from
𝑒𝑥 = 𝒚, 𝑒𝑧 𝑒𝑐 = 1
𝑒𝑦
≈ 1, 𝑦 > 0 0, 𝑦 ≤ 0
𝑦 ReLU 𝑦
𝒚 𝑋
0𝒚 + 𝒄𝟏
𝑋
1𝒊𝟐 + 𝒄𝟐
𝑋
𝑙𝒊𝒍 + 𝒄𝒍
𝒊𝒍+𝟐 ⋯ 𝒊𝟐 𝒊𝟑 𝒊𝒍
𝒇𝒚 𝒇𝒚 1
𝑓𝑦𝑗 𝑘=1
𝐿
𝑓𝑦𝑘 = 𝑞
𝑧 = 𝑗 𝑦 where class 𝑦 = 𝑧
𝑙 = arg max 𝑦1, … , 𝑦𝐿 = class(𝒚)
𝑧 be the “ground truth” target for 𝑦.
𝑧, 𝑧 be the loss for our prediction on 𝑦.
𝑧 then this should be 0.
𝑧 − 𝑧
2 2
𝑧 log 𝑧
𝑧 over training pairs 𝑦, 𝑧 .
respect to the weight parameters. 𝑥∗ = arg min
𝑥 𝐾 𝑥 = arg min 𝑥 𝑦, 𝑧 ∈𝑈
𝑀 𝑧, nn 𝑦, 𝑥
𝑥(𝑢+1) = 𝑥(𝑢) − 𝜃𝛼𝐾 𝑥 𝑢
𝐾(𝑥) 𝑥 𝛼𝐾(𝑥(𝑢)) 𝑥(𝑢) −𝜃𝛼𝐾(𝑥(𝑢))
backwards through the network. 𝜖𝑔
1 ∘ 𝑔 2 ∘ ⋯ ∘ 𝑔 𝑜
𝜖𝑦 = 𝜖𝑔
1
𝜖𝑔
2
⋅ 𝜖𝑔
2
𝜖𝑔
3
⋅ ⋯ ⋅ 𝜖𝑔
𝑜
𝜖𝑦
𝜖𝑏𝑗 = − 𝑙 𝑧𝑙 𝑏𝑙 𝜖𝑏𝑙 𝜖ℎ𝑗
𝜖ℎ𝑗 = 𝑏𝑗 1 − 𝑏𝑗 , 𝜖𝑏𝑗 𝜖ℎ𝑘 = −𝑏𝑗𝑏𝑘 for 𝑗 ≠ 𝑘
𝜖𝑋𝑗 = 𝟐ℎ𝑗>0 ⋅ 𝒚, 𝜖ℎ𝑗 𝜀𝑐𝑗 = 𝟐ℎ𝑗>0
𝑦 𝑋𝑦 + 𝒄 𝒊 − 𝑧 log 𝑏(h)
Homework: Prove
𝜖𝐾 𝜖ℎ𝑗 = 𝑏𝑗 −
𝑧𝑗 and verify softmax derivatives.
𝑧
𝜖𝑋𝑗 = 𝜖𝐾 𝜖ℎ𝑗 𝜖ℎ𝑗 𝜀𝑆𝑓𝑀𝑉𝑗 𝜖𝑆𝑓𝑀𝑉𝑗 𝜖Linear𝑗 𝜖Linear𝑗 𝜖𝑋𝑗
= 𝑏𝑗 − 𝑧𝑗 𝟐ℎ𝑗>0 ⋅ 𝒚
𝜖𝑐𝑗 = 𝜖𝐾 𝜖ℎ𝑗 𝜖ℎ𝑗 𝜀𝑆𝑓𝑀𝑉𝑗 𝜖𝑆𝑓𝑀𝑉𝑗 𝜖Linear𝑗 𝜖Linear𝑗 𝜖𝑐𝑗
= 𝑏𝑗 − 𝑧𝑗 𝟐ℎ𝑗>0 ⋅ 1
𝒚 𝑋𝒚 + 𝒄 𝒊 − 𝑧 log 𝑏(h) 𝜖𝐾 𝜖ℎ 𝑏 𝒊 , 𝒛 𝑧 𝜖ℎ 𝜖𝑆𝑓𝑀𝑉 𝑏(𝒊) − 𝒛 𝜖𝑆𝑓𝑀𝑉 𝜖Linear 𝟐ℎ>0𝑏(𝒊) − 𝒛 𝛼𝐾(𝑋, 𝑐)
the entire training set: 𝐾 𝑥 = 𝑦,
𝑧 ∈𝑈 𝑀
𝑧, nn 𝑦, 𝑥 .
𝐾 𝑥 ≈ 𝑦,
𝑧 ∈𝐶⊂𝑈 𝑀
𝑧, nn 𝑦, 𝑥 .
batching several examples for matrix multiplications, e.g. 𝐼 = 𝑋𝑌.
SGD GD SGD + Momentum
𝜃 ℎ 𝑢+1 𝛼𝐾 𝑥 𝑢
update more if there have been few changes.
1 𝑢−𝑢0 𝑗=𝑢0+1 𝑢
𝑥 𝑗 = 𝑥(𝑢−1) +
1 𝑢−𝑢0 𝑥 𝑢 −
𝑥 𝑢−1
> 𝜄 then 𝑥(𝑢+1) = 𝑥(𝑢) −
𝜃𝜄 𝛼𝐾 𝑥 𝑢
𝛼𝐾 𝑥 𝑢
𝐾(𝑥) 𝑥 − 𝜄 𝛼𝐾 𝑥 𝑢 𝛼𝐾(𝑥(𝑢)) −𝛼𝐾(𝑥(𝑢))
2 (for sigmoid).
𝒛 log 𝒛
High temperature, low peakiness Many small weights, small gradients Low temperature, high peakiness Few big weights, big gradients
the target function.
little detail in the training data.
parameters to generalize better.
𝑧; 𝑥 = 𝑀 𝑧, nn 𝑦, 𝑥 +
𝜇 2 𝑥 2 2
layer is reasonable, then scale up as needed.
Shlens 2015)
𝑢 =
−∞ ∞ 𝑔 𝜐 𝑢 − 𝜐 𝑒𝜐
your data as a tensor rather than a feature vector.
roughly preserves it, e.g. image dimensions.
then 𝑋 is 240,400 x 76,800. If we convolve with a 5x5 filter then 𝑋′ is 75 x 1 and is applied 76,800 times.
𝑦𝑢 𝑡𝑢−1
Olah’s blog post on LSTMs.
𝐽 𝑃 𝐺 𝐷𝑢 𝑦𝑢 𝑡𝑢−1 𝑡𝑢 𝐷𝑢−1 𝑉 𝑋 × × ×
to supervised learning despite the initial neural net revival being prompted by the generative model of (Mohamed, Dahl, Hinton 2009).
textbook.
Recognition by Fei-Fei Li and Andrej Karpathy.
slides presented at the Machine Learning Summer School 2015.
from CVPR 2014.