Deep Learning with Neural Networks The Structure and Optimization - PowerPoint PPT Presentation

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan Zelener Machine Learning Reading Group January 7 th 2016 The Graduate Center, CUNY

Objectives • Explain some of the trends of deep learning and neural networks in machine learning research. • Give a theoretical and practical understanding of neural network structure and training. • Provide baseline for reading neural network papers in machine learning and related fields. • Brief look at some major work that could be used in further reading group discussions.

Why are neural networks back again? • State-of-the-art performance on benchmark perception datasets. • TIMIT – (Mohamed, Dahl, Hinton 2009) • 23.0% phoneme error rate vs 24.4% ensemble method. • 17.7% with LSTM RNN (Graves, Mohamed, Hinton 2013) • Imagenet – (Krizhevsky, Sutskever, Hinton 2012) • 16% top-5 error vs 25% of competing methods. • In 2015 deep nets can achieve ~3.5% top-5 error. • Larger datasets and faster computation. • Good enough that industry is now investing resources. • A few innovations: ReLU, Dropout.

Why should neural networks work? • No strong and useful theoretical guarantees yet. • Universal approximation theorems • Taylor’s theorem (differentiable functions) • Stone-Weierstrass theorem (continuous functions) 𝑂 𝑈 𝑦 + 𝑐 𝑗 , |𝐺 𝑦 − 𝑔 𝑦 | < 𝜗 • 𝐺 𝑦 = 𝑗=1 𝑤 𝑗 𝜚 𝑥 𝑗 • 𝐺 𝑦 is a piecewise constant approximation of 𝑔(𝑦) . 𝑈 𝑦 + 𝑐 𝑗 should be 1 if 𝑔 𝑦 ≈ 𝑤 𝑗 and 0 otherwise. • The neural network unit 𝜚 𝑥 𝑗 • Optimization of neural networks • “Many equally good local minimum” for simplified ideal models. • Saddle point problem in non-convex optimization. (Dauphin et al. 2014) • Loss surface of multilayer neural networks. (Choromanska et al. 2015)

Why go deeper? • Deep Neural Networks • Universal approximation theorem is for single layer networks. • For complicated functions we may need very large 𝑂 . • Empirically, deep networks learn better representations than shallow networks using fewer parameters. • For applications where data is highly structured, e.g. vision, facilitates composition of features, feature sharing, and distributed representation. • Caveat: Deep nets can be compressed. (Ba and Caruana 2014)

Why go deeper? • Deep Learning for Representation Learning • Classic pipeline • Raw Measurements  Features  Prediction. • Replace human heuristic feature engineering with learned representations. • New pipeline • Raw Measurements  Prediction. (Representation is inside the  ) • End-to-end optimization, but not necessarily a neural network. • Caveat: Replaces feature engineering with pipeline engineering. • Deep Learning as composition of classical models • Feed-forward neutral network ≡ Recursive generalized linear model.

Why neural? • Loosely biologically inspired architecture. • LeNet CNN inspired by cat and monkey visual cortex. (Hubel and Wisel 1968) • Real neurons respond to simple structures like edges. • Probably not actually a good model for how brains work, although there may be some similarities. • Pitfall: Mistaking neural networks for neuroscience. • I will try to avoid neural inspired jargon where possible but it has become standard in the field.

The Architecture • In machine learning we want to find good approximations to interesting functions 𝑔: 𝑌 → 𝑍 that describe mappings from observations to useful predictions. • The approximation function should be: • Computationally tractable • Nonlinear (if 𝑔 is nonlinear of course) • Parameterizable (so we can learn parameters given training data)

The Architecture - Tractable • Step 1: Start with a linear function. • 𝑧 = 𝑥 𝑈 𝒚 + 𝑐 – Linear unit. • 𝒛 = 𝑋𝒚 + 𝒄 – Linear layer. • Efficient to compute, optimized and stable. BLAS libraries and GPUs. • 𝑍 = 𝑋𝑌 + 𝒄 – Linear layer with batch input. • Easily differentiable, and thus optimizable. 𝑒𝑧 𝑒𝑧 • 𝑒𝑥 = 𝒚, 𝑒𝑐 = 1 • Many parameters, 𝑃 𝑜𝑛 for 𝑜 input dim and 𝑛 outputs.

The Architecture - Nonlinear • Step 2: Add a non-linearity. • 𝒛 = 𝜚 𝑋𝒚 + 𝒄 ReLU 𝑦 • 𝜚 ⋅ is some nonlinear function, historically sigmoid. • Logistic function 𝜏: ℝ → (0,1) • Hyperbolic tangent tanh: ℝ → (−1,1) 𝑦 • ReLU (Rectified Linear Unit) is a popular choice now. • ReLU 𝑦 = max 0, 𝑦 • Computationally efficient and surprisingly just works. ≈ 1, 𝑦 > 0 𝑒ReLU • 0, 𝑦 ≤ 0 𝑒𝑦 • Note: ReLU is not differentiable at 𝑦 = 0 , but we take 0 for the subgradient.

The Architecture – Parameterizable • Step 3: Repeat until deep. 𝒊 𝟑 𝒊 𝒍 𝒊 𝟐 ⋯ 𝒚 𝑋 0 𝒚 + 𝒄 𝟏 𝑋 1 𝒊 𝟐 + 𝒄 𝟐 𝑋 𝑙 𝒊 𝒍 + 𝒄 𝒍 𝒊 𝒍+𝟐 • Multilayer Perceptron • Parameters for linear functions, but entire network is nonlinear. • Each 𝒊 𝑗 is called an activation. Internal layers are called hidden. • Final activation can be used as linear regression. • Differentiable using backpropagation.

The Architecture • (From Vincent Vanhoucke’s slides.)

The Architecture – Classification • Softmax regression (aka multinomial logistic regression) 𝒇 𝒚 • 𝒃(𝒚) = softmax evidence = normalize 𝒇 𝒚 = 𝒇 𝒚 1 𝑓 𝑦𝑗 • 𝑏 𝑗 = 𝑓 𝑦𝑘 = 𝑞 𝑧 = 𝑗 𝑦 where class 𝑦 = 𝑧 𝐿 𝑘=1 • Exponentiate 𝒚 to exaggerate differences between features. • Normalize so 𝒃 is a probability distribution over 𝐿 classes. • Softmax is a differentiable approximation to the indicator function. • 𝟐 class 𝒚 𝑙 = 1 𝑙 = arg max 𝑦 1 , … , 𝑦 𝐿 = class(𝒚) 0 otherwise.

The Architecture - Objective • How far away is the network? • Let 𝑧 = nn(𝑦, 𝑥) be the network’s prediction on 𝑦 . • Let 𝑧 be the “ground truth” target for 𝑦 . • Let 𝑀 𝑧, 𝑧 be the loss for our prediction on 𝑦 . • If 𝑧 = 𝑧 then this should be 0 . 2 • Squared Euclidean distance/ 𝑀 2 loss: 𝑧 − 𝑧 2 • Cross-entropy/negative log likelihood loss: − 𝑧 log 𝑧 • The objective function is 𝑀 𝑧, 𝑧 over training pairs 𝑦, 𝑧 .

Learning Algorithm for Neural Networks • Training is the process of minimizing the objective function with respect to the weight parameters. 𝑥 ∗ = arg min 𝑥 𝐾 𝑥 = arg min 𝑀 𝑧, nn 𝑦, 𝑥 𝑥 𝑦, 𝑧 ∈𝑈 • This optimization is done by iterative steps of gradient descent. 𝑥 (𝑢+1) = 𝑥 (𝑢) − 𝜃𝛼𝐾 𝑥 𝑢 • 𝛼𝐾 𝑥 is the gradient direction. 𝐾(𝑥) • 𝜃 is the learning rate/step size. −𝜃𝛼𝐾(𝑥 (𝑢) ) 𝛼𝐾(𝑥 (𝑢) ) • Needs to be “small enough” for convergence. 𝑥 𝑥 (𝑢)

Learning Algorithm for Neural Networks • 𝐾(𝑥) is highly non-linear and non-convex, but it is differentiable. • The backpropagation algorithm applies the chain rule for derivation backwards through the network. 𝜖𝑔 1 ∘ 𝑔 2 ∘ ⋯ ∘ 𝑔 = 𝜖𝑔 ⋅ 𝜖𝑔 ⋅ ⋯ ⋅ 𝜖𝑔 𝑜 1 2 𝑜 𝜖𝑦 𝜖𝑔 𝜖𝑔 𝜖𝑦 2 3

Backpropagation Example 𝑧 𝒊 − 𝑧 log 𝑏(h) 𝑦 𝑋𝑦 + 𝒄 • Let 𝑏 ⋅ = softmax ⋅ Homework: 𝜖𝐾 𝜖𝐾 𝑧 𝑙 𝜖𝑏 𝑙 𝜖ℎ 𝑗 = 𝑏 𝑗 − 𝑧 𝑗 Prove • 𝜖𝑏 𝑗 = − 𝑙 𝑏 𝑙 𝜖ℎ 𝑗 and verify softmax 𝜖𝑏 𝑗 𝜖𝑏 𝑗 • 𝜖ℎ 𝑗 = 𝑏 𝑗 1 − 𝑏 𝑗 , 𝜖ℎ 𝑘 = −𝑏 𝑗 𝑏 𝑘 for 𝑗 ≠ 𝑘 derivatives. 𝜖ℎ 𝑗 𝜖ℎ 𝑗 • 𝜖𝑋 𝑗 = 𝟐 ℎ 𝑗 >0 ⋅ 𝒚 , 𝜀𝑐 𝑗 = 𝟐 ℎ 𝑗 >0

Backpropagation Example 𝑧 𝒊 − 𝑧 log 𝑏(h) 𝒚 𝑋𝒚 + 𝒄 𝑏 𝒊 , 𝒛 𝛼𝐾(𝑋, 𝑐) 𝜖𝑆𝑓𝑀𝑉 𝜖ℎ 𝜖𝐾 𝑏(𝒊) − 𝟐 ℎ>0 𝑏(𝒊) − 𝒛 𝜖Linear 𝒛 𝜖𝑆𝑓𝑀𝑉 𝜖ℎ 𝜖𝐾 𝜖𝐾 𝜖ℎ 𝑗 𝜖𝑆𝑓𝑀𝑉 𝑗 𝜖Linear 𝑗 • 𝜖𝑋 𝑗 = = 𝑏 𝑗 − 𝑧 𝑗 𝟐 ℎ 𝑗 >0 ⋅ 𝒚 𝜖ℎ 𝑗 𝜀𝑆𝑓𝑀𝑉 𝑗 𝜖Linear 𝑗 𝜖𝑋 𝑗 𝜖𝐾 𝜖𝐾 𝜖ℎ 𝑗 𝜖𝑆𝑓𝑀𝑉 𝑗 𝜖Linear 𝑗 • 𝜖𝑐 𝑗 = = 𝑏 𝑗 − 𝑧 𝑗 𝟐 ℎ 𝑗 >0 ⋅ 1 𝜖ℎ 𝑗 𝜀𝑆𝑓𝑀𝑉 𝑗 𝜖Linear 𝑗 𝜖𝑐 𝑗 • Homework: Work out backprop with two linear layers and a batch of inputs 𝑌 .

The Data is Too Damn Big • The objective function for gradient descent requires summing over the entire training set: 𝐾 𝑥 = 𝑦, 𝑧 ∈𝑈 𝑀 𝑧, nn 𝑦, 𝑥 . • This is too costly for big datasets, we need to approximate. • Stochastic Gradient Descent uses small batches of the training set. 𝐾 𝑥 ≈ 𝑦, 𝑧 ∈𝐶⊂𝑈 𝑀 𝑧, nn 𝑦, 𝑥 . • After every batch is used, one epoch, training data is randomly permuted. • Poor estimates but repeated many times and smoothed. • Online (batch size = 1) might be great if we didn’t lose low -level efficiency of batching several examples for matrix multiplications, e.g. 𝐼 = 𝑋𝑌 .

Deep Learning with Neural Networks The Structure and Optimization - PowerPoint PPT Presentation

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan Zelener Machine Learning Reading Group January 7 th 2016 The Graduate Center, CUNY Objectives Explain some of the trends of deep learning and

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Timing Activity 10:00- 10:15 Welcome and Mindfulness 10:15- 10:20 Hello from Di our family

Modern Systems for Neural Networks Valentin Dalibard This talk 1.Practicalities of training

Lecture 14 Objects and Classes Object-oriented programming (OOP) Were always looking for

Deep Learning Algorithms for Recognition of Facial Ageing Features Konstantin Kiselev Research

Aggravations Eugen Bacic Presentation @ B-Sides Ottawa September 5, 2014 1 How I Feel

Introducing the Center for Creative Land Recycling (CCLR or See Clear) National

Residential Youth Work Phil Simpson & Mark Tiddy Session Aims Create a sense of Yes, I

Adult Services Transformation: Passport to Independence Design Phase Update Scrutiny Committee

Deep Learning with Neural Networks The Structure and Optimization - PowerPoint PPT Presentation

Deep Learning with Neural Networks The Structure and Optimization of Deep Neural Networks Allan Zelener Machine Learning Reading Group January 7 th 2016 The Graduate Center, CUNY Objectives Explain some of the trends of deep learning and

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

Minjie Wang Deep Learning Deep Learning trend in the past 10 years Caffe State-of-art DL

Timing Activity 10:00- 10:15 Welcome and Mindfulness 10:15- 10:20 Hello from Di our family

Modern Systems for Neural Networks Valentin Dalibard This talk 1.Practicalities of training

Lecture 14 Objects and Classes Object-oriented programming (OOP) Were always looking for

Deep Learning Algorithms for Recognition of Facial Ageing Features Konstantin Kiselev Research

Aggravations Eugen Bacic Presentation @ B-Sides Ottawa September 5, 2014 1 How I Feel

Introducing the Center for Creative Land Recycling (CCLR or See Clear) National

Residential Youth Work Phil Simpson &amp; Mark Tiddy Session Aims Create a sense of Yes, I

Adult Services Transformation: Passport to Independence Design Phase Update Scrutiny Committee

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Residential Youth Work Phil Simpson & Mark Tiddy Session Aims Create a sense of Yes, I