Lecture 13: Introduction to Deep Learning Deep Convolutional Neural - PowerPoint PPT Presentation

It’s an old paradigm • The first learning machine:   Feature Extractor the Perceptron - Built at Cornell in 1960 A • The Perceptron was a linear classifier on top of a simple feature extractor W i • The vast majority of practical applications of ML today use glorified linear classifiers N y=sign ( W i F i ( X ) +b ) or glorified template matching. ∑ • Designing a feature extractor requires i= 1 considerable e ff orts by experts. slide by Marc’Aurelio Ranzato, Yann LeCun 33

Hierarchical Compositionality VISION pixels edge texton motif part object SPEECH spectral sample formant motif phone word band slide by Marc’Aurelio Ranzato, Yann LeCun NLP character word NP/VP/.. clause sentence story 34

Building A Complicated Function Given a library of simple functions Compose into a complicate function slide by Marc’Aurelio Ranzato, Yann LeCun 35

Building A Complicated Function Given a library of simple functions Idea 1: Linear Combinations • Boosting Compose into a • Kernels • … complicate function slide by Marc’Aurelio Ranzato, Yann LeCun 36

Building A Complicated Function Given a library of simple functions Idea 2: Compositions • Deep Learning Compose into a • Grammar models • Scattering transforms… complicate function slide by Marc’Aurelio Ranzato, Yann LeCun 37

Building A Complicated Function Given a library of simple functions Idea 2: Compositions • Deep Learning Compose into a • Grammar models • Scattering transforms… complicate function slide by Marc’Aurelio Ranzato, Yann LeCun 38

Deep Learning = Hierarchical Compositionality “car” slide by Marc’Aurelio Ranzato, Yann LeCun 39

Deep Learning = Hierarchical Compositionality “car” Low-Level   Mid-Level   High-Level   Trainable   Feature Feature Feature Classifier slide by Marc’Aurelio Ranzato, Yann LeCun Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013] 40

slide by Dhruv Batra Sparse DBNs [Lee et al. ICML ‘09] Figure courtesy: Quoc Le 41

Three key ideas • (Hierarchical) Compositionality - Cascade of non-linear transformations - Multiple layers of representations • End-to-End Learning - Learning (goal-driven) representations - Learning to feature extract • Distributed Representations - No single neuron “encodes” everything - Groups of neurons work together slide by Dhruv Batra 42

Traditional Machine Learning VISION hand-crafted   your favorite   features “car” classifier SIFT/HOG fixed learned SPEECH hand-crafted   your favorite   features \ ˈ d ē p\ classifier MFCC fixed learned slide by Marc’Aurelio Ranzato, Yann LeCun NLP hand-crafted   your favorite   This burrito place features “+” classifier is yummy and fun! Bag-of-words fixed learned 43

Traditional Machine Learning (more accurately) “Learned” VISION K-Means/ SIFT/HOG classifier “car” pooling fixed unsupervised supervised SPEECH Mixture of MFCC classifier \ ˈ d ē p\ Gaussians slide by Marc’Aurelio Ranzato, Yann LeCun fixed unsupervised supervised NLP Parse Tree This burrito place n-grams classifier “+” Syntactic is yummy and fun! fixed unsupervised supervised 44

Deep Learning = End-to-End Learning “Learned” VISION K-Means/ SIFT/HOG classifier “car” pooling fixed unsupervised supervised SPEECH Mixture of MFCC classifier \ ˈ d ē p\ Gaussians fixed unsupervised supervised slide by Marc’Aurelio Ranzato, Yann LeCun NLP Parse Tree This burrito place n-grams classifier “+” Syntactic is yummy and fun! fixed unsupervised supervised 45

Deep Learning = End-to-End Learning • A hierarchy of trainable feature transforms - Each module transforms its input representation into a higher-level one. - High-level features are more global and more invariant - Low-level features are shared among categories Trainable   Trainable   Trainable   slide by Marc’Aurelio Ranzato, Yann LeCun Feature- Feature- Feature- Transform /   Transform /   Transform /   Classifier Classifier Classifier Learned Internal Representations 46

“Shallow” vs Deep Learning • “Shallow” models hand-crafted “Simple” Trainable Feature Extractor Classifier fixed learned • Deep models Trainable   Trainable   Trainable   Feature- Feature- Feature- slide by Marc’Aurelio Ranzato, Yann LeCun Transform /   Transform /   Transform /   Classifier Classifier Classifier Learned Internal Representations 47

Three key ideas • (Hierarchical) Compositionality - Cascade of non-linear transformations - Multiple layers of representations • End-to-End Learning - Learning (goal-driven) representations - Learning to feature extract • Distributed Representations - No single neuron “encodes” everything - Groups of neurons work together slide by Dhruv Batra 48

Localist representations • The simplest way to represent things with neural networks is to dedicate one neuron to each thing . - Easy to understand. - Easy to code by hand • Often used to represent inputs to a net - Easy to learn • This is what mixture models do. • Each cluster corresponds to one neuron - Easy to associate with other representations or responses. • But localist models are very inefficient whenever the data has componential slide by Geoff Hinton structure. Image credit: Moontae Lee 49

Distributed Representations • Each neuron must represent something, so this must be a local representation. • Distributed representation means a many-to-many relationship between two types of representation (such as concepts and neurons). - Each concept is represented by many neurons - Each neuron participates in the representation of many concepts slide by Geoff Hinton Local Distributed Image credit: Moontae Lee 50

Power of distributed representations! Scene Classification bedroom mountain • Possible internal representations: - Objects - Scene attributes - Object parts - Textures slide by Bolei Zhou B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba “Object Detectors Emerge in Deep Scene CNNs” , ICLR 2015 51

Deep Convolutional   Neural Networks 52

53 Convolutions slide by Yisong Yue

Convolution Filters slide by Yisong Yue 54

55 Gabor Filters slide by Yisong Yue

Gaussian Blur Filters slide by Yisong Yue 56

Convolutional Neural Networks 57 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

58 Convolution Layer height 32x32x3 image width 32 depth 32 3 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Convolution Layer 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson spatially, computing dot 32 products” 3 59

Convolution Layer Filters always extend the full depth of the input volume 32x32x3 image 5x5x3 filter 32 Convolve the filter with the image i.e. “slide over the image slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson spatially, computing dot 32 products” 3 60

Convolution Layer 32x32x3 image 5x5x3 filter 32 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 (i.e. 5*5*3 = 75-dimensional dot product + bias) 3 61

Convolution Layer activation 32x32x3 image map 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 3 1 62

Convolution Layer consider a second, green filter activation maps 32x32x3 image 5x5x3 filter 32 28 convolve (slide) over all spatial locations 28 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 3 1 63

For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps: activation maps 32 28 Convolution Layer 28 32 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 3 6 We stack these up to get a “new image” of size 28x28x6! 64

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 28 CONV, ReLU e.g. 6 5x5x3 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 28 filters 3 6 65

Preview: ConvNet is a sequence of Convolutional Layers, interspersed with activation functions 32 28 24 …. CONV, CONV, CONV, ReLU ReLU ReLU e.g. 6 e.g. 10 5x5x 6 5x5x3 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32 28 24 filters filters 3 6 10 66

[From recent Yann 67 LeCun slides] Preview slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

[From recent Yann 68 LeCun slides] Preview slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

one filter => one activation map example 5x5 filters (32 total) We call the layer convolutional because it is related to convolution of two signals: slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson elementwise multiplication and sum of a filter and the signal (image) 69

70 70 Preview slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

A closer look at spatial dimensions: activation 32x32x3 image map 5x5x3 filter 32 28 convolve (slide) over all spatial locations slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 28 32 3 1 71

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 72

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter => 5x5 output 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 76

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 77

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 78

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 2 => 3x3 output! 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 79

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 80

A closer look at spatial dimensions: 7 7x7 input (spatially) assume 3x3 filter applied with stride 3? 7 doesn’t fit! cannot apply 3x3 filter on slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 7x7 input with stride 3. 81

N Output size: (N - F) / stride + 1 F e.g. N = 7, F = 3: stride 1 => (7 - 3)/1 + 1 = 5 N F stride 2 => (7 - 3)/2 + 1 = 3 stride 3 => (7 - 3)/3 + 1 = 2.33 : \ slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 82

In practice: Common to zero pad   the border e.g. input 7x7 0 0 0 0 0 0 3x3 filter, applied with stride 1 0 pad with 1 pixel border => what is the output? 0 0 0 (recall:) slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson (N - F) / stride + 1 83

In practice: Common to zero pad   the border e.g. input 7x7 0 0 0 0 0 0 3x3 filter, applied with stride 1 0 pad with 1 pixel border => what is the output? 0 0 7x7 output! 0 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 84

In practice: Common to zero pad   the border e.g. input 7x7 0 0 0 0 0 0 3x3 filter, applied with stride 1 0 pad with 1 pixel border => what is the output? 0 0 7x7 output! in general, common to see CONV layers 0 with stride 1, filters of size FxF , and zero- padding with (F-1)/2. (will preserve size slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson spatially) e.g. F = 3 => zero pad with 1 F = 5 => zero pad with 2 F = 7 => zero pad with 3 85

Remember back to… E.g. 32x32 input convolved repeatedly with 5x5 filters   shrinks volumes spatially! (32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well. 32 28 24 …. CONV, CONV, CONV, slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson ReLU ReLU ReLU e.g. 6 e.g. 10 5x5x 6 5x5x3 32 28 24 filters filters 3 6 10 86

Recap: Convolution Layer (No padding, no strides) Convolving a 3 × 3 kernel over a 4 × 4 input using unit strides   (i.e., i = 4, k = 3, s = 1 and p = 0). Image credit: Vincent Dumoulin and Francesco Visin 87

Computing the output values of a 2D discrete convolution   i 1 = i 2 = 5, k 1 = k 2 = 3, s 1 = s 2 = 2, and p 1 = p 2 = 1 Image credit: Vincent Dumoulin and Francesco Visin 88

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: ? slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 89

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Output volume size: (32+2*2-5)/1+1 = 32 spatially, so slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 32x32x10 90

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer? slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 91

Examples time: Input volume: 32x32x3 10 5x5 filters with stride 1, pad 2 Number of parameters in this layer? each filter has 5*5*3 + 1 = 76 params (+1 for bias) slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson => 76*10 = 760 92

93 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

Common settings: K = (powers of 2, e.g. 32, 64, 128, 512) - F = 3, S = 1, P = 1 - F = 5, S = 1, P = 2 - F = 5, S = 2, P = ? (whatever fits) - F = 1, S = 1, P = 0 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 94

(btw, 1x1 convolution layers make perfect sense) 1x1 CONV 56 with 32 filters 56 (each filter has size 1x1x64, and performs a slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 64-dimensional dot product) 56 56 64 32 95

96 Example: CONV layer in Torch slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

97 Example: CONV layer in Caffe slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

98 Example: CONV layer in Lasagne slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson

The brain/neuron view of CONV Layer 32x32x3 image 5x5x3 filter 32 slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 1 number: 32 the result of taking a dot product between the filter and this part of the image 3 (i.e. 5*5*3 = 75-dimensional dot product) 99

The brain/neuron view of CONV Layer 32x32x3 image 5x5x3 filter 32 It’s just a neuron with local connectivity... slide by Fei-Fei Li, Andrej Karpathy & Justin Johnson 1 number: 32 the result of taking a dot product between the filter and this part of the image 3 (i.e. 5*5*3 = 75-dimensional dot product) 100

Lecture 13: Introduction to Deep Learning Deep Convolutional Neural - PowerPoint PPT Presentation

Lecture 13: Introduction to Deep Learning Deep Convolutional Neural Networks Aykut Erdem November 2016 Hacettepe University Administrative Assignment 3 is out! It is due November 30, 2016 You will implement a 2-layer Neural

Malaysian Healthy Ageing Society Plenary Lecture Plenary Lecture Plenary Lecture Plenary

CEE 680 Lecture #2 1/22/2020 1 CEE 680 Lecture #2 1/22/2020 2 CEE 680 Lecture #2

Pocket Lecture Pocket Lecture Pocket Lecture Pocket Lecture Listen Audio Notes Progress

Multiphase Modelling in Cancer Helen Byrne Wolfson Centre for Mathematical Biology Mathematical

Previous Lecture Todays Lecture Slides for Lecture 5 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 30 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 28 Completion of divide-by-3 counter

Previous Lecture Todays Lecture Slides for Lecture 12 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 3 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 2 ENEL 353: Digital Circuits Fall 2013

Previous Lecture Todays Lecture Slides for Lecture 35 ENEL 353: Digital Circuits Fall

Lecture Capture Introduction to Lecture Capture Learning Outcomes What will lecture capture

Previous Lecture Todays Lecture Slides for Lecture 32 Completion of a timing analysis

Repetition Automatic Control, Basic Course, Lecture 11 Fredrik Bagge Carlson December 17, 2016

Previous Lecture Todays Lecture Slides for Lecture 26 ENEL 353: Digital Circuits Fall

Previous Lecture Todays Lecture Slides for Lecture 33 ENEL 353: Digital Circuits Fall

Year 4 Literacy Lesson One I can consistently choose nouns or pronouns appropriately to aid

Lecture 6: Convolutional NN Princeton University COS 495 Instructor: Yingyu Liang Review:

Neural Turing Machines Tristan Deleu June 23, 2016 @tristandeleu Deep Learning The

CS535: Deep Learning Winter 2018 Fuxin Li Course Information Instructor: Dr. Fuxin Li

Fermilab Keras Workshop Stefan Wunsch stefan.wunsch@cern.ch December 8, 2017 1 What is this

Introduction Welcome CSLog : Combinatorial Optimization, Discrete Algorithms and Logistics

A Semantics for Context-Oriented Programming with Layers Dave Clarke and Ilya Sergey Katholieke

Large-scale Data Processing and Optimisation Eiko Yoneki University of Cambridge Computer