Lecture 1: Feedforward Princeton University COS 495 Instructor: - - PowerPoint PPT Presentation

β–Ά
lecture 1 feedforward
SMART_READER_LITE
LIVE PREVIEW

Lecture 1: Feedforward Princeton University COS 495 Instructor: - - PowerPoint PPT Presentation

Deep Learning Basics Lecture 1: Feedforward Princeton University COS 495 Instructor: Yingyu Liang Motivation I: representation learning Machine learning 1-2-3 Collect data and extract features Build model: choose hypothesis class


slide-1
SLIDE 1

Deep Learning Basics Lecture 1: Feedforward

Princeton University COS 495 Instructor: Yingyu Liang

slide-2
SLIDE 2

Motivation I: representation learning

slide-3
SLIDE 3

Machine learning 1-2-3

  • Collect data and extract features
  • Build model: choose hypothesis class π“˜ and loss function π‘š
  • Optimization: minimize the empirical loss
slide-4
SLIDE 4

Features

Color Histogram

Red Green Blue

Extract features

𝑦 𝑧 = π‘₯π‘ˆπœš 𝑦

build hypothesis

slide-5
SLIDE 5

Features: part of the model

𝑧 = π‘₯π‘ˆπœš 𝑦

build hypothesis

Linear model Nonlinear model

slide-6
SLIDE 6

Example: Polynomial kernel SVM

𝑦1 𝑦2 𝑧 = sign(π‘₯π‘ˆπœš(𝑦) + 𝑐) Fixed 𝜚 𝑦

slide-7
SLIDE 7

Motivation: representation learning

  • Why don’t we also learn 𝜚 𝑦 ?

Learn 𝜚 𝑦

𝑦 𝑧 = π‘₯π‘ˆπœš 𝑦

Learn π‘₯

𝜚 𝑦

slide-8
SLIDE 8

Feedforward networks

  • View each dimension of 𝜚 𝑦 as something to be learned

𝑦

𝜚 𝑦

𝑧 = π‘₯π‘ˆπœš 𝑦

… …

slide-9
SLIDE 9

Feedforward networks

  • Linear functions πœšπ‘— 𝑦 = πœ„π‘—

π‘ˆπ‘¦ don’t work: need some nonlinearity

𝑦

𝜚 𝑦

𝑧 = π‘₯π‘ˆπœš 𝑦

… …

slide-10
SLIDE 10

Feedforward networks

  • Typically, set πœšπ‘— 𝑦 = 𝑠(πœ„π‘—

π‘ˆπ‘¦) where 𝑠(β‹…) is some nonlinear function

𝑦

𝜚 𝑦

𝑧 = π‘₯π‘ˆπœš 𝑦

… …

slide-11
SLIDE 11

Feedforward deep networks

  • What if we go deeper?

… …

… … … …

𝑦

β„Ž1 β„Ž2 β„Žπ‘€ 𝑧

slide-12
SLIDE 12

Figure from Deep learning, by Goodfellow, Bengio, Courville. Dark boxes are things to be learned.

slide-13
SLIDE 13

Motivation II: neurons

slide-14
SLIDE 14

Motivation: neurons

Figure from Wikipedia

slide-15
SLIDE 15

Motivation: abstract neuron model

  • Neuron activated when the correlation

between the input and a pattern πœ„ exceeds some threshold 𝑐

  • 𝑧 = threshold(πœ„π‘ˆπ‘¦ βˆ’ 𝑐)
  • r 𝑧 = 𝑠(πœ„π‘ˆπ‘¦ βˆ’ 𝑐)
  • 𝑠(β‹…) called activation function

𝑧 𝑦1 𝑦2 𝑦𝑒

slide-16
SLIDE 16

Motivation: artificial neural networks

slide-17
SLIDE 17

Motivation: artificial neural networks

  • Put into layers: feedforward deep networks

… …

… … … …

𝑦

β„Ž1 β„Ž2 β„Žπ‘€ 𝑧

slide-18
SLIDE 18

Components in Feedforward networks

slide-19
SLIDE 19

Components

  • Representations:
  • Input
  • Hidden variables
  • Layers/weights:
  • Hidden layers
  • Output layer
slide-20
SLIDE 20

Components

… …

… … … …

Hidden variables β„Ž1 β„Ž2 β„Žπ‘€ 𝑧 Input 𝑦 First layer Output layer

slide-21
SLIDE 21

Input

  • Represented as a vector
  • Sometimes require some

preprocessing, e.g.,

  • Subtract mean
  • Normalize to [-1,1]

Expand

slide-22
SLIDE 22

Output layers

  • Regression: 𝑧 = π‘₯π‘ˆβ„Ž + 𝑐
  • Linear units: no nonlinearity

β„Ž 𝑧 Output layer

slide-23
SLIDE 23

Output layers

  • Multi-dimensional regression: 𝑧 = π‘‹π‘ˆβ„Ž + 𝑐
  • Linear units: no nonlinearity

β„Ž 𝑧 Output layer

slide-24
SLIDE 24

Output layers

  • Binary classification: 𝑧 = 𝜏(π‘₯π‘ˆβ„Ž + 𝑐)
  • Corresponds to using logistic regression on β„Ž

β„Ž 𝑧 Output layer

slide-25
SLIDE 25

Output layers

  • Multi-class classification:
  • 𝑧 = softmax 𝑨 where 𝑨 = π‘‹π‘ˆβ„Ž + 𝑐
  • Corresponds to using multi-class

logistic regression on β„Ž

β„Ž 𝑧 Output layer 𝑨

slide-26
SLIDE 26

Hidden layers

  • Neuron take weighted linear

combination of the previous layer

  • So can think of outputting one

value for the next layer

… …

β„Žπ‘— β„Žπ‘—+1

slide-27
SLIDE 27

Hidden layers

  • 𝑧 = 𝑠(π‘₯π‘ˆπ‘¦ + 𝑐)
  • Typical activation function 𝑠
  • Threshold t 𝑨 = 𝕁[𝑨 β‰₯ 0]
  • Sigmoid 𝜏 𝑨 = 1/ 1 + exp(βˆ’π‘¨)
  • Tanh tanh 𝑨 = 2𝜏 2𝑨 βˆ’ 1

𝑧 𝑦 𝑠(β‹…)

slide-28
SLIDE 28

Hidden layers

  • Problem: saturation

𝑧 𝑦 𝑠(β‹…)

Figure borrowed from Pattern Recognition and Machine Learning, Bishop

Too small gradient

slide-29
SLIDE 29

Hidden layers

  • Activation function ReLU (rectified linear unit)
  • ReLU 𝑨 = max{𝑨, 0}

Figure from Deep learning, by Goodfellow, Bengio, Courville.

slide-30
SLIDE 30

Hidden layers

  • Activation function ReLU (rectified linear unit)
  • ReLU 𝑨 = max{𝑨, 0}

Gradient 0 Gradient 1

slide-31
SLIDE 31

Hidden layers

  • Generalizations of ReLU gReLU 𝑨 = max 𝑨, 0 + 𝛽 min{𝑨, 0}
  • Leaky-ReLU 𝑨 = max{𝑨, 0} + 0.01 min{𝑨, 0}
  • Parametric-ReLU 𝑨 : 𝛽 learnable

𝑨 gReLU 𝑨