CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, - - PowerPoint PPT Presentation

cs489 698 lecture 9 feb 1 2017
SMART_READER_LITE
LIVE PREVIEW

CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, - - PowerPoint PPT Presentation

CS489/698 Lecture 9: Feb 1, 2017 Multi-layer Neural Networks, Error Backpropagation [D] Chapt. 10, [HTF] Chapt. 11, [B] Sec. 5.2, 5.3, [M] Sec. 16.5, [RN] Sec. 18.7 CS489/698 (c) 2017 P. Poupart 1 Quick Recap: Linear Models Linear


slide-1
SLIDE 1

CS489/698 Lecture 9: Feb 1, 2017

Multi-layer Neural Networks, Error Backpropagation [D] Chapt. 10, [HTF] Chapt. 11, [B] Sec. 5.2, 5.3, [M] Sec. 16.5, [RN] Sec. 18.7

CS489/698 (c) 2017 P. Poupart 1

slide-2
SLIDE 2

Quick Recap: Linear Models

Linear Regression Linear Classification

CS489/698 (c) 2017 P. Poupart 2

slide-3
SLIDE 3

Quick Recap: Non-linear Models

Non-linear classification Non-linear regression

CS489/698 (c) 2017 P. Poupart 3

slide-4
SLIDE 4

Non-linear Models

  • Convenient modeling assumption: linearity
  • Extension: non-linearity can be obtained by mapping

to a non-linear feature space

  • Limit: the basis functions

are chosen a priori and are fixed

  • Question: can we work with unrestricted non-linear

models?

CS489/698 (c) 2017 P. Poupart 4

slide-5
SLIDE 5

Flexible Non-Linear Models

  • Idea 1: Select basis functions that correspond to the training

data and retain only a subset of them (e.g., Support Vector Machines)

  • Idea 2: Learn non-linear basis functions (e.g., Multi-layer

Neural Networks)

CS489/698 (c) 2017 P. Poupart 5

slide-6
SLIDE 6

Two-Layer Architecture

  • Feed-forward neural network
  • Hidden units:
  • Output units:
  • Overall:

CS489/698 (c) 2017 P. Poupart 6

slide-7
SLIDE 7

Common activation functions

  • Threshold:
  • Sigmoid:
  • Gaussian:
  • Tanh:
  • Identity:

CS489/698 (c) 2017 P. Poupart 7

slide-8
SLIDE 8

Adaptive non-linear basis functions

  • Non-linear regression

– : non-linear function and : identity

  • Non-linear classification

– : non-linear function and : sigmoid

CS489/698 (c) 2017 P. Poupart 8

slide-9
SLIDE 9

Weight training

  • Parameters:
  • Objectives:

– Error minimization

  • Backpropagation (aka “backprop”)

– Maximum likelihood – Maximum a posteriori – Bayesian learning

CS489/698 (c) 2017 P. Poupart 9

slide-10
SLIDE 10

Least squared error

  • Error function
  • When

then we are optimizing a linear combination of non- linear basis functions

Linear combo Non-linear basis functions

CS489/698 (c) 2017 P. Poupart 10

slide-11
SLIDE 11

Sequential Gradient Descent

  • For each example

adjust the weights as follows:

  • How can we compute the gradient efficiently given

an arbitrary network structure?

  • Answer: backpropagation algorithm

CS489/698 (c) 2017 P. Poupart 11

slide-12
SLIDE 12

Backpropagation Algorithm

  • Two phases:

– Forward phase: compute output

  • f each unit

– Backward phase: compute delta at each unit

CS489/698 (c) 2017 P. Poupart 12

slide-13
SLIDE 13

Forward phase

  • Propagate inputs forward to compute the output of

each unit

  • Output

at unit : where

CS489/698 (c) 2017 P. Poupart 13

slide-14
SLIDE 14

Backward phase

  • Use chain rule to recursively compute gradient

– For each weight :

  • – Let
  • then

– Since then

  • CS489/698 (c) 2017 P. Poupart

14

slide-15
SLIDE 15

Simple Example

  • Consider a network with two layers:

– Hidden nodes:

  • Tip:
  • – Output node:
  • Objective: squared error

CS489/698 (c) 2017 P. Poupart 15

slide-16
SLIDE 16

Simple Example

  • Forward propagation:

– Hidden units: – Output units:

  • Backward propagation:

– Output units: – Hidden units:

  • Gradients:

– Hidden layers:

  • – Output layer:
  • CS489/698 (c) 2017 P. Poupart

16

slide-17
SLIDE 17

Non-linear regression examples

  • Two layer network:

– 3 tanh hidden units and 1 identity output unit

  • CS489/698 (c) 2017 P. Poupart

17

slide-18
SLIDE 18

Analysis

  • Efficiency:

– Fast gradient computation: linear in number of weights

  • Convergence:

– Slow convergence (linear rate) – May get trapped in local optima

  • Prone to overfitting

– Solutions: early stopping, regularization (add penalty term to objective)

CS489/698 (c) 2017 P. Poupart 18