Backpropagation and Gradient Descent Brian Carignan, Dec 5 2016 - - PowerPoint PPT Presentation

backpropagation and gradient descent
SMART_READER_LITE
LIVE PREVIEW

Backpropagation and Gradient Descent Brian Carignan, Dec 5 2016 - - PowerPoint PPT Presentation

Backpropagation and Gradient Descent Brian Carignan, Dec 5 2016 Overview Notation/background | Neural networks | Activation functions | Vectorization | Cost functions Introduction Algorithm Overview Four fundamental


slide-1
SLIDE 1

Backpropagation and Gradient Descent

Brian Carignan, Dec 5 2016

slide-2
SLIDE 2

Overview ▪ Notation/background

| Neural networks | Activation functions | Vectorization | Cost functions

▪ Introduction ▪ Algorithm Overview ▪ Four fundamental equations

| Definitions (all 4) and proofs (1 and 2)

▪ Example from thesis related work

2

slide-3
SLIDE 3

Neural Networks 1

3

slide-4
SLIDE 4

Neural Networks 2 ▪ a – Activation of a neuron is related to the activations in the previous layer ▪ b – bias of a neuron

4

slide-5
SLIDE 5

Activation Functions ▪ Similar to an ON/ OFF switch ▪ Required properties

| Nonlinear | Continuously differentiable

5

slide-6
SLIDE 6

Vectorization ▪ Represent each layer as a vector

| Simplifies notation | Leads to faster computation by exploiting vector math

▪ z – weighted input vector

6

slide-7
SLIDE 7

Cost Function ▪ Objective Function ▪ Optimization Problem ▪ Assumptions

| Can average over Cx | Function of the

  • utputs

▪ x – individual training examples (fixed)

▪ Example:

7

slide-8
SLIDE 8

Introduction ▪ Backpropagation

| Backward propagation of errors | Calculate gradients | One way to train neural networks

▪ Gradient Descent

| Optimization method | Finds a local minimum | Takes steps proportional to -gradient at current point

8

slide-9
SLIDE 9

Algorithm Overview

9

slide-10
SLIDE 10

Equation 1 ▪ Definition of error:

10

slide-11
SLIDE 11

Equation 2 ▪ Key difference

| Transpose of weight matrix

▪ Pushes error backwards

11

slide-12
SLIDE 12

Equation 3 ▪ Note that previous equations computed error

12

slide-13
SLIDE 13

Equation 4 ▪ Describes learning rate ▪ General insights

| Slow learning when: | Input activation approaches 0 | Output activation approaches 0 or 1 (from derivative of sigmoid)

13

slide-14
SLIDE 14

Proof – Equation 1 ▪ Steps

  • 1. Definition of error
  • 2. Chain rule
  • 3. k=j
  • 4. BP1 (components)

14

slide-15
SLIDE 15

Proof – Equation 2 ▪ Steps

  • 1. Definition of error
  • 2. Chain rule
  • 3. Substitute definition
  • f error
  • 4. Derivative of

weighted input vector

  • 5. BP2 (components)

▪ Recall:

15

slide-16
SLIDE 16

Example – Thesis Related Work

16

slide-17
SLIDE 17

References ▪ Michael A. Nielsen, "Neural Networks and Deep Learning", Determination Press, (2015) ▪ Bordes et al. “Translating embeddings for modeling multi-relational data”, NIPS'13, (2013)
 


17