Deep Learning for Classification CS293S, Yang, 2017 Computational - - PowerPoint PPT Presentation

deep learning for classification
SMART_READER_LITE
LIVE PREVIEW

Deep Learning for Classification CS293S, Yang, 2017 Computational - - PowerPoint PPT Presentation

Deep Learning for Classification CS293S, Yang, 2017 Computational graph for classification w 1 f 1 w 2 S f 2 >0? w 3 f 3 Objective: Classification Accuracy m l acc ( w ) = 1 sign( w > f ( x ( i ) )) == y ( i ) X m i =1


slide-1
SLIDE 1

Deep Learning for Classification

CS293S, Yang, 2017

slide-2
SLIDE 2

Slide 1

Computational graph for classification

S

f1 f2 f3 w1 w2 w3 >0?

  • Objective: Classification Accuracy

– Issue: How to find these parameteres?

lacc(w) = 1 m

m

X

i=1

⇣ sign(w>f(x(i))) == y(i)⌘

slide-3
SLIDE 3

Slide 2

Neural Net with Soft-Max

  • Score for y=1:

Score for y=-1:

  • Probability of label:
  • Objective:
  • Log:

l(w) =

m

Y

i=1

p(y = y(i)|f(x(i)); w)

ll(w) =

m

X

i=1

log p(y = y(i)|f(x(i)); w)

w>f(x)

−w>f(x)

p(y = 1|f(x); w) = ew>f(x(i)) ew>f(x) + e−w>f(x) p(y = −1|f(x); w) = e−w>f(x) ew>f(x) + e−w>f(x)

slide-4
SLIDE 4

Slide 3

Two-Layer Neural Network

S

f1 f2 f3 w1

3

w2

3

w3

3

>0?

S

w1

2

w2

2

w3

2

>0?

S

w1

1

w2

1

w3

1

>0?

S

w1 w2 w3

z → tanh(z) = ez − e−z ez + e−z

slide-5
SLIDE 5

Slide 4

N-Layer Neural Network

S

f1 f2 f3 >0?

S

>0?

S

>0?

S S

>0?

S

>0?

S

>0?

S

>0?

S

>0?

S

>0?

… … …

slide-6
SLIDE 6

Slide 5

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

5 Convolutional Network (AlexNet) input image weights loss

slide-7
SLIDE 7

Slide 6

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

6

Activation Functions

Sigmoid tanh tanh(x) ReLU max(0,x) Leaky ReLU max(0.1x, x) Maxout ELU

slide-8
SLIDE 8

Slide 7

Multi-class Softmax

  • 3-class softmax – classes A, B, C

– 3 weight vectors:

  • Probability of label A: (similar for B, C)
  • Objective:
  • Log:

wA, wB, wC

p(y = A|f(x); w) = ew>

Af(x)

ew>

Af(x) + ew> Bf(x) + ew> C f(x)

l(w) =

m

Y

i=1

p(y = y(i)|f(x(i); w)

ll(w) =

m

X

i=1

log p(y = y(i)|f(x(i); w)

slide-9
SLIDE 9

Slide 8

Multi-class Two-Layer Neural Network

S

f1 f2 f3 w1

3

w2

3

w3

3

>0?

S

w1

2

w2

2

w3

2

>0?

S

w1

1

w2

1

w3

1

>0?

S

w1 w2 w3

z → tanh(z) = ez − e−z ez + e−z

S S

w1 w3 w2 w1 w2 w3

A A A B B B C C C Score for A Score for B Score for C

slide-10
SLIDE 10

Slide 9

  • How to find parameters that minimize an objective function?
  • Idea:

– Start somewhere – Repeat: Take a step in the steepest descent direction

Gradient Descent Method for Optimization

Figure source: Mathworks

slide-11
SLIDE 11

Slide 10

Generally, Steepest Direction

  • Steepest Direction = direction of the gradient

rg =     

∂g ∂w1 ∂g ∂w2

· · ·

∂g ∂wn

    

  • Init:
  • For i = 1, 2, …

w w w α ⇤ rg(w)

§ Gradient Descent

slide-12
SLIDE 12

Slide 11

What is the Steepest Descent Direction?

  • First-Order Taylor Expansion:
  • Steepest Descent Direction:
  • Recall:

à

  • Hence, solution:

g(w + ∆) ≈ g(w) + ∂g ∂w1 ∆1 + ∂g ∂w2 ∆2

rg = "

∂g ∂w1 ∂g ∂w2

#

min

∆:∆2

1+∆2 2≤✏ g(w + ∆)

min

∆:∆2

1+∆2 2≤✏

∂g ∂w1 ∆1 + ∂g ∂w2 ∆2

min

a:kak✏ a>b

a = b ✏ kbk rg ✏ krgk

slide-13
SLIDE 13

Slide 12

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

12

How to Calculate a Partial Deriviate in a Computational Graph

Given a function f(x,y,z)= (x+y)z, What is the partial derivie of f with respect to x, y, z?

slide-14
SLIDE 14

Slide 13

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

13 e.g. x = -2, y = 5, z = -4 x, y, z values are from a training example

slide-15
SLIDE 15

Slide 14

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

14 e.g. x = -2, y = 5, z = -4 Want:

slide-16
SLIDE 16

Slide 15

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

15 e.g. x = -2, y = 5, z = -4 Want:

slide-17
SLIDE 17

Slide 16

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

16 e.g. x = -2, y = 5, z = -4 Want:

slide-18
SLIDE 18

Slide 17

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

17 e.g. x = -2, y = 5, z = -4 Want:

slide-19
SLIDE 19

Slide 18

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

18 e.g. x = -2, y = 5, z = -4 Want:

slide-20
SLIDE 20

Slide 19

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

19 e.g. x = -2, y = 5, z = -4 Want:

slide-21
SLIDE 21

Slide 20

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

20 e.g. x = -2, y = 5, z = -4 Want:

slide-22
SLIDE 22

Slide 21

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

21 e.g. x = -2, y = 5, z = -4 Want:

slide-23
SLIDE 23

Slide 22

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

22 e.g. x = -2, y = 5, z = -4 Want: Chain rule:

slide-24
SLIDE 24

Slide 23

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

23 e.g. x = -2, y = 5, z = -4 Want:

slide-25
SLIDE 25

Slide 24

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

24 e.g. x = -2, y = 5, z = -4 Want: Chain rule:

slide-26
SLIDE 26

Slide 25

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

25

f

activations

slide-27
SLIDE 27

Slide 26

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

26

f

activations

“local gradient”

slide-28
SLIDE 28

Slide 27

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

27

f

activations

“local gradient”

gradients

slide-29
SLIDE 29

Slide 28

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

28

f

activations gradients

“local gradient”

slide-30
SLIDE 30

Slide 29

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

29

f

activations gradients

“local gradient”

slide-31
SLIDE 31

Slide 30

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

30

f

activations gradients

“local gradient”

slide-32
SLIDE 32

Slide 31

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

31 Another example:

slide-33
SLIDE 33

Slide 32

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

32 Another example:

slide-34
SLIDE 34

Slide 33

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

33 Another example:

slide-35
SLIDE 35

Slide 34

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

34 Another example:

slide-36
SLIDE 36

Slide 35

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

35 Another example:

slide-37
SLIDE 37

Slide 36

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

36 Another example:

slide-38
SLIDE 38

Slide 37

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

37 Another example:

slide-39
SLIDE 39

Slide 38

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

38 Another example:

slide-40
SLIDE 40

Slide 39

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

39 Another example:

slide-41
SLIDE 41

Slide 40

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

40 Another example:

(-1) * (-0.20) = 0.20

slide-42
SLIDE 42

Slide 41

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

41 Another example:

slide-43
SLIDE 43

Slide 42

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

42 Another example:

[local gradient] x [its gradient] [1] x [0.2] = 0.2 [1] x [0.2] = 0.2 (both inputs!)

slide-44
SLIDE 44

Slide 43

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

43 Another example:

slide-45
SLIDE 45

Slide 44

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

44 Another example:

[local gradient] x [its gradient] x0: [2] x [0.2] = 0.4 w0: [-1] x [0.2] = -0.2

slide-46
SLIDE 46

Slide 45

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

45

sigmoid function

sigmoid gate

slide-47
SLIDE 47

Slide 46

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

46

sigmoid function

sigmoid gate

(0.73) * (1 - 0.73) = 0.2

slide-48
SLIDE 48

Slide 47

Fei-Fei Li & Andrej Karpathy & Justin Johnson

Lecture 5 - 20 Jan 2016

47

Gradients add at branches

+

slide-49
SLIDE 49

Slide 48

Summary

  • Deep learning

– New direction for test processing given its success in image/audio processing – Framworks and software

  • TensorFllow (Google).
  • Others: Theano, Torch, CAFFE, computation graph toolkit

(CGT)