Neural Networks 2/2 Many slides attributable to: Prof. Mike Hughes - - PowerPoint PPT Presentation

neural networks 2 2
SMART_READER_LITE
LIVE PREVIEW

Neural Networks 2/2 Many slides attributable to: Prof. Mike Hughes - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Neural Networks 2/2 Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James, Witten,


slide-1
SLIDE 1

Neural Networks 2/2

1

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/

Many slides attributable to: Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books)

  • Prof. Mike Hughes
slide-2
SLIDE 2

Logistics

  • Project 1: Keep going!
  • Recitation tonight Monday
  • Hands-on intro to neural nets
  • With automatic differentiation
  • HW4 out on Wed, due in TWO WEEKS

2

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-3
SLIDE 3

Objectives Today: Neural Networks Unit 2/2

  • Review: learning feature representations
  • Feed-forward neural nets (MLPs)
  • Activation functions
  • Loss functions for multi-class classification
  • Training via gradient descent
  • Back-propagation = gradient descent + chain rule
  • Automatic differentiation

3

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-4
SLIDE 4

What will we learn?

4

Mike Hughes - Tufts COMP 135 - Spring 2019

Supervised Learning Unsupervised Learning Reinforcement Learning

Data, Label Pairs Performance measure Task data x label y

{xn, yn}N

n=1

Training Prediction Evaluation

slide-5
SLIDE 5

5

Mike Hughes - Tufts COMP 135 - Spring 2019

y

x2 x1

is a binary variable (red or blue)

Supervised Learning

binary classification

Unsupervised Learning Reinforcement Learning

Task: Binary Classification

slide-6
SLIDE 6

Feature, Label Pairs

Feature Transform Pipeline

6

Mike Hughes - Tufts COMP 135 - Spring 2019

data x label y Data, Label Pairs Performance measure

{xn, yn}N

n=1

Task

φ(x)

{φ(xn), yn}N

n=1

slide-7
SLIDE 7

Logistic Regr. Network Diagram

7

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Emily Fox (UW) https://courses.cs.washington.edu/courses/cse41 6/18sp/slides/

0 or 1

slide-8
SLIDE 8

Multi-class Classification

8

Mike Hughes - Tufts COMP 135 - Spring 2019

How to do this?

slide-9
SLIDE 9

>>> yhat_N = model.predict(x_NF) >>> yhat_N[:5] [0, 0, 1, 0, 1]

Binary Prediction

Goal: Predict label (0 or 1) given features x

  • Input:
  • Output:

9

Mike Hughes - Tufts COMP 135 - Spring 2019

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or

  • ther numeric types (e.g. integer,

binary) Binary label (0 or 1)

“features” “covariates” “attributes” “responses” or “labels”

yi ∈ {0, 1}

slide-10
SLIDE 10

>>> yproba_N2 = model.predict_proba(x_NF) >>> yproba1_N = model.predict_proba(x_NF)[:,1] >>> yproba1_N[:5] [0.143, 0.432, 0.523, 0.003, 0.994]

Binary Proba. Prediction

Goal: Predict probability of label given features

  • Input:
  • Output:

10

Mike Hughes - Tufts COMP 135 - Spring 2019

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or

  • ther numeric types (e.g. integer,

binary)

“features” “covariates” “attributes” “probability”

ˆ pi , p(Yi = 1|xi)

Value between 0 and 1 e.g. 0.001, 0.513, 0.987

slide-11
SLIDE 11

>>> yhat_N = model.predict(x_NF) >>> yhat_N[:6] [0, 3, 1, 0, 0, 2]

Multi-class Prediction

Goal: Predict one of C classes given features x

  • Input:
  • Output:

11

Mike Hughes - Tufts COMP 135 - Spring 2019

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or

  • ther numeric types (e.g. integer,

binary) Integer label (0 or 1 or … or C-1 )

“features” “covariates” “attributes” “responses” or “labels”

yi ∈ {0, 1, 2, . . . C − 1}

slide-12
SLIDE 12

>>> yproba_NC = model.predict_proba(x_NF) >>> yproba_c_N = model.predict_proba(x_NF)[:,c] >>> np.sum(yproba_NC, axis=1) [1.0, 1.0, 1.0, 1.0]

Multi-class Proba. Prediction

Goal: Predict probability of label given features Input: Output:

12

Mike Hughes - Tufts COMP 135 - Spring 2019

xi , [xi1, xi2, . . . xif . . . xiF ]

Entries can be real-valued, or other numeric types (e.g. integer, binary)

“features” “covariates” “attributes”

“probability”

Vector of C non-negative values, sums to one

ˆ pi , [p(Yi = 0|xi) p(Yi = 1|xi) . . . p(Yi = C − 1|xi)]

slide-13
SLIDE 13

From Real Value to Probability

13

Mike Hughes - Tufts COMP 135 - Spring 2019

sigmoid(z) = 1 1 + e−z

probability

slide-14
SLIDE 14

From Vector of Reals to Vector of Probabilities

14

Mike Hughes - Tufts COMP 135 - Spring 2019

zi = [zi1 zi2 . . . zic . . . ziC]

ˆ pi = " ezi1 PC

c=1 ezic

ezi2 PC

c=1 ezic . . .

. . . eziC PC

c=1 ezic

#

called the “softmax” function

slide-15
SLIDE 15

Representing multi-class labels

Encode as length-C one hot binary vector

15

Mike Hughes - Tufts COMP 135 - Spring 2019

Examples (assume C=4 labels) class 0: [1 0 0 0] class 1: [0 1 0 0] class 2: [0 0 1 0] class 3: [0 0 0 1]

yn ∈ {0, 1, 2, . . . C − 1} ∈ { − } ¯ yn = [¯ yn1 ¯ yn2 . . . ¯ ync . . . ¯ ynC]

slide-16
SLIDE 16

“Neuron” for Binary Prediction

16

Mike Hughes - Tufts COMP 135 - Spring 2019

Linear function with weights w Logistic sigmoid activation function

Credit: Emily Fox (UW)

Probability

  • f class 1
slide-17
SLIDE 17

Neurons for Multi-class Prediction

  • Can you draw it?

17

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-18
SLIDE 18

Recall: Binary log loss

18

Mike Hughes - Tufts COMP 135 - Spring 2019

log loss(y, ˆ p) = −y log ˆ p − (1 − y) log(1 − ˆ p)

error(y, ˆ y) = ( 1 if y 6= ˆ y if y = ˆ y

Plot assumes:

  • True label is 1
  • Threshold is 0.5
  • Log base 2
slide-19
SLIDE 19

Multi-class log loss

log loss(¯ yn, ˆ pn) = −

C

X

c=1

¯ ync log ˆ pnc

19

Mike Hughes - Tufts COMP 135 - Spring 2019

Input: two vectors of length C Output: scalar value (strictly non-negative)

Justifications carry over from the binary case:

  • Interpret as upper bound on the error rate
  • Interpret as cross entropy of multi-class discrete random variable
  • Interpret as log likelihood of multi-class discrete random variable
slide-20
SLIDE 20

MLP: Multi-Layer Perceptron 1 or more hidden layers followed by 1 output layer

20

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-21
SLIDE 21

Linear decision boundaries cannot solve XOR

21

Mike Hughes - Tufts COMP 135 - Spring 2019

X_1 X_2 y 1 1 1 1 1 1

slide-22
SLIDE 22

Diagram of an MLP

22

Mike Hughes - Tufts COMP 135 - Spring 2019

f1(x, w1) f2(·, w2) f3(·, w3)

Input data x

Output

slide-23
SLIDE 23

Each Layer Extracts “Higher Level” Features

23

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-24
SLIDE 24

MLPs with 1 hidden layer can approximate any functions with enough hidden units!

24

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-25
SLIDE 25

Which Activation Function?

25

Mike Hughes - Tufts COMP 135 - Spring 2019

Linear function with weights w Non-linear activation function

Credit: Emily Fox (UW)

slide-26
SLIDE 26

Activation Functions

26

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-27
SLIDE 27

How to train Neural Nets? Just like logistic regression Set up a loss function Apply Gradient Descent!

27

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-28
SLIDE 28

Review: LR notation

  • Feature vector with first entry constant
  • Weight vector (first entry is the “bias”)
  • “Score” value z (real number, -inf to +inf)

28

Mike Hughes - Tufts COMP 135 - Spring 2019

w = [w0 w1 w2 . . . wF ]

slide-29
SLIDE 29

Review: Gradient of LR

29

Mike Hughes - Tufts COMP 135 - Spring 2019

J(zn(w)) = ynzn − log(1 + ezn) d d d − d dwf J(zn(w)) = d dzn J(zn) · d dwf z(w)

Log likelihood Gradient w.r.t. weight on feature f

chain rule!

slide-30
SLIDE 30

MLP : Composable Functions

30

Mike Hughes - Tufts COMP 135 - Spring 2019

f1(x, w1) f2(·, w2) f3(·, w3)

Input data x

slide-31
SLIDE 31

Output as function of x

31

Mike Hughes - Tufts COMP 135 - Spring 2019

f1(x, w1) f2(·, w2) f3(·, w3)

Input data x

f3(f2(f1(x, w1), w2), w3)

slide-32
SLIDE 32

Minimizing loss for composable functions

min

w1,w2,w3 N

X

n=1

loss(yn, f3(f2(f1(xn, w1), w2), w3)

32

Mike Hughes - Tufts COMP 135 - Spring 2019

Loss can be:

  • Squared error for regression problems
  • Log loss for multi-way classification problems
  • … many others possible!
slide-33
SLIDE 33

Compute loss via Forward Propagation

33

Mike Hughes - Tufts COMP 135 - Spring 2019

w(2) w(1) b(1) b(2)

slide-34
SLIDE 34

Compute loss via Forward Propagation

34

Mike Hughes - Tufts COMP 135 - Spring 2019

Step 2: w(2) w(1) b(1) b(2)

slide-35
SLIDE 35

Compute loss via Forward Propagation

35

Mike Hughes - Tufts COMP 135 - Spring 2019

w(2) w(1) Step 3: Step 2: b(1) b(2)

slide-36
SLIDE 36

Compute gradient via Back Propagation

36

Mike Hughes - Tufts COMP 135 - Spring 2019

w(2) w(1) b(1) b(2) w(2) w(1) b(1) b(2)

Visual Demo: https://google-developers.appspot.com/machine- learning/crash-course/backprop-scroll/

slide-37
SLIDE 37

Network view of Backprop

37

Mike Hughes - Tufts COMP 135 - Spring 2019

slide-38
SLIDE 38

Is Symbolic Differentiation the

  • nly way to compute derivatives?

38

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Justin Domke (UMass) https://people.cs.umass.edu/~domk e/courses/sml/09autodiff_nnets.pdf

slide-39
SLIDE 39

Another view: Computation graph

39

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Justin Domke (UMass) https://people.cs.umass.edu/~domk e/courses/sml/09autodiff_nnets.pdf

slide-40
SLIDE 40

Forward Propagation

40

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Justin Domke (UMass) https://people.cs.umass.edu/~domk e/courses/sml/09autodiff_nnets.pdf

slide-41
SLIDE 41

Back propagation

41

Mike Hughes - Tufts COMP 135 - Spring 2019 Credit: Justin Domke (UMass) https://people.cs.umass.edu/~domk e/courses/sml/09autodiff_nnets.pdf