Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & - - PowerPoint PPT Presentation

machine learning 2
SMART_READER_LITE
LIVE PREVIEW

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & - - PowerPoint PPT Presentation

Machine Learning 2 DS 4420 - Spring 2020 Neural Networks & backprop Byron C Wallace Neural Networks! In 2020, neural networks are the dominant technology in machine learning (for better or worse)! Today, well spend a day going


slide-1
SLIDE 1

Machine Learning 2

DS 4420 - Spring 2020

Neural Networks & backprop

Byron C Wallace

slide-2
SLIDE 2

Neural Networks!

  • In 2020, neural networks are the dominant technology

in machine learning (for better or worse)!

  • Today, we’ll spend a day going over some of the

fundamentals of NNs and modern libraries (we saw a preview last time, with auto-diff)!

  • This will also serve as a refresher of gradient descent
slide-3
SLIDE 3

Neural Networks!

  • In 2020, neural networks are the dominant technology

in machine learning (for better or worse)!

  • Today, we’ll go over some of the fundamentals of NNs

and modern libraries (we saw a preview last week, with auto-diff)!

  • This will also serve as a refresher of gradient descent
slide-4
SLIDE 4

Neural Networks!

  • In 2020, neural networks are the dominant technology

in machine learning (for better or worse)!

  • Today, we’ll go over some of the fundamentals of NNs

and modern libraries (we saw a preview last week, with auto-diff)!

  • This will also serve as a refresher on gradient descent
slide-5
SLIDE 5

Gradient Descent in Linear Models

Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as optimization We’ll start with linear models, review gradient descent, and then talk about neural nets + backprop

slide-6
SLIDE 6

Gradient Descent in Linear Models

Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as search/

  • ptimization

We’ll start with linear models, review gradient descent, and then talk about neural nets + backprop

slide-7
SLIDE 7

Gradient Descent in Linear Models

Last time we thought in probabilistic terms and discussed maximum likelihood estimation for “generative” models Today we’ll take the view of learning as search/

  • ptimization

We’ll start with linear models, review gradient descent, and then talk about neural nets + backprop

slide-8
SLIDE 8

Loss

0 if we’re correct 1 if we’re wrong

The simplest loss is probably 0/1 loss: What’s an algo that minimizes this?

slide-9
SLIDE 9

The Perceptron!

slide-10
SLIDE 10

is

ifw.x.IT

1 otherwise Consider a simple linear model with parameters w

x

y

Training data

slide-11
SLIDE 11

is

ifw.x.IT

1 otherwise Consider a simple linear model with parameters w

x

y

Training data

(assumes bias term moved into x or omitted)

slide-12
SLIDE 12

is

ifw.x.IT

1 otherwise Consider a simple linear model with parameters w

x

y

Training data

(assumes bias term moved into x or omitted)

The learning problem is to estimate w

slide-13
SLIDE 13

is

ifw.x.IT

1 otherwise Consider a simple linear model with parameters w

x

y

Training data

(assumes bias term moved into x or omitted)

The learning problem is to estimate w What is our criterion for a good w? Minimal loss

slide-14
SLIDE 14

Perceptron!

Algorithm 5 PerceptronTrain(D, MaxIter)

1: wd ← 0, for all d = 1 . . . D

// initialize weights

2: b ← 0

// initialize bias

3: for iter = 1 . . . MaxIter do 4:

for all (x,y) ∈ D do

5:

a ← ∑D

d=1 wd xd + b

// compute activation for this example

6:

if ya ≤ 0 then

7:

wd ← wd + yxd, for all d = 1 . . . D

// update weights

8:

b ← b + y

// update bias

9:

end if

10:

end for

11: end for 12: return w0, w1, . . . , wD, b

Fig and Alg from CIML [Daume]

slide-15
SLIDE 15
slide-16
SLIDE 16

Problems with 0/1 loss

  • If we’re wrong by .0001 it is “as bad” as being

wrong by .9999

  • Because it is discrete, optimization is hard if the

instances are not linearly separable

slide-17
SLIDE 17

Smooth loss

Hinge

Langley

2

Max

  • l

g

z

2

w

Xi

raw

  • utput

yE

I

I

Idea: Introduce a “smooth” loss function to make optimization easier Example: Hinge loss

1 loss

i

2

wrong

correct

signed margin

which

slide-18
SLIDE 18

Losses

·

Zero/one:

`(0/1)(y, ˆ

y) = 1[y ˆ y ≤ 0] Hinge:

`(hin)(y, ˆ

y) = max{0, 1 − y ˆ y} Logistic:

`(log)(y, ˆ

y) = 1 log 2 log (1 + exp[−y ˆ y]) Exponential:

`(exp)(y, ˆ

y) = exp[−y ˆ y] Squared:

`(sqr)(y, ˆ

y) = (y − ˆ y)2

Fig and Eq’s from CIML [Daume]

slide-19
SLIDE 19

Regularization

min

w,b

n

`(yn, w · xn + b) + lR(w, b)

slide-20
SLIDE 20

Regularization

min

w,b

n

`(yn, w · xn + b) + lR(w, b)

Prevent w from “getting to crazy”

slide-21
SLIDE 21

By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355

Gradient descent

slide-22
SLIDE 22

Algorithm 21 GradientDescent(F, K, η1, . . . )

1: z(0) h0, 0, . . . , 0i

// initialize variable we are optimizing

2: for k = 1 . . . K do 3:

g(k) rzF|z(k-1)

// compute gradient at current location

4:

z(k) z(k-1) η(k)g(k)

// take a step down the gradient

5: end for 6: return z(K)

Alg from CIML [Daume]

slide-23
SLIDE 23

rwL = rw ∑

n

exp ⇥ yn(w · xn + b) ⇤ + rw λ 2 ||w||2 (

= ∑

n

(rw yn(w · xn + b)) exp

⇥ yn(w · xn + b) ⇤ + λw (

= ∑

n

ynxn exp ⇥ yn(w · xn + b) ⇤ + λw (

slide-24
SLIDE 24

rwL = rw ∑

n

exp ⇥ yn(w · xn + b) ⇤ + rw λ 2 ||w||2 (

= ∑

n

(rw yn(w · xn + b)) exp

⇥ yn(w · xn + b) ⇤ + λw (

= ∑

n

ynxn exp ⇥ yn(w · xn + b) ⇤ + λw (

slide-25
SLIDE 25

rwL = rw ∑

n

exp ⇥ yn(w · xn + b) ⇤ + rw λ 2 ||w||2 (

= ∑

n

(rw yn(w · xn + b)) exp

⇥ yn(w · xn + b) ⇤ + λw (

= ∑

n

ynxn exp ⇥ yn(w · xn + b) ⇤ + λw (

slide-26
SLIDE 26

Limitations of linear models

slide-27
SLIDE 27

E

E

E

W1

h

wa b

  • v k

y wut but

I

WZ

i

  • l

y

Neural networks

Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations (x) nor

  • utputs (y)
slide-28
SLIDE 28

E

E

E

W1

h

wa b

  • v k

y wut but

I

WZ

i

  • l

y

Neural networks

Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations (x) nor

  • utputs (y)

(Non-linear) activation functions

slide-29
SLIDE 29

E

E

E

W1

h

wa b

  • v k

y wut but

I

WZ

i

  • l

y

Neural networks

Idea: Basically stack together a bunch of linear models. This introduces hidden units which are neither observations (x) nor

  • utputs (y)

The challenge: How do we update weights associated with each node in this multi-layer regime?

slide-30
SLIDE 30

back-propagation = gradient descent + chain rule

slide-31
SLIDE 31

Algorithm 27 ForwardPropagation(x)

1: for all input nodes u do 2:

hu ← corresponding feature of x

3: end for 4: for all nodes v in the network whose parent’s are computed do 5:

av ← ∑u∈par(v) w(u,v)hu

6:

hv ← tanh(av)

7: end for 8: return ay

Tanh is another common activation function

slide-32
SLIDE 32

Algorithm 28 BackPropagation(x, y)

1: run ForwardPropagation(x) to compute activations 2: ey ← y − ay

// compute overall network error

3: for all nodes v in the network whose error ev is computed do 4:

for all u ∈ par(v) do

5:

gu,v ← −evhu

// compute gradient of this edge

6:

eu ← eu + evwu,v(1 − tanh2(au)) // compute the “error” of the parent node

7:

end for

8: end for 9: return all gradients ge

slide-33
SLIDE 33

What are we doing with these gradients again?

slide-34
SLIDE 34

By Gradient_descent.png: The original uploader was Olegalexandrov at English Wikipedia.derivative work: Zerodamage - This file was derived from: Gradient descent.png:, Public Domain, https://commons.wikimedia.org/w/index.php?curid=20569355

Gradient descent

slide-35
SLIDE 35

Neural Networks!

If you’re interested in learning more…