CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of - - PowerPoint PPT Presentation

cs 6316 machine learning
SMART_READER_LITE
LIVE PREVIEW

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of - - PowerPoint PPT Presentation

CS 6316 Machine Learning Neural Networks Yangfeng Ji Department of Computer Science University of Virginia Overview 1. From Logistic Regression to Neural Networks 2. Expressive Power of Neural Networks 3. Learning Neural Networks 4.


slide-1
SLIDE 1

CS 6316 Machine Learning

Neural Networks

Yangfeng Ji

Department of Computer Science University of Virginia

slide-2
SLIDE 2

Overview

  • 1. From Logistic Regression to Neural Networks
  • 2. Expressive Power of Neural Networks
  • 3. Learning Neural Networks
  • 4. Computation Graph

1

slide-3
SLIDE 3

From Logistic Regression to Neu- ral Networks

slide-4
SLIDE 4

Logistic Regression

◮ An unified form for y ∈ {−1, +1}

p(Y +1 | x) 1 1 + exp(−w, x) (1)

3

slide-5
SLIDE 5

Logistic Regression

◮ An unified form for y ∈ {−1, +1}

p(Y +1 | x) 1 1 + exp(−w, x) (1)

◮ The sigmoid function σ(a) with a ∈ R

σ(a) 1 1 + exp(−a) (2)

3

slide-6
SLIDE 6

Graphical Representation

◮ A specific example of LR

p(Y 1 | x) σ(

4

  • j1

wjx·,j) (3)

◮ The graphical representation of this LR model is

x1 x2 x3 x4 Input layer y Output layer

4

slide-7
SLIDE 7

Capacity of a LR

Logistic regression gives a linear decision boundary

x1 x2

5

slide-8
SLIDE 8

From LR to Neural Networks

Build upon logistic regression, a simple neural network can be constructed as

zk

  • σ(

d

  • j1

w(1)

k,jx·,j)

k ∈ [K] (4) P(y 1 | x)

  • σ(

K

  • k1

w(o)

k zk)

(5)

◮ x ∈ Rd: d-dimensional input ◮ y ∈ {−1, +1} (binary classification problem)

6

slide-9
SLIDE 9

From LR to Neural Networks

Build upon logistic regression, a simple neural network can be constructed as

zk

  • σ(

d

  • j1

w(1)

k,jx·,j)

k ∈ [K] (4) P(y 1 | x)

  • σ(

K

  • k1

w(o)

k zk)

(5)

◮ x ∈ Rd: d-dimensional input ◮ y ∈ {−1, +1} (binary classification problem) ◮ {w(1)

k,i} and {w(o) k } are two sets of the parameters, and

◮ K is the number of hidden units, each of them has the

same form as a LR.

6

slide-10
SLIDE 10

Mathematical Formulation

◮ Element-wise formulation

zk

  • σ(

d

  • j1

w(1)

k,jx·,j)

k ∈ [K] (6) P(y +1 | x)

  • σ(

K

  • k1

w(o)

k zk)

(7)

7

slide-11
SLIDE 11

Mathematical Formulation

◮ Element-wise formulation

zk

  • σ(

d

  • j1

w(1)

k,jx·,j)

k ∈ [K] (6) P(y +1 | x)

  • σ(

K

  • k1

w(o)

k zk)

(7)

◮ Matrix-vector formulation

z

  • σ(W(1)x)

(8) P(y +1 | x)

  • σ((w(o))Tz)

(9) where W(1) ∈ RK×d and w(o) ∈ RK

7

slide-12
SLIDE 12

Graphical Representation

x·,1 x·,2 x·,3 x·,4 Input layer z1 z2 z3 z4 z5 Hidden layer y Output layer

◮ Depth: 2 (two-layer neural network) ◮ Width: 5 (the maximal number of units in each layer)

8

slide-13
SLIDE 13

Hypothesis Space

The hypothesis space of neural networks is usually defined by the architecture of the network, which includes

◮ the nodes in the network, ◮ the connections in the network, and ◮ the activation function (e.g., σ)

x·,1 x·,2 x·,3 x·,4 Input layer z1 z2 z3 z4 z5 Hidden layer y Output layer

9

slide-14
SLIDE 14

Other Activation Functions

(a) Sign function

10

slide-15
SLIDE 15

Other Activation Functions

(a) Sign function (b) Tanh function

10

slide-16
SLIDE 16

Other Activation Functions

(a) Sign function (b) Tanh function (c) ReLU function [Jarrett et al., 2009]

10

slide-17
SLIDE 17

Another Network/Hypothesis Space

Simply increasing the number of layers or increase the number of hidden units, we can create another hypothesis space

x·,1 x·,2 x·,3 x·,4 Input layer Hidden layer Hidden layer y Output layer

11

slide-18
SLIDE 18

Expressive Power of Neural Net- works

slide-19
SLIDE 19

Two-layer NNs with Sign Function

Consider a neural network defined by the following functions zk

  • sign(

d

  • j1

w(1)

k,jx·,j)

k ∈ [K] (10) h(x)

  • sign(

K

  • k1

w(o)

k zk)

(11) where sign(a) is the sign function.

13

slide-20
SLIDE 20

Two-layer NNs with Sign Function

Consider a neural network defined by the following functions zk

  • sign(

d

  • j1

w(1)

k,jx·,j)

k ∈ [K] (10) h(x)

  • sign(

K

  • k1

w(o)

k zk)

(11) where sign(a) is the sign function. h(x) can be rewritten as h(x) sign

  • K
  • k1

w(o)

k

· sign(

d

  • j1

w(1)

k,ix·,j)

  • (12)

13

slide-21
SLIDE 21

Decision Boundary

h(x) is defined by a combination of K linear predictors x1 x2 Similar conclusion applies to other activation functions. [Shalev-Shwartz and Ben-David, 2014, Page 274]

14

slide-22
SLIDE 22

Universal Approximation Theorem

Restrict the inputs x·,j ∈ {−1, +1}∀j ∈ [d] as binary

Universal Approximation Theorem

For every d, there exists a two-layer neural network (Equations 10 – 11), such that this hypothesis space contains all functions from {−1, +1}d to {−1, +1} [Shalev-Shwartz and Ben-David, 2014, Section 20.3]

15

slide-23
SLIDE 23

Universal Approximation Theorem

Restrict the inputs x·,j ∈ {−1, +1}∀j ∈ [d] as binary

Universal Approximation Theorem

For every d, there exists a two-layer neural network (Equations 10 – 11), such that this hypothesis space contains all functions from {−1, +1}d to {−1, +1}

◮ The minimal size of network that satisfies the

theorem is exponential in d

◮ Similar results hold for σ as the activation function

[Shalev-Shwartz and Ben-David, 2014, Section 20.3]

15

slide-24
SLIDE 24

Learning Neural Networks

slide-25
SLIDE 25

Neural Network Predictions

Consider a binary classification problem with Y {−1, +1},

◮ A two-layer neural network gives the following

prediction as P(Y +1 | x) σ

  • (w(o))Tσ(W(1)x)
  • (13)

where {w(o), W(1)} are the parameters

17

slide-26
SLIDE 26

Neural Network Predictions

Consider a binary classification problem with Y {−1, +1},

◮ A two-layer neural network gives the following

prediction as P(Y +1 | x) σ

  • (w(o))Tσ(W(1)x)
  • (13)

where {w(o), W(1)} are the parameters

◮ Assume the ground-truth label is y, let’s introduce an

empirical distribution q(Y y′ | x) δ(y′, y)

1

y′ y y′ y (14)

17

slide-27
SLIDE 27

Cross Entropy

Given one data point, The loss function of a neural network is usually defined as the cross entropy of the prediction distribution p and the empirical distribution p H(q, p)

  • −q(Y +1 | x) log p(Y +1 | x)

−q(Y −1 | x) log p(Y −1 | x) (15)

18

slide-28
SLIDE 28

Cross Entropy

Given one data point, The loss function of a neural network is usually defined as the cross entropy of the prediction distribution p and the empirical distribution p H(q, p)

  • −q(Y +1 | x) log p(Y +1 | x)

−q(Y −1 | x) log p(Y −1 | x) (15) Since q is defined with a Delta function, Depending on y, we have H(q, p)

− log p(Y +1 | x)

Y +1 − log p(Y −1 | x) Y −1 (16)

18

slide-29
SLIDE 29

Cross Entropy

Given one data point, The loss function of a neural network is usually defined as the cross entropy of the prediction distribution p and the empirical distribution p H(q, p)

  • −q(Y +1 | x) log p(Y +1 | x)

−q(Y −1 | x) log p(Y −1 | x) (15) Since q is defined with a Delta function, Depending on y, we have H(q, p)

− log p(Y +1 | x)

Y +1 − log p(Y −1 | x) Y −1 (16) It is equivalent to the negative log-likelihood (NLL) function used in learning LR.

18

slide-30
SLIDE 30

ERM

◮ Given a set of training example S {(xi, yi)}m

i1, the

loss function is defined as L(θ) −

m

  • i1

log p(yi | xi) (17) where θ indicates all the parameters in a network.

19

slide-31
SLIDE 31

ERM

◮ Given a set of training example S {(xi, yi)}m

i1, the

loss function is defined as L(θ) −

m

  • i1

log p(yi | xi) (17) where θ indicates all the parameters in a network.

◮ For example, θ {w(o), W(1)}, for the previously

defined two-layer neural network

19

slide-32
SLIDE 32

ERM

◮ Given a set of training example S {(xi, yi)}m

i1, the

loss function is defined as L(θ) −

m

  • i1

log p(yi | xi) (17) where θ indicates all the parameters in a network.

◮ For example, θ {w(o), W(1)}, for the previously

defined two-layer neural network

◮ Just like learning a LR, we can use gradient-based

learning algorithm

19

slide-33
SLIDE 33

Gradient-based Learning

A simple scratch of gradient-based learning1

  • 1. Compute the gradient of θ, ∂L(θ)

∂θ

1More detail will be discussed in the next lecture

20

slide-34
SLIDE 34

Gradient-based Learning

A simple scratch of gradient-based learning1

  • 1. Compute the gradient of θ, ∂L(θ)

∂θ

  • 2. Update the parameter with the gradient

θ(new) ← θ(old) − η · ∂L(θ) ∂θ

  • θθ(old)

(18) where η is the learning rate

1More detail will be discussed in the next lecture

20

slide-35
SLIDE 35

Gradient-based Learning

A simple scratch of gradient-based learning1

  • 1. Compute the gradient of θ, ∂L(θ)

∂θ

  • 2. Update the parameter with the gradient

θ(new) ← θ(old) − η · ∂L(θ) ∂θ

  • θθ(old)

(18) where η is the learning rate

  • 3. Go back step 1 until it converges

1More detail will be discussed in the next lecture

20

slide-36
SLIDE 36

Gradient Computation

Consider the two-layer neural network with one training example (x, y), to further simplify the computation, we assume y +1 log p(y | x) log σ

  • (w(o))Tσ(W(1)x)
  • (19)

21

slide-37
SLIDE 37

Gradient Computation

Consider the two-layer neural network with one training example (x, y), to further simplify the computation, we assume y +1 log p(y | x) log σ

  • (w(o))Tσ(W(1)x)
  • (19)

The gradient with respect to w(o) is

∂L(θ) ∂w(o)

∂ log σ

  • ·
  • ∂σ
  • ·
  • ·

∂σ

  • (w(o))Tσ(W(1)x)
  • ∂(w(o))Tσ(W(1)x)

· ∂(w(o))Tσ(W(1)x) ∂w(o) (20)

21

slide-38
SLIDE 38

Gradient Computation

Consider the two-layer neural network with one training example (x, y), to further simplify the computation, we assume y +1 log p(y | x) log σ

  • (w(o))Tσ(W(1)x)
  • (19)

The gradient with respect to w(o) is

∂L(θ) ∂w(o)

∂ log σ

  • ·
  • ∂σ
  • ·
  • ·

∂σ

  • (w(o))Tσ(W(1)x)
  • ∂(w(o))Tσ(W(1)x)

· ∂(w(o))Tσ(W(1)x) ∂w(o)

  • 1 − σ
  • (w(o))Tσ(W(1)x)

· σ(W(1)x) (20)

which is in the similar form as the LR updating equation.

21

slide-39
SLIDE 39

Gradient Computation (II)

The gradient with respect to W(1) is

∂L(θ) ∂w(o)

∂ log σ

  • ·
  • ∂σ
  • ·
  • ·

∂σ

  • (w(o))Tσ(W(1)x)
  • ∂(w(o))Tσ(W(1)x)

·∂(w(o))Tσ(W(1)x) ∂σ(W(1)x) · ∂σ(W(1)x) ∂W(1)x · ∂W(1)x ∂W(1) (21)

22

slide-40
SLIDE 40

Gradient Computation (II)

The gradient with respect to W(1) is

∂L(θ) ∂w(o)

∂ log σ

  • ·
  • ∂σ
  • ·
  • ·

∂σ

  • (w(o))Tσ(W(1)x)
  • ∂(w(o))Tσ(W(1)x)

·∂(w(o))Tσ(W(1)x) ∂σ(W(1)x) · ∂σ(W(1)x) ∂W(1)x · ∂W(1)x ∂W(1) (21)

◮ Both of them are the applications of the chain rule in

calculus plus some derivatives of basic functions

◮ In the literature of neural networks, it is called the

back-propagation algorithm [Rumelhart et al., 1986]

22

slide-41
SLIDE 41

Computation Graph

slide-42
SLIDE 42

Forward Operations

Consider the example of a two-layer neural network P(Y +1 | x) σ

  • (w(o))Tσ(W(1)x)
  • (22)

A neural network is a composition of some basic functions and operations. For example

◮ σ(·) ◮ matrix transpose (w(o))T ◮ matrix-vector multiplication W(1)x

24

slide-43
SLIDE 43

Forward Graph

The computation graph of the two-layer neural network2 x W(1) · x W(1) σ (w(o))Tz w(o) σ p(Y | x)

2For simplicity, the transpose operation is ignored from the graph

25

slide-44
SLIDE 44

Backward Operations

Similarly, the gradient of neural network parameters are computed with a series of backward operations associated with the derivative of some basic function. For example

∂σ(x) ∂x

σ(x)(1 − σ(x))

∂aTx ∂x

a

∂ log(x) ∂x

1

x

∂Wx ∂x

      

xT . . . xT

      

26

slide-45
SLIDE 45

Backward Graph

With the chain rule, gradient of the loss function with respect to any parameter can be computed backward step-by-step along the path

x W(1) · x W(1) σ (w(o))Tz w(o) σ − log p(Y | x)

∂(W(1) · x) ∂σ ∂((w(o))Tz) ∂σ ∂W(1) ∂w(o)

27

slide-46
SLIDE 46

Computation Graph

Perform the forward/backward step with a graph of basic

  • perations (e.g., PyTorch, Tensorflow)

x W(1) · x W(1) σ (w(o))Tz w(o) σ p(Y | x) x W(1) · x W(1) σ (w(o))Tz w(o) σ − log p(Y | x)

∂(W(1) · x) ∂σ ∂((w(o))Tz) ∂σ ∂W(1) ∂w(o)

28

slide-47
SLIDE 47

Computation Graph

Perform the forward/backward step with a graph of basic

  • perations (e.g., PyTorch, Tensorflow)

x W(1) · x W(1) σ (w(o))Tz w(o) σ p(Y | x) x W(1) · x W(1) σ (w(o))Tz w(o) σ − log p(Y | x)

∂(W(1) · x) ∂σ ∂((w(o))Tz) ∂σ ∂W(1) ∂w(o)

◮ Modular implementation: implement each module

with its forward/backward operations together

◮ Automatic differentiation: automatically run with the

backward step

28

slide-48
SLIDE 48

What is Deep Learning?

Definition

Deep Learning is building a system by assembling parameterized modules into a (possibly dynamic) computation graph, and training it to perform a task by

  • ptimizing the parameters using a gradient-based

method. [LeCun, 2020, AAAI 2020 Keynote]

29

slide-49
SLIDE 49

Reference

Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). What is the best multi-stage architecture for object recognition? In Proceedings of the 12th International Conference on Computer Vision, pages 2146–2153. IEEE. LeCun, Y. (2020). Self-supervised learning. Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088):533–536. Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge university press.

30