[PPT] - Neural Networks - II Henrik I Christensen Robotics & PowerPoint Presentation

SLIDE 1

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Neural Networks - II

Henrik I Christensen

Robotics & Intelligent Machines @ GT Georgia Institute of Technology, Atlanta, GA 30332-0280 hic@cc.gatech.edu

Henrik I Christensen (RIM@GT) Neural Networks 1 / 23

SLIDE 2

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Outline

1

Introduction

2

Mixture Density Networks

3

Bayesian Neural Networks

4

Summary

Henrik I Christensen (RIM@GT) Neural Networks 2 / 23

SLIDE 3

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Introduction

Last lecture:

Neural networks as a layered regression problem Feed-forward networks Linear model with activation functions Global Optimization

Coverage of multi-modal networks Bayesian models for neural networks

Henrik I Christensen (RIM@GT) Neural Networks 3 / 23

SLIDE 4

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Outline

1

Introduction

2

Mixture Density Networks

3

Bayesian Neural Networks

4

Summary

Henrik I Christensen (RIM@GT) Neural Networks 4 / 23

SLIDE 5

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Motivation

The models this far have assumed a Gaussian Distribution

How about multi-modal distributions? How about inverse problems

Mixture models is one possible solution

Henrik I Christensen (RIM@GT) Neural Networks 5 / 23

SLIDE 6

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Motivation

The models this far have assumed a Gaussian Distribution

How about multi-modal distributions? How about inverse problems

Mixture models is one possible solution

Henrik I Christensen (RIM@GT) Neural Networks 5 / 23

SLIDE 7

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Motivation

The models this far have assumed a Gaussian Distribution

How about multi-modal distributions? How about inverse problems

Mixture models is one possible solution

Henrik I Christensen (RIM@GT) Neural Networks 5 / 23

SLIDE 8

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Simple Robot Example L1 L2 θ1 θ2 (x1, x2) (x1, x2)

elbow down elbow up

Henrik I Christensen (RIM@GT) Neural Networks 6 / 23

SLIDE 9

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Simple Functional Approximation Example

1 1 1 1

Henrik I Christensen (RIM@GT) Neural Networks 7 / 23

SLIDE 10

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Basic Formulation

Objective - approximation of: p(t|x) A generic model p(t|x) =

K

k=1

πk(x)N(t|µk(x), σ2

k(x))

Here a Gaussian mixture is used but any distribution could be the basis Parameters to be estimated πk(x), µk(x) and σ2

k(x).

Henrik I Christensen (RIM@GT) Neural Networks 8 / 23

SLIDE 11

Introduction Mixture Density Networks Bayesian Neural Networks Summary

The mixture density network

x1 xD θ1 θM θ t p(t|x)

Henrik I Christensen (RIM@GT) Neural Networks 9 / 23

SLIDE 12

Introduction Mixture Density Networks Bayesian Neural Networks Summary

The Model Parameters

Mixing coefficients

K

k=1

πk(x) = 1 0 ≤ πk(x) ≤ 1 achieved using softmax πk(x) = eaπ

k

K

l=1 eaπ

l

The variance must be postive, so a good choice is σk(x) = eaσ

k

The means can be represented by direct activations µkj(x) = aµ

kj

Henrik I Christensen (RIM@GT) Neural Networks 10 / 23

SLIDE 13

Introduction Mixture Density Networks Bayesian Neural Networks Summary

The Energy Equation(s)

The error function is then as seen before E(w) = −

N

n=1

ln K

k=1

πk(xn, w)N(t|µk(xn, w), σ2

k(xn, w))

Computing the derivatives we can minimize E(w)

Lets use γnk = γn(tn|xn) = πkNnk/ πlNnl The derivatives are then ∂En ∂aπ

k

= πk − γnk ∂En ∂aµ

kl

= γnk µkl − tnl σ2

k

∂En

∂aσ

k

= γnk

L − ||tn − µk||2

σ2

k

Henrik I Christensen (RIM@GT)

Neural Networks 11 / 23

SLIDE 14

Introduction Mixture Density Networks Bayesian Neural Networks Summary

A Toy Example

(a) 1 1 (b) 1 1 (c) 1 1 (d) 1 1 Henrik I Christensen (RIM@GT) Neural Networks 12 / 23

SLIDE 15

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Mixed density networks

The net is optimizing a mixture of parameters Different parts corresponds to different components Each part has its own set “energy terms” and gradients Illustrates the flexibility but also complications

Henrik I Christensen (RIM@GT) Neural Networks 13 / 23

SLIDE 16

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Outline

1

Introduction

2

Mixture Density Networks

3

Bayesian Neural Networks

4

Summary

Henrik I Christensen (RIM@GT) Neural Networks 14 / 23

SLIDE 17

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Introductory Remarks

What is the output was a probability distribution? Could we optimize over the posterior distribution? p(t|x) Assume it is Gaussian to enable processing p(t|x, w, β) = N(t|y(x, w), β−1) Let’s consider how we can analyze the problem?

Henrik I Christensen (RIM@GT) Neural Networks 15 / 23

SLIDE 18

Introduction Mixture Density Networks Bayesian Neural Networks Summary

The Laplace Approximation - I

Sometimes the posterior is no longer Gaussian

Challenges integration Closed form solutions might not be available

How can we generate an approximation Obviously, using a Gaussian approximation would be helpful. Using a Laplace approximation Consider for now p(z) = f (z)

f (a)da

the denominator is merely for normalization and considered unknown Assume the mode, z0 has been determined, so that df (z)/dz = 0

Henrik I Christensen (RIM@GT) Neural Networks 16 / 23

SLIDE 19

Introduction Mixture Density Networks Bayesian Neural Networks Summary

The Laplace Approximation - II

Taylor expansion of ln f is then ln f (z) ≈ ln f (z0) − 1 2A(z − z0)2 where A = − d2 dz2 ln f (z)|z=z0 Taking the exponential f (z) ≈ f (z0)e{− A

2 (z−z0)2}

which can be transformed to q(z) = A 2π 1

2

e{− A

2 (z−z0)2}

the extension to multi-variate distribution is straight forward (see book).

Henrik I Christensen (RIM@GT) Neural Networks 17 / 23

SLIDE 20

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Posterior Parameter Distribution

Back to the Bayesian networks For an IID dataset with target values t = {t1, . . . , tN} we have p(t|w, β) =

N

n=1

N(tn|y(xn, w), β−1) The posterior is then p(w|t, α, β) ∝ p(w|α)p(t|w, β) As usual we have ln p(w|t) = −α 2 wTw − β 2

N

n=1

{y(xn, w) − tn}2 + const

Henrik I Christensen (RIM@GT) Neural Networks 18 / 23

SLIDE 21

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Posterior Parameter Distribution - II

We can use the Laplace approximation to estimate the distribution A = −∇2 ln p(w|t, α, β) = αI + βH The approximation would be q(w|t) = N(w|wMAP, A−1) In turn we have p(t|x, t, α, β) = N(t|y(x, wMAP), σ2) where σ2 = β−1 + gTA−1g and g = ∇wy(x, w)|w=wMAP

Henrik I Christensen (RIM@GT) Neural Networks 19 / 23

SLIDE 22

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Optimization of Hyper-parameters

How do we estimate α and β? We can consider the problem p(t|α, β) =

p(t|w, β)p(w|α)dw

From linear regression we have the composition βHui = λiui where H is the Hessian for the error, E with regression with have α = γ wT

MAPwMAP

where γ is the effective rank of the Hessian Similarly β can be derived to be 1 β = 1 N − γ

N

n=1

{y(xn, wMAP) − tn}2

Henrik I Christensen (RIM@GT) Neural Networks 20 / 23

SLIDE 23

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Bayesian Neural Networks

Modelling of system as a probabilistic generator Use standard techniques to generate wMAP We can in addition generate estimates for the precision/variance

Henrik I Christensen (RIM@GT) Neural Networks 21 / 23

SLIDE 24

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Outline

1

Introduction

2

Mixture Density Networks

3

Bayesian Neural Networks

4

Summary

Henrik I Christensen (RIM@GT) Neural Networks 22 / 23

SLIDE 25

Introduction Mixture Density Networks Bayesian Neural Networks Summary

Summary

With Neural Nets we have a general functional estimator Can be applied both for regression and discrmination The basis functions can be a broad set of functions NNs can also be used for estimation of mixture systems Estimation of probability distributions is also possible for Gaussians (approximation w. wMAP, β) Neural nets is a rich area with a long history.

Henrik I Christensen (RIM@GT) Neural Networks 23 / 23