T-61.182 Information Theory and Machine Learning 38. Introduction - - PowerPoint PPT Presentation

▶

Nov 29, 2022 221 likes •424 views

T-61.182 Information Theory and Machine Learning 38. Introduction to Neural Networks 40. Capacity of a Single Neuron 41. Learning as Inference Presented by Yang, Zhi-rong on 22, April 2004 T-61.182 Information Theory and Machine Learning

SLIDE 1

T-61.182 Information Theory and Machine Learning

38. Introduction to Neural Networks
40. Capacity of a Single Neuron
41. Learning as Inference

Presented by Yang, Zhi-rong on 22, April 2004

T-61.182 Information Theory and Machine Learning – p. 1/2

SLIDE 2

Introduction to Neural Networks – Memories – Terminology Capacity of a Single Neuron – Neural network learning as communication – The capacity of a single neuron – Counting threshold functions Learning as Inference – Neural network learning as inference – Beyond optimization: making predictions – Implementation by Monte Carlo method – Implementation by Gaussian approximations

T-61.182 Information Theory and Machine Learning – p. 2/2

SLIDE 3

Memories

Address-based memory scheme – not associative – not robust or fault-tolerant – not distributed Biological memory systems – content addressable – error-tolerant and robust – parallel and distributed

T-61.182 Information Theory and Machine Learning – p. 3/2

SLIDE 4

Terminology

Architecture Activity rule Learning rule Supervised neural networks Unsupervised neural networks

T-61.182 Information Theory and Machine Learning – p. 4/2

SLIDE 5

NN learning as communication

1. Obtain adapted weights

{tn}N

n=1

↓ {xn}N

n=1

− → Learning algorithm − → w

2. Communication

{xn}N

n=1

− → w − → {ˆ tn}N

n=1

T-61.182 Information Theory and Machine Learning – p. 5/2

SLIDE 6

The capacity of a single neuron

General position Definition 1 A set of points {xn} in K-dimensional space are in general position if any subset of size ≤ K is linearly independent The linear threshold function

y = f K

wkxk

f(a) =
1

a > 0 a ≤ 0

T-61.182 Information Theory and Machine Learning – p. 6/2

SLIDE 7

Counting threshold functions

Denote T(N, K) the number of distinct threshold functions

n N points n general position in K dimensions. In this

section, the author try to derive a fomula for T(N, K). To start with, let us work out a few cases by hand.

K = 1, for any N T(N, 1) = 2 N = 1, for any K T(1, K) = 2 K = 2 T(N, 2) = 2N

The points of XOR function are unrealizable.

T-61.182 Information Theory and Machine Learning – p. 7/2

SLIDE 8

Counting threshold functions

Final Result

T(N, K) =

K ≥ N 2 K−1

k=0

N−1

K < N

Vapnik-Chervonenkis dimension (VC dimension)

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 1.2 N/K T(N,K)/2N

T-61.182 Information Theory and Machine Learning – p. 8/2

SLIDE 9

NN learning as inference

Objective function to be minimized

M(w) = G(w) + αEW(w)

with error function

G(w) = −

n
t(n)lny(x(n); w) + (1 − t(n))ln(1 − y(x(n); w))
and a regularizer

EW(w) = 1 2

w2

T-61.182 Information Theory and Machine Learning – p. 9/2

SLIDE 10

NN learning as inference

Finally

P(w|D, α) = P(D|w)P(w|α) P(D|α)

(1)

= eG(w)e−αEW (w)/ZW (α) P(D|α)

(2)

= 1 ZM exp(−M(w))

(3)

T-61.182 Information Theory and Machine Learning – p. 10/2

SLIDE 11

NN learning as inference

Denote

y(w; x) ≡ P(t = 1|x, w)

Then

P(t|x, w) = yt(1 − y)1−t = exp[tlny + (1 − t)ln(1 − y)]

The likelihood can be expressed in terms of the error function

P(D|w) = exp[−G(w)]

Similarly for the regularizer

P(w|α) = 1 ZW(α)exp(−αEW)

T-61.182 Information Theory and Machine Learning – p. 11/2

SLIDE 12

Making predictions

Over-confident prediction (example)

* * * * * A B * * * * * A B

T-61.182 Information Theory and Machine Learning – p. 12/2

SLIDE 13

Bayesian prediction: marginalizing

Take into account the whole posterior ensemble

P(t(N+1)|x(N+1), D, α) =

dKwP(t(N+1)|x(N+1), w, α)P(w|D, α)

Try to find a way of computing the integral

P(t(N+1) = 1|x(N+1), D, α) =

dKwP(t(N+1)|x(N+1), w, α) 1

ZM exp(−M(w))

T-61.182 Information Theory and Machine Learning – p. 13/2

SLIDE 14

The Langevin Monte Carlo Method

g = gradM(w); M = findM(w); for l=1:L p = randn(size(w)); H = p’p/2+M; p = p-epsilong/2; wnew = w+epsilonp; gnew = gradM(wnew); p = p-epsilongnew/2; Mnew = findM(wnew); Hnew = p’*p/2+Mnew; dH = Hnew-H; if (dH<0||rand()<exp(-dH)) g=gnew; w=wnew; M=Mnew; endfor

T-61.182 Information Theory and Machine Learning – p. 14/2

SLIDE 15

The Langevin Monte Carlo Method

‘gradient descent with added noise’

∆w = −1 2ǫ2g + ǫp

speedup by Hamiltonian Monte Carlo wnew=w; gnew=g; for tau=1:Tau p = p-epsilongnew/2; wnew = wnew+epsilonp; gnew = gradM(wnew); p = p-epsilon*gnew/2; endfor

T-61.182 Information Theory and Machine Learning – p. 15/2

SLIDE 16

Gaussian approximations

Taylor expand M(w)

M(w) ≃ M(wMP) + 1 2(w − wMP)TA(w − wMP) + · · ·

with Hessian matrix

Aij ≡ ∂2 ∂wi∂wj M(w)

w=wMP

The Gaussian approximation is defined as:

Q(w; wMP, A) = [det(A/2π)]1/2exp

−1

2(w − wMP)T A(w − wMP)

T-61.182 Information Theory and Machine Learning – p. 16/2

SLIDE 17

Gaussian approximations

the second derivative of M(w) with respect to w is given by

∂2 ∂wi∂wj M(w) =

f′(a(n))x(n)

x(n)

+ αδij

where

f(a) ≡ 1 1 + e−a a(n) =

wjx(n)

T-61.182 Information Theory and Machine Learning – p. 17/2

SLIDE 18

Gaussian approximations

P(a|x, D, α) = Normal(aMP, s2) =

1 √ 2πs2exp

−(a−aMP )2

2s2

where

aMP = a(x; wMP)

and

s2 = xTA−1x

T-61.182 Information Theory and Machine Learning – p. 18/2

SLIDE 19

Gaussian approximations

Therefore the marginalized output is:

P(t = 1|x, D, α) = ψ(aMP, s2) ≡

daf(a)Normal(aMP, s2)

And further approximation can be applied:

ψ(aMP, s2) ≃ φ(aMP, s2) ≡ f(κ(s)aMP)

where

κ(s) = 1/

1 + πs2/8

T-61.182 Information Theory and Machine Learning – p. 19/2

SLIDE 20

Exercises

Practice on counting threshold functions: Ex. 40.6 (page 490) Prove the approximation on Hessian matrix: Ex. 41.1 (page 501)

T-61.182 Information Theory and Machine Learning – p. 20/2

T-61.182 Information Theory and Machine Learning

Contents

Memories

Address-based memory scheme – not associative – not robust or fault-tolerant – not distributed Biological memory systems – content addressable – error-tolerant and robust – parallel and distributed

Terminology

Architecture Activity rule Learning rule Supervised neural networks Unsupervised neural networks

NN learning as communication

{tn}N

↓ {xn}N

− → Learning algorithm − → w

{xn}N

− → w − → {ˆ tn}N

The capacity of a single neuron

General position Definition 1 A set of points {xn} in K-dimensional space are in general position if any subset of size ≤ K is linearly independent The linear threshold function

y = f K

wkxk

a > 0 a ≤ 0

Counting threshold functions

Denote T(N, K) the number of distinct threshold functions

section, the author try to derive a fomula for T(N, K). To start with, let us work out a few cases by hand.

K = 1, for any N T(N, 1) = 2 N = 1, for any K T(1, K) = 2 K = 2 T(N, 2) = 2N

The points of XOR function are unrealizable.

Counting threshold functions

Final Result

T(N, K) =

K ≥ N 2 K−1

N−1

Vapnik-Chervonenkis dimension (VC dimension)

NN learning as inference

Objective function to be minimized

M(w) = G(w) + αEW(w)

with error function

G(w) = −

EW(w) = 1 2

w2

NN learning as inference

Finally

P(w|D, α) = P(D|w)P(w|α) P(D|α)

= eG(w)e−αEW (w)/ZW (α) P(D|α)

= 1 ZM exp(−M(w))

NN learning as inference

Denote

y(w; x) ≡ P(t = 1|x, w)

Then

P(t|x, w) = yt(1 − y)1−t = exp[tlny + (1 − t)ln(1 − y)]

The likelihood can be expressed in terms of the error function

P(D|w) = exp[−G(w)]

Similarly for the regularizer

P(w|α) = 1 ZW(α)exp(−αEW)

Making predictions

Over-confident prediction (example)

Bayesian prediction: marginalizing

Take into account the whole posterior ensemble

P(t(N+1)|x(N+1), D, α) =

Try to find a way of computing the integral

P(t(N+1) = 1|x(N+1), D, α) =

The Langevin Monte Carlo Method

g = gradM(w); M = findM(w); for l=1:L p = randn(size(w)); H = p’*p/2+M; p = p-epsilon*g/2; wnew = w+epsilon*p; gnew = gradM(wnew); p = p-epsilon*gnew/2; Mnew = findM(wnew); Hnew = p’*p/2+Mnew; dH = Hnew-H; if (dH<0||rand()<exp(-dH)) g=gnew; w=wnew; M=Mnew; endfor

The Langevin Monte Carlo Method

‘gradient descent with added noise’

∆w = −1 2ǫ2g + ǫp

speedup by Hamiltonian Monte Carlo wnew=w; gnew=g; for tau=1:Tau p = p-epsilon*gnew/2; wnew = wnew+epsilon*p; gnew = gradM(wnew); p = p-epsilon*gnew/2; endfor

Gaussian approximations

Taylor expand M(w)

M(w) ≃ M(wMP) + 1 2(w − wMP)TA(w − wMP) + · · ·

with Hessian matrix

Aij ≡ ∂2 ∂wi∂wj M(w)

The Gaussian approximation is defined as:

Q(w; wMP, A) = [det(A/2π)]1/2exp

Gaussian approximations

the second derivative of M(w) with respect to w is given by

∂2 ∂wi∂wj M(w) =

f′(a(n))x(n)

x(n)

+ αδij

where

f(a) ≡ 1 1 + e−a a(n) =

wjx(n)

Gaussian approximations

P(a|x, D, α) = Normal(aMP, s2) =

g = gradM(w); M = findM(w); for l=1:L p = randn(size(w)); H = p’p/2+M; p = p-epsilong/2; wnew = w+epsilonp; gnew = gradM(wnew); p = p-epsilongnew/2; Mnew = findM(wnew); Hnew = p’*p/2+Mnew; dH = Hnew-H; if (dH<0||rand()<exp(-dH)) g=gnew; w=wnew; M=Mnew; endfor

speedup by Hamiltonian Monte Carlo wnew=w; gnew=g; for tau=1:Tau p = p-epsilongnew/2; wnew = wnew+epsilonp; gnew = gradM(wnew); p = p-epsilon*gnew/2; endfor