T-61.182 Information Theory and Machine Learning 38. Introduction - - PowerPoint PPT Presentation

t 61 182 information theory and machine learning
SMART_READER_LITE
LIVE PREVIEW

T-61.182 Information Theory and Machine Learning 38. Introduction - - PowerPoint PPT Presentation

T-61.182 Information Theory and Machine Learning 38. Introduction to Neural Networks 40. Capacity of a Single Neuron 41. Learning as Inference Presented by Yang, Zhi-rong on 22, April 2004 T-61.182 Information Theory and Machine Learning


slide-1
SLIDE 1

T-61.182 Information Theory and Machine Learning

  • 38. Introduction to Neural Networks
  • 40. Capacity of a Single Neuron
  • 41. Learning as Inference

Presented by Yang, Zhi-rong on 22, April 2004

T-61.182 Information Theory and Machine Learning – p. 1/2

slide-2
SLIDE 2

Contents

Introduction to Neural Networks – Memories – Terminology Capacity of a Single Neuron – Neural network learning as communication – The capacity of a single neuron – Counting threshold functions Learning as Inference – Neural network learning as inference – Beyond optimization: making predictions – Implementation by Monte Carlo method – Implementation by Gaussian approximations

T-61.182 Information Theory and Machine Learning – p. 2/2

slide-3
SLIDE 3

Memories

Address-based memory scheme – not associative – not robust or fault-tolerant – not distributed Biological memory systems – content addressable – error-tolerant and robust – parallel and distributed

T-61.182 Information Theory and Machine Learning – p. 3/2

slide-4
SLIDE 4

Terminology

Architecture Activity rule Learning rule Supervised neural networks Unsupervised neural networks

T-61.182 Information Theory and Machine Learning – p. 4/2

slide-5
SLIDE 5

NN learning as communication

  • 1. Obtain adapted weights

{tn}N

n=1

↓ {xn}N

n=1

− → Learning algorithm − → w

  • 2. Communication

{xn}N

n=1

− → w − → {ˆ tn}N

n=1

T-61.182 Information Theory and Machine Learning – p. 5/2

slide-6
SLIDE 6

The capacity of a single neuron

General position Definition 1 A set of points {xn} in K-dimensional space are in general position if any subset of size ≤ K is linearly independent The linear threshold function

y = f K

  • k=1

wkxk

  • f(a) =
  • 1

a > 0 a ≤ 0

T-61.182 Information Theory and Machine Learning – p. 6/2

slide-7
SLIDE 7

Counting threshold functions

Denote T(N, K) the number of distinct threshold functions

  • n N points n general position in K dimensions. In this

section, the author try to derive a fomula for T(N, K). To start with, let us work out a few cases by hand.

K = 1, for any N T(N, 1) = 2 N = 1, for any K T(1, K) = 2 K = 2 T(N, 2) = 2N

The points of XOR function are unrealizable.

T-61.182 Information Theory and Machine Learning – p. 7/2

slide-8
SLIDE 8

Counting threshold functions

Final Result

T(N, K) =

  • 2N

K ≥ N 2 K−1

k=0

N−1

k

  • K < N

Vapnik-Chervonenkis dimension (VC dimension)

0.5 1 1.5 2 2.5 3 0.2 0.4 0.6 0.8 1 1.2 N/K T(N,K)/2N

T-61.182 Information Theory and Machine Learning – p. 8/2

slide-9
SLIDE 9

NN learning as inference

Objective function to be minimized

M(w) = G(w) + αEW(w)

with error function

G(w) = −

  • n
  • t(n)lny(x(n); w) + (1 − t(n))ln(1 − y(x(n); w))
  • and a regularizer

EW(w) = 1 2

  • i

w2

i

T-61.182 Information Theory and Machine Learning – p. 9/2

slide-10
SLIDE 10

NN learning as inference

Finally

P(w|D, α) = P(D|w)P(w|α) P(D|α)

(1)

= eG(w)e−αEW (w)/ZW (α) P(D|α)

(2)

= 1 ZM exp(−M(w))

(3)

T-61.182 Information Theory and Machine Learning – p. 10/2

slide-11
SLIDE 11

NN learning as inference

Denote

y(w; x) ≡ P(t = 1|x, w)

Then

P(t|x, w) = yt(1 − y)1−t = exp[tlny + (1 − t)ln(1 − y)]

The likelihood can be expressed in terms of the error function

P(D|w) = exp[−G(w)]

Similarly for the regularizer

P(w|α) = 1 ZW(α)exp(−αEW)

T-61.182 Information Theory and Machine Learning – p. 11/2

slide-12
SLIDE 12

Making predictions

Over-confident prediction (example)

* * * * * A B * * * * * A B

T-61.182 Information Theory and Machine Learning – p. 12/2

slide-13
SLIDE 13

Bayesian prediction: marginalizing

Take into account the whole posterior ensemble

P(t(N+1)|x(N+1), D, α) =

  • dKwP(t(N+1)|x(N+1), w, α)P(w|D, α)

Try to find a way of computing the integral

P(t(N+1) = 1|x(N+1), D, α) =

  • dKwP(t(N+1)|x(N+1), w, α) 1

ZM exp(−M(w))

T-61.182 Information Theory and Machine Learning – p. 13/2

slide-14
SLIDE 14

The Langevin Monte Carlo Method

g = gradM(w); M = findM(w); for l=1:L p = randn(size(w)); H = p’*p/2+M; p = p-epsilon*g/2; wnew = w+epsilon*p; gnew = gradM(wnew); p = p-epsilon*gnew/2; Mnew = findM(wnew); Hnew = p’*p/2+Mnew; dH = Hnew-H; if (dH<0||rand()<exp(-dH)) g=gnew; w=wnew; M=Mnew; endfor

T-61.182 Information Theory and Machine Learning – p. 14/2

slide-15
SLIDE 15

The Langevin Monte Carlo Method

‘gradient descent with added noise’

∆w = −1 2ǫ2g + ǫp

speedup by Hamiltonian Monte Carlo wnew=w; gnew=g; for tau=1:Tau p = p-epsilon*gnew/2; wnew = wnew+epsilon*p; gnew = gradM(wnew); p = p-epsilon*gnew/2; endfor

T-61.182 Information Theory and Machine Learning – p. 15/2

slide-16
SLIDE 16

Gaussian approximations

Taylor expand M(w)

M(w) ≃ M(wMP) + 1 2(w − wMP)TA(w − wMP) + · · ·

with Hessian matrix

Aij ≡ ∂2 ∂wi∂wj M(w)

  • w=wMP

The Gaussian approximation is defined as:

Q(w; wMP, A) = [det(A/2π)]1/2exp

  • −1

2(w − wMP)T A(w − wMP)

  • T-61.182 Information Theory and Machine Learning – p. 16/2
slide-17
SLIDE 17

Gaussian approximations

the second derivative of M(w) with respect to w is given by

∂2 ∂wi∂wj M(w) =

N

  • n=1

f′(a(n))x(n)

i

x(n)

j

+ αδij

where

f(a) ≡ 1 1 + e−a a(n) =

  • j

wjx(n)

j

T-61.182 Information Theory and Machine Learning – p. 17/2

slide-18
SLIDE 18

Gaussian approximations

P(a|x, D, α) = Normal(aMP, s2) =

1 √ 2πs2exp

  • −(a−aMP )2

2s2

  • where

aMP = a(x; wMP)

and

s2 = xTA−1x

T-61.182 Information Theory and Machine Learning – p. 18/2

slide-19
SLIDE 19

Gaussian approximations

Therefore the marginalized output is:

P(t = 1|x, D, α) = ψ(aMP, s2) ≡

  • daf(a)Normal(aMP, s2)

And further approximation can be applied:

ψ(aMP, s2) ≃ φ(aMP, s2) ≡ f(κ(s)aMP)

where

κ(s) = 1/

  • 1 + πs2/8

T-61.182 Information Theory and Machine Learning – p. 19/2

slide-20
SLIDE 20

Exercises

Practice on counting threshold functions: Ex. 40.6 (page 490) Prove the approximation on Hessian matrix: Ex. 41.1 (page 501)

T-61.182 Information Theory and Machine Learning – p. 20/2