Pseudo Orthogonal Bases Give the Optimal Generalization Capability - - PDF document

pseudo orthogonal bases give the optimal generalization
SMART_READER_LITE
LIVE PREVIEW

Pseudo Orthogonal Bases Give the Optimal Generalization Capability - - PDF document

Pseudo Orthogonal Bases Give the Optimal Generalization Capability in Neural Network Learning Masashi Sugiyama Hidemitsu Ogawa Department of Computer Science, Tokyo Institute of Technology, Japan SPIEs 44th Annual Meeting and Exhibition


slide-1
SLIDE 1

Pseudo Orthogonal Bases Give the Optimal Generalization Capability in Neural Network Learning Tokyo Institute of Technology, Japan Department of Computer Science, Masashi Sugiyama Hidemitsu Ogawa

slide-2
SLIDE 2

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 2

Pseudo Orthogonal Bases (POBs)

Definition

✓ ✏

H : a finite dimensional Hilbert space M ≥ dim(H) A set {φm}M

m=1 of elements in H is called a POB

if any f in H is expressed as f =

M

  • m=1f, φmφm,

where ·, · denotes the inner product in H.

✒ ✑

1/2 1/ √ 2 1/ √ 2 φ2 φ1 φ3 −1/2

H = R2, M = 3

  • If M = dim(H),

a POB is reduced to an ONB.

  • A POB is

a tight frame with frame bound 1. f2 =

M

  • m=1 |f, φm|2.

If φ1 = φ2 = · · · = φM, then {φm}M

m=1 is called

a pseudo orthonormal basis (PONB).

slide-3
SLIDE 3

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 3

Frame, POB, PBOB, · · ·

  • Frame

– Duffin and Shaeffer (1952) – Young (1980)

  • Pseudo orthogonal basis (POB)

– Ogawa and Iijima (1973) f =

M

  • m=1f, φmφm
  • Pseudo biorthogonal basis (PBOB)

– Ogawa (1978) f =

M

  • m=1f, φ∗

mφm

⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩

Signal restoration, Computerized Tomography, Neural Network Learning, . . .

slide-4
SLIDE 4

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 4

Learning in Neural Networks

{

ξ1 ξ2 ξL . . . y = f0(x) x . . . . . .

v11 v12 w1 w2 vLN wN u1 u2 uN

neurons synapses modifiable weights Purpose of NN Learning

✓ ✏

Modify weights by using training examples: {(xm, ym) | ym = f(xm) + nm}M

m=1 ,

and obtain underlying input-output rule.

✒ ✑

target function f learning result f0 x1 x2 x3 y1 y2 y3

slide-5
SLIDE 5

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 5

NN Learning as an Inverse Problem

sampling

  • perator

function space y +n sample value target learning H A X CM

  • perator

space learning function result f f0

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

f(x1) f(x2) . . . f(xM)

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

sample value vector sampling : y =

⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝

y1 . . . yM

⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠

= Af + n learning : f0 = Xy representation of sampling operator A

✓ ✏

A =

M

  • m=1(em ⊗ ψm)

ψm(x) = K(x, xm) K(x, x′) : reproducing kernel f, ψm = f(xm)

✒ ✑

slide-6
SLIDE 6

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 6

Trigonometric Polynomial Space

A Hilbert space H is called a trigonometric polynomial space of order N if H is spanned by {exp(inx)}N

n=−N

which are defined on [−π, π] and the inner product in H is defined as f, g = 1 2π

π

−π f(x)g(x)dx.

K(x, x′) =

⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩

sin (2N + 1)(x − x′) 2

  • sin x − x′

2 (x = x′) 2N + 1 (x = x′)

−3 −2 −1 1 2 3 −2 2 4 6 8 10 12

Profile of the reproducing kernel of a trigonometric polynomial space of order 5 (x′ = 0).

slide-7
SLIDE 7

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 7

Process of NN Learning

y +n H A X CM f f0

⎛ ⎜ ⎜ ⎜ ⎜ ⎝

f(x1) f(x2) . . . f(xM)

⎞ ⎟ ⎟ ⎟ ⎟ ⎠

  • 1. (Active Learning)

Sample points {xm}M

m=1 are determined.

  • 2. Sample values {ym}M

m=1 are gathered.

  • 3. X and f0 are calculated : Projection Learning

When noise covariance matrix is σ2I, X = A†. A† is the Moore-Penrose generalized inverse of A. Our goal

✓ ✏

We give the optimal solution to active learning.

✒ ✑

slide-8
SLIDE 8

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 8

Active Learning

Find a set {xm}M

m=1 of sample points which minimizes

JG = Enf0 − f2, Generalization error where En denotes the ensemble average over the noise. If noise covariance matrix is σ2I, then JG yields JG = PN (A)f2

  • bias

+ σ2tr((AA∗)†)

  • variance

, where N (A) denotes the null space of A. Bias of f0 is 0 ⇐ ⇒ N (A) = {0} ⇓ Strategy

✓ ✏

Find a set {xm}M

m=1 of sample points which minimizes

JG = σ2tr((AA∗)†) under the constraint of N (A) = {0}.

✒ ✑

slide-9
SLIDE 9

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 9

Main Theorem

Suppose noise covariance matrix is σ2I with σ2 > 0. JG is minimized under the constraint of N (A) = {0} if and only if { 1

√ Mψm}M m=1 forms a PONB in H.

In this case, the minimum value of JG is σ2(2N + 1) M . f =

M

  • m=1f,

1 √ M ψm 1 √ M ψm for all f ∈ H. ψ1 = ψ2 = · · · = ψM ψm(x) = K(x, xm) K(x, x′) : reproducing kernel K(x, x′) =

⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩

sin (2N + 1)(x − x′) 2

  • sin x − x′

2 (x = x′) 2N + 1 (x = x′)

slide-10
SLIDE 10

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 10

Interpretation

When { 1

√ Mψm}M m=1 forms a PONB in H,

Af = √ Mf. f0 = Xy = A†Af + A†n1 + A†n2. A†Af = f ⇐ = N (A) = {0} A†n2 = 0 ⇐ = X : Projection Learning A†n1 =

1 √ Mn1

⇐ = { 1

√ Mψm}M m=1 : PONB

f H CM X = A† f0 R(A) n1 n2 Af A × √ M × 1

√ M

n Amplification Amplification y = Af + n

slide-11
SLIDE 11

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 11

Examples of PONB –1–

Example 1

✓ ✏

M ≥ 2N + 1 (= dim(H)), c : −π ≤ c ≤ −π + 2π M . If we put {xm}M

m=1 as

xm = c + 2π M (m − 1), then { 1

√ Mψm}M m=1 forms a PONB in H.

✒ ✑

−π π x1 x2 · · · xM M sample points are fixed to 2π/M intervals and sample values are gathered once at each point. ψm(x) = K(x, xm) K(x, x′) : reproducing kernel K(x, x′) =

⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩

sin (2N + 1)(x − x′) 2

  • sin x − x′

2 (x = x′) 2N + 1 (x = x′)

slide-12
SLIDE 12

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 12

Examples of PONB –2–

M = k(2N + 1) : k is a positive integer. For a general finite dimensional Hilbert space H, {φm}M

m=1 becomes a PONB

if { √ kφm}M

m=1 consists of k sets of ONBs in H.

Example 2

✓ ✏

c : −π ≤ c ≤ −π + 2π 2N + 1. If we put {xm}M

m=1 as

xm = c + 2πp 2N + 1 : p = m − 1 (mod (2N + 1)), then { 1

√ Mψm}M m=1 forms a PONB in H.

✒ ✑

}

−π π x1 x2 x2N+1 x2N+2 x2N+3 x2(2N+1) · · · · · · . . . . . . . . . xM−2N xM−2N+1 xM · · · k times (2N + 1) sample points are fixed to 2π/(2N + 1) intervals and sample values are gathered k times at each point.

slide-13
SLIDE 13

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 13

Computer Simulation 1

N = 3 (dim(H) = 7), M = 21

target function learning result

−3 −2 −1 1 2 3 −10 −8 −6 −4 −2 2 4 6 8 10

(A) Optimal sampling : JG = 0.333

−3 −2 −1 1 2 3 −10 −8 −6 −4 −2 2 4 6 8 10

(B) Random sampling : JG = 1.202

slide-14
SLIDE 14

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 14

Computer simulation 2

7 14 21 28 35 42 49 56 63 70 0.5 1 1.5

The number of training examples JG Optimal sampling Random sampling (average of 100 trials)

slide-15
SLIDE 15

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII

  • No. 15

Conclusions

  • 1. We showed that pseudo orthogonal bases (POBs)

give the optimal solution to active learning in neural networks.

  • 2. By utilizing properties of POBs, we clarified the

mechanism of achieving the optimal generalization.

  • 3. We gave two construction methods of PONBs.
slide-16
SLIDE 16

Active Learning in Neural Networks

slide-17
SLIDE 17

Projection learning

f0 = XAf

  • signal

component + Xn noise component minimize EnXn2 under the constraint of XAf = PR(A∗)f H approximation space R(A∗) f0 f projection learning operator

✓ ✏

X = V †A∗U† + Y (I − UU†)

Q : noise covariance matrix A∗ : adjoint operator of A U = AA∗ + Q U† : Moore-Penrose V = A∗U †A generalized inverse of U Y : arbitrary operator

✒ ✑