Pseudo Orthogonal Bases Give the Optimal Generalization Capability - - PDF document
Pseudo Orthogonal Bases Give the Optimal Generalization Capability - - PDF document
Pseudo Orthogonal Bases Give the Optimal Generalization Capability in Neural Network Learning Masashi Sugiyama Hidemitsu Ogawa Department of Computer Science, Tokyo Institute of Technology, Japan SPIEs 44th Annual Meeting and Exhibition
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 2
Pseudo Orthogonal Bases (POBs)
Definition
✓ ✏
H : a finite dimensional Hilbert space M ≥ dim(H) A set {φm}M
m=1 of elements in H is called a POB
if any f in H is expressed as f =
M
- m=1f, φmφm,
where ·, · denotes the inner product in H.
✒ ✑
1/2 1/ √ 2 1/ √ 2 φ2 φ1 φ3 −1/2
H = R2, M = 3
- If M = dim(H),
a POB is reduced to an ONB.
- A POB is
a tight frame with frame bound 1. f2 =
M
- m=1 |f, φm|2.
If φ1 = φ2 = · · · = φM, then {φm}M
m=1 is called
a pseudo orthonormal basis (PONB).
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 3
Frame, POB, PBOB, · · ·
- Frame
– Duffin and Shaeffer (1952) – Young (1980)
- Pseudo orthogonal basis (POB)
– Ogawa and Iijima (1973) f =
M
- m=1f, φmφm
- Pseudo biorthogonal basis (PBOB)
– Ogawa (1978) f =
M
- m=1f, φ∗
mφm
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
Signal restoration, Computerized Tomography, Neural Network Learning, . . .
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 4
Learning in Neural Networks
{
ξ1 ξ2 ξL . . . y = f0(x) x . . . . . .
v11 v12 w1 w2 vLN wN u1 u2 uN
neurons synapses modifiable weights Purpose of NN Learning
✓ ✏
Modify weights by using training examples: {(xm, ym) | ym = f(xm) + nm}M
m=1 ,
and obtain underlying input-output rule.
✒ ✑
target function f learning result f0 x1 x2 x3 y1 y2 y3
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 5
NN Learning as an Inverse Problem
sampling
- perator
function space y +n sample value target learning H A X CM
- perator
space learning function result f f0
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
f(x1) f(x2) . . . f(xM)
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
sample value vector sampling : y =
⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
y1 . . . yM
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
= Af + n learning : f0 = Xy representation of sampling operator A
✓ ✏
A =
M
- m=1(em ⊗ ψm)
ψm(x) = K(x, xm) K(x, x′) : reproducing kernel f, ψm = f(xm)
✒ ✑
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 6
Trigonometric Polynomial Space
A Hilbert space H is called a trigonometric polynomial space of order N if H is spanned by {exp(inx)}N
n=−N
which are defined on [−π, π] and the inner product in H is defined as f, g = 1 2π
π
−π f(x)g(x)dx.
K(x, x′) =
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
sin (2N + 1)(x − x′) 2
- sin x − x′
2 (x = x′) 2N + 1 (x = x′)
−3 −2 −1 1 2 3 −2 2 4 6 8 10 12
Profile of the reproducing kernel of a trigonometric polynomial space of order 5 (x′ = 0).
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 7
Process of NN Learning
y +n H A X CM f f0
⎛ ⎜ ⎜ ⎜ ⎜ ⎝
f(x1) f(x2) . . . f(xM)
⎞ ⎟ ⎟ ⎟ ⎟ ⎠
- 1. (Active Learning)
Sample points {xm}M
m=1 are determined.
- 2. Sample values {ym}M
m=1 are gathered.
- 3. X and f0 are calculated : Projection Learning
When noise covariance matrix is σ2I, X = A†. A† is the Moore-Penrose generalized inverse of A. Our goal
✓ ✏
We give the optimal solution to active learning.
✒ ✑
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 8
Active Learning
Find a set {xm}M
m=1 of sample points which minimizes
JG = Enf0 − f2, Generalization error where En denotes the ensemble average over the noise. If noise covariance matrix is σ2I, then JG yields JG = PN (A)f2
- bias
+ σ2tr((AA∗)†)
- variance
, where N (A) denotes the null space of A. Bias of f0 is 0 ⇐ ⇒ N (A) = {0} ⇓ Strategy
✓ ✏
Find a set {xm}M
m=1 of sample points which minimizes
JG = σ2tr((AA∗)†) under the constraint of N (A) = {0}.
✒ ✑
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 9
Main Theorem
Suppose noise covariance matrix is σ2I with σ2 > 0. JG is minimized under the constraint of N (A) = {0} if and only if { 1
√ Mψm}M m=1 forms a PONB in H.
In this case, the minimum value of JG is σ2(2N + 1) M . f =
M
- m=1f,
1 √ M ψm 1 √ M ψm for all f ∈ H. ψ1 = ψ2 = · · · = ψM ψm(x) = K(x, xm) K(x, x′) : reproducing kernel K(x, x′) =
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
sin (2N + 1)(x − x′) 2
- sin x − x′
2 (x = x′) 2N + 1 (x = x′)
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 10
Interpretation
When { 1
√ Mψm}M m=1 forms a PONB in H,
Af = √ Mf. f0 = Xy = A†Af + A†n1 + A†n2. A†Af = f ⇐ = N (A) = {0} A†n2 = 0 ⇐ = X : Projection Learning A†n1 =
1 √ Mn1
⇐ = { 1
√ Mψm}M m=1 : PONB
f H CM X = A† f0 R(A) n1 n2 Af A × √ M × 1
√ M
n Amplification Amplification y = Af + n
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 11
Examples of PONB –1–
Example 1
✓ ✏
M ≥ 2N + 1 (= dim(H)), c : −π ≤ c ≤ −π + 2π M . If we put {xm}M
m=1 as
xm = c + 2π M (m − 1), then { 1
√ Mψm}M m=1 forms a PONB in H.
✒ ✑
−π π x1 x2 · · · xM M sample points are fixed to 2π/M intervals and sample values are gathered once at each point. ψm(x) = K(x, xm) K(x, x′) : reproducing kernel K(x, x′) =
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
sin (2N + 1)(x − x′) 2
- sin x − x′
2 (x = x′) 2N + 1 (x = x′)
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 12
Examples of PONB –2–
M = k(2N + 1) : k is a positive integer. For a general finite dimensional Hilbert space H, {φm}M
m=1 becomes a PONB
if { √ kφm}M
m=1 consists of k sets of ONBs in H.
Example 2
✓ ✏
c : −π ≤ c ≤ −π + 2π 2N + 1. If we put {xm}M
m=1 as
xm = c + 2πp 2N + 1 : p = m − 1 (mod (2N + 1)), then { 1
√ Mψm}M m=1 forms a PONB in H.
✒ ✑
}
−π π x1 x2 x2N+1 x2N+2 x2N+3 x2(2N+1) · · · · · · . . . . . . . . . xM−2N xM−2N+1 xM · · · k times (2N + 1) sample points are fixed to 2π/(2N + 1) intervals and sample values are gathered k times at each point.
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 13
Computer Simulation 1
N = 3 (dim(H) = 7), M = 21
target function learning result
−3 −2 −1 1 2 3 −10 −8 −6 −4 −2 2 4 6 8 10
(A) Optimal sampling : JG = 0.333
−3 −2 −1 1 2 3 −10 −8 −6 −4 −2 2 4 6 8 10
(B) Random sampling : JG = 1.202
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 14
Computer simulation 2
7 14 21 28 35 42 49 56 63 70 0.5 1 1.5
The number of training examples JG Optimal sampling Random sampling (average of 100 trials)
SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII
- No. 15
Conclusions
- 1. We showed that pseudo orthogonal bases (POBs)
give the optimal solution to active learning in neural networks.
- 2. By utilizing properties of POBs, we clarified the
mechanism of achieving the optimal generalization.
- 3. We gave two construction methods of PONBs.
Active Learning in Neural Networks
Projection learning
f0 = XAf
- signal
component + Xn noise component minimize EnXn2 under the constraint of XAf = PR(A∗)f H approximation space R(A∗) f0 f projection learning operator
✓ ✏
X = V †A∗U† + Y (I − UU†)
Q : noise covariance matrix A∗ : adjoint operator of A U = AA∗ + Q U† : Moore-Penrose V = A∗U †A generalized inverse of U Y : arbitrary operator
✒ ✑