Pseudo Orthogonal Bases Give the Optimal Generalization Capability - PDF document

Pseudo Orthogonal Bases Give the Optimal Generalization Capability in Neural Network Learning Masashi Sugiyama Hidemitsu Ogawa Department of Computer Science, Tokyo Institute of Technology, Japan

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 2 Pseudo Orthogonal Bases (POBs) ✓ ✏ Definition H : a finite dimensional Hilbert space M ≥ dim( H ) A set { φ m } M m =1 of elements in H is called a POB if any f in H is expressed as M f = m =1 � f, φ m � φ m , � where �· , ·� denotes the inner product in H . ✒ ✑ √ φ 3 1 / 2 • If M = dim( H ), φ 2 1 / 2 a POB is reduced to an ONB. √ 1 / 2 • A POB is a tight frame with frame bound 1. − 1 / 2 M � f � 2 = φ 1 m =1 |� f, φ m �| 2 . � H = R 2 , M = 3 If � φ 1 � = � φ 2 � = · · · = � φ M � , then { φ m } M m =1 is called a pseudo orthonormal basis (PONB).

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 3 Frame, POB, PBOB, · · · • Frame – Duffin and Shaeffer (1952) – Young (1980) • Pseudo orthogonal basis (POB) – Ogawa and Iijima (1973) M m =1 � f, φ m � φ m f = � • Pseudo biorthogonal basis (PBOB) – Ogawa (1978) M m =1 � f, φ ∗ f = � m � φ m ⎧ Signal restoration, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ Computerized Tomography, ⎪ ⎪ ⎪ ⎨ Neural Network Learning, ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ . ⎪ . ⎪ . ⎪ ⎪ ⎪ ⎩

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 4 Learning in Neural Networks u 1 modifiable weights v 11 { ξ 1 v 12 u 2 w 1 w 2 ξ 2 x y = f 0 ( x ) . . . . . . . . . w N synapses ξ L v LN u N neurons ✓ Purpose of NN Learning ✏ Modify weights by using training examples: { ( x m , y m ) | y m = f ( x m ) + n m } M m =1 , and obtain underlying input-output rule. ✒ ✑ target function f learning result f 0 y 2 y 3 y 1 x 1 x 2 x 3

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 5 NN Learning as an Inverse Problem sample value function space sampling space H operator C M target ⎛ ⎞ A f ( x 1 ) ⎜ ⎟ function f ( x 2 ) ⎜ ⎟ ⎜ ⎟ . ⎜ ⎟ . . f ⎜ ⎟ ⎝ ⎠ f ( x M ) learning + n operator X f 0 y learning sample value result vector ⎛ ⎞ y 1 ⎜ ⎟ ⎜ ⎟ . ⎜ ⎟ . sampling : y = . = Af + n ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ ⎜ ⎟ y M ⎝ ⎠ learning : f 0 = Xy ✓ representation of sampling operator A ✏ M � A = m =1 ( e m ⊗ ψ m ) ψ m ( x ) = K ( x, x m ) K ( x, x ′ ) : reproducing kernel � f, ψ m � = f ( x m ) ✒ ✑

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 6 Trigonometric Polynomial Space A Hilbert space H is called a trigonometric polynomial space of order N if H is spanned by { exp( inx ) } N n = − N which are defined on [ − π, π ] and the inner product in H is defined as � f, g � = 1 � π − π f ( x ) g ( x ) dx. 2 π ⎧ sin (2 N + 1)( x − x ′ ) sin x − x ′ ⎪ ⎪ � ( x � = x ′ ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 ⎪ K ( x, x ′ ) = ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ( x = x ′ ) ⎪ 2 N + 1 ⎪ ⎪ ⎪ ⎩ 12 10 8 6 4 2 0 −2 −3 −2 −1 0 1 2 3 Profile of the reproducing kernel of a trigonometric polynomial space of order 5 ( x ′ = 0).

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 7 Process of NN Learning H C M ⎛ ⎞ A f ( x 1 ) ⎜ f ( x 2 ) ⎟ ⎜ ⎟ ⎜ . ⎟ . f ⎜ ⎟ . ⎝ ⎠ f ( x M ) + n X f 0 y 1. (Active Learning) Sample points { x m } M m =1 are determined. 2. Sample values { y m } M m =1 are gathered. 3. X and f 0 are calculated : Projection Learning When noise covariance matrix is σ 2 I , X = A † . A † is the Moore-Penrose generalized inverse of A . ✓ Our goal ✏ We give the optimal solution to active learning. ✒ ✑

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 8 Active Learning Find a set { x m } M m =1 of sample points which minimizes J G = E n � f 0 − f � 2 , Generalization error where E n denotes the ensemble average over the noise. If noise covariance matrix is σ 2 I , then J G yields + σ 2 tr(( AA ∗ ) † ) J G = � P N ( A ) f � 2 , � �� variance bias where N ( A ) denotes the null space of A . Bias of f 0 is 0 ⇐ ⇒ N ( A ) = { 0 } ⇓ ✓ Strategy ✏ Find a set { x m } M m =1 of sample points which minimizes J G = σ 2 tr(( AA ∗ ) † ) under the constraint of N ( A ) = { 0 } . ✒ ✑

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 9 Main Theorem Suppose noise covariance matrix is σ 2 I with σ 2 > 0. J G is minimized under the constraint of N ( A ) = { 0 } if and only if { 1 M ψ m } M m =1 forms a PONB in H . √ In this case, the minimum value of J G is σ 2 (2 N + 1) . M 1 M ψ m � 1 M √ √ f = m =1 � f, � M ψ m for all f ∈ H. � ψ 1 � = � ψ 2 � = · · · = � ψ M � ψ m ( x ) = K ( x, x m ) K ( x, x ′ ) : reproducing kernel ⎧ sin (2 N + 1)( x − x ′ ) sin x − x ′ ⎪ ⎪ � ( x � = x ′ ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 ⎪ K ( x, x ′ ) = ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ( x = x ′ ) ⎪ 2 N + 1 ⎪ ⎪ ⎪ ⎩

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 10 Interpretation When { 1 M ψ m } M m =1 forms a PONB in H , √ √ � Af � = M � f � . f 0 = Xy = A † Af + A † n 1 + A † n 2 . A † Af = f ⇐ N ( A ) = { 0 } = A † n 2 = 0 ⇐ = X : Projection Learning � A † n 1 � = 1 { 1 M ψ m } M M � n 1 � ⇐ = m =1 : PONB √ √ C M H Amplification √ y = Af + n × n 2 M A n f Af n 1 f 0 X = A † × 1 R ( A ) √ M Amplification

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 11 Examples of PONB –1– ✓ Example 1 ✏ M ≥ 2 N + 1 (= dim( H )) , c : − π ≤ c ≤ − π + 2 π M . If we put { x m } M m =1 as x m = c + 2 π M ( m − 1) , then { 1 M ψ m } M √ m =1 forms a PONB in H . ✒ ✑ x 1 x 2 x M · · · − π π M sample points are fixed to 2 π/M intervals and sample values are gathered once at each point. ψ m ( x ) = K ( x, x m ) K ( x, x ′ ) : reproducing kernel ⎧ sin (2 N + 1)( x − x ′ ) sin x − x ′ ⎪ ⎪ � ( x � = x ′ ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 2 2 ⎪ K ( x, x ′ ) = ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ( x = x ′ ) ⎪ 2 N + 1 ⎪ ⎪ ⎪ ⎩

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 12 Examples of PONB –2– M = k (2 N + 1) : k is a positive integer . For a general finite dimensional Hilbert space H , { φ m } M m =1 becomes a PONB √ kφ m } M if { m =1 consists of k sets of ONBs in H . ✓ Example 2 ✏ 2 π c : − π ≤ c ≤ − π + 2 N + 1 . If we put { x m } M m =1 as 2 πp x m = c + 2 N + 1 : p = m − 1 (mod (2 N + 1)) , then { 1 M ψ m } M m =1 forms a PONB in H . √ ✒ ✑ } x M − 2 N x M − 2 N +1 x M · · · . . . . . . . . . k times x 2(2 N +1) x 2 N +2 x 2 N +3 · · · x 2 N +1 x 1 x 2 · · · − π π (2 N + 1) sample points are fixed to 2 π/ (2 N + 1) intervals and sample values are gathered k times at each point.

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 13 Computer Simulation 1 N = 3 (dim( H ) = 7), M = 21 target function learning result 10 8 6 4 2 0 −2 −4 −6 −8 −10 −3 −2 −1 0 1 2 3 (A) Optimal sampling : J G = 0 . 333 10 8 6 4 2 0 −2 −4 −6 −8 −10 −3 −2 −1 0 1 2 3 (B) Random sampling : J G = 1 . 202

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 14 Computer simulation 2 1.5 1 J G 0.5 0 7 14 21 28 35 42 49 56 63 70 The number of training examples Optimal sampling Random sampling (average of 100 trials)

SPIE’s 44th Annual Meeting and Exhibition Wavelet Applications in Signal and Image Processing VII No. 15 Conclusions 1. We showed that pseudo orthogonal bases (POBs) give the optimal solution to active learning in neural networks. 2. By utilizing properties of POBs, we clarified the mechanism of achieving the optimal generalization. 3. We gave two construction methods of PONBs.

Active Learning in Neural Networks

Projection learning f 0 = XAf + Xn � �� signal noise component component E n � Xn � 2 minimize under the constraint of XAf = P R ( A ∗ ) f H f f 0 R ( A ∗ ) approximation space ✓ projection learning operator ✏ X = V † A ∗ U † + Y ( I − UU † ) A ∗ : adjoint operator of A Q : noise covariance matrix U = AA ∗ + Q U † : Moore-Penrose V = A ∗ U † A generalized inverse of U Y : arbitrary operator ✒ ✑

Pseudo Orthogonal Bases Give the Optimal Generalization Capability - PDF document

Pseudo Orthogonal Bases Give the Optimal Generalization Capability in Neural Network Learning Masashi Sugiyama Hidemitsu Ogawa Department of Computer Science, Tokyo Institute of Technology, Japan SPIEs 44th Annual Meeting and Exhibition

HEALTH #UNselfie Maryland GIVE the gift of GIVE LEADERSHIP #UNselfie Maryland GIVE the gift

Orthogonal Complements and Orthonormal Matrices Orthogonal Complements Defn. For a set W , the

Chemistry 2000 Slide Set 20: Organic bases Marc R. Roussel March 26, 2020 Chemistry 2000 Slide

Orthogonal range searching Orthogonal range searching Problem: Given a set of n points Orthogonal

Finite Projective Planes http://math.uwyo.edu/moorhouse/pub/planes/ Eric Moorhouse Mutually

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Latin Squares and Orthogonal Arrays Lucia Moura School of Electrical Engineering and Computer

Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and

Classification of self-orthogonal F q + u F q -codes Classification of self-orthogonal F q + u F q

Designs of Orthogonal Filter Banks and Orthogonal Cosine-Modulated Filter Banks Jie Yan

On optimal short recurrences for generating orthogonal Krylov subspace bases J org Liesen

Orthogonal polynomials and zeros of optimal approximants Daniel Seco (with Bnteau, Khavinson,

Acids and Bases Slide 3 / 208 Table of Contents: Acids and Bases Click on the topic to go to

Acids and Bases Slide 3 / 208 Slide 4 / 208 Table of Contents: Acids and Bases Click on the

Acids and Bases List as many things that you can about acids or bases in 15 seconds. Share

Acids and Bases Slide 3 / 208 Table of Contents: Acids and Bases Click on the topic to go to

On Ridge Functions Allan Pinkus Technion September 23, 2013 Allan Pinkus (Technion) Ridge

The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and

1 2 Revelation5:6 TheLambPierced&Slaughtered

This webinar may be recorded. This webinar presents a sampling of best practices and overviews,

CS 349F: Technologies for Financial Systems Instructors: Balaji Prabhakar and Mendel Rosenblum

Regularization Theory Nicolas Rougon Institut Mines-Tlcom / Tlcom SudParis ARTEMIS

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

Numerical Methods for Linear Discrete Ill-Posed Problems Lothar Reichel Como, May 2018 Part 1:

Pseudo Orthogonal Bases Give the Optimal Generalization Capability - PDF document

Pseudo Orthogonal Bases Give the Optimal Generalization Capability in Neural Network Learning Masashi Sugiyama Hidemitsu Ogawa Department of Computer Science, Tokyo Institute of Technology, Japan SPIEs 44th Annual Meeting and Exhibition

HEALTH #UNselfie Maryland GIVE the gift of GIVE LEADERSHIP #UNselfie Maryland GIVE the gift

Orthogonal Complements and Orthonormal Matrices Orthogonal Complements Defn. For a set W , the

Chemistry 2000 Slide Set 20: Organic bases Marc R. Roussel March 26, 2020 Chemistry 2000 Slide

Orthogonal range searching Orthogonal range searching Problem: Given a set of n points Orthogonal

Finite Projective Planes http://math.uwyo.edu/moorhouse/pub/planes/ Eric Moorhouse Mutually

An Optimal Transport View on Generalization Nemo Fournier January 13, 2020 An Optimal Transport

Latin Squares and Orthogonal Arrays Lucia Moura School of Electrical Engineering and Computer

Cheap Orthogonal Constraints in Neural Networks: A Simple Parametrization of the Orthogonal and

Classification of self-orthogonal F q + u F q -codes Classification of self-orthogonal F q + u F q

Designs of Orthogonal Filter Banks and Orthogonal Cosine-Modulated Filter Banks Jie Yan

On optimal short recurrences for generating orthogonal Krylov subspace bases J org Liesen

Orthogonal polynomials and zeros of optimal approximants Daniel Seco (with Bnteau, Khavinson,

Acids and Bases Slide 3 / 208 Table of Contents: Acids and Bases Click on the topic to go to

Acids and Bases Slide 3 / 208 Slide 4 / 208 Table of Contents: Acids and Bases Click on the

Acids and Bases List as many things that you can about acids or bases in 15 seconds. Share

Acids and Bases Slide 3 / 208 Table of Contents: Acids and Bases Click on the topic to go to

On Ridge Functions Allan Pinkus Technion September 23, 2013 Allan Pinkus (Technion) Ridge

The Bayesian approach to inverse problems Youssef Marzouk Department of Aeronautics and

1 2 Revelation5:6 TheLambPierced&amp;Slaughtered

This webinar may be recorded. This webinar presents a sampling of best practices and overviews,

CS 349F: Technologies for Financial Systems Instructors: Balaji Prabhakar and Mendel Rosenblum

Regularization Theory Nicolas Rougon Institut Mines-Tlcom / Tlcom SudParis ARTEMIS

Dynamic Inverse Problems: Schmitt Efficient Algorithms and Approximate Inverse Problems

Numerical Methods for Linear Discrete Ill-Posed Problems Lothar Reichel Como, May 2018 Part 1:

1 2 Revelation5:6 TheLambPierced&Slaughtered