complete statistical theory of learning learning using
play

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL - PowerPoint PPT Presentation

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik 1 PART I VC THEORY OF GENERALIZATION 2 THE MAIN QUESTION OF LEARNING THEORY QUESTION: When in set of functions { f ( x ) } we can minimize


  1. COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik 1

  2. PART I VC THEORY OF GENERALIZATION 2

  3. THE MAIN QUESTION OF LEARNING THEORY QUESTION: When in set of functions { f ( x ) } we can minimize functional � R ( f ) = L ( y , f ( x )) dP ( x , y ) , f ( x ) ∈ { f ( x ) } , if measure P ( x , y ) is unknown but we are given ℓ iid pairs ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) . ANSWER: We can minimize functional R ( f ) using data if and only if the VC-dimension h of set { f ( x ) } is finite. 3

  4. DEFINITION OF VC DIMENSION Let { θ ( f ( x )) } be a set of indicator functions (here θ ( u ) = 1 if u ≥ 0 and θ ( u ) = 0 if u < 0). • VC-dimension of set of indicator functions { θ ( f ( x )) } is equal h if h is the maximal number of vectors x 1 , ..., x h that can be shattered (separated into all 2 h possible subsets) us- ing indicator functions from { θ ( f ( x )) } . If such vectors exist for any number h the VC dimension of the set is infinite. • VC-dimension of set of real valued functions { f ( x ) } is the VC-dimension of the set of indicator functions { θ ( f ( x ) + b ) } 4

  5. TWO THEOREMS OF VC THEORY Theorem 1. If set { f ( x ) } has VC dimension h, then with probability 1 − η for all functions f ( x ) the bound � e 2 + 4eR ℓ R ( f ) ≤ R ℓ emp ( f ) + emp ( f ) , holds true, where � h − ln η � ℓ � emp ( f ) = 1 R ℓ L ( y i , f ( x i )) , e = O . ℓ ℓ i = 1 Theorem 2. Let x , w ∈ R n . The VC dimension h of set of linear indicator functions { θ ( x T w ) : || x || 2 ≤ 1 , || w || 2 ≤ C } is h ≤ min ( C , n ) + 1 5

  6. STRUCTURAL RISK MINIMIZATION PRINCIPLE To find the desired approximation f ℓ ( x ) in a set { f ( x ) } : FIRST, introduce a structure on a set of functions { f ( x ) } { f ( x ) } 1 ⊂ { f ( x ) } 2 ⊂ · · · { f ( x ) } m ⊂ { f ( x ) } with corresponding VC-dimension h k h 1 ≤ h 2 ≤ · · · ≤ h m ≤ ∞ . SECOND, chose the function f ℓ ( x ) that minimizes the bound � h k − ln η � � e 2 + 4eR ℓ R ( f ) ≤ R ℓ emp ( f ) + emp ( f ) , e = O . ℓ 1. over elements { f ( x ) } k (with VC dimension h k ) and 2. the function f ℓ ( x ) (with the smallest in { f ( x ) } k loss R ℓ emp ( f ) . 6

  7. FOUR QUESTIONS TO COMPLETE LEARNING THEORY 1. How to choose loss function L ( y , f ) in functional R ( f ) ? 2. How to select an admissible set of functions { f ( x ) } ? 3. How to construct structure on admissible set? 4. How to minimize functional on constructed structure? The talk answers these questions for pattern recognition problem. 7

  8. PART II TARGET FUNCTIONAL FOR MINIMIZATION 8

  9. SETTING OF PROBLEM: GOD PLAYS DICE Object Nature 𝑧 𝑗 𝑦 𝑗 ሻ ሻ 𝑄(𝑧|𝑦 𝑄(𝑦 𝑧 𝑗 𝑦 𝑗 Learning Machine 𝑦 𝑔 𝑦, 𝛽 , 𝛽𝜗Λ 𝑧 𝑦 1 , 𝑧 1 , … , 𝑦 𝓂 , 𝑧 𝓂 Given ℓ i.i.d. observations ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) , x ∈ X , y ⊂ { 0 , 1 } generated by unknown P ( x , y ) = P ( y | x ) P ( x ) find the rule r ( x ) = θ ( f 0 ( x )) , which minimizes in a set { f ( x ) } probability of misclassification � R θ ( f ) = | y − θ ( f ( x )) | dP ( x , y ) 9

  10. STANDARD REPLACEMENT OF BASIC SETTING Using data ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) , x ∈ X , y ⊂ { 0 , 1 } minimize in the set of functions { f ( x ) } the functional � ( y − f ( x )) 2 dP ( x , y ) R ( f ) = � (instead of functional R I θ ( f ) = | y − θ ( f ( x )) | dP ( x , y ) ). Minimizer f 0 ( x ) of R ( f ) estimates condition probability function f 0 ( x ) = P ( y = 1 | x ) . Use the classification rule r ( x ) = θ ( f 0 ( x ) − 0 . 5 ) = θ ( P ( y = 1 | x ) − 0 . 5 ) . 10

  11. PROBLEM WITH STANDARD REPLACEMENT Minimization of functional R ( f ) in the set { f ( x ) } is equiva- lent to minimization of the expression � � ( y − f ( x )) 2 dP ( x , y ) = [( y − f 0 ( x )) + ( f 0 ( x ) − f ( x ))] 2 dP ( x , y ) R ( f ) = where f 0 ( x ) minimizes R ( f ) . This is equivalent to minimiza- tion � ( y − f 0 ( x )) 2 dP ( x , y )+ R ( f ) = � � ( f 0 ( x ) − f ( x )) 2 dP ( x ) + 2 ( y − f 0 ( x ))( f 0 ( x ) − f ( x )) dP ( x , y ) . ACTUAL GOAL IS: USING ℓ OBSERVATIONS TO MINIMIZE THE SECOND INTEGRAL, NOT SUM OF LAST TWO INTEGRALS. 11

  12. DIRECT ESTIMATION OF CONDITIONAL PROBABILITY 1. When y ⊂ { 0 , 1 } the conditional probability P ( y = 1 | x ) is defined by some real valued function 0 ≤ f ( x ) ≤ 1. 2. From Bayesian formula P ( y = 1 | x ) p ( x ) = p ( y = 1 , x ) follows that any function G ( x − x ′ ) ∈ L 2 defines equation � � G ( x − x ′ ) f ( x ′ ) dP ( x ′ ) = G ( x − x ′ ) dP ( y = 1 , x ′ ) ( ∗ ) which solution is conditional probability f ( x ) = P ( y = 1 | x ) . 3. To estimate conditional probability means to solve the equation (*) when P ( x ) and P ( y = 1 , x ) are unknown but data, ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) generated according to P ( y , x ) , are given. 4. Solution of equation (*) is ill-posed problem. 12

  13. MAIN INDUCTIVE STEP IN STATISTICS Replace the unknown Cumulative Distribution Function (CDF) P ( x ) , x = ( x 1 , ..., x n )) T ∈ R n with it estimate P ℓ ( x ) : The Empir- ical Cumulative Distribution Function (ECDF) n ℓ � � P ℓ ( x ) = 1 θ { x k − x k θ { x − x i } , θ { x − x i } = i } ℓ i = 1 k = 1 obtained from data x i = ( x 1 i , ..., x n i ) T , x 1 , ..., x ℓ , The main theorem of statistics claims that ECDF converges to actual CDF uniformly with fast rate of convergence. The following inequality holds true x | P ( x ) − P ℓ ( x ) | > ε } < 2 exp {− 2 ε 2 ℓ } , P { sup ∀ ε. 13

  14. TWO CONSTRUCTIVE SETTINGS OF CLASSIFICATION PROBLEM 1. Standard constructive setting: Minimization of functional � ( y − f ( x )) 2 dP ℓ ( x , y ) , R emp ( f ) = in a set { f ( x ) } using data ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) leads to ℓ � R emp ( f ) = 1 ( y i − f ( x i )) 2 , f ( x ) ∈ { f ( x ) } . ℓ i = 1 • • • 2. New constructive setting: Solution of equation � � G ( x − x ′ ) f ( x ′ ) dP ℓ ( x ′ ) = G ( x − x ′ ) dP ℓ ( y = 1 , x ′ ) , using data leads to solution in { f ( x ) } the equation ℓ ℓ � � 1 G ( x − x i ) f ( x i ) = 1 y j G ( x − x j ) , f ( x ) ∈ { f ( x ) } . ℓ ℓ i = 1 j = 1 14

  15. NADARAYA-WATSON ESTIMATOR OF CONDITIONAL PROBABILITY It is known Nadaraya-Watson estimator of P ( y = 1 | x ) : � ℓ i = 1 y i G ( x − x i ) f ( x ) = , � ℓ i = 1 G ( x − x i ) where special kernels G ( x − x i ) (say, Gaussian) are used. This estimator is the solution of ”corrupted” equation ℓ ℓ � � 1 G ( x − x i ) f ( x ) = 1 y i G ( x − x i ) ℓ ℓ i = 1 i = 1 (which uses special kernel) rather than the obtained equation ℓ ℓ � � 1 G ( x − x i ) f ( x i ) = 1 y j G ( x − x j ) . ℓ ℓ i = 1 j = 1 (which is defined for any kernel G ( x − x ′ ) from L 2 ). 15

  16. WHAT MEANS TO SOLVE THE EQUATION To solve the equation ℓ ℓ � � 1 G ( x − x i ) f ( x i ) = 1 y j G ( x − x j ) ℓ ℓ i = 1 j = 1 means to find the function in { f ( x ) } minimizing L 2 -distance �   2 ℓ ℓ � �   R ( f ) = G ( x − x i ) f ( x i ) − y j G ( x − x j ) d µ ( x ) . i = 1 j = 1 Simple algebra leads to expression ℓ � R V ( f ) = ( y i − f ( x i ))( y j − f ( x j )) v ( x i , x j ) , i , j = 1 where values v ( x i , x j ) are � v ( x i , x j ) = G ( x − x i ) G ( x − x j ) d µ ( x ) , i , j = 1 , ..., ℓ. Values v ( x i , x j ) form V -matrix. 16

  17. THE V -MATRIX ESTIMATE 1. For µ ( x ) = P ( x ) elements v ( x i , x j ) of V -matrix are � v ( x i , x j ) = G ( x − x i ) G ( x − x j ) dP ( x ) . Using empirical estimate P ℓ ( x ) instead of P ( x ) we obtain the following estimates of elements of V -matrix ℓ � v ( x i , x j ) = 1 G ( x s − x i ) G ( x s − x j ) . ℓ s = 1 2. For µ ( x ) = x , x ∈ ( − 1 , 1 ) and G ( x − x ′ ) = exp {− 0 . 5 δ 2 ( x − x ′ ) 2 } , v ( x i , x j ) = exp {− δ 2 ( x i − x j ) 2 }{ erf [ δ ( 1 + 0 . 5 ( x i + x j ))] + erf { δ ( 1 − 0 . 5 ( x i + x j ))] } . 17

  18. LEAST V -QUADRATIC FORM METHOD AND LEAST SQUARES METHOD Let ( x 1 , y 1 ) , ..., ( x ℓ , b ℓ ) be training data. Using notations: Y = ( y 1 , ..., y ℓ ) T , F ( f ) = ( f ( x 1 ) , ..., f ( x ℓ )) T , V = || v ( x i , x j ) || we can rewrite functional ℓ � R V ( f ) = ( y i − f ( x i ))( y j − f ( x j )) v ( x i , x j ) , i , j = 1 in matrix form R V ( f ) = ( Y − F ( f )) T V ( Y − F ( f )) , We call this functional Least V -quadratic functional. Identity matrix I instead of V forms Least Squares functional R I ( f ) = ( Y − F ( f )) T ( Y − F ( f )) , 18

  19. PART III SELECTION OF ADMISSIBLE SET OF FUNCTIONS 19

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend