COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL - PowerPoint PPT Presentation

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik 1

PART I VC THEORY OF GENERALIZATION 2

THE MAIN QUESTION OF LEARNING THEORY QUESTION: When in set of functions { f ( x ) } we can minimize functional � R ( f ) = L ( y , f ( x )) dP ( x , y ) , f ( x ) ∈ { f ( x ) } , if measure P ( x , y ) is unknown but we are given ℓ iid pairs ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) . ANSWER: We can minimize functional R ( f ) using data if and only if the VC-dimension h of set { f ( x ) } is finite. 3

DEFINITION OF VC DIMENSION Let { θ ( f ( x )) } be a set of indicator functions (here θ ( u ) = 1 if u ≥ 0 and θ ( u ) = 0 if u < 0). • VC-dimension of set of indicator functions { θ ( f ( x )) } is equal h if h is the maximal number of vectors x 1 , ..., x h that can be shattered (separated into all 2 h possible subsets) using indicator functions from { θ ( f ( x )) } . If such vectors exist for any number h the VC dimension of the set is infinite. • VC-dimension of set of real valued functions { f ( x ) } is the VC-dimension of the set of indicator functions { θ ( f ( x ) + b ) } 4

TWO THEOREMS OF VC THEORY Theorem 1. If set { f ( x ) } has VC dimension h, then with probability 1 − η for all functions f ( x ) the bound � e 2 + 4eR ℓ R ( f ) ≤ R ℓ emp ( f ) + emp ( f ) , holds true, where � h − ln η � ℓ � emp ( f ) = 1 R ℓ L ( y i , f ( x i )) , e = O . ℓ ℓ i = 1 Theorem 2. Let x , w ∈ R n . The VC dimension h of set of linear indicator functions { θ ( x T w ) : || x || 2 ≤ 1 , || w || 2 ≤ C } is h ≤ min ( C , n ) + 1 5

STRUCTURAL RISK MINIMIZATION PRINCIPLE To find the desired approximation f ℓ ( x ) in a set { f ( x ) } : FIRST, introduce a structure on a set of functions { f ( x ) } { f ( x ) } 1 ⊂ { f ( x ) } 2 ⊂ · · · { f ( x ) } m ⊂ { f ( x ) } with corresponding VC-dimension h k h 1 ≤ h 2 ≤ · · · ≤ h m ≤ ∞ . SECOND, chose the function f ℓ ( x ) that minimizes the bound � h k − ln η � � e 2 + 4eR ℓ R ( f ) ≤ R ℓ emp ( f ) + emp ( f ) , e = O . ℓ 1. over elements { f ( x ) } k (with VC dimension h k ) and 2. the function f ℓ ( x ) (with the smallest in { f ( x ) } k loss R ℓ emp ( f ) . 6

FOUR QUESTIONS TO COMPLETE LEARNING THEORY 1. How to choose loss function L ( y , f ) in functional R ( f ) ? 2. How to select an admissible set of functions { f ( x ) } ? 3. How to construct structure on admissible set? 4. How to minimize functional on constructed structure? The talk answers these questions for pattern recognition problem. 7

PART II TARGET FUNCTIONAL FOR MINIMIZATION 8

SETTING OF PROBLEM: GOD PLAYS DICE Object Nature 𝑧 𝑗 𝑦 𝑗 ሻ ሻ 𝑄(𝑧|𝑦 𝑄(𝑦 𝑧 𝑗 𝑦 𝑗 Learning Machine 𝑦 𝑔 𝑦, 𝛽 , 𝛽𝜗Λ 𝑧 𝑦 1 , 𝑧 1 , … , 𝑦 𝓂 , 𝑧 𝓂 Given ℓ i.i.d. observations ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) , x ∈ X , y ⊂ { 0 , 1 } generated by unknown P ( x , y ) = P ( y | x ) P ( x ) find the rule r ( x ) = θ ( f 0 ( x )) , which minimizes in a set { f ( x ) } probability of misclassification � R θ ( f ) = | y − θ ( f ( x )) | dP ( x , y ) 9

STANDARD REPLACEMENT OF BASIC SETTING Using data ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) , x ∈ X , y ⊂ { 0 , 1 } minimize in the set of functions { f ( x ) } the functional � ( y − f ( x )) 2 dP ( x , y ) R ( f ) = � (instead of functional R I θ ( f ) = | y − θ ( f ( x )) | dP ( x , y ) ). Minimizer f 0 ( x ) of R ( f ) estimates condition probability function f 0 ( x ) = P ( y = 1 | x ) . Use the classification rule r ( x ) = θ ( f 0 ( x ) − 0 . 5 ) = θ ( P ( y = 1 | x ) − 0 . 5 ) . 10

PROBLEM WITH STANDARD REPLACEMENT Minimization of functional R ( f ) in the set { f ( x ) } is equivalent to minimization of the expression � � ( y − f ( x )) 2 dP ( x , y ) = [( y − f 0 ( x )) + ( f 0 ( x ) − f ( x ))] 2 dP ( x , y ) R ( f ) = where f 0 ( x ) minimizes R ( f ) . This is equivalent to minimization � ( y − f 0 ( x )) 2 dP ( x , y )+ R ( f ) = � � ( f 0 ( x ) − f ( x )) 2 dP ( x ) + 2 ( y − f 0 ( x ))( f 0 ( x ) − f ( x )) dP ( x , y ) . ACTUAL GOAL IS: USING ℓ OBSERVATIONS TO MINIMIZE THE SECOND INTEGRAL, NOT SUM OF LAST TWO INTEGRALS. 11

DIRECT ESTIMATION OF CONDITIONAL PROBABILITY 1. When y ⊂ { 0 , 1 } the conditional probability P ( y = 1 | x ) is defined by some real valued function 0 ≤ f ( x ) ≤ 1. 2. From Bayesian formula P ( y = 1 | x ) p ( x ) = p ( y = 1 , x ) follows that any function G ( x − x ′ ) ∈ L 2 defines equation � � G ( x − x ′ ) f ( x ′ ) dP ( x ′ ) = G ( x − x ′ ) dP ( y = 1 , x ′ ) ( ∗ ) which solution is conditional probability f ( x ) = P ( y = 1 | x ) . 3. To estimate conditional probability means to solve the equation (*) when P ( x ) and P ( y = 1 , x ) are unknown but data, ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) generated according to P ( y , x ) , are given. 4. Solution of equation (*) is ill-posed problem. 12

MAIN INDUCTIVE STEP IN STATISTICS Replace the unknown Cumulative Distribution Function (CDF) P ( x ) , x = ( x 1 , ..., x n )) T ∈ R n with it estimate P ℓ ( x ) : The Empir- ical Cumulative Distribution Function (ECDF) n ℓ � � P ℓ ( x ) = 1 θ { x k − x k θ { x − x i } , θ { x − x i } = i } ℓ i = 1 k = 1 obtained from data x i = ( x 1 i , ..., x n i ) T , x 1 , ..., x ℓ , The main theorem of statistics claims that ECDF converges to actual CDF uniformly with fast rate of convergence. The following inequality holds true x | P ( x ) − P ℓ ( x ) | > ε } < 2 exp {− 2 ε 2 ℓ } , P { sup ∀ ε. 13

TWO CONSTRUCTIVE SETTINGS OF CLASSIFICATION PROBLEM 1. Standard constructive setting: Minimization of functional � ( y − f ( x )) 2 dP ℓ ( x , y ) , R emp ( f ) = in a set { f ( x ) } using data ( x 1 , y 1 ) , ..., ( x ℓ , y ℓ ) leads to ℓ � R emp ( f ) = 1 ( y i − f ( x i )) 2 , f ( x ) ∈ { f ( x ) } . ℓ i = 1 • • • 2. New constructive setting: Solution of equation � � G ( x − x ′ ) f ( x ′ ) dP ℓ ( x ′ ) = G ( x − x ′ ) dP ℓ ( y = 1 , x ′ ) , using data leads to solution in { f ( x ) } the equation ℓ ℓ � � 1 G ( x − x i ) f ( x i ) = 1 y j G ( x − x j ) , f ( x ) ∈ { f ( x ) } . ℓ ℓ i = 1 j = 1 14

NADARAYA-WATSON ESTIMATOR OF CONDITIONAL PROBABILITY It is known Nadaraya-Watson estimator of P ( y = 1 | x ) : � ℓ i = 1 y i G ( x − x i ) f ( x ) = , � ℓ i = 1 G ( x − x i ) where special kernels G ( x − x i ) (say, Gaussian) are used. This estimator is the solution of ”corrupted” equation ℓ ℓ � � 1 G ( x − x i ) f ( x ) = 1 y i G ( x − x i ) ℓ ℓ i = 1 i = 1 (which uses special kernel) rather than the obtained equation ℓ ℓ � � 1 G ( x − x i ) f ( x i ) = 1 y j G ( x − x j ) . ℓ ℓ i = 1 j = 1 (which is defined for any kernel G ( x − x ′ ) from L 2 ). 15

WHAT MEANS TO SOLVE THE EQUATION To solve the equation ℓ ℓ � � 1 G ( x − x i ) f ( x i ) = 1 y j G ( x − x j ) ℓ ℓ i = 1 j = 1 means to find the function in { f ( x ) } minimizing L 2 -distance �   2 ℓ ℓ � �   R ( f ) = G ( x − x i ) f ( x i ) − y j G ( x − x j ) d µ ( x ) . i = 1 j = 1 Simple algebra leads to expression ℓ � R V ( f ) = ( y i − f ( x i ))( y j − f ( x j )) v ( x i , x j ) , i , j = 1 where values v ( x i , x j ) are � v ( x i , x j ) = G ( x − x i ) G ( x − x j ) d µ ( x ) , i , j = 1 , ..., ℓ. Values v ( x i , x j ) form V -matrix. 16

THE V -MATRIX ESTIMATE 1. For µ ( x ) = P ( x ) elements v ( x i , x j ) of V -matrix are � v ( x i , x j ) = G ( x − x i ) G ( x − x j ) dP ( x ) . Using empirical estimate P ℓ ( x ) instead of P ( x ) we obtain the following estimates of elements of V -matrix ℓ � v ( x i , x j ) = 1 G ( x s − x i ) G ( x s − x j ) . ℓ s = 1 2. For µ ( x ) = x , x ∈ ( − 1 , 1 ) and G ( x − x ′ ) = exp {− 0 . 5 δ 2 ( x − x ′ ) 2 } , v ( x i , x j ) = exp {− δ 2 ( x i − x j ) 2 }{ erf [ δ ( 1 + 0 . 5 ( x i + x j ))] + erf { δ ( 1 − 0 . 5 ( x i + x j ))] } . 17

LEAST V -QUADRATIC FORM METHOD AND LEAST SQUARES METHOD Let ( x 1 , y 1 ) , ..., ( x ℓ , b ℓ ) be training data. Using notations: Y = ( y 1 , ..., y ℓ ) T , F ( f ) = ( f ( x 1 ) , ..., f ( x ℓ )) T , V = || v ( x i , x j ) || we can rewrite functional ℓ � R V ( f ) = ( y i − f ( x i ))( y j − f ( x j )) v ( x i , x j ) , i , j = 1 in matrix form R V ( f ) = ( Y − F ( f )) T V ( Y − F ( f )) , We call this functional Least V -quadratic functional. Identity matrix I instead of V forms Least Squares functional R I ( f ) = ( Y − F ( f )) T ( Y − F ( f )) , 18

PART III SELECTION OF ADMISSIBLE SET OF FUNCTIONS 19

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL - PowerPoint PPT Presentation

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik 1 PART I VC THEORY OF GENERALIZATION 2 THE MAIN QUESTION OF LEARNING THEORY QUESTION: When in set of functions { f ( x ) } we can minimize

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

10 Elements of a Complete Streets Policy National Complete Streets Coalition Tuesday, October 24

Today. Types of graphs. Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Today.

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Chapter 1: Probability Theory (a recap) STK4011/9011: Statistical Inference Theory Johan Pensar

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical

Statistical Machine Translation Statistical Machine Translation p Lecture 2 Theory and Praxis of

City of Alexandria Complete Streets Program What are Complete Streets? Complete Streets are

COMPLETE STREETS Complete Streets Elements Complete Streets Policy Best Practices MDOT

1 Diagonal Complete Latin Squares (Author: Jenny Zhang) Definition : Right-diagonal complete

Inverted Commas Complete Complete whiteboard. Complete 2 3 1 Challenge 1 Lesson 3

@OhioComplete Complete to Compete Regional Meetings Complete to Compete Regional Meetings

Today. Complete Graph. K 4 and K 5 Types of graphs. K n complete graph on n vertices. Complete

Today. Complete Graph. K 5 Types of graphs. K n complete graph on n vertices. Complete Graphs.

Today. Types of graphs. Complete Graphs. Trees. Hypercubes. Complete Graph. K n complete graph

CS/COE 1520 pitt.edu/~ach54/cs1520 Responsive Web Design Viewing a webpage in a small window 2

Software for TDA ACM-BCB Workshop on TDA October 2, 2016 by Svetlana Lockwood Topological Data

Cold Case : The Lost MNIST Digits The Sherlocks: Chhavi Yadav NYU Lon Bottou FAIR,NYU What

CAPITAL MARKETS DAY February 28, 2019 Tab 2018 performance Judith HARTMANN p. 3 1 11:00

Alessandro Acq isti and Ralph Gross Alessandro Acquisti and Ralph Gross Heinz College/CyLab C

New Modification of Restricted Boltzmann Machine that Considers the Stochasticity of Real Neural

Context-Free Grammars 19 March 2019 OSU CSE 1 BL Compiler Structure Code Tokenizer Parser

Number Theory Divisibility and Primes Definition. If a and b are integers and there is some