A Geometric Approach to Statistical Learning Theory Shahar - PowerPoint PPT Presentation

A Geometric Approach to Statistical Learning Theory Shahar Mendelson Centre for Mathematics and its Applications The Australian National University Canberra, Australia

What is a learning problem • A class of functions F on a probability space (Ω , µ ) • A random variable Y one wishes to estimate • A loss functional ℓ • The information we have: a sample ( X i , Y i ) n i =1 Our goal: with high probability, find a good approximation to Y in F with respect to ℓ , that is • Find f ∈ F such that E ℓ ( f ( X ) , Y )) is “almost optimal”. • f is selected according to the sample ( X i , Y i ) n i =1 . Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Example Consider • The random variable Y is a fixed function T : Ω → [0 , 1] (that is Y i = T ( X i )). • The loss functional is ℓ ( u, v ) = ( u − v ) 2 . Hence, the goal is to find some f ∈ F for which � E ℓ ( f ( X ) , T ( X )) = E ( f ( X ) − T ( X )) 2 = ( f ( t ) − T ( t )) 2 dµ ( t ) Ω is as small as possible. To select f we use the sample X 1 , ..., X n and the values T ( X 1 ) , ..., T ( X n ). Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

A variation Consider the following excess loss : ℓ f ( x ) = ( f ( x ) − T ( x )) 2 − ( f ∗ ( x ) − T ( x )) 2 = ℓ f ( x ) − ℓ f ∗ ( x ) , ¯ where f ∗ minimizes E ℓ f ( x ) = E ( f ( X ) − T ( X )) 2 in the class. The difference between the two cases: Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Our Goal • Given a fixed sample size, how close to the optimal can one get using empirical data? • How does the specific choice of the loss influence the estimate? • What parameters of the class F are important? • Although one has access to a random sample, the measure which generates the data is not known. Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

The algorithm Given a sample ( X 1 , ..., X n ), select ˆ f ∈ F which satisfies n 1 � argmin f ∈ F ℓ f ( X i ) , n i =1 that is, ˆ f is the “best function” in the class on the data. � The hope is that with high probability E ( ℓ ˆ f | X 1 , ..., X n ) = ℓ ˆ f ( t ) dµ ( t ) is close to the optimal. In other words, hopefully, with high probability, the empirical minimizer of the loss is “almost” the best function in the class with respect to the loss. Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Back to the squared loss In the case of the squared excess loss - ℓ f ( x ) = ( f ( x ) − T ) 2 − ( f ∗ ( x ) − T ( x )) 2 , ¯ since the second term if the same for every f ∈ F , the empirical minimization selects n � ( f ( X i ) − T ( X i )) 2 argmin f ∈ F i =1 and the question is how to relate this empirical distance to the “real” distance we are interested in. Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Analyzing the algorithm For a second, let’s forget the loss, and from here on, to simplify notation, denote by G the loss class. � n We shall attempt to connect 1 i =1 g ( X i ) (i.e. the random, empirical structure n on G ) to E g . We shall examine various notions of similarity of the structures. Note: in the case of an excess loss, 0 ∈ G and our aim is to be as close to 0 as possible. Otherwise, our aim is to approach g ∗ � = 0. Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

A road map • Asking the “correct” question - beware of loose methods of attack. • Properties of the loss and their significance. • Estimating the complexity of a class. Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

A little history Originally, the study of { 0 , 1 } -valued classes (e.g. Perceptrons) used the uniform law of large numbers : � n � � � 1 � � � Pr ∃ g ∈ G g ( X i ) − E g � ≥ ε , � � n � � � i =1 which is a uniform measure of similarity . If the probability of this is small, then for every g ∈ G , the empirical structure is “close” to the real one. In particular, this is true for the empirical minimizer, and thus, on the good event, n g ≤ 1 � E ˆ g ( X i ) + ε. ˆ n i =1 In a minute: this approach is suboptimal!!!! Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Why is this bad? Consider the excess loss case. • We hope that the algorithm will get us close to 0... • So, it seems likely that we would only need to control the part of G which is not too far from 0. • No need to control functions which are far away, while in the ULLN, we control every function in G . Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Why would this lead to a better bound? Well, first, the set is smaller... More important: • functions close to 0 in expectation are likely to have a small variance (under mild assumptions)... On the other hand, • Because of the CLT, for every fixed function g ∈ G and n large enough, with probability 1 / 2, � � n � 1 var( g ) � � � g ( X i ) − E g � ∼ , � � n n � � � i =1 Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

So? • Control over the entire class = ⇒ control over functions with nonzero variance ⇒ rate of convergence can’t be better than c/ √ n . = • If g ∗ � = 0, we can’t hope to get a faster rate than c/ √ n using this method. • This shows the statistical limitation of the loss. Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

What does this tell us about the loss? • To get faster convergence rates one has to consider the excess loss. • We also need a condition that would imply that if the expectation is small, the variance is small (e.g E ℓ 2 f ≤ B E ℓ f - A Bernstein condition). It turns out that this condition is connected to convexity properties of ℓ at 0. • One has to connect the richness of G to that of F (which follows from a Lipshitz condition on ℓ ). Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Localization - Excess loss There are several ways to localize. • It is enough to bound � n � ∃ g ∈ G 1 � Pr g ( X i ) ≤ ε E g ≥ 2 ε n i =1 • This event upper bounds the probability that the algorithm fails. If this probability is small , and since n − 1 � n i =1 ˆ g ( X i ) ≤ ε , then E ˆ g ≤ 2 ε . • Another (similar) option: relative bounds: � � � � n − 1 / 2 � n i =1 ( g ( X i ) − E g ) � � Pr ∃ g ∈ G � ≥ ε � � � � � var( g ) � Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Comparing structures Suppose that one could find r n , for which, with high probability, for every g ∈ G with E g ≥ r n , n 1 2 E g ≤ 1 g ( X i ) ≤ 3 � 2 E g n i =1 (here, 1 / 2 and 3 / 2 can be replaced by 1 − ε and 1 + ε ). Then if ˆ g was produced by the algorithm it can either • have a “large expectation” - E ˆ g ≥ r n , = ⇒ � n g ≤ 2 • The structures are similar and thus E ˆ i =1 ˆ g ( X i ), n Or • have a “small expectation” = ⇒ E ˆ g ≤ r n , Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Comparing structures II Thus, with high probability n � � r n , 2 � E ˆ g ≤ max g ( X i ) ˆ . n i =1 This result is based on a ratio limit theorem , because we would like to show that n − 1 � n � � i =1 g ( X i ) � � sup − 1 � ≤ ε. � � E g g ∈ G, E g ≥ r n � This normalization is possible if E g 2 can be bounded using E g (which is a property of the loss). Otherwise, one needs a slightly different localization. Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Star shape If G is star-shaped, its “relative richness” increases as r becomes smaller. Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

Why is this better? Thanks to a star-shape assumption, our aim is to find the smallest r n such that with high probability, � � n 1 � ≤ r � � � sup g ( X i ) − E g 2 . � � n � � g ∈ G, E g = r � i =1 This would imply that the error of the algorithm is at most r . For the non-localized result, to obtain the same error, one needs to show � n � 1 � � � sup g ( X i ) − E g � ≤ r, � � n � � g ∈ G � i =1 where the supremum is on a much larger set, and includes functions with a “large” variance. Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

BTW, even this is not optimal.... It turns out that a structural approach (uniform or localized) does not give the best result that one could get on the error of the EM algorithm. A sharp bound follows from a direct analysis of the algorithm (under mild assumptions) and depends on the behavior of the (random) function � n � 1 � � ˆ � φ n ( r ) = sup g ( X i ) − E g � . � � n � � g ∈ G, E g = r � i =1 Shahar Mendelson: A Geometric Approach to Statistical Learning Theory

A Geometric Approach to Statistical Learning Theory Shahar - PowerPoint PPT Presentation

A Geometric Approach to Statistical Learning Theory Shahar Mendelson Centre for Mathematics and its Applications The Australian National University Canberra, Australia What is a learning problem A class of functions F on a probability

Geometric Optimization Piotr Indyk April 26, 2005 Lecture 19: Geometric Optimization Geometric

Geometric Algebra A powerful tool for solving geometric problems in visual computing Leandro A.

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

The Geometric Burnsides Problem Brandon Seward University of Michigan Geometric and

Statistical and Computational Statistical and Computational Learning Theory Learning Theory

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Subdivision Surfaces 1 Geometric Modeling Geometric Modeling Sometimes need more than

PDE-based Geometric Modeling and Interactive Sculpting for Graphics Hong Qin Center for Visual

Geometric Interpretation of the Derivative (Review) Geometric Interpretation of the Derivative

Subdivision Surfaces 1 Geometric Modeling Geometric Modeling Sometimes need more than

EXAMPLES OF FOUR-DIMENSIONAL GEOMETRIC TRANSITION Joint with S. Riolo Fribourg, 8th May 2019 W

Data Structures for Moving Objects Pankaj K. Agarwal Center for Geometric Computing Department

Geometric Representations 3D Graphics Motivation Geometric representation What do we want

2D Geometric Transformations Question : How do we represent a geometric object in the plane?

Geometric Firefighting Rolf Klein University of Bonn HMI, June 19, 2018 Rolf Klein Geometric

Coalgebraic Correspondence Theory and Gaifman Locality Tadeusz Litak a , Dirk Pattinson b , and

Emergent spacetimes: Toy models for quantum gravity. Matt Visser Time and Matter Lake

Outline Examples of geometrical star designs Angle sum of a triangle and other polygons

GSI Colloquium Bastian Lher October 2017 What does it take to fjnd a dirty bomb? Bastian

Home Monitoring of Chronic Disease Telehealth Trial Organisational challenges and moving towards a

Analysis of K-lines X ray fluorescence of Rare Earth and High Z elements on storage

The stability of Azores Aug 2017 John Webb, UNSW/CMS fundamental constants Cambridge

Canonical Correlation Analysis In principal components analysis, we analyzed one set of variables