regml 2020 class 1 statistical learning theory
play

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco - PowerPoint PPT Presentation

RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT All starts with DATA Supervised: { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , Unsupervised: { x 1 , . . . , x m } , Semi-supervised: { ( x 1 , y 1 ) , . . .


  1. RegML 2020 Class 1 Statistical Learning Theory Lorenzo Rosasco UNIGE-MIT-IIT

  2. All starts with DATA ◮ Supervised: { ( x 1 , y 1 ) , . . . , ( x n , y n ) } , ◮ Unsupervised: { x 1 , . . . , x m } , ◮ Semi-supervised: { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ∪ { x 1 , . . . , x m } L.Rosasco, RegML 2020

  3. Learning from examples L.Rosasco, RegML 2020

  4. Setting for the supervised learning problem ◮ X × Y probability space, with measure ρ . ◮ S n = ( x 1 , y 1 ) , . . . , ( x n , y n ) ∼ ρ n , i.e. sampled i.i.d. ◮ L : Y × Y → [0 , ∞ ) , measurable loss function . ◮ Expected risk � E ( f ) = L ( y, f ( x )) dρ ( x, y ) . X × Y Problem: Solve f : X → Y E ( f ) , min given only S n ( ρ fixed, but unknown). L.Rosasco, RegML 2020

  5. Data space X Y ���� ���� output space input space L.Rosasco, RegML 2020

  6. Input space X input space: ◮ linear spaces, e. g. – vectors, – functions, – matrices/operators ◮ “structured” spaces, e. g. – strings, – probability distributions, – graphs L.Rosasco, RegML 2020

  7. Output space Y output space ◮ linear spaces, e. g. – Y = R , regression, – Y = R T , multi-task regression, – Y Hilbert space, functional regression, ◮ “structured” spaces – Y = { +1 , − 1 } , classification, – Y = { 1 , . . . , T } , multi-label classification, – strings, – probability distributions, – graphs L.Rosasco, RegML 2020

  8. Probability distribution Reflects uncertainty and stochasticity of the learning problem ρ ( x, y ) = ρ X ( x ) ρ ( y | x ) , ◮ ρ X marginal distribution on X , ◮ ρ ( y | x ) conditional distribution on Y given x ∈ X . L.Rosasco, RegML 2020

  9. Conditional distribution and noise f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) Regression y i = f ∗ ( x i ) + ǫ i , ◮ Let f ∗ : X → Y , fixed function ◮ ǫ 1 , . . . , ǫ n zero mean random variables ◮ x 1 , . . . , x n random L.Rosasco, RegML 2020

  10. Conditional distribution and misclassification Classification ρ ( y | x ) = { ρ (1 | x ) , ρ ( − 1 | x ) } , 0.9 1 Noise in classification: overlap between the classes � � � � � � � ρ (1 | x ) − ρ ( − 1 | x ) � ≤ t ∆ t = x ∈ X � L.Rosasco, RegML 2020

  11. Marginal distribution and sampling ρ X takes into account uneven sampling of the input space L.Rosasco, RegML 2020

  12. Marginal distribution, densities and manifolds p ( x ) = dρ X ( x ) → p ( x ) = dρ X ( x ) d vol( x ) , dx 1.0 1.0 0.5 0.5 0.0 0.0 � 0.5 � 0.5 � 1.0 � 1.0 � 1.0 � 0.5 0.0 0.5 1.0 � 1.0 � 0.5 0.0 0.5 1.0 L.Rosasco, RegML 2020

  13. Loss functions L : Y × Y → [0 , ∞ ) , ◮ The cost of predicting f ( x ) in place of y . � ◮ Part of the problem definition E ( f ) = L ( y, f ( x )) dρ ( x, y ) ◮ Measures the pointwise error , L.Rosasco, RegML 2020

  14. Losses for regression L ( y, y ′ ) = L ( y − y ′ ) ◮ Square loss L ( y, y ′ ) = ( y − y ′ ) 2 , ◮ Absolute loss L ( y, y ′ ) = | y − y ′ | , ◮ ǫ -insensitive L ( y, y ′ ) = max( | y − y ′ | − ǫ, 0) , 1.0 0.8 Square Loss 0.6 Absolute - insensitive 0.4 0.2 1.0 0.5 0.5 1.0 L.Rosasco, RegML 2020

  15. Losses for classification L ( y, y ′ ) = L ( − yy ′ ) ◮ 0-1 loss L ( y, y ′ ) = 1 {− yy ′ > 0 } ◮ Square loss L ( y, y ′ ) = (1 − yy ′ ) 2 , ◮ Hinge-loss L ( y, y ′ ) = max(1 − yy ′ , 0) , ◮ logistic loss L ( y, y ′ ) = log(1 + exp( − yy ′ )) , 2.0 1.5 0 1 loss square loss 1.0 Hinge loss Logistic loss 0.5 0.5 1 2 L.Rosasco, RegML 2020

  16. Losses for structured prediction Loss specific for each learning task e. g. ◮ Multi-class: square loss, weighted square loss, logistic loss, . . . ◮ Multi-task: weighted square loss, absolute, . . . ◮ . . . L.Rosasco, RegML 2020

  17. Expected risk � E ( f ) = E L ( f ) = L ( y, f ( x )) dρ ( x, y ) X × Y note that f ∈ F where F = { f : X → Y | f measurable } . Example Y = {− 1 , +1 } , L ( y, f ( x )) = 1 {− yf ( x ) > 0 } E ( f ) = P ( { ( x, y ) ∈ X × Y | f ( x ) � = y } ) . L.Rosasco, RegML 2020

  18. Target function f ρ = arg min f ∈F E ( f ) , can be derived for many loss functions... L.Rosasco, RegML 2020

  19. Target functions in regression square loss , � f ρ ( x ) = ydρ ( y | x ) Y absolute loss , f ρ ( x ) = median ρ ( y | x ) , where � y � + ∞ median p ( · ) = y s . t . tdp ( t ) = tdp ( t ) . −∞ y L.Rosasco, RegML 2020

  20. Target functions in classification 0-1 loss , f ρ ( x ) = sign ( ρ (1 | x ) − ρ ( − 1 | x )) square loss , f ρ ( x ) = ρ (1 | x ) − ρ ( − 1 | x ) logistic loss , ρ (1 | x ) f ρ ( x ) = log ρ ( − 1 | x ) hinge-loss , f ρ ( x ) = sign ( ρ (1 | x ) − ρ ( − 1 | x )) L.Rosasco, RegML 2020

  21. Learning algorithms S n → � f n = � f S n f n estimates f ρ given the observed examples S n How to measure the error of an estimator? L.Rosasco, RegML 2020

  22. Excess risk Excess Risk: E ( � f ) − inf f ∈F E ( f ) , Consistency: For any ǫ > 0 � � E ( � n →∞ P lim f ) − inf f ∈F E ( f ) > ǫ = 0 , L.Rosasco, RegML 2020

  23. Tail bounds, sample complexity and error bound ◮ Tail bounds : For any ǫ > 0 , n ∈ N � � E ( � P f ) − inf f ∈F E ( f ) > ǫ ≤ δ ( n, F , ǫ ) ◮ Sample complexity: For any ǫ > 0 , δ ∈ (0 , 1] , when n ≥ n 0 ( ǫ, δ, F ) � � E ( � f ) − inf f ∈F E ( f ) > ǫ ≤ δ, P ◮ Error bounds : For any δ ∈ (0 , 1] , n ∈ N , with probability at least 1 − δ , E ( � f ) − inf f ∈F E ( f ) ≤ ǫ ( n, F , δ ) , L.Rosasco, RegML 2020

  24. Error bounds and no free-lunch theorem Theorem For any � f , there exists a problem for which E ( E ( � f ) − inf f ∈F E ( f )) > 0 L.Rosasco, RegML 2020

  25. No free-lunch theorem continued Theorem For any � f , there exists a ρ such that E ( E ( � f ) − inf f ∈F E ( f )) > 0 F → H Hypothesis space L.Rosasco, RegML 2020

  26. Hypothesis space H ⊂ F E.g. X = R d � d w j x j , | w ∈ R d , ∀ x ∈ X } H = { f ( x ) = � w, x � = j =1 then H ⋍ R d . L.Rosasco, RegML 2020

  27. Finite dictionaries D = { φ i : X → R | i = 1 , . . . , p } p � H = { f ( x ) = w j φ j ( x ) | w 1 , . . . , w p ∈ R , ∀ x ∈ X } j =1 f ( x ) = w ⊤ Φ( x ) , Φ( x ) = ( φ 1 ( x ) , . . . , φ p ( x )) L.Rosasco, RegML 2020

  28. This class Learning theory ingredients ◮ Data space/distribution ◮ Loss function, risks and target functions ◮ Learning algorithms and error estimates ◮ Hypothesis space L.Rosasco, RegML 2020

  29. Next class ◮ Regularized learning algorithm: penalization ◮ Statistics and computations ◮ Nonparametrics and kernels L.Rosasco, RegML 2020

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend