mit 9 520 6 860 fall 2019 statistical learning theory and
play

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and - PowerPoint PPT Presentation

MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical Learning Setting Lorenzo Rosasco Learning from examples rather than being explicitly programmed. theory. L.Rosasco, 9.520/6.860 Fall 2018


  1. MIT 9.520/6.860, Fall 2019 Statistical Learning Theory and Applications Class 02: Statistical Learning Setting Lorenzo Rosasco

  2. Learning from examples rather than being explicitly programmed. theory. L.Rosasco, 9.520/6.860 Fall 2018 ◮ Machine Learning deals with systems that are trained from data ◮ Here we describe the framework considered in statistical learning

  3. All starts with DATA L.Rosasco, 9.520/6.860 Fall 2018 ◮ Supervised : { ( x 1 , y 1 ) , . . . , ( x n , y n ) } . ◮ Unsupervised: { x 1 , . . . , x m } . ◮ Semi-supervised: { ( x 1 , y 1 ) , . . . , ( x n , y n ) } ∪ { x 1 , . . . , x m } .

  4. Supervised learning Problem: given S n fjnd f x new y new L.Rosasco, 9.520/6.860 Fall 2018 x 1 y 1 x n y n

  5. The supervised learning problem Defjne expected risk : Problem: Solve given only i.e. n i.i.d. samples w.r.t. P fjxed, but unknown. L.Rosasco, 9.520/6.860 Fall 2018 ◮ X × R probability space, with measure P . ◮ ℓ : Y × Y → [ 0 , ∞ ) , measurable loss function . L ( f ) = L ( f ) = E ( x , y ) ∼ P [ ℓ ( y , f ( x ))] f : X → Y L ( f ) , min S n = ( x 1 , y 1 ) , . . . , ( x n , y n ) ∼ P n ,

  6. Data space X input space Y output space L.Rosasco, 9.520/6.860 Fall 2018 ���� ����

  7. Input space X input space: – vectors, – functions, – matrices/operators. – strings, – probability distributions, – graphs. L.Rosasco, 9.520/6.860 Fall 2018 ◮ Linear spaces, e. g. ◮ “Structured” spaces, e. g.

  8. Output space Y output space: – Y Hilbert space, functional regression. – strings, – probability distributions, – graphs. L.Rosasco, 9.520/6.860 Fall 2018 ◮ linear spaces, e. g. – Y = R , regression, – Y = R T , multitask regression, ◮ “Structured” spaces, e. g. – Y = {− 1 , 1 } , classifjcation, – Y = { 1 , . . . , T } , multicategory classifjcation,

  9. Probability distribution Refmects uncertainty and stochasticity of the learning problem, L.Rosasco, 9.520/6.860 Fall 2018 P ( x , y ) = P X ( x ) P ( y | x ) , ◮ P X marginal distribution on X , ◮ P ( y | x ) conditional distribution on Y given x ∈ X .

  10. Conditional distribution and noise Regression L.Rosasco, 9.520/6.860 Fall 2018 f ∗ ( x 4 , y 4 ) ( x 1 , y 1 ) ( x 5 , y 5 ) ( x 3 , y 3 ) ( x 2 , y 2 ) y i = f ∗ ( x i ) + ǫ i . ◮ Let f ∗ : X → Y , fjxed function, ◮ ǫ 1 , . . . , ǫ n zero mean random variables, ǫ i ∼ N ( 0 , σ ) , ◮ x 1 , . . . , x n random, P ( y | x ) = N ( f ∗ ( x ) , σ ) .

  11. Conditional distribution and misclassifjcation Noise in classifjcation: overlap between the classes, Classifjcation L.Rosasco, 9.520/6.860 Fall 2018 P ( y | x ) = { P ( 1 | x ) , P ( − 1 | x ) } . 0.9 1 � � � � � � � P ( 1 | x ) − 1 / 2 � ≤ δ ∆ δ = x ∈ X � .

  12. Marginal distribution and sampling L.Rosasco, 9.520/6.860 Fall 2018 P X takes into account uneven sampling of the input space.

  13. Marginal distribution, densities and manifolds L.Rosasco, 9.520/6.860 Fall 2018 dx p ( x ) = dP X ( x ) ⇒ p ( x ) = dP X ( x ) d vol ( x ) 1.0 1.0 0.5 0.5 0.0 0.0 � 0.5 � 0.5 � 1.0 � 1.0 � 1.0 � 0.5 0.0 0.5 1.0 � 1.0 � 0.5 0.0 0.5 1.0

  14. Loss functions Note: sometimes it is useful to consider loss of the form L.Rosasco, 9.520/6.860 Fall 2018 ℓ : Y × Y → [ 0 , ∞ ) ◮ Cost of predicting f ( x ) in place of y . ◮ Measures the pointwise error ℓ ( y , f ( x )) . � ◮ Part of the problem defjnition since L ( f ) = X × Y ℓ ( y , f ( x )) dP ( x , y ) . ℓ : Y × G → [ 0 , ∞ ) for some space G , e.g. G = R .

  15. Loss for regression L.Rosasco, 9.520/6.860 Fall 2018 ℓℓ ( y , y ′ ) = V ( y − y ′ ) , V : R → [ 0 , ∞ ) . ◮ Square loss ℓ ( y , y ′ ) = ( y − y ′ ) 2 . ◮ Absolute loss ℓ ( y , y ′ ) = | y − y ′ | . ◮ ǫ -insensitive ℓ ( y , y ′ ) = max( | y − y ′ | − ǫ, 0 ) . 1.0 0.8 Square Loss 0.6 Absolute - insensitive 0.4 0.2 1.0 0.5 0.5 1.0

  16. Loss for classifjcation L.Rosasco, 9.520/6.860 Fall 2018 ℓ ( y , y ′ ) = V ( − yy ′ ) , V : R → [ 0 , ∞ ) . ◮ 0-1 loss ℓ ( y , y ′ ) = Θ( − yy ′ ) , Θ( a ) = 1, if a ≥ 0 and 0 otherwise. ◮ Square loss ℓ ( y , y ′ ) = ( 1 − yy ′ ) 2 . ◮ Hinge-loss ℓ ( y , y ′ ) = max( 1 − yy ′ , 0 ) . ◮ Logistic loss ℓ ( y , y ′ ) = log( 1 + exp( − yy ′ )) . 2.0 1.5 0 1 loss square loss 1.0 Hinge loss Logistic loss 0.5 0.5 1 2

  17. Loss function for structured prediction Loss specifjc for each learning task, e.g. L.Rosasco, 9.520/6.860 Fall 2018 ◮ Multiclass: square loss, weighted square loss, logistic loss, … ◮ Multitask: weighted square loss, absolute, … ◮ …

  18. Expected risk with Example L.Rosasco, 9.520/6.860 Fall 2018 � L ( f ) = E ( x , y ) ∼ P [ ℓ ( y , f ( x ))] = ℓ ( y , f ( x )) dP ( x , y ) , X × Y f ∈ F , F = { f : X → Y | f measurable } . Y = {− 1 , + 1 } , ℓ ( y , f ( x )) = Θ( − yf ( x )) 1 L ( f ) = P ( { ( x , y ) ∈ X × Y | f ( x ) � = y } ) . 1 Θ( a ) = 1, if a ≥ 0 and 0 otherwise.

  19. Target function can be derived for many loss functions. It is possible to show that: f P L.Rosasco, 9.520/6.860 Fall 2018 = arg min f ∈F L ( f ) , � � � L ( f ) = dP ( x , y ) ℓ ( y , f ( x )) = dPX ( x ) ℓ ( y , f ( x )) dP ( y | x ) , � �� � L x ( f ( x )) � ◮ inf f ∈F L ( f ) = dP X ( x ) inf a ∈ R L x ( a ) . ◮ Minimizers of L ( f ) can be derived “pointwise” from the inner risk L x ( f ( x )) . ◮ Measurability of this pointwise defjnition of f P can be ensured.

  20. Target functions in regression Y y absolute loss L.Rosasco, 9.520/6.860 Fall 2018 square loss f P ( x ) = arg min L x ( a ) . a ∈ R � f P ( x ) = ydP ( y | x ) . f P ( x ) = median ( P ( y | x )) , � y � + ∞ median ( p ( · )) = y s . t . tdp ( t ) = tdp ( t ) . −∞

  21. Target functions in classifjcation misclassifjcation loss square loss logistic loss hinge-loss L.Rosasco, 9.520/6.860 Fall 2018 f P ( x ) = sign ( P ( 1 | x ) − P ( − 1 | x )) . f P ( x ) = P ( 1 | x ) − P ( − 1 | x ) . P ( 1 | x ) f P ( x ) = log P ( − 1 | x ) . f P ( x ) = sign ( P ( 1 | x ) − P ( − 1 | x )) .

  22. Difgerent loss, difgerent target Learning enters the picture when the latter is impossible or hard to compute (as in simulations). induced computations. L.Rosasco, 9.520/6.860 Fall 2018 ◮ Each loss functions defjnes a difgerent optimal target function. ◮ As we see in the following, loss functions also difger in terms of

  23. Learning algorithms Solve given only Learning algorithm How to measure the error of an estimate? L.Rosasco, 9.520/6.860 Fall 2018 min f ∈F L ( f ) , S n = ( x 1 , y 1 ) , . . . , ( x n , y n ) ∼ P n . S n → � f n = � f S n . f n estimates f P given the observed examples S n .

  24. Excess risk Excess risk: L.Rosasco, 9.520/6.860 Fall 2018 L ( � f ) − min f ∈F L ( f ) . Consistency: For any ǫ > 0 , � � L ( � lim f ) − min f ∈F L ( f ) > ǫ = 0 . n →∞ P

  25. Other forms of consistency Note: difgerent notions of consistency correspond to difgerent notions of convergence for random variables: weak, in expectation and almost sure. L.Rosasco, 9.520/6.860 Fall 2018 Consistency in Expectation: For any ǫ > 0 , n →∞ E [ L ( � lim f ) − min f ∈F L ( f )] = 0 . Consistency almost surely: For any ǫ > 0 , � � n →∞ L ( � lim f ) − min f ∈F L ( f ) = 0 = 1 . P

  26. Sample complexity, tail bounds and error bounds L.Rosasco, 9.520/6.860 Fall 2018 ◮ Sample complexity: For any ǫ > 0 , δ ∈ ( 0 , 1 ] , when n ≥ n P , F ( ǫ, δ ) , � � L ( � f ) − min f ∈F L ( f ) ≥ ǫ ≤ δ. P ◮ Tail bounds : For any ǫ > 0 , n ∈ N , � � L ( � P f ) − min f ∈F L ( f ) ≥ ǫ ≤ δ P , F ( n , ǫ ) . ◮ Error bounds : For any δ ∈ ( 0 , 1 ] , n ∈ N , � � L ( � f ) − min f ∈F L ( f ) ≤ ǫ P , F ( n , δ ) ≥ 1 − δ. P

  27. No free-lunch theorem A good algorithm should have small sample complexity for many distributions P . No free-lunch Is it possible to have an algorithm with small (fjnite) sample complexity for all problems? The no free lunch theorem provides a negative answer. In other words given an algorithm there exists a problem for which the learning performance are arbitrarily bad. L.Rosasco, 9.520/6.860 Fall 2018

  28. Algorithm design: complexity and regularization The design of most algorithms proceed as follows: L.Rosasco, 9.520/6.860 Fall 2018 ◮ Pick a (possibly large) class of function H , ideally min f ∈H L ( f ) = min f ∈F L ( f ) ◮ Defjne a procedure A γ ( S n ) = ˆ f γ ∈ H to explore the space H

  29. Bias and variance Small Bias lead to good data fjt, high variance to possible instability. L.Rosasco, 9.520/6.860 Fall 2018 Key error decomposition Let f γ be the solution obtained with an infjnite number of examples. L (ˆ f ∈H L ( f ) = L (ˆ f γ ) − min f γ ) − L ( f γ ) + L ( f γ ) − min f ∈H L ( f ) � �� � � �� � Variance / Estimation Bias / Approximation

  30. ERM and structural risk minimization A classical example. Example n n L.Rosasco, 9.520/6.860 Fall 2018 Then, let Consider ( H γ ) γ such that H 1 ⊂ H 2 , . . . H γ ⊂ . . . H � ˆ � � f γ = min L ( f ) , L ( f ) = 1 ℓ ( y i , f ( x i )) f ∈H γ i = 1 H γ are functions f ( x ) = w ⊤ x (or f ( x ) = w ⊤ Φ( x ) ), s.t. � w � ≤ γ

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend