overview of statistical learning theory
play

Overview of statistical learning theory Daniel Hsu Columbia TRIPODS - PowerPoint PPT Presentation

Overview of statistical learning theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Statistical model for machine learning 2 Basic goal of machine learning Goal : Predict outcome y from set of possible outcomes Y , on the basis of observation x


  1. Overview of statistical learning theory Daniel Hsu Columbia TRIPODS Bootcamp 1

  2. Statistical model for machine learning 2

  3. Basic goal of machine learning Goal : Predict outcome y from set of possible outcomes Y , on the basis of observation x from feature space X . ◮ Examples : 1. x = email message, y = spam or ham 2. x = image of handwritten digit, y = digit 3. x = medical test results, y = disease status 3

  4. Basic goal of machine learning Goal : Predict outcome y from set of possible outcomes Y , on the basis of observation x from feature space X . ◮ Examples : 1. x = email message, y = spam or ham 2. x = image of handwritten digit, y = digit 3. x = medical test results, y = disease status Learning algorithm : ◮ Receives training data ( x 1 , y 1 ) , . . . , ( x n , y n ) ∈ X × Y and returns a prediction function ˆ f : X → Y . ◮ On (new) test example ( x, y ) , predict ˆ f ( x ) . 3

  5. Assessing the quality of predictions Loss function : ℓ : Y × Y → R + ◮ Prediction is ˆ y , true outcome is y . ◮ Loss ℓ (ˆ y, y ) measures how bad ˆ y is as a prediction of y . 4

  6. Assessing the quality of predictions Loss function : ℓ : Y × Y → R + ◮ Prediction is ˆ y , true outcome is y . ◮ Loss ℓ (ˆ y, y ) measures how bad ˆ y is as a prediction of y . Examples : 1. Zero-one loss:  0 if ˆ y = y,  ℓ (ˆ y, y ) = 1 { ˆ y � = y } = 1 if ˆ y � = y.  2. Squared loss (for Y ⊆ R ): y − y ) 2 . ℓ (ˆ y, y ) = (ˆ 4

  7. Why is this possible? ◮ Only input provided to learning algorithm is training data ( x 1 , y 1 ) , . . . , ( x n , y n ) . ◮ To be useful, training data must be related to test example ( x, y ) . How can we formalize this? 5

  8. Basic statistical model for data IID model of data Regard training data and test example as independent and identically distributed ( X × Y ) -valued random variables: ( X 1 , Y 1 ) , . . . , ( X n , Y n ) , ( X, Y ) ∼ iid P. Can use tools from probability to study behavior of learning algorithms under this model. 6

  9. Risk Loss ℓ ( f ( X ) , Y ) is random, so study average-case performance. Risk of a prediction function f , defined by R ( f ) = E [ ℓ ( f ( X ) , Y )] , where expectation is taken with respect to test example ( X, Y ) . 7

  10. Risk Loss ℓ ( f ( X ) , Y ) is random, so study average-case performance. Risk of a prediction function f , defined by R ( f ) = E [ ℓ ( f ( X ) , Y )] , where expectation is taken with respect to test example ( X, Y ) . Examples : 1. Mean squared error : ℓ = squared loss, R ( f ) = E [( f ( X ) − Y ) 2 ] . 2. Error rate : ℓ = zero-one loss, R ( f ) = P ( f ( X ) � = Y ) . 7

  11. Comparison to classical statistics How (classical) learning theory differs from classical statistics : ◮ Typically, data distribution P is allowed to be arbitrary. ◮ E.g., not from a parametric family { P θ : θ ∈ Θ } . ◮ Focus on prediction rather than general estimation of P . Now : Much overlap between machine learning and statistics. 8

  12. Inductive bias 9

  13. Is predictability enough? Requirements for learning: ◮ Relationship between training data and test example ◮ Formalized by iid model for data. ◮ Relationship between Y and X . ◮ Example: X and Y are non-trivially correlated. Is this enough? 10

  14. No free lunch For any n ≤ |X| 2 and any learning algorithm, there is a distribution, from which the n training data and test example are drawn iid, s.t.: 1. There is a function f ∗ : X → Y with P ( f ∗ ( X ) � = Y ) = 0 . 2. The learning algorithm returns a function ˆ f : X → Y with f ( X ) � = Y ) ≥ 1 P ( ˆ 4 . 11

  15. How to pay for lunch Must make some assumption about learning problem in order for learning algorithm to work well. ◮ Called inductive bias of the learning algorithm. 12

  16. How to pay for lunch Must make some assumption about learning problem in order for learning algorithm to work well. ◮ Called inductive bias of the learning algorithm. Common approach: ◮ Assume there is a good prediction function in a restricted function class F ⊂ Y X . ◮ Goal: find ˆ f : X → Y with small excess risk R ( ˆ f ) − min f ∈F R ( f ) either in expectation or with high probability over random draw of training data. 12

  17. Examples 13

  18. Example #1: Threshold functions X = R , Y = { 0 , 1 } . ◮ Threshold functions F = { f θ : θ ∈ R } where f θ is defined by  0 if x ≤ θ,  f θ ( x ) = 1 { x > θ } = 1 if x > θ.  14

  19. Example #1: Threshold functions X = R , Y = { 0 , 1 } . ◮ Threshold functions F = { f θ : θ ∈ R } where f θ is defined by  0 if x ≤ θ,  f θ ( x ) = 1 { x > θ } = 1 if x > θ.  ◮ Learning algorithm: 1. Sort training examples by x i -value. 2. Consider candidate threshold values that are (i) equal to x i -values, (ii) equal to values midway between consecutive but non-equal x i -values, and (iii) a value smaller than all x i -values. 3. Among candidate thresholds, pick ˆ θ such that f ˆ θ incorrectly classifies the smallest number of examples in training data. 14

  20. Example #2: Linear functions X = R d , Y = R , ℓ = squared loss. ◮ Linear functions F = { f w : w ∈ R d } where f w is defined by f w ( x ) = w T x. 15

  21. Example #2: Linear functions X = R d , Y = R , ℓ = squared loss. ◮ Linear functions F = { f w : w ∈ R d } where f w is defined by f w ( x ) = w T x. ◮ Learning algorithm (“Ordinary Least Squares”): ◮ Return a solution ˆ w to system of linear equations given by   n n  1  w = 1 � � T x i x y i x i . i n n i =1 i =1 15

  22. Example #3: Linear classifiers X = R d , Y = {− 1 , +1 } . ◮ Linear classifiers F = { f w : w ∈ R d } where f w is defined by  − 1 if w T x ≤ 0 ,  f w ( x ) = sign( w T x ) = +1 if w T x > 0 .  16

  23. Example #3: Linear classifiers X = R d , Y = {− 1 , +1 } . ◮ Linear classifiers F = { f w : w ∈ R d } where f w is defined by  − 1 if w T x ≤ 0 ,  f w ( x ) = sign( w T x ) = +1 if w T x > 0 .  ◮ Learning algorithm (“Support Vector Machine”): ◮ Return solution ˆ w to following optimization problem: n 2 + 1 λ � 2 � w � 2 min [1 − y i w T x i ] + . n w ∈ R d i =1 16

  24. Over-fitting and generalization 17

  25. Over-fitting Over-fitting : Phenomenon where learning algorithm returns ˆ f that “fits” training data well, but does not give accurate predictions on test examples. 18

  26. Over-fitting Over-fitting : Phenomenon where learning algorithm returns ˆ f that “fits” training data well, but does not give accurate predictions on test examples. ◮ Empirical risk of f (on training data ( X 1 , Y 1 ) , . . . , ( X n , Y n ) ): n R n ( f ) = 1 � ℓ ( f ( X i ) , Y i ) . n i =1 ◮ Over-fitting : R n ( ˆ f ) small, but R ( ˆ f ) large. 18

  27. Generalization How to avoid over-fitting “Theorem”: R ( ˆ f ) − R n ( ˆ f ) is likely to be small, if learning algorithm chooses ˆ f from F that is “not too rich” relative to n . ◮ ⇒ Observed performance on training data (i.e., empirical risk) generalizes to expected performance on test example (i.e., risk). ◮ Justifies learning algorithms based on minimizing empirical risk. 19

  28. Other issues 20

  29. Risk decomposition R ( ˆ f ) = g : X→Y R ( g ) inf (inherent unpredictability) + inf f ∈F R ( f ) − g : X→Y R ( g ) inf (approximation gap) + inf f ∈F R n ( f ) − inf f ∈F R ( f ) (estimation gap) + R n ( ˆ f ) − inf f ∈F R n ( f ) (optimization gap) + R ( ˆ f ) − R n ( ˆ f ) . (more estimation gap) 21

  30. Risk decomposition R ( ˆ f ) = g : X→Y R ( g ) inf (inherent unpredictability) + inf f ∈F R ( f ) − g : X→Y R ( g ) inf (approximation gap) + inf f ∈F R n ( f ) − inf f ∈F R ( f ) (estimation gap) + R n ( ˆ f ) − inf f ∈F R n ( f ) (optimization gap) + R ( ˆ f ) − R n ( ˆ f ) . (more estimation gap) ◮ Approximation : ◮ Which function classes F are “rich enough” for a broad class of learning problems? ◮ E.g., neural networks, Reproducing Kernel Hilbert Spaces. ◮ Optimization : ◮ Often finding minimizer of R n is computationally hard. ◮ What can we do instead? 21

  31. Alternative model: online learning Alternative to iid model for data : ◮ Examples arrive in a stream, one at at time. ◮ At time t : ◮ Nature reveals x t . ◮ Learner makes prediction ˆ y t . ◮ Nature reveals y t . ◮ Learner incurs loss ℓ (ˆ y t , y t ) . 22

  32. Alternative model: online learning Alternative to iid model for data : ◮ Examples arrive in a stream, one at at time. ◮ At time t : ◮ Nature reveals x t . ◮ Learner makes prediction ˆ y t . ◮ Nature reveals y t . ◮ Learner incurs loss ℓ (ˆ y t , y t ) . Relationship between past and future : ◮ No statistical assumption on data. ◮ Just assume there exists f ∗ ∈ F with small (empirical) risk n 1 ℓ ( f ∗ ( x t ) , y t ) . � n t =1 22

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend