Introduction to Machine Learning CMU-10701 10. Risk Minimization - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabás Póczos

10. Risk Minimization 2

What have we seen so far? Several algorithms that seem to work fine on training datasets: • Linear regression • Naïve Bayes classifier • Perceptron • Support Vector Machines  How good are these algorithms on unknown test sets?  How many training samples do we need to achieve small error?  What is the smallest possible error we can achieve? ) Learning Theory 3

Outline • Risk and loss –Loss functions –Risk –Empirical risk vs True risk –Empirical Risk minimization • Underfitting and Overfitting • Classification • Regression 4

Supervised Learning Setup Generative model of the data : (train and test data) Regression: Classification: 5

Loss Loss function: It measures how good we are on a particular (x,y) pair. 6

Loss Examples Classification loss: Regression : Predict house prices. Price ) Mean of p( |x) L 2 loss for regression: ) Median of p( |x) L 1 loss for regression: 7

Squared loss, L 2 loss Picture form Alex 8

L 1 loss Picture form Alex 9

ε− insensitive loss Picture form Alex 10

Huber’s robust loss Picture form Alex 11

Risk Risk of f classification/regression function: = The expected loss Why do we care about this? 12

Why do we care about risk? Risk of f classification/regression function: =The expected loss Our true goal is to minimize the loss of the test points! Usually we don’t know the test points and their labels in advance…, but (LLN) That is why our goal is to minimize the risk . 13

Risk Examples Risk: The expected loss Classification loss: Risk of classification loss: L 2 loss for regression: Risk of L 2 loss: 14

Bayes Risk The expected loss Definition: Bayes Risk We consider all possible function f here We don’t know P, but we have i.i.d. training data sampled from P! Goal of Learning: The learning algorithm constructs this function f D from the training data. 15

Consistency of learning methods Risk is a random variable: Definition: Stone’s theorem 1977 : Many classification, regression algorithms are universally consistent for certain loss functions under certain conditions: kNN, Parzen kernel regression, SVM,… Yayyy!!!  Wait! This doesn’t tell us anything about the rates… 16

No Free Lunch! Devroy 1982 : For every consistent learning method and for every fixed convergence rate a n , 9 P(X,Y) distribution such that the convergence rate of this learning method on P(X,Y) distributed data is slower than a n  What can we do now? 17

What do we mean on rate? Notation: ( stochastic rate, stochastic little o and big O ) ( stochastically bounded ) Definition: ( stochastically bounded ) Example: ( CLT ) 18

Empirical Risk and True Risk 19

Empirical Risk Shorthand: True risk of f (deterministic) : Bayes risk : Let us use the empirical counter part: Empirical risk: 20

Empirical Risk Minimization Law of Large Numbers: Empirical risk is converging to the Bayes risk 21

Overfitting in Classification with ERM Generative model: Bayes classifier: Bayes risk: Picture from David Pal 22

Overfitting in Classification with ERM n-order thresholded polynomials Empirical risk: Bayes risk: Picture from David Pal 23

Overfitting in Regression with ERM Is the following predictor a good one? What is its empirical risk? (performance on training data) zero ! What about true risk? > zero Will predict very poorly on new random test point: Large generalization error ! 24

Overfitting in Regression If we allow very complicated predictors, we could overfit the training data. Examples: Regression (Polynomial of order k-1 – degree k ) 1.5 1.4 k=1 k=2 1.2 linear constant 1 1 0.8 0.6 0.5 0.4 0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.4 5 k=3 k=7 0 1.2 -5 quadratic 1 6 th order -10 0.8 -15 0.6 -20 -25 0.4 -30 0.2 -35 0 -40 -0.2 -45 25 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Solutions to Overfitting 26

Solutions to Overfitting Structural Risk Minimization Notation: Empirical risk Risk 1 st issue: (Model error, Approximation error) Solution: Structural Risk Minimzation (SRM) 27

Approximation error, Estimation error, PAC framework Risk of the classifier f Approximation error Estimation error Bayes risk Probably Approximately Correct (PAC) learning framework Estimation error 28

Solution to Overfitting 2 nd issue: Solution: 29

Approximation with the Hinge loss and quadratic loss Picture is taken from R. Herbrich

Effect of Model Complexity If we allow very complicated predictors, we could overfit the training data. fixed # training data Prediction error on training data Empirical risk is no longer a good indicator of true risk 31

Underfitting Bayes risk = 0.1 32

Underfitting Best linear classifier: The empirical risk of the best linear classifier: 33

Underfitting Best quadratic classifier: Same as the Bayes risk ) good fit! 34

Classification using the classification loss 35

The Bayes Classifier Lemma I: Lemma II: 36

Proofs Lemma I: Trivial from definition Lemma II: Surprisingly long calculation 37

The Bayes Classifier This is what the learning algorithm produces We will need these definitions, please copy it! 38

The Bayes Classifier This is what the learning algorithm produces Theorem I: The true risk of what the learning algorithm produces 39

The Bayes Classifier This is what the learning algorithm produces Theorem II: 40

Proofs Theorem I: Not so long calculations. Theorem II: Trivial Corollary: Main message: It’s enough to derive upper bounds for 41

Illustration of the Risks 42

It’s enough to derive upper bounds for Let us see why we have learned the tail bounds! 43

Hoeffding’s inequality (1963) Special case 44

Binomial distributions Our goal is to bound Bernoulli(p) Therefore, from Hoeffding we have: Yuppie!!! 45

Inversion From Hoeffding we have: Therefore, 46

Union Bound Our goal is to bound: We already know: Theorem : [tail bound on the ‘deviation’ in the worst case] Worst case error This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk! Proof : 47

Inversion of Union Bound We already know: Therefore, 48

Inversion of Union Bound •The larger the N , the looser the bound •This results is distribution free: True for all P(X,Y) distributions • It is useless if N is big, or infinite… (e.g. all possible hyperplanes) We will see later how to fix that. (Hint: We haven’t used McDiarmid yet) 49

The Expected Error Our goal is to bound: We already know: (Tail bound, Concentration inequality) Theorem : [Expected ‘deviation’ in the worst case] Worst case deviation This is not the worst classifier in terms of classification accuracy! Worst case means that the empirical risk of classifier f is the furthest from its true risk! Proof: we already know a tail bound. (From that actually we get a bit weaker inequality… oh well) 50

Thanks for your attention  51

Introduction to Machine Learning CMU-10701 10. Risk Minimization - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk Minimization 2 What have we seen so far? Several algorithms that seem to work fine on training datasets: Linear regression Nave Bayes

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso

On the Local Minima of the Empirical Risk Chi Jin * 1 , Lydia T. Liu* 1 , Rong Ge 2 , Michael I.

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

Lecture Series - MSG 141 C2-Simula5on Interoperability

CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct Learning [BDSS] Chapter 1

The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016

Introduction to Machine Learning CART: Splitting Criteria compstat-lmu.github.io/lecture_i2ml

Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks Maksym

Introduction to Machine Learning CMU-10701 10. Risk Minimization - PowerPoint PPT Presentation

Introduction to Machine Learning CMU-10701 10. Risk Minimization Barnabs Pczos 10. Risk Minimization 2 What have we seen so far? Several algorithms that seem to work fine on training datasets: Linear regression Nave Bayes

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti

CMU-10701 Support Vector Machines Barnabs Pczos &amp; Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 11. Learning Theory Barnabs Pczos Learning

Introduction to Machine Learning CMU-10701 2. MLE, MAP, Bayes classification Barnabs Pczos

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 Markov Chain Monte Carlo Methods Barnabs Pczos

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos &amp; Alex

Introduction to Machine Learning CMU-10701 14. Principal Component Analysis Barnabs Pczos

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &amp;

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos &amp; Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos &amp; Aarti

Introduction to Machine Learning CMU-10701 2. MLE, MAP What happened last time? Barnabs

Introduction to Machine Learning CMU-10701 19. Clustering and EM Barnabs Pczos Contents

Introduction to Machine Learning CMU-10701 8. Stochastic Convergence Barnabs Pczos

9.54 class 8 Supervised learning Optimization, regularization, kernels Shimon Ullman + Tomaso

On the Local Minima of the Empirical Risk Chi Jin * 1 , Lydia T. Liu* 1 , Rong Ge 2 , Michael I.

Empirical Risk Minimization October 29, 2015 Outline Empirical risk minimization view

Lecture Series - MSG 141 C2-Simula5on Interoperability

CS485/685 Lecture 15: Feb 28, 2012 Probably Approximately Correct Learning [BDSS] Chapter 1

The landscape of empirical risk for non-convex losses Song Mei ICME, Stanford December 3, 2016

Introduction to Machine Learning CART: Splitting Criteria compstat-lmu.github.io/lecture_i2ml

Provably Robust Boosted Decision Stumps and Trees against Adversarial Attacks Maksym

Introduction to Machine Learning CMU-10701 Support Vector Machines Barnabs Pczos & Aarti

CMU-10701 Support Vector Machines Barnabs Pczos & Aarti Singh 2014 Spring

Introduction to Machine Learning CMU-10701 Deep Learning Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 2. Basic Statistics Barnabs Pczos & Alex

Introduction to Machine Learning CMU-10701 3. Bayes classification Barnabs Pczos & Aarti

Introduction to Machine Learning CMU-10701 Principal Component Analysis Barnabs Pczos &

Introduction to Machine Learning CMU-10701 Clustering and EM Barnabs Pczos & Aarti Singh

Introduction to Machine Learning CMU-10701 Hidden Markov Models Barnabs Pczos & Aarti