Machine Learning (CSE 446): Concepts & the i.i.d. Supervised - PowerPoint PPT Presentation

Machine Learning (CSE 446): Concepts & the “i.i.d.” Supervised Learning Paradigm Sham M Kakade c � 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17

Review 1 / 17

Decision Tree: Making a Prediction root n:p ϕ 1? Data : decision tree t , input example x Result : predicted class 0 1 if t has the form Leaf ( y ) then n0:p0 n1:p1 return y ; else ϕ 2? # t.φ is the feature associated with t ; # t .child( v ) is the subtree for value v ; 0 1 return DTreeTest ( t .child( t.φ ( x ) ), x )); n10:p10 n11:p11 end Algorithm 1: DTreeTest ϕ 3? ϕ 4? 0 1 0 1 n100:p100 n101:p101 n110:p110 n111:p111 2 / 17

(review) Greedily Building a Decision Tree (Binary Features) Data : data D , feature set Φ Result : decision tree if all examples in D have the same label y , or Φ is empty and y is the best guess then return Leaf ( y ); else for each feature φ in Φ do partition D into D 0 and D 1 based on φ -values; let mistakes( φ ) = (non-majority answers in D 0 ) + (non-majority answers in D 1 ); end let φ ∗ be the feature with the smallest number of mistakes; return Node ( φ ∗ , { 0 → DTreeTrain ( D 0 , Φ \ { φ ∗ } ), 1 → DTreeTrain ( D 1 , Φ \ { φ ∗ } ) } ); end Algorithm 2: DTreeTrain 3 / 17

Danger: Overfitting overfitting error rate unseen data (lower is better) training data depth of the decision tree 4 / 17

Today’s Lecture 4 / 17

The “i.i.d.” Supervised Learning Setup ◮ Let ℓ be a loss function ; ℓ ( y, ˆ y ) is our loss by predicting ˆ y when y is the correct output. ◮ Let D ( x, y ) define the (unknown) underlying probability of input/output pair ( x, y ) , in “nature.” We never “know” this distribution. ◮ The training data D = � ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) � are assumed to be identical, independently, distributed (i.i.d.) samples from D . ◮ We care about our expected error (i.e. the expected loss, the “true” loss, ...) with regards to the underlying distribution D . ◮ Goal: find a hypothesis which as has “low” expected error, using the training set. 5 / 17

Concepts and terminology ◮ The learning algorithm maps the training set D to a some hypothesis ˆ f . ◮ often have a “ hypothesis class ” F , where our algorithm chooses ˆ f ∈ F . ◮ The training error of f is the loss of f on the training set. ◮ overfitting! (and underfitting ) Also: The generalization error is often referred to as the difference between the training error of ˆ f and the expected error of ˆ f . ◮ Ways to check/avoid overfitting: ◮ use test set , i.i.d. data sampled D , to estimate the the expected error . ◮ use a “ Dev elopment set ”, i.i.d. from D , for hyperparameter turning (or cross validation ) ◮ We really just get sampled data, and we can break it up as we like . 6 / 17

Loss functions ◮ ℓ ( y, ˆ y ) is our loss by outputting ˆ y when y is the correct output. ◮ Many loss functions: ◮ For binary classification, where y ∈ { 0 , 1 } : ℓ ( y, ˆ y ) = � y � = ˆ y � ◮ For multi-class classification, where y is one of k -outcomes: ℓ ( y, ˆ y ) = � y � = ˆ y � ◮ For regression, where y ∈ R , we often use the square loss: y ) 2 ℓ ( y, ˆ y ) = ( y − ˆ ◮ Classifier f ’s true expected error (or loss) : � ǫ ( f ) = D ( x, y ) · ℓ ( y, f ( x )) = E ( x,y ) ∼D [ ℓ ( y, f ( x ))] ( x,y ) Sometimes, when clear from context, the loss or error refers to the expected loss. 7 / 17

Training error ◮ Goal: We want to find an f which has low ǫ ( f ) . But we don’t know ǫ ( f ) ? ◮ The training error of hypothesis f is f ’s average error on the training data : N ǫ ( f ) = 1 � ˆ ℓ ( y n , f ( x n )) N n =1 ◮ In contrast, classifier f ’s true expected loss: ǫ ( f ) = E ( x,y ) ∼D [ ℓ ( y, f ( x ))] ◮ Idea: Use the training error ˆ ǫ ( f ) as an empirical approximation to ǫ ( f ) . And hope that this approximation is good! 8 / 17

The training error and the LLN ◮ For a fixed f (which does not depend on the training set D ), the training error is an unbiased estimate of the expected error. Proof: Taking an expectation over the dataset D ǫ ( f )] = E [ 1 ℓ ( y n , f ( x n ))] = 1 E ℓ ( y n , f ( x n )) = 1 � � � E D [ˆ ǫ ( f ) = ǫ ( f ) N N N n n n ◮ LLN: for a fixed f (not a function of D ) and for large N , ˆ ǫ ( f ) → ǫ ( f ) e.g. for any fixed classifier, you can get a good estimate of its mistake rate with a large dataset.. ◮ This suggests: finding f which makes the training error small is a good approach? 9 / 17

What could go wrong? ◮ A learning algorithm which “memorizes” the data is easy to construct: While such algorithms have 0 training error, they often have true expected error no better than guessing. ◮ What went wrong? ◮ for a given f , we just need a training set to estimate the bias of a coin (for binary classification). this is easy. ◮ BUT there is a (“very small”) chance this approximation fails (for “large N”) ◮ try enough hypothesis, and, by chance alone, one will look good. 10 / 17

Overfitting, More Formally ◮ Let ˆ f be the output of training algorithm. ǫ ( ˆ f ) , the training error of ˆ ◮ It is never true (in almost all cases) that ˆ f , is an unbiased estimate ǫ ( ˆ f ) , of the expected loss of ˆ f . ◮ It is usually a gross underestimate . ◮ The generalization error of our algorithm is: ǫ ( ˆ ǫ ( ˆ f ) − ˆ f ) Large generalization error means we have overfit. ◮ We would like both : ǫ ( ˆ ◮ our training error, ˆ f ) , to be small ◮ our generalization error to be small ◮ If both occur, then we have low expected error :) ◮ It is usually easy to get one of these two to be small. ◮ Overfitting: this is the fundamental problem of ML. 11 / 17

Danger: Overfitting overfitting error rate unseen data (lower is better) training data depth of the decision tree 12 / 17

Test sets and Dev. Sets ◮ Checking for overfitting: ◮ use test set , i.i.d. data sampled D , to estimate the the expected error . ◮ We get an unbiased estimate of the true error (and an accurate one for “reasonable” N ). ◮ we should never use the test set during training, as this violates the approximation quality. ◮ Hyperparameters “def”: params of our algorithm/pseudo-code 1. usually they monotonically make training error lower e.g. decision tree maximal width and maximal depth. 2. sometimes not we just don’t know how to set them (e.g. learning rates) ◮ How do we set hyperparams? For case 1, ◮ use a dev set , i.i.d. from D , for hyperparameter turning (or cross validation ) ◮ learn with training set (using different hyperparams); then check on your dev set. 13 / 17

Back to decision trees . . . 14 / 17

Avoiding Overfitting by Stopping Early ◮ Set a maximum tree depth d max . (also need to set a maximum width w ) ◮ Only consider splits that decrease error by at least some ∆ . ◮ Only consider splitting a node with more than N min examples. In each case, we have a hyperparameter ( d max , w, ∆ , N min ), which you should tune on development data . 15 / 17

Avoiding Overfitting by Pruning ◮ Build a big tree (i.e., let it overfit), call it t 0 . ◮ For i ∈ { 1 , . . . , | t 0 |} : greedily choose a set of sibling-leaves in t i − 1 to collapse that increases error the least; collapse to produce t i . (Alternately, collapse the split whose contingency table is least surprising under chance assumptions.) ◮ Choose the t i that performs best on development data. 16 / 17

More Things to Know ◮ Instead of using the number of mistakes, we often use information-theoretic quantities to choose the next feature. ◮ For continuous-valued features, we use thresholds, e.g., φ ( x ) ≤ τ . In this case, you must choose τ . If the sorted values of φ are � v 1 , v 2 , . . . , v N � , you only need to consider � N − 1 � v n + v n +1 τ ∈ n =1 (midpoints between consecutive feature values). 2 ◮ For continuous-valued outputs , what value makes sense as the prediction at a leaf? What loss should we use instead of � y � = ˆ y � ? 17 / 17

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised - PowerPoint PPT Presentation

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a Prediction root n:p

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Midterm review CS 446 1. Lecture review (Lec1.) Basic setting: supervised learning Training data

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

Machine Learning (CSE 446): (continuation of overfitting &) Limits of Learning Sham M Kakade

CoSc 450: Programming Paradigms Paradigm A pattern of thinking that is frequently difficult to

Introduction to Mobile Robotics Robot Control Paradigms Wolfram Burgard, Cyrill Stachniss,

Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model

Philosophical implications of the paradigm shift in model theory John T. Baldwin University of

Database Metatheory: Asking the Big Queries :Presenters Shervin Mohammadi Valerie Ishida

12. Prudent OO Design Venkat Subramaniam OODP-1 The Pillars of the Paradigm Abstraction

Algorithmic Paradigms Divide and Conquer Idea: Divide problem instance into smaller sub-instances

A New Paradigm For Network Security Through Experiences From Reality (ANPFNSTEFR) Mohit Lad,

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised - PowerPoint PPT Presentation

Machine Learning (CSE 446): Concepts & the i.i.d. Supervised Learning Paradigm Sham M Kakade c 2018 University of Washington cse446-staff@cs.washington.edu 1 / 17 Review 1 / 17 Decision Tree: Making a Prediction root n:p

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Machine Learning Supervised Learning Unsupervised Learning CSE 446: Expectation Maximization

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Scikit-Learn: Machine Learning with Introduction to Scikit-Learn: Machine Learning

Machine Learning (CSE 446): Introduction Sham M Kakade 2018 c University of Washington

Machine Learning (CSE 446): Probabilistic Machine Learning MLE &amp; MAP Sham M Kakade 2018

Fairness in Machine Learning Fairness in Supervised Learning Make decisions by machine learning:

Machine Learning for NLP Supervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Midterm review CS 446 1. Lecture review (Lec1.) Basic setting: supervised learning Training data

CSE 446: Linear Algebra Review Sachin Mehta University of Washington, Seattle Email:

CSCI 446: Ar*ficial Intelligence CSCI 446: Ar*ficial Intelligence

CSCI 446: Artificial Intelligence CSCI 446: Artificial Intelligence Course Website:

Supervised Learning Prof. Kuan-Ting Lai 2020/4/9 Machine Learning Supervised Unsupervised

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Machine Learning (CSE 446): Learning as Minimizing Loss: Regularization and Gradient Descent

Machine Learning (CSE 446): (continuation of overfitting &amp;) Limits of Learning Sham M Kakade

CoSc 450: Programming Paradigms Paradigm A pattern of thinking that is frequently difficult to

Introduction to Mobile Robotics Robot Control Paradigms Wolfram Burgard, Cyrill Stachniss,

Discovering Morphological Paradigms from Plain Text Using a Dirichlet Process Mixture Model

Philosophical implications of the paradigm shift in model theory John T. Baldwin University of

Database Metatheory: Asking the Big Queries :Presenters Shervin Mohammadi Valerie Ishida

12. Prudent OO Design Venkat Subramaniam OODP-1 The Pillars of the Paradigm Abstraction

Algorithmic Paradigms Divide and Conquer Idea: Divide problem instance into smaller sub-instances

A New Paradigm For Network Security Through Experiences From Reality (ANPFNSTEFR) Mohit Lad,

Machine Learning (CSE 446): Probabilistic Machine Learning MLE & MAP Sham M Kakade 2018

CSCI 446: Arficial Intelligence CSCI 446: Arficial Intelligence

Machine Learning (CSE 446): (continuation of overfitting &) Limits of Learning Sham M Kakade