Overfi'ng and Model selec1on Aar$ Singh Machine - PowerPoint PPT Presentation

¡ Overfi'ng ¡and ¡Model ¡selec1on ¡ Aar$ ¡Singh ¡ ¡ ¡ Machine ¡Learning ¡10-‑701/15-‑781 ¡ Feb ¡20, ¡2014 ¡

True ¡vs. ¡Empirical ¡Error ¡ True Error : ¡Target ¡performance ¡measure ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Classifica$on ¡– ¡Probability ¡of ¡misclassifica$on ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Regression ¡– ¡Mean ¡Squared ¡Error ¡ ¡ Performance ¡on ¡a ¡random ¡test ¡point ¡(X,Y) ¡ Empirical Error : ¡Performance ¡on ¡training ¡data ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Classifica$on ¡– ¡Propor$on ¡of ¡misclassified ¡examples ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ Regression ¡– ¡Average ¡Squared ¡Error ¡

Overfi'ng ¡ Is ¡the ¡following ¡predictor ¡a ¡good ¡one? ¡ ¡ ¡ ¡ ¡ ¡ What ¡is ¡its ¡empirical ¡error? ¡(performance ¡on ¡training ¡data) ¡ zero ! ¡ What ¡about ¡true ¡error? ¡ > zero ¡ Will ¡predict ¡very ¡poorly ¡on ¡new ¡random ¡test ¡point: ¡ ¡ Poor generalization !

Overfi'ng ¡ If ¡we ¡allow ¡very ¡complicated ¡predictors, ¡we ¡could ¡overfit ¡the ¡ training ¡data. ¡ ¡ Examples: ¡ ¡Classifica$on ¡(1-‑NN ¡classifier) ¡ ¡ ¡ Football ¡player ¡? ¡ No ¡ Yes ¡ Weight ¡ Weight ¡ Height ¡ Height ¡

Overfi'ng ¡ If ¡we ¡allow ¡very ¡complicated ¡predictors, ¡we ¡could ¡overfit ¡the ¡ training ¡data. ¡ ¡ Examples: ¡ ¡Regression ¡(Polynomial ¡of ¡order ¡k ¡– ¡degree ¡up ¡to ¡k-‑1) ¡ ¡ 1.5 1.4 k=1 ¡ k=2 ¡ ¡ 1.2 1 1 0.8 0.6 0.5 0.4 0.2 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.4 5 k=3 ¡ k=7 ¡ 0 1.2 -5 1 -10 0.8 -15 0.6 -20 -25 0.4 -30 0.2 -35 0 -40 -0.2 -45 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Effect ¡of ¡Model ¡Complexity ¡ If ¡we ¡allow ¡very ¡complicated ¡predictors, ¡we ¡could ¡overfit ¡the ¡ training ¡data. ¡ fixed ¡# ¡training ¡data ¡ True ¡error ¡ Empirical ¡error ¡ Empirical error is no longer a good indicator of true error

Examples ¡of ¡Model ¡Spaces ¡ Model ¡Spaces ¡with ¡increasing ¡complexity: ¡ ¡ • ¡ ¡Nearest-‑Neighbor ¡classifiers ¡with ¡varying ¡neighborhood ¡sizes ¡k ¡= ¡1,2,3,… ¡ Small ¡neighborhood ¡=> ¡Higher ¡complexity ¡ ¡ • ¡ ¡Decision ¡Trees ¡with ¡depth ¡k ¡or ¡with ¡k ¡leaves ¡ Higher ¡depth/ ¡More ¡# ¡leaves ¡=> ¡Higher ¡complexity ¡ ¡ • ¡ ¡Regression ¡with ¡polynomials ¡of ¡order ¡k ¡= ¡0, ¡1, ¡2, ¡… ¡ Higher ¡degree ¡=> ¡Higher ¡complexity ¡ ¡ • ¡ ¡Kernel ¡Regression ¡with ¡bandwidth ¡h ¡ Small ¡bandwidth ¡=> ¡Higher ¡complexity ¡ ¡

Restric1ng ¡Model ¡Complexity ¡ True ¡Error/Risk ¡ ¡ ¡Empirical ¡Error/Risk ¡ ¡ n X R ( f ) = 1 b loss( f ( X i ) , Y i ) R ( f ) = E XY [loss( f ( X ) , Y )] ¡ n i =1 ¡ ¡ Op$mal ¡Predictor ¡ f ∗ = arg min ¡ R ( f ) ¡ f ¡ Empirical ¡Risk ¡Minimizer ¡over ¡class ¡ ¡ F ¡ b b ¡ f n = arg min R ( f ) f ∈ F ¡

Effect ¡of ¡Model ¡Complexity ¡ Want ¡ ¡ ¡ ¡ ¡ ¡ ¡to ¡be ¡as ¡good ¡as ¡op$mal ¡predictor ¡ ¡ R ( b R ( b Excess Risk f ∈ F R ( f ) − R ( f ∗ ) f n ) − inf f ∈ F R ( f ) + inf f n ) − R ( f ∗ ) finite sample size Due to randomness Due to restriction of training data of model class Es$ma$on ¡ ¡ error ¡ Excess ¡risk ¡ Approx. ¡error ¡ R ( f ∗ )

Effect ¡of ¡Model ¡Complexity ¡ R ( b R ( b f ∈ F R ( f ) − R ( f ∗ ) f n ) − inf f ∈ F R ( f ) + inf f n ) − R ( f ∗ )

Bias ¡– ¡Variance ¡Tradeoff ¡ Regression: ¡ No$ce: ¡Op$mal ¡predictor ¡ R ( f ∗ ) does ¡not ¡have ¡zero ¡error ¡ R ( b f n ) . ¡ . ¡ . ¡ variance bias 2 Noise var Random ¡component ¡ ≡ est err ≡ approx err

Bias ¡– ¡Variance ¡Tradeoff ¡ 3 Independent training datasets Large ¡bias, ¡Small ¡variance ¡– ¡poor ¡approxima$on ¡but ¡robust/stable ¡ 1.6 1.6 1.6 1.4 1.4 1.4 1.2 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Small ¡bias, ¡Large ¡variance ¡– ¡good ¡approxima$on ¡but ¡instable ¡ 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0 -0.5 -0.5 -0.5 -1 -1 -1 -1.5 -1.5 -1.5 -2 -2 -2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Bias ¡– ¡Variance ¡Tradeoff: ¡Deriva1on ¡ Regression: ¡ No$ce: ¡Op$mal ¡predictor ¡ R ( f ∗ ) does ¡not ¡have ¡zero ¡error ¡ As ¡in ¡HW1 ¡solu$on, ¡we ¡can ¡write ¡the ¡MSE ¡of ¡any ¡func$on ¡f ¡as ¡ R ( f ) = E [( f ( X ) − Y ) 2 ] 2 ] = E [( f ( X ) − f ∗ ( X ) + f ∗ ( X ) − Y ) 2 ] = E [( f ( X ) − f ∗ ( X )) 2 + ( f ∗ ( X ) − Y ) 2 + 2( f ( X ) − f ∗ ( X ))( f ∗ ( X ) − Y )] = E [( f ( X ) − f ∗ ( X )) 2 ] + E [( f ∗ ( X ) − Y ) 2 ] + 2 E [( f ( X ) − f ∗ ( X ))( f ∗ ( X ) − Y )] since ¡ 0 ¡ E XY [( f ( X ) − f ∗ ( X ))( f ∗ ( X ) − Y )] = E X [ E Y | X [( f ( X ) − f ∗ ( X ))( f ∗ ( X ) − Y ) | X ]] = E X [( f ( X ) − f ∗ ( X )) E Y | X [( f ∗ ( X ) − Y ) | X ]] = 0

Bias ¡– ¡Variance ¡Tradeoff: ¡Deriva1on ¡ Regression: ¡ No$ce: ¡Op$mal ¡predictor ¡ R ( f ∗ ) does ¡not ¡have ¡zero ¡error ¡ As ¡in ¡HW1 ¡solu$on, ¡we ¡can ¡write ¡the ¡MSE ¡of ¡any ¡func$on ¡f ¡as ¡ R ( f ) = E [( f ( X ) − Y ) 2 ] 2 ] = E [( f ( X ) − f ∗ ( X ) + f ∗ ( X ) − Y ) 2 ] = E [( f ( X ) − f ∗ ( X )) 2 + ( f ∗ ( X ) − Y ) 2 + 2( f ( X ) − f ∗ ( X ))( f ∗ ( X ) − Y )] = E [( f ( X ) − f ∗ ( X )) 2 ] + E [( f ∗ ( X ) − Y ) 2 ] + 2 E [( f ( X ) − f ∗ ( X ))( f ∗ ( X ) − Y )] R ( f ∗ )

Bias ¡– ¡Variance ¡Tradeoff: ¡Deriva1on ¡ Regression: ¡ No$ce: ¡Op$mal ¡predictor ¡ R ( f ∗ ) does ¡not ¡have ¡zero ¡error ¡ Now b f n ( X ), and hence R ( b f n ( X )), is random as it depends on training data − σ 2 f ∗ ( X )) 2 ] f ∗ ( X )) 2 ] f ∗ ( X )) 2 ] f ∗ ( X ))] f ∗ ( X )) 2 ] f ∗ ( X ))] 0 ¡

Bias ¡– ¡Variance ¡Tradeoff: ¡Deriva1on ¡ Regression: ¡ No$ce: ¡Op$mal ¡predictor ¡ R ( f ∗ ) does ¡not ¡have ¡zero ¡error ¡ Now b f n ( X ), and hence R ( b f n ( X )), is random as it depends on training data − σ 2 f ∗ ( X )) 2 ] f ∗ ( X )) 2 ] f ∗ ( X )) 2 ] f ∗ ( X ))] f ∗ ( X )) 2 ] Variance ¡ Bias 2 ¡

Model ¡Selec1on ¡ Setup: ¡ ¡Model ¡Classes ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡of ¡increasing ¡complexity ¡ ¡ ¡ ¡ We ¡can ¡select ¡the ¡right ¡complexity ¡model ¡in ¡a ¡data-‑driven/adap$ve ¡way: ¡ ¡ q ¡ ¡Hold-‑out ¡ ¡ ¡ q ¡ ¡Cross-‑valida$on ¡ ¡ q ¡ ¡Complexity ¡Regulariza$on ¡ ¡ q ¡ ¡Informa)on ¡Criteria ¡-‑ ¡AIC, ¡BIC, ¡Minimum ¡Descrip$on ¡Length ¡(MDL) ¡

Hold-‑out ¡method ¡ We ¡would ¡like ¡to ¡pick ¡the ¡model ¡that ¡has ¡smallest ¡generaliza$on ¡error. ¡ ¡ Can ¡judge ¡generaliza$on ¡error ¡by ¡using ¡an ¡independent ¡sample ¡of ¡data. ¡ ¡ Hold – out procedure: ¡ ¡ ¡n ¡data ¡points ¡available ¡ ¡ ¡1) ¡Split ¡into ¡two ¡sets: ¡ ¡ ¡ ¡ ¡ ¡Training ¡dataset ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Valida$on ¡dataset ¡ NOT test Data !! ¡ ¡ ¡ ¡ ¡2) ¡Use ¡ D T ¡for ¡training ¡a ¡predictor ¡from ¡each ¡model ¡class: ¡ ¡ λ ∈ Λ ¡ ¡ Evaluated ¡on ¡training ¡dataset ¡ D T ¡ ¡ ¡

Hold-‑out ¡method ¡ ¡3) ¡Use ¡ Dv ¡to ¡select ¡the ¡model ¡class ¡which ¡has ¡smallest ¡empirical ¡error ¡on ¡ D v ¡ ¡ ¡ ¡ ¡ Evaluated ¡on ¡valida$on ¡dataset ¡ D V ¡ ¡ ¡4) ¡Hold-‑out ¡predictor ¡ ¡ ¡ ¡ ¡ ¡ ¡ Intuition: ¡Small ¡error ¡on ¡one ¡set ¡of ¡data ¡will ¡not ¡imply ¡small ¡error ¡on ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡a ¡randomly ¡sub-‑sampled ¡second ¡set ¡of ¡data ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡ ¡Ensures ¡method ¡is ¡“stable” ¡

Overfi'ng and Model selec1on Aar$ Singh Machine - PowerPoint PPT Presentation

Overfi'ng and Model selec1on Aar$ Singh Machine Learning 10-701/15-781 Feb 20, 2014 True vs. Empirical Error True Error : Target performance

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Cosmological model : Cosmological model Cosmological model Cosmological model : : : :

CGE model development (1) CGE model development (1) Concept of CGE model and Concept of CGE

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Model REM Rapid Engineering Model What is REM? REM Rapid Engineering Model What is REM? REM

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

CGE model development (2) CGE model development (2) CGE model development CGE model development

PLS Advanced Diffusion Model New Advanced Diffusion Model for Dopants in Silicon Advanced Dopant

a F a 3 = F ad = Adhesion Force dP F ad = Adhesion Force ad Hertz Model Hertz Model 2 K

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

A New Two- -Scale Mix Model: Towards Scale Mix Model: Towards a Multi a Multi- - A New Two A

o status update for the low momentum TF - F2F oct 12 Jakob Lettenbichler, Rudolf Fr uhwirth

A Generic Data Exchange System for F2F Networks Cyril Soler C.Soler The GXS System 03 Feb.

Report from VLDATA F2F meeting in London G. Ganis, 16 June 2014 Reminder End of May

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grnwald, Mark Reid, Bob

tEN Virtual Learning Lunchbox #1, 2020 Moving courses online: the experience of the BA from 2018

PIVOTING F2F TO ONLINE LEARNING: THE IN'S & OUT'S OF TRANSFORMING YOUR ACCREDITED ACTIVITIES

Improving learning and learner engagement in f2f and blended settings Sahana Murthy Educational

Openness of W3C Working Groups Paul Cotton Microsoft, WS-Policy WG co-chair W3C Process (in a

Sambuz

Useful Links

Newsletter

Mail Us

Overfi'ng and Model selec1on Aar$ Singh Machine - PowerPoint PPT Presentation

Overfi'ng and Model selec1on Aar$ Singh Machine Learning 10-701/15-781 Feb 20, 2014 True vs. Empirical Error True Error : Target performance

Random forests and wine Machine Learning Toolbox Random forests Popular type of machine

Cosmological model : Cosmological model Cosmological model Cosmological model : : : :

CGE model development (1) CGE model development (1) Concept of CGE model and Concept of CGE

k -Step Ahead Prediction Error Model 1. k -Step Ahead Prediction Error Model 1. ARMAX model is

MODEL SELECTION AND REGULARISATION MODEL SELECTION ESTIMATING THE ACCURACY OF THE MODEL We

Take out your DNA model DNA and the Human Genome DNA Model How was your How was your model

Probabilistic and Model Fusion: . . . Model Fusion: . . . Interval Uncertainty Model Fusion:

Model-Free Methods Model-Free Methods Model-based: use all branches S 2 A 1 S 3 R=2 A 2 S 2 S 1

Model REM Rapid Engineering Model What is REM? REM Rapid Engineering Model What is REM? REM

A Model of Visual Imagery A Model of Visual Imagery John Abbondanza, OD, FCOVD John Abbondanza,

Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model Web/CD Hybrid Model for t he Dist

CGE model development (2) CGE model development (2) CGE model development CGE model development

PLS Advanced Diffusion Model New Advanced Diffusion Model for Dopants in Silicon Advanced Dopant

a F a 3 = F ad = Adhesion Force dP F ad = Adhesion Force ad Hertz Model Hertz Model 2 K

Probabilistic model Probabilistic model c Probabilistic model Probabilistic model c c

A New Two- -Scale Mix Model: Towards Scale Mix Model: Towards a Multi a Multi- - A New Two A

o status update for the low momentum TF - F2F oct 12 Jakob Lettenbichler, Rudolf Fr uhwirth

A Generic Data Exchange System for F2F Networks Cyril Soler C.Soler The GXS System 03 Feb.

Report from VLDATA F2F meeting in London G. Ganis, 16 June 2014 Reminder End of May

Mixability in Statistical Learning Tim van Erven Joint work with: Peter Grnwald, Mark Reid, Bob

tEN Virtual Learning Lunchbox #1, 2020 Moving courses online: the experience of the BA from 2018

PIVOTING F2F TO ONLINE LEARNING: THE IN'S &amp; OUT'S OF TRANSFORMING YOUR ACCREDITED ACTIVITIES

Improving learning and learner engagement in f2f and blended settings Sahana Murthy Educational

Openness of W3C Working Groups Paul Cotton Microsoft, WS-Policy WG co-chair W3C Process (in a

Sambuz

Useful Links

Newsletter

Mail Us

PIVOTING F2F TO ONLINE LEARNING: THE IN'S & OUT'S OF TRANSFORMING YOUR ACCREDITED ACTIVITIES