Introduction to Statistical Learning Jean-Philippe Vert - PowerPoint PPT Presentation

Introduction to Statistical Learning Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr Mines ParisTech and Institut Curie Master Course, 2011. Jean-Philippe Vert (Mines ParisTech) 1 / 46

Outline Introduction 1 Linear methods for regression 2 Linear methods for classification 3 Nonlinear methods with positive definite kernels 4 Jean-Philippe Vert (Mines ParisTech) 2 / 46

Motivations Predict the risk of second heart from demographic, diet and clinical measurements Predict the future price of a stock from company performance measures Recognize a ZIP code from an image Identify the risk factors for prostate cancer and many more applications in many areas of science, finance and industry where a lot of data are collected. Jean-Philippe Vert (Mines ParisTech) 4 / 46

Learning from data Supervised learning An outcome measurement (target or response variable) which can be quantitative (regression) or categorial (classification) which we want to predicted based on a set of features or descriptors or predictors) We have a training set with features and outcome We build a prediction model, or learner to predict outcome from features for new unseen objects Unsupervised learning No outcome Describe how data are organized or clustered Examples - Fig 1.1-1.3 Jean-Philippe Vert (Mines ParisTech) 5 / 46

Machine learning / data mining vs statistics They share many concepts and tools, but in ML: Prediction is more important than modelling (understanding, causality) There is no settled philosophy or theoretical framework We are ready to use ad hoc methods if they seem to work on real data We often have many features, and sometimes large training sets. We focus on efficient algorithms, with little or no human intervention. We often use complex nonlinear models dfs Jean-Philippe Vert (Mines ParisTech) 6 / 46

Organization Focus on supervised learning (regression and classification) Reference: "The Elements of Statistical Learning" by Hastie, Tibshirani and Friedman (HTF) Available online at http: //www-stat.stanford.edu/~tibs/ElemStatLearn/ Practical sessions using R Jean-Philippe Vert (Mines ParisTech) 7 / 46

Notations Y ∈ Y the response (usually Y = {− 1 , 1 } or R ) X ∈ X the input (usually X = R p ) x 1 , . . . , x N observed inputs, stored in the N × p matrix X y 1 , . . . , y N observed inputs, stored in the vector Y ∈ Y N Jean-Philippe Vert (Mines ParisTech) 8 / 46

Simple method 1: Linear least squares Parametric model for β ∈ R p + 1 : p � β i X i = X ⊤ β f β ( X ) = β 0 + i = 1 Estimate ˆ β from training data to minimize N � ( y i − f β ( x i )) 2 RSS ( β ) = i = 1 See Fig 2.1 Good if model is correct... Jean-Philippe Vert (Mines ParisTech) 9 / 46

Simple method 2: Nearest neighbor methods (k-NN) Prediction based on the k nearest neighbors: Y ( x ) = 1 ˆ � y i k x i ∈ N k ( x ) Depends on k Less assumptions that linear regression, but more risk of overfitting Fig 2.2-2.4 Jean-Philippe Vert (Mines ParisTech) 10 / 46

Statistical decision theory Joint distribution Pr ( X , Y ) Loss function L ( Y , f ( X )) , e.g. squared error loss L ( Y , f ( X )) = ( Y − f ( X )) 2 Expected prediction error (EPE): EPE ( f ) = E ( X , Y ) ∼ Pr ( X , Y ) L ( Y , f ( X )) Minimizer is f ( X ) = E ( Y | X ) (regression function) Bayes classifier for 0 / 1 loss in classification ( Fig 2.5 ) Jean-Philippe Vert (Mines ParisTech) 11 / 46

Least squares and k -NN Least squares assumes f ( x ) is linear, and pools over values of X to estimate the best parameters. Stable but biased k -NN assumes f ( x ) is well approximated by a locally constant function, and pools over local sample data to approximate conditional expectation. Less stable but less biased. Jean-Philippe Vert (Mines ParisTech) 12 / 46

Local methods in high dimension If N is large enough, k -NN seems always optimal (universally consistent) But when p is large, curse of dimension: No method can be "local’ ( Fig 2.6 ) Training samples sparsely populate the input space, which can lead to large bias or variance ( eq. 2.25 and Fig 2.7-2.8 ) If structure is known (eg, linear regression function), we can reduce both variance and bias ( Fig. 2.9 ) Jean-Philippe Vert (Mines ParisTech) 13 / 46

Bias-variance trade-off Assume Y = f ( X ) + ǫ , on a fixed design. Y ( x ) is random because of ǫ , ˆ f ( X ) is random because of variations in the training set T . Then � 2 f ( X ) 2 − 2 EY ˆ � = EY 2 + E ˆ Y − ˆ E ǫ, T f ( X ) f ( X ) � 2 � = Var ( Y ) + Var (ˆ EY − E ˆ f ( X )) + f ( X ) f ) 2 + variance (ˆ = noise + bias (ˆ f ) Jean-Philippe Vert (Mines ParisTech) 14 / 46

Structured regression and model selection Define a family of function classes F λ , where λ controls the "complexity", eg: Ball of radius λ in a metric function space Bandwidth of the kernel is a kernel estimator Number of basis functions For each λ , define ˆ f λ = argmin EPE ( f ) F λ Select ˆ f = ˆ f ˆ λ to minimize the bias-variance tradeoff ( Fig. 2.11 ). Jean-Philippe Vert (Mines ParisTech) 15 / 46

Cross-validation A simple and systematic procedure to estimate the risk (and to optimize the model’s parameters) Randomly divide the training set (of size N ) into K (almost) equal 1 portions, each of size K / N For each portion, fit the model with different parameters on the 2 K − 1 other groups and test its performance on the left-out group Average performance over the K groups, and take the parameter 3 with the smallest average performance. Taking K = 5 or 10 is recommended as a good default choice. Jean-Philippe Vert (Mines ParisTech) 16 / 46

Summary To learn complex functions in high dimension from limited training sets, we need to optimize a bias-variance trade-off. We will do that typically by: Define a family of learners of various complexities (eg, dimension 1 of a linear predictor) Define an estimation procedure for each learner (eg, 2 least-squares or empirical risk minimization) Define a procedure to tune the complexity of the learner (eg, 3 cross-validation) Jean-Philippe Vert (Mines ParisTech) 17 / 46

Linear least squares Parametric model for β ∈ R p + 1 : p � β i X i = X ⊤ β f β ( X ) = β 0 + i = 1 Estimate ˆ β from training data to minimize N � ( y i − f β ( x i )) 2 RSS ( β ) = i = 1 Solution if X ⊤ X is non-singular: � − 1 � ˆ X ⊤ X X ⊤ Y β = Jean-Philippe Vert (Mines ParisTech) 19 / 46

Fitted values Fitted values on the training set: � − 1 � − 1 � � Y = X ˆ ˆ X ⊤ X X ⊤ Y = HY X ⊤ X X ⊤ β = X with H = X Geometrically: H projects Y on the span of X (Fig. 3.2) If X is singular, ˆ β is not uniquely defined, but ˆ Y is Jean-Philippe Vert (Mines ParisTech) 20 / 46

Inference on coefficients Assume Y = X β + ǫ , with ǫ ∼ N ( 0 , σ 2 I ) Then ˆ X ⊤ X � β, σ 2 � � � β ∼ N − 1 σ = � Y − ˆ Y � 2 / ( N − p − 1 ) Estimating variance: ˆ Statistics on coefficients: ˆ β j − β j ∼ t N − p − 1 σ √ v j ˆ allows to test the hypothesis H 0 : β j = 0, and gives confidence intervals ˆ β j ± t α/ 2 , N − p − 1 ˆ σ � v j Jean-Philippe Vert (Mines ParisTech) 21 / 46

Introduction to Statistical Learning Jean-Philippe Vert - PowerPoint PPT Presentation

Introduction to Statistical Learning Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr Mines ParisTech and Institut Curie Master Course, 2011. Jean-Philippe Vert (Mines ParisTech) 1 / 46 Outline Introduction 1 Linear methods for regression 2

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Introduction to Statistical Process Control Statistical Process Control (SPC) uses seven major

Workshop 4: Statistical modelling intro Murray Logan 10 Mar 2019 Section 1 Introduction

Workshop 4: Statistical modelling intro Murray Logan March 10, 2019 Table of contents 1

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R

ECON2228 Notes 3 Christopher F Baum Boston College Economics 20142015 cfb (BC Econ)

Estimation theory Parametric estimation Properties of estimators Minimum variance

Evaluating Estimators Statistical evaluation ways of choosing with- out access to test data

Using graphs and Laplacian eigenvalues to evaluate block designs R. A. Bailey University of St

Introduction to General and Generalized Linear Models Mixed effects models - Part II Henrik

Intr oduc tion to E c onome tr ic s Chapte r 2 E ze quie l Ur ie l Jim ne z Unive r

Slide Set 5 CLRM: sample properties of OLS Pietro Coretto pcoretto@unisa.it Econometrics

Introduction to Statistical Learning Jean-Philippe Vert - PowerPoint PPT Presentation

Introduction to Statistical Learning Jean-Philippe Vert Jean-Philippe.Vert@ensmp.fr Mines ParisTech and Institut Curie Master Course, 2011. Jean-Philippe Vert (Mines ParisTech) 1 / 46 Outline Introduction 1 Linear methods for regression 2

Day 1: Introduction to Statistical Learning Lucas Leemann Essex Summer School Introduction to

Statistical Statistical Statistical Model Statistical Model Model Checking Model Checking

Foundations of AI Why learning works 1 6 . Statistical Machine Learning Bayesian Learning and

Statistical graphics with Statistical graphics with ggplot2 ggplot2 Programming for Statistical

COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 23. PGM

STK-IN4300 Statistical Learning Methods in Data Science Statistical Boosting Boosting as a

Introduction to Statistical Process Control Statistical Process Control (SPC) uses seven major

Workshop 4: Statistical modelling intro Murray Logan 10 Mar 2019 Section 1 Introduction

Workshop 4: Statistical modelling intro Murray Logan March 10, 2019 Table of contents 1

MACHINE LEARNING, STATISTICAL LEARNING AND PARALLEL COMPUTING INTRODUCTION VS MACHINE LEARNING

Statistical Natural Language Processing Statistical models: learning, inference, estimation,

COMPLETE STATISTICAL THEORY OF LEARNING LEARNING USING STATISTICAL INVARIANTS Vladimir Vapnik

Statistical presentation Statistical presentation Statistical tabulations by age, sex and 3 digit

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation &amp; the European Statistical System EEA Seminar EEA Seminar

13 Jan, 2011 Statistical Literacy: Confounding UTSA Confounding 2011 1 2011 2 Statistical

Linear Regression, Regularization Bias-Variance Tradeoff Thanks to C Guestrin, T Dietterich, R

ECON2228 Notes 3 Christopher F Baum Boston College Economics 20142015 cfb (BC Econ)

Estimation theory Parametric estimation Properties of estimators Minimum variance

Evaluating Estimators Statistical evaluation ways of choosing with- out access to test data

Using graphs and Laplacian eigenvalues to evaluate block designs R. A. Bailey University of St

Introduction to General and Generalized Linear Models Mixed effects models - Part II Henrik

Intr oduc tion to E c onome tr ic s Chapte r 2 E ze quie l Ur ie l Jim ne z Unive r

Slide Set 5 CLRM: sample properties of OLS Pietro Coretto pcoretto@unisa.it Econometrics

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar

EFTA Statistical Cooperation & the European Statistical System EEA Seminar EEA Seminar