Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff - PowerPoint PPT Presentation

Machine Learning Theory CS 446

1. SVM risk

0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 ℓ ( Y ˆ � ℓ ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore define excess risk R ( f ) − � R ( f ) . 1 / 61

SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 ℓ ( Y ˆ � ℓ ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore define excess risk R ( f ) − � R ( f ) . 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C 1 / 61

SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 ℓ ( Y ˆ � ℓ ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore define excess risk R ( f ) − � R ( f ) . 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C What’s going on here? 1 / 61

SVM risk Consider the empirical and true/population risk of SVM: given f , � � � R ( f ) = 1 ℓ ( Y ˆ � ℓ ( y i ˆ R ( f ) = E f ( X )) , f ( x i )) , n i =1 and furthermore define excess risk R ( f ) − � R ( f ) . 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C What’s going on here? (I just tricked you into caring about theory.) 1 / 61

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . 2 / 61

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? 2 / 61

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) 2 / 61

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ f = arg min f ∈F R ( f ) ? 2 / 61

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈F R ( f ) ? Answer: no; in general � f ) !) 2 / 61

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈F R ( f ) ? Answer: no; in general � f ) !) Nature labels according to some g (not necessarily inside F !): R ( ˆ f ) = R ( g ) (inherent unpredictability) + R ( ¯ f ) − R ( g ) (approximation gap) + � R ( ¯ f ) − R ( ¯ f ) (estimation gap) + � R ( ˆ f ) − � R ( ¯ f ) (optimization gap) + R ( ˆ f ) − � R ( ˆ f ) (generalization gap) 2 / 61

Decomposing excess risk ERM ˆ R ( ˆ f is an approximate ERM: � f ) ≈ min f ∈F � R ( f ) . Let’s also define true/population risk minimizer ¯ f := arg min f ∈F R ( f ) . ( Question: What is F ? Answer: depends on kernel!) ( Question: is ¯ R ( ¯ f ) ≥ R ( ˆ f = arg min f ∈F R ( f ) ? Answer: no; in general � f ) !) Nature labels according to some g (not necessarily inside F !): R ( ˆ f ) = R ( g ) (inherent unpredictability) + R ( ¯ f ) − R ( g ) (approximation gap) + � R ( ¯ f ) − R ( ¯ f ) (estimation gap) + � R ( ˆ f ) − � R ( ¯ f ) (optimization gap) + R ( ˆ f ) − � R ( ˆ f ) (generalization gap) Let’s go through this step by step. 2 / 61

Inherent unpredictability Nature labels according to some g (not necessarily inside F !): R ( g ) (inherent unpredictability) 3 / 61

Inherent unpredictability Nature labels according to some g (not necessarily inside F !): R ( g ) (inherent unpredictability) ◮ If g is the function with lowest classification error, we can write down an explicit form: g ( x ) := sign(Pr[ Y = +1 | X = x ] − 1 / 2) . ◮ If g minimizes R with convex ℓ , again can write down g pointwise via Pr[ Y = +1 | X = x ] . 3 / 61

Approximation gap ¯ f minimizes R over F , and g is chosen by nature; consider R ( ¯ f ) − R ( g ) . (approximation gap) 4 / 61

Approximation gap ¯ f minimizes R over F , and g is chosen by nature; consider R ( ¯ f ) − R ( g ) . (approximation gap) ◮ We’ve shown that if R is misclassification, F is affine classifier, g is quadratic, can have gap 1 / 4 . ◮ We can make this gap arbitrarily small if F is: 2 layer wide network, RBF kernel SVM, polynomial classifier with arbitrary degree . . . ◮ What is F for SVM? 4 / 61

Approximation gap Consider SVM with no kernel. Can we only say � T x : w ∈ R d � F := x �→ w ? 5 / 61

Approximation gap Consider SVM with no kernel. Can we only say � T x : w ∈ R d � F := x �→ w ? w := arg min w � R ( w ) + λ 2 � w � 2 , Note, for ˆ n � � � λ w ) + λ R ( 0 ) + λ 2 � 0 � 2 = 1 w � 2 ≤ � w � 2 ≤ � T x i y i 2 � ˆ R ( ˆ 2 � ˆ 1 − 0 + = 1 , n i =1 and so SVM is working with the finer set � � T x : � w � 2 ≤ 2 F λ := w �→ w . λ 5 / 61

Approximation gap What about kernel SVM? 6 / 61

Approximation gap What about kernel SVM? Now working with     n � α i y i k ( x i , x ) : α ∈ R n F k :=  x �→  i =1 which is a random variable! (( x i , y i )) n i =1 given by data. 6 / 61

Approximation gap What about kernel SVM? Now working with     n � α i y i k ( x i , x ) : α ∈ R n F k :=  x �→  i =1 which is a random variable! (( x i , y i )) n i =1 given by data. This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a refined notion F k,λ . 6 / 61

Approximation gap What about kernel SVM? Now working with     n � α i y i k ( x i , x ) : α ∈ R n F k :=  x �→  i =1 which is a random variable! (( x i , y i )) n i =1 given by data. This function class is called a reproducing kernel hilbert space (RKHS). We can use it to develop a refined notion F k,λ . Going forward: we always try to work with the tightest possible function class defined by the data and algorithm. 6 / 61

Estimation gap ¯ f minimizes R over F . R ( ¯ � f ) − R ( ¯ f ) (estimation gap) 7 / 61

Estimation gap ¯ f minimizes R over F . R ( ¯ � f ) − R ( ¯ f ) (estimation gap) ◮ If (( x i , y i )) n i =1 drawn IID from same distribution as E in R , R ( ¯ n →∞ R ( ¯ by central limit theorem, � f ) − − − − → f ) . ◮ Next week, we’ll discuss high probability bounds for finite n . 7 / 61

Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) 8 / 61

Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) ◮ This is algorithmic : we reduce this number by optimizing better. ◮ We’ve advocated the use of gradient descent. ◮ Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) R uses a convex loss and ˆ ◮ If � f has at least one training mistake, relating R and test set mislcassifications can be painful. 8 / 61

Optimization gap ˆ R , and ¯ f ∈ F minimizes � f ∈ F minimizes R . R ( ˆ R ( ¯ � f ) − � f ) (optimization gap) ◮ This is algorithmic : we reduce this number by optimizing better. ◮ We’ve advocated the use of gradient descent. ◮ Many of these problems are NP-hard even in trivial cases. (Linear separator with noise and 3-node neural net are NP-hard.) R uses a convex loss and ˆ ◮ If � f has at least one training mistake, relating R and test set mislcassifications can be painful. Specifically considering SVM. ◮ This is a convex optimization problem . ◮ We can solve it in many ways (primal, dual, projected gradient descent, coordinate descent, Newton, et.), it doesn’t really matter so long as we end up close; the primal solutions are unique. 8 / 61

Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. 9 / 61

Generalization ˆ f is returned by ERM. R ( ˆ R ( ˆ f ) − � f ) This quantity is excess risk; when it is small, we say we generalize, otherwise we overfit. R ( ¯ n →∞ R ( ¯ ◮ Before, we said “By CLT, � f ) − − − − → f ) ”. Is this quantity the same? 9 / 61

Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff - PowerPoint PPT Presentation

Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C SVM risk Consider the

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik

Welcome to Girl Scouts and the great adventure of being a Girl Scout troop leader! Get ready for an

& policy sciences Criminology, Social Policy, Sociology, Social Sciences careers@bath.ac.uk

10601 Learning Objectives Course Level Learning Outcomes 1. Course Level a. Implement and

Information, Learning and Falsification David Balduzzi December 17, 2011 Max Planck Institute

Slides from lecture Friday, April 26, 2019 12:02 PM Unfiled Notes Page 1 Unfiled Notes Page 2

Course Introduction Mattox Beckman University of Illinois at Urbana-Champaign Department of

Daniel Archambault Lecturer (Assistant Professor) Swansea University, Wales, UK Information

Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff - PowerPoint PPT Presentation

Machine Learning Theory CS 446 1. SVM risk 0.6 0.5 aff tr aff te misclassification rate quad tr 0.4 quad te poly10 tr poly10 te 0.3 rbf1 tr rbf1 te rbf01 tr 0.2 rbf01 te 0.1 0.0 -1 0 1 10 10 10 C SVM risk Consider the

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik

Welcome to Girl Scouts and the great adventure of being a Girl Scout troop leader! Get ready for an

&amp; policy sciences Criminology, Social Policy, Sociology, Social Sciences careers@bath.ac.uk

10601 Learning Objectives Course Level Learning Outcomes 1. Course Level a. Implement and

Information, Learning and Falsification David Balduzzi December 17, 2011 Max Planck Institute

Slides from lecture Friday, April 26, 2019 12:02 PM Unfiled Notes Page 1 Unfiled Notes Page 2

Course Introduction Mattox Beckman University of Illinois at Urbana-Champaign Department of

Daniel Archambault Lecturer (Assistant Professor) Swansea University, Wales, UK Information

& policy sciences Criminology, Social Policy, Sociology, Social Sciences careers@bath.ac.uk