Kernel Methods For Regression and Classification Mike Hughes - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Summary of Unit 5: Kernel Methods For Regression and Classification Mike Hughes - Tufts COMP 135 - Fall 2020 2

SVM Logistic Regression Loss hinge cross entropy (log loss) Sensitive to Less More sensitive outliers Probabilistic? No Yes Multi-class? Only via separate Easy , using softmax model for each class (one-vs-all) Kernelizable? Yes, with speed Yes (cover next benefits from class) sparsity Mike Hughes - Tufts COMP 135 - Fall 2020 3

Multi-class SVMs • How do we extend idea of margin to more than 2 classes? Not so elegant. Two options: One vs rest Need to fit C separate models Pick class with largest f(x) One vs one Need to fit C(C-1)/2 models Pick class with most f(x) “wins” Mike Hughes - Tufts COMP 135 - Fall 2020 4

<latexit sha1_base64="CX2Uyb5hSgPhD4S+wTSYLrJzMWI=">ACJ3icbZBNSwMxEIazftb6VfXoJViEFqTsVkEvSrEXjxXaWujWk2zbWh2sySz2rL03jxr3gRVESP/hPTj4NWBwLPvDPDZF4vElyDbX9aC4tLyurqbX0+sbm1nZmZ7euZawoq1EpGp4RDPBQ1YDoI1IsVI4Al24/XL4/rNHVOay7AKw4i1AtINuc8pASO1Mxduj0ASjXKDPD7HLrABJFr6EJDBKIfv285tFQ+ODBSn4IqOBG3y8jPtzNZu2BPAv8FZwZNItKO/PidiSNAxYCFUTrpmNH0EqIAk4FG6XdWLOI0D7psqbBkARMt5LJnSN8aJQO9qUyLwQ8UX9OJCTQeh4pjMg0NPztbH4X60Zg3/WSngYxcBCOl3kxwKDxGPTcIcrRkEMDRCquPkrpj2iCAVjbdqY4Myf/BfqxYJzXChen2RLlzM7UmgfHaActApKqErVE1RNEDekKv6M16tJ6td+tj2rpgzWb20K+wvr4BFrKkLg=</latexit> Multi-class Logistic Regression • How do we extend LR to more than 2 classes? • Elegant: Can train weights using same prediction function we’ll use at test time p ( x ) = softmax( w T 1 x, w T 2 x, . . . w T ˆ C x ) Mike Hughes - Tufts COMP 135 - Fall 2020 5

Kernel methods Use kernel functions (similarity function with special properties) to obtain flexible high- dimensional feature transformations without explicit features Solve “dual” problem (for parameter alpha), not “primal” problem (for weights w) Can use the “kernel trick” for: * regression * classification (Logistic Regr. or SVM) Mike Hughes - Tufts COMP 135 - Fall 2020 6

Kernel Methods for Regression Kernels exist for: • Periodic regression • Histograms • Strings • Graphs, • And more! Mike Hughes - Tufts COMP 135 - Fall 2020 7

Review: Key concepts in supervised learning • Parametric vs nonparametric methods • Bias vs variance Mike Hughes - Tufts COMP 135 - Fall 2020 8

Parametric vs Nonparametric • Parametric methods • Complexity of decision function fixed in advance and specified by a finite fixed number of parameters, regardless of training data size Linear regression Neural networks Logistic regression • Nonparametric methods • Complexity of decision function can grow as more training data is observed Nearest neighbor methods Decision trees Ensembles of trees Mike Hughes - Tufts COMP 135 - Fall 2020 9

Bias & Variance Estimate (a random variable) ˆ y Known “true” response y Credit: Scott Fortmann-Roe http://scott.fortmann-roe.com/docs/BiasVariance.html Mike Hughes - Tufts COMP 135 - Fall 2020 10

<latexit sha1_base64="h1ZEA4W0jGTPZAVN/oGQAtWEzI=">ACInicbVDLSgNBEJz1bXxFPXoZDIeDLtRUA9CUASPEUwUsmvonUzM4OyDmV4xLPstXvwVLx4U9ST4Mc7GCJpYMFBUVTPd5cdSaLTtD2tsfGJyanpmtjA3v7C4VFxeaegoUYzXWSQjdemD5lKEvI4CJb+MFYfAl/zCvznO/YtbrSIwnPsxdwL4DoUHcEAjdQqHrjI7zBtgMo23S5g2su26CF1A8Cu76cnWZP+6HSbuj6oPHFVoV6rWLdh90lDgDUiID1FrFN7cdsSTgITIJWjcdO0YvBYWCSZ4V3ETzGNgNXPOmoSEXHtp/8SMbhilTuRMi9E2ld/T6QaN0LfJPMV9fDXi7+5zUT7Ox7qQjBHnIvj/qJiRPO+aFsozlD2DAGmhNmVsi4oYGhaLZgSnOGTR0mjUnZ2ypWz3VL1aFDHDFkj62STOGSPVMkpqZE6YeSePJn8mI9WE/Wq/X+HR2zBjOr5A+szy8grqNc</latexit> Decompose into Bias & Variance is known “true” response value at given known heldout input x y is a Random Variable obtained by fitting estimator to random ˆ y sample of N training data examples, then predicting at x Bias : Error from average model to true y − y ) 2 (¯ y , E [ˆ ¯ y ] How far the average prediction of our model (averaged over all possible training sets of size N) is from true response Variance : h y 2 i y 2 y ) 2 ] = E Deviation over model samples Var(ˆ y ) = E [(ˆ y − ¯ ˆ − ¯ How far predictions based on a single training set are from the average prediction Mike Hughes - Tufts COMP 135 - Spring 2019 11

Total Error: Bias^2 + Variance h � � 2 i h y − y ) 2 i y ( x tr , y tr ) − y ˆ = E (ˆ E h yy + y 2 i Expected value is over y 2 − 2ˆ = E ˆ samples of the observed training set h y 2 i yy + y 2 = E ˆ − 2¯ h y 2 i y 2 + ¯ y 2 − 2¯ yy + y 2 = E ˆ − ¯ y − y ) 2 (¯ = Var(ˆ y )+ Variance Mike Hughes - Tufts COMP 135 - Fall 2020 12

Toy example: ISL Fig. 6.5 total error bias Error due to inability of typical fit (averaged over training sets) to capture true predictive relationship variance Error due to estimating from a More flexible Less flexible single finite-size training set overfitting underfitting All supervised learning methods must manage bias/variance tradeoff. Hyperparameter search is key. Mike Hughes - Tufts COMP 135 - Spring 2019 13

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Dimensionality Reduction & Embedding Prof. Mike Hughes Many ideas/slides attributable to: Liping Liu (Tufts), Emily Fox (UW) Matt Gormley (CMU) 14

What will we learn? Supervised Learning Data Examples Performance { x n } N measure Task n =1 Unsupervised Learning summary data of x x Reinforcement Learning Mike Hughes - Tufts COMP 135 - Fall 2020 15

Task: Embedding Supervised Learning x 2 Unsupervised Learning embedding Reinforcement x 1 Learning Mike Hughes - Tufts COMP 135 - Fall 2020 16

Dim. Reduction/Embedding Unit Objectives • Goals of dimensionality reduction • Reduce feature vector size (keep signal, discard noise) • “Interpret” features: visualize/explore/understand • Common approaches • Principal Component Analysis (PCA) • word2vec and other neural embeddings • Evaluation Metrics • Storage size - Reconstruction error • “Interpretability” Mike Hughes - Tufts COMP 135 - Fall 2020 17

Example: 2D viz. of movies Mike Hughes - Tufts COMP 135 - Fall 2020 18

Example: Genes vs. geography Nature, 2008 Where possible, we based the geographic origin on the observed country data for grandparents. We used a ‘strict consensus’ approach: if all observed grandparents originated from a single country, we used that country as the origin. If an individual’s observed grandparents originated from different countries, we excluded the individual. Where grandparental data were unavailable, we used the individual’s country of birth. Total sample size after exclusion: 1,387 subjects Features: over half a million variable DNA sites in the human genome Mike Hughes - Tufts COMP 135 - Fall 2020 19

Example: Genes vs. geography Nature, 2008 Mike Hughes - Tufts COMP 135 - Fall 2020 20

Example: Eigen Clothing Mike Hughes - Tufts COMP 135 - Fall 2020 21

Mike Hughes - Tufts COMP 135 - Fall 2020 22

Centering the Data Goal: each feature’s mean = 0.0 Mike Hughes - Tufts COMP 135 - Spring 2019 23

Why center? • Think of mean vector as simplest possible “reconstruction” of a dataset • No example specific parameters, just one F- dim vector N ( x n − m ) T ( x n − m ) X min m ∈ R F n =1 m ∗ = mean( x 1 , . . . x N ) Mike Hughes - Tufts COMP 135 - Spring 2019 24

Mean reconstruction original reconstructed Mike Hughes - Tufts COMP 135 - Fall 2020 25

Principal Component Analysis Mike Hughes - Tufts COMP 135 - Fall 2020 26

Linear Projection to 1D Mike Hughes - Tufts COMP 135 - Fall 2020 27

Reconstruction from 1D to 2D Mike Hughes - Tufts COMP 135 - Fall 2020 28

2D Orthogonal Basis Mike Hughes - Tufts COMP 135 - Fall 2020 29

Which 1D projection is best? Idea: Minimize reconstruction error Mike Hughes - Tufts COMP 135 - Fall 2020 30

K-dim Reconstruction with PCA x i = Wz i + m + F vector F vector F x K K vector “mean” High- Weights Low-dim vector dim. vector data Problem: Over-parameterized. Too many possible solutions! If we scale z x2, we can scale W / 2 and get equivalent reconstruction We need to constrain the magnitude of weights. Let’s make all the weight vectors be unit vectors: ||W||_2 = 1 Mike Hughes - Tufts COMP 135 - Spring 2019 31

Principal Component Analysis Training step: .fit() • Input: • X : training data, N x F • N high-dim. example vectors • K : int, number of components • Satisfies 1 <= K <= F • Output: Trained parameters for PCA • m : mean vector, size F • W : learned basis of weight vectors, F x K • One F-dim. vector (magnitude 1) for each component • Each of the K vectors is orthogonal to every other Mike Hughes - Tufts COMP 135 - Spring 2019 32

Principal Component Analysis Transformation step: .transform() • Input: • X : training data, N x F • N high-dim. example vectors • Trained PCA “model” • m : mean vector, size F • W : learned basis of eigenvectors, F x K • One F-dim. vector (magnitude 1) for each component • Each of the K vectors is orthogonal to every other • Output: • Z : projected data, N x K Mike Hughes - Tufts COMP 135 - Spring 2019 33

Example: EigenFaces Mike Hughes - Tufts COMP 135 - Fall 2020 34

Kernel Methods For Regression and Classification Mike Hughes - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Summary of Unit 5: Kernel Methods For Regression and Classification Mike Hughes - Tufts COMP 135 - Fall 2020 2 SVM Logistic Regression Loss hinge

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Optimization for Kernel Methods S. Sathiya Keerthi Yahoo! Research, Burbank, CA, USA Kernel

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Skoltech Skolkovo Institute of Science and Technology Kernel Methods Refresher Kernel trick:

Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban Spring 2015

1 Kernel methods & optimization One example of a kernel that is frequently used in practice

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein <

Next Utterance Ranking Based On Context Response Similarity Basma El Amel Boussaha, Nicolas

Julia Ric har APSE Soft F ac ilitie s Gr oup 01 st Nove mbe r 2016 WINNING NEW BUSINESS

Regional Superparametrization of OpenIFS by 3D LES Gijs van den Oord (NLeSC), Fredrik Jansson

Model Uncertainty Quantification for Data Assimilation in partially observed Lorenz 96 Sahani

! AAS AASHTO SCOPM Task sk Force Force on on ! MAP AP-21 -21 Na Nation ional l Perf

TCIPG Overview Bill Sanders on behalf of the TCIPG Team 2012 Industry Workshop October 30, 2012

Extension complexity bounds for polygons Numerical factorizations and conjectures Franois

LBNF Neutrino Beam Monitoring Laura Fields and Zarko Pavlovic Joint ND/BIWG Meeting 26 June 2019

Kernel Methods For Regression and Classification Mike Hughes - - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Summary of Unit 5: Kernel Methods For Regression and Classification Mike Hughes - Tufts COMP 135 - Fall 2020 2 SVM Logistic Regression Loss hinge

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Black Kernel Rot Malady of Pecan B Wood, C Bock, l Wells, T Cottrell, M Hotchkiss Black Kernel

Kernel Properties - Convexity Leila Wehbe October 1st 2013 Leila Wehbe Kernel Properties -

Processes, Protection and the Kernel: Processes, Protection and the Kernel: Mode, Space, and

Linux Kernel Debugging Your kernel just oopsed - What do you do, hotshot? Muli Ben-Yehuda

Introduction to Linux Kernel Modules Luca Abeni luca.abeni@santannapisa.it Linux Kernel Modules

Tight Kernel Query Complexity of Kernel Ridge Regression and Kernel -means Clustering Manuel

Kernel Machines Support Vector Machines 1 Kernel Machines Optimal Separating HyperPlanes Soft

Kernel Methods Lei Tang Arizona State University Jul. 26th, 2007 Lei Tang Kernel Methods

Optimization for Kernel Methods S. Sathiya Keerthi Yahoo! Research, Burbank, CA, USA Kernel

Machine learning theory Kernel methods Hamid Beigy Sharif university of technology April 20,

Skoltech Skolkovo Institute of Science and Technology Kernel Methods Refresher Kernel trick:

Machine Learning Kernel Methods Hamid R. Rabiee Mohammad H. Rohban Spring 2015

1 Kernel methods &amp; optimization One example of a kernel that is frequently used in practice

Kernel Machines Steven J Zeil Old Dominion Univ. Fall 2010 1 Support Vector Machines Kernel

A kernel in a library Genodes custom kernel approach Martin Stein &lt;

Next Utterance Ranking Based On Context Response Similarity Basma El Amel Boussaha, Nicolas

Julia Ric har APSE Soft F ac ilitie s Gr oup 01 st Nove mbe r 2016 WINNING NEW BUSINESS

Regional Superparametrization of OpenIFS by 3D LES Gijs van den Oord (NLeSC), Fredrik Jansson

Model Uncertainty Quantification for Data Assimilation in partially observed Lorenz 96 Sahani

! AAS AASHTO SCOPM Task sk Force Force on on ! MAP AP-21 -21 Na Nation ional l Perf

TCIPG Overview Bill Sanders on behalf of the TCIPG Team 2012 Industry Workshop October 30, 2012

Extension complexity bounds for polygons Numerical factorizations and conjectures Franois

LBNF Neutrino Beam Monitoring Laura Fields and Zarko Pavlovic Joint ND/BIWG Meeting 26 June 2019

1 Kernel methods & optimization One example of a kernel that is frequently used in practice

A kernel in a library Genodes custom kernel approach Martin Stein <