Introduction to Machine Learning Non-linear prediction with kernels - PowerPoint PPT Presentation

Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch)

Solving non-linear classification tasks How can we find nonlinear classification boundaries? Similar as in regression, can use non-linear transformations of the feature vectors, followed by linear classification 2

Avoiding the feature explosion Need O( d k ) dimensions to represent (multivariate) polynomials of degree k on d features Example : d=10000, k=2 è Need ~100M dimensions In the following, we can see how we can efficiently implicitly operate in such high-dimensional feature spaces (i.e., without ever explicitly computing the transformation) 3

The „Kernel Trick“ Express problem s.t. it only depends on inner products Replace inner products by kernels Example: Perceptron n n 1 X X α j y i y j x T max { 0 , − i x j } α = arg min ˆ n α 1: n i =1 j =1 n n 1 X X α = arg min ˆ max { 0 , − α j y i y j k ( x i , x j ) } n α 1: n i =1 j =1 Will see further examples later 4

Kernelized Perceptron Initialize α 1 = · · · = α n = 0 For t=1,2,... Training Pick data point ( x i, ,y i ) uniformly at random Predict ⇣ n ⌘ X y = sign ˆ α j y j k ( x j , x i ) j =1 If y 6 = y i set ˆ α i ← α i + η t For new point x, predict ⇣ n Prediction ⌘ X y = sign ˆ α j y j k ( x j , x ) j =1 5

Questions What are valid kernels? How can we select a good kernel for our problem? Can we use kernels beyond the perceptron? Kernels work in very high-dimensional spaces. Doesn‘t this lead to overfitting? 6

Definition: kernel functions Data space X A kernel is a function satisfying k : X × X → R 1) Symmetry: For any it must hold that x , x 0 ∈ X k ( x , x 0 ) = k ( x 0 , x ) 2) Positive semi-definiteness: For any n , any set , the kernel (Gram) matrix S = { x 1 , . . . , x n } ⊆ X   k ( x 1 , x 1 ) k ( x 1 , x n ) . . . . . . . K =   . .   k ( x n , x 1 ) k ( x n , x n ) . . . must be positive semi-definite 7

Examples of kernels on R d Linear kernel: k ( x , x 0 ) = x T x 0 k ( x , x 0 ) = ( x T x 0 + 1) d Polynomial kernel: Gaussian (RBF, squared exp. kernel): k ( x , x 0 ) = exp( − || x − x 0 || 2 2 /h 2 ) Laplacian kernel: k ( x , x 0 ) = exp( − || x − x 0 || 1 /h ) 8

Examples of (non)-kernels k ( x, x 0 ) = sin( x ) cos( x 0 ) k ( x , x 0 ) = x T M x 0 9

Effect of kernel on function class Given kernel k, predictors (for kernelized classification) have the form ⇣ n ⌘ X y = sign ˆ α j y j k ( x j , x ) j =1 10

Example: Gaussian kernel 1 0.9 k ( x , x 0 ) = exp( − || x − x 0 || 2 2 /h 2 ) 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 600 700 n X f ( x ) = α i k ( x i , x ) i =1 2 3 2.5 1 2 1.5 0 1 0.5 -1 0 -2 -0.5 -1 -3 -1.5 -2 -4 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Bandwidth h=.3 Bandwidth h=.1 11

Example: Laplace/Exponential kernel 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 100 200 300 400 500 600 700 n X f ( x ) = α i k ( x i , x ) i =1 1.5 2.5 1 2 1.5 0.5 1 0 0.5 -0.5 0 -0.5 -1 -1 -1.5 -1.5 -2 -2 -2.5 -2.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Bandwidth h=1 Bandwidth h=.3 12

Demo: Effect on decision boundary 13

Kernels beyond R d Can define kernels on a variety of objects: Sequence kernels Graph kernels Diffusion kernels Kernels on probability distributions ... 14

Example: Graph kernels [Borgwardt et al.] Can define a kernel for measuring similarity between graphs by comparing random walks on both graphs (not further defined here) 15

Example: Diffusion kernels on graphs s 4 K = exp( − β L ) s 1 s 1 s 2 s 3 s 3 s 5 s 7 s 8 s 6 s 9 s 9 s 10 s 11 s 12 s 12 Can measure similarity among nodes in a graph via diffusion kernels (not defined here) 16

Kernel engineering (composition rules) Suppose we have two kernels k 1 : X × X → R k 2 : X × X → R defined on data space X Then the following functions are valid kernels: k ( x , x 0 ) = k 1 ( x , x 0 ) + k 2 ( x , x 0 ) k ( x , x 0 ) = k 1 ( x , x 0 ) k 2 ( x , x 0 ) k ( x , x 0 ) = c k 1 ( x , x 0 ) for c > 0 k ( x , x 0 ) = f ( k 1 ( x , x 0 )) where f is a polynomial with positive coefficients or the exponential function 17

Example: ANOVA kernel 18

Example: Modeling pairwise data May want to use kernels to model pairwise data (users x products; genes x patients; ...) 8 3 6 4 2 2 1 Payoffs Payoffs 0 0 − 2 − 1 − 4 − 6 − 2 − 8 − 3 1 − 10 1 0.5 1 0.5 1 0.5 0 0 0.5 0 0 − 0.5 − 0.5 − 0.5 − 0.5 − 1 − 1 − 1 Contexts − 1 Contexts Actions Actions 19

Where are we? We’ve seen how to kernelize the perceptron Discussed properties of kernels, and seen examples Next questions: What kind of predictors / decision boundaries do kernel methods entail? Can we use the kernel trick beyond the perceptron? 20

Kernels as similarity functions Recall Perceptron (and SVM) classification rule: n ! X y = sign α i y i k ( x i , x ) i =1 Consider Gaussian kernel k ( x , x 0 ) = exp( − || x − x 0 || 2 /h 2 ) 21

Side note: Nearest-neighbor classifiers For data point x , predict majority of labels of k nearest neighbors n ! X y = sign y i [ x i among k nearest neighbors of x ] i =1 22

Demo: k-NN 23

Nearest-neighbor classifiers For data point x , predict majority of labels of k nearest neighbors n ! X y = sign y i [ x i among k nearest neighbors of x ] i =1 How to choose k ? Cross-validation! J 24

K-NN vs. Kernel Perceptron k-Nearest Neighbor: n ! X y = sign y i [ x i among k nearest neighbors of x ] i =1 Kernel Perceptron: n ! X y = sign y i α i k ( x i , x ) i =1 25

Comparison: k-NN vs Kernelized Perceptron Method k-NN Kernelized Perceptron Advantages No training Optimized weights can necessary lead to improved performance Can capture „global trends“ with suitable kernels Depends on „wrongly classified“ examples only Disadvantages Depends on all data Training requires optimization è inefficient 26

Parametric vs nonparametric learning Parametric models have finite set of parameters Example : Linear regression, linear Perceptron, ... Nonparametric models grow in complexity with the size of the data Potentially much more expressive But also more computationally complex – Why? Example : Kernelized Perceptron, k-NN, ... Kernels provide a principled way of deriving nonparametric models from parametric ones 27

Where are we? We’ve seen how to kernelize the perceptron Discussed properties of kernels, and seen examples Next question: Can we use the kernel trick beyond the perceptron? 28

Kernelized SVM The support vector machine n 1 X max { 0 , 1 − y i w T x i } + λ || w || 2 w = arg min ˆ 2 n w i =1 can also be kernelized 29

How to kernelize the objective? n 1 X max { 0 , 1 − y i w T x i } + λ || w || 2 w = arg min ˆ 2 n w i =1 30

How to kernelize the regularizer? n 1 X max { 0 , 1 − y i w T x i } + λ || w || 2 w = arg min ˆ 2 n w i =1 31

Learning & prediction with kernel classifier Learning: Solve the problem n Per- 1 max { 0 , − y i α T k i } X Or: arg min ceptron: n α i =1 n 1 max { 0 , 1 − y i α T k i } + λα T D y KD y α X SVM: arg min n α i =1 k i = [ y 1 k ( x i , x 1 ) , . . . , y n k ( x i , x n )] Prediction: For data point x predict label y as n ! X y = sign ˆ α i y i k ( x i , x ) i =1 32

Demo: Kernelized SVM 33

Kernelized Linear Regression From linear to nonlinear regression: f(x) f(x) + + + + + + + + + + + + + + + + + + + ++ + + + + x x Can also kernelize linear regression Predictor has the form n X f ( x ) = α i k ( x i , x ) i =1 34

Example: Kernelized linear regression Original (parametric) linear optimization problem n ⌘ 2 1 ⇣ X w T x i − y i + λ || w || 2 w = arg min ˆ 2 n w i =1 Similar as in perceptron, optimal lies in span of data: n X ˆ w = α i x i i =1 35

Kernelizing linear regression n n ⌘ 2 1 ⇣ X X w T x i − y i + λ || w || 2 w = ˆ w = arg min ˆ α i x i 2 n w i =1 i =1 36

Kernelized linear regression   k ( x 1 , x 1 ) k ( x 1 , x n ) . . . n ⇣ n 1 ⌘ 2 + λα T K α . . X X α = arg min ˆ α j k ( x i , x j ) − y i . . K =   . . n   α 1: n i =1 j =1 k ( x n , x 1 ) k ( x n , x n ) . . . 37

Learning & Predicting with KLR Learning: Solve least squares problem 1 n || α T K − y || 2 2 + λα T K α α = arg min ˆ α Closed-form solution: α = ( K + n λ I ) − 1 y ˆ Prediction: For data point x predict response y as n X y = ˆ α i k ( x i , x ) ˆ i =1 38

Demo: Kernelized linear regression 39

KLR for the linear kernel What if ? k ( x , x 0 ) = x T x 0 40

Application: semi-parametric regression Often, parametric models are too „rigid“, and nonparametric models fail to extrapolate Solution : Use additive combination of linear and nonlinear kernel function k ( x , x 0 ) = c 1 exp( || x − x 0 || 2 2 /h 2 ) + c 2 x T x 0 41

Demo: Semi-parametric KLR 42

Example 43

Example fits 44

Introduction to Machine Learning Non-linear prediction with kernels - PowerPoint PPT Presentation

Introduction to Machine Learning Non-linear prediction with kernels Prof. Andreas Krause Learning and Adaptive Systems (las.ethz.ch) Solving non-linear classification tasks How can we find nonlinear classification boundaries? Similar as in

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Nonparametric analysis of CMB Nonparametric analysis of CMB power spectrum data and consistency

Regression: Simple and Linear Introduction to Machine Learning Regression Principle REGRESSION

Explainable(?) Statistical ML Derek Doran Dept. of Computer Science and Engineering Wright

Nonparametric inference of interaction laws in particle/agent systems Fei Lu Department of

Lecture 14: Local linear regression non-parametric estimation, perceptron and update algo, etc

50 Ways with GPs Richard Wilkinson School of Maths and Statistics University of Sheffield

CPSC 340: Machine Learning and Data Mining Non-Parametric Models Summer 2020 Course Map

Unsupervised Coreference Resolution in a Nonparametric Bayesian Model Aria Haghighi and Dan Klein

Sambuz

Useful Links

Newsletter

Mail Us