Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron - PowerPoint PPT Presentation

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML)

Nonlinear Features x4: -1 x1: +1 x2: -1 x3: +1 • Concatenated (combined) features • XOR: x = (x 1 , x 2 , x 1 x 2 ) • income: add “degree + major” • Perceptron • Map data into feature space x → φ ( x ) • Solution in span of φ ( x i )

Quadratic Features • Separating surfaces are Circles, hyperbolae, parabolae

Kernels as dot products Problem Extracting features can sometimes be very costly. Example: second order features in 1000 dimensions. This leads to 5005 numbers. For higher order polyno- 5 · 10 5 mial features much worse. Solution Don’t compute the features, try to compute dot products implicitly. For some features this works . . . Definition A kernel function k : X ⇥ X ! R is a symmetric function in its arguments for which the following property holds k ( x, x 0 ) = h Φ ( x ) , Φ ( x 0 ) i for some feature map Φ . If k ( x, x 0 ) is much cheaper to compute than Φ ( x ) . . .

Quadratic Kernel x4: -1 x1: +1 x2: -1 x3: +1 for x in ℝ n , quadratic ɸ : Quadratic Features in R 2 naive: ɸ (x): O ( n 2 ) p ⇣ ⌘ x 2 2 x 1 x 2 , x 2 ɸ (x) ∙ ɸ (x’): O ( n 2 ) Φ ( x ) := 1 , 2 kernel k (x,x’): O ( n ) Dot Product p p D⇣ ⌘ ⇣ 2 ⌘E 2 , h Φ ( x ) , Φ ( x 0 ) i = x 0 2 x 0 1 x 0 2 , x 0 x 2 2 x 1 x 2 , x 2 1 , , 2 1 2 = h x, x 0 i 2 . = k ( x, x 0 ) Insight Trick works for any polynomials of order d via h x, x 0 i d .

The Perceptron on features initialize w, b = 0 repeat Pick ( x i , y i ) from data if y i ( w · Φ ( x i ) + b )  0 then w 0 = w + y i Φ ( x i ) b 0 = b + y i until y i ( w · Φ ( x i ) + b ) > 0 for all i • Nothing happens if classified correctly end • Weight vector is linear combination X w = α i φ ( x i ) i ∈ I • Classifier is (implicitly) a linear combination of inner products X f ( x ) = α i h φ ( x i ) , φ ( x ) i i ∈ I

Kernelized Perceptron Functional Form initialize f = 0 repeat Pick ( x i , y i ) from data if y i f ( x i ) ≤ 0 then increase its vote by 1 f ( · ) ← f ( · ) + y i k ( x i , · ) + y i α i ← α i + y i until y i f ( x i ) > 0 for all i end • instead of updating w , now update α i • Weight vector is linear combination X w = α i φ ( x i ) • Classifier is linear combination of inner products i ∈ I X X f ( x ) = α i h φ ( x i ) , φ ( x ) i = α i k ( x i , x ) i ∈ I i ∈ I

Kernelized Perceptron Primal Form Dual Form update weights update linear coefficients α i ← α i + y i w ← w + y i φ ( x i ) classify implicitly equivalent to: X f ( k ) = w · φ ( x ) w = α i φ ( x i ) i ∈ I • Nothing happens if classified correctly • Weight vector is linear combination X w = α i φ ( x i ) • Classifier is linear combination of inner products i ∈ I X X f ( x ) = α i h φ ( x i ) , φ ( x ) i = α i k ( x i , x ) i ∈ I i ∈ I

Kernelized Perceptron Primal Form Dual Form update weights update linear coefficients α i ← α i + y i w ← w + y i φ ( x i ) classify implicitly equivalent to: X f ( k ) = w · φ ( x ) w = α i φ ( x i ) i ∈ I classify X f ( x ) = w · φ ( x ) = [ α i φ ( x i )] φ ( x ) slow i ∈ I X = α i h φ ( x i ) , φ ( x ) i O ( d 2 ) i ∈ I fast X = α i k ( x i , x ) O ( d ) i ∈ I

Kernelized Perceptron Dual Form initialize for all α i = 0 i update linear coefficients repeat Pick from data ( x i , y i ) α i ← α i + y i if then y i f ( x i ) ≤ 0 implicitly α i ← α i + y i X w = α i φ ( x i ) until for all y i f ( x i ) > 0 i i ∈ I classify X f ( x ) = w · φ ( x ) = [ α i φ ( x i )] φ ( x ) if #features >> #examples, slow i ∈ I dual is easier; X = α i h φ ( x i ) , φ ( x ) i O ( d 2 ) otherwise primal is easier i ∈ I fast X = α i k ( x i , x ) O ( d ) i ∈ I

Kernelized Perceptron Primal Perceptron Dual Perceptron update weights update linear coefficients α i ← α i + y i w ← w + y i φ ( x i ) classify implicitly X f ( k ) = w · φ ( x ) w = α i φ ( x i ) i ∈ I Q: when is #features >> #examples? if #features >> #examples, dual is easier; A: higher-order polynomial kernels otherwise primal is easier or exponential kernels (inf. dim.)

Kernelized Perceptron Pros/Cons of Kernel in Dual Dual Perceptron • pros: update linear coefficients • no need to compute ɸ (x) (time) α i ← α i + y i • no need to store ɸ (x) and w implicitly (memory) X w = α i φ ( x i ) • cons: i ∈ I classify • sum over all misclassified training examples for test X f ( x ) = w · φ ( x ) = [ α i φ ( x i )] φ ( x ) • need to store all misclassified i ∈ I slow training examples (memory) X = α i h φ ( x i ) , φ ( x ) i O ( d 2 ) • called “support vector set” i ∈ I fast X = α i k ( x i , x ) • SVM will minimize this set! O ( d ) i ∈ I

Kernelized Perceptron Dual Perceptron Primal Perceptron update on new param. w (implicit) update on new param. x1: -1 α = (-1, 0, 0) -x1 x1: -1 w = (0, -1) x2: +1 α = (-1, 1, 0) -x1 + x2 x2: +1 w = (2, 0) x3: +1 α = (-1, 1, 1) -x1 + x2 + x3 x3: +1 w = (2, -1) linear kernel (identity map) final implicit w = (2, -1) x 1 (0 , 1) : − 1 x 2 (2 , 1) : +1 geometric interpretation of dual classification: sum of dot-products with x2 & x3 x 3 (0 , − 1) : +1 bigger than dot-product with x1 (agreement w/ positive > w/ negative)

XOR Example Dual Perceptron x4: -1 x1: +1 update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0) φ (x1) x2: -1 x2: -1 α = (+1, -1, 0, 0) φ (x1) - φ (x2) x3: +1 √ k ( x, x 0 ) = ( x · x 0 ) 2 ⇔ φ ( x ) = ( x 2 1 , x 2 √ 2 x 1 x 2 ) 2 , w = (0 , 0 , 2 2) classification rule in dual/geom: in dual/algebra: ( x · x 1 ) 2 > ( x · x 2 ) 2 ( x · x 1 ) 2 > ( x · x 2 ) 2 x1: +1 ⇒ cos 2 θ 1 > cos 2 θ 2 ⇒ ( x 1 + x 2 ) 2 > ( x 1 − x 2 ) 2 ⇒ | cos θ 1 | > | cos θ 2 | ⇒ x 1 x 2 > 0 also verify in primal x2: -1

Circle Example?? Dual Perceptron update on new param. w (implicit) x1: +1 α = (+1, 0, 0, 0) φ (x1) x2: -1 α = (+1, -1, 0, 0) φ (x1) - φ (x2) √ k ( x, x 0 ) = ( x · x 0 ) 2 ⇔ φ ( x ) = ( x 2 1 , x 2 2 x 1 x 2 ) 2 ,

Polynomial Kernels Idea We want to extend k ( x, x 0 ) = h x, x 0 i 2 to k ( x, x 0 ) = ( h x, x 0 i + c ) d where c > 0 and d 2 N . Prove that such a kernel corresponds to a dot product. Proof strategy + c is just augmenting space. Simple and straightforward: compute the explicit sum simpler proof: set x 0 = sqrt( c ) given by the kernel, i.e. m ✓ d ◆ k ( x, x 0 ) = ( h x, x 0 i + c ) d = ( h x, x 0 i ) i c d � i X i i =0 Individual terms ( h x, x 0 i ) i are dot products for some Φ i ( x ) .

y +1 Circle Example +1 -1 x2 Dual Perceptron update on new param. w (implicit) x3 x5 x1 x1: +1 α = (+1, 0, 0, 0, 0) φ (x1) x2: -1 α = (+1, -1, 0, 0, 0) φ (x1) - φ (x2) x4 x3: -1 α = (+1, -1, -1, 0, 0) √ k ( x, x 0 ) = ( x · x 0 ) 2 ⇔ φ ( x ) = ( x 2 1 , x 2 2 x 1 x 2 ) 2 , k ( x, x 0 ) = ( x · x 0 + 1) 2 ⇔ φ ( x ) =?

Examples you only need to know polynomial and gaussian. Examples of kernels k ( x, x 0 ) h x, x 0 i Linear exp ( � λ k x � x 0 k ) Laplacian RBF distorts distance � λ k x � x 0 k 2 � � Gaussian RBF exp ( h x, x 0 i + c i ) d , c � 0 , d 2 N Polynomial B 2 n +1 ( x � x 0 ) B-Spline distorts angle E c [ p ( x | c ) p ( x 0 | c )] Cond. Expectation Simple trick for checking Mercer’s condition Compute the Fourier transform of the kernel and check that it is nonnegative.

Kernel Summary • For a feature map ɸ , find a magic function k , s.t.: • the dot-product ɸ (x) ∙ ɸ (x’) = k (x, x’) • this k (x, x’) should be much faster than ɸ (x) • k (x, x’) should be computable in O ( n ) if x in ℝ n • ɸ (x) is much slower: O( n d ) for poly d, more for Gaussian • But for any k function, is there a ɸ s.t. ɸ (x) ∙ ɸ (x’) = k (x,x’)? Examples of kernels k ( x, x 0 ) h x, x 0 i Linear exp ( � λ k x � x 0 k ) Laplacian RBF � λ k x � x 0 k 2 � � Gaussian RBF exp ( h x, x 0 i + c i ) d , c � 0 , d 2 N Polynomial 0 B-Spline

Mercer’s Theorem The Theorem For any symmetric function k : X ⇥ X ! R which is square integrable in X ⇥ X and which satisfies Z k ( x, x 0 ) f ( x ) f ( x 0 ) dxdx 0 � 0 for all f 2 L 2 ( X ) X ⇥ X there exist φ i : X ! R and numbers λ i � 0 where λ i φ i ( x ) φ i ( x 0 ) for all x, x 0 2 X . X k ( x, x 0 ) = i Interpretation Double integral is the continuous version of a vector- matrix-vector multiplication. For positive semidefinite matrices we have X X k ( x i , x j ) α i α j � 0

Properties Distance in Feature Space Distance between points in feature space via d ( x, x 0 ) 2 := k Φ ( x ) � Φ ( x 0 ) k 2 = h Φ ( x ) , Φ ( x ) i � 2 h Φ ( x ) , Φ ( x 0 ) i + h Φ ( x 0 ) , Φ ( x 0 ) i = k ( x, x ) + k ( x 0 , x 0 ) � 2 k ( x, x ) Kernel Matrix To compare observations we compute dot products, so we study the matrix K given by K ij = h Φ ( x i ) , Φ ( x j ) i = k ( x i , x j ) where x i are the training patterns. Similarity Measure The entries K ij tell us the overlap between Φ ( x i ) and Φ ( x j ) , so k ( x i , x j ) is a similarity measure.

Kernelized Pegasos for SVM for HW2, you don’t need to randomly choose training examples. just go over all training examples in the original order, and call that an epoch (same as HW1).

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron - PowerPoint PPT Presentation

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x2: -1 x3: +1 Concatenated (combined) features XOR: x = (x 1 , x 2 , x 1 x 2 )

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

The Distribution of a Linear Combination of Random Variables Bernd Schr oder logo1 Bernd

0.5 = 4 5 1 * 0 0.5 0 7 1 1 7 1 1 7 0 1.0 kernel Why filter images? Convolution

Data Mining and Matrices 02 Linear Algebra Refresher Rainer Gemulla, Pauli Miettinen April

GCDs & linear (ab)c = a (bc), a a = 0, combinations a + 0 = a, a+1 > a, .

A Quick Review of Linear Algebra (linear combination, linear independence, span, basis) +

Linearly independent functions Definition The set of functions { 1 , . . . , n } is called

Why Quantum (Wave Probability) Need for a Geometric . . . Models Are a Good Description of Many

Warm-up Question Do you understand the following sentence? The set of 2 2 symmetric matrices

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron - PowerPoint PPT Presentation

Machine Learning Fall 2017 Kernels (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang (Chap. 12 of CIML) Nonlinear Features x4: -1 x1: +1 x2: -1 x3: +1 Concatenated (combined) features XOR: x = (x 1 , x 2 , x 1 x 2 )

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

The Distribution of a Linear Combination of Random Variables Bernd Schr oder logo1 Bernd

0.5 = 4 5 1 * 0 0.5 0 7 1 1 7 1 1 7 0 1.0 kernel Why filter images? Convolution

Data Mining and Matrices 02 Linear Algebra Refresher Rainer Gemulla, Pauli Miettinen April

GCDs &amp; linear (ab)c = a (bc), a a = 0, combinations a + 0 = a, a+1 &gt; a, .

A Quick Review of Linear Algebra (linear combination, linear independence, span, basis) +

Linearly independent functions Definition The set of functions { 1 , . . . , n } is called

Why Quantum (Wave Probability) Need for a Geometric . . . Models Are a Good Description of Many

Warm-up Question Do you understand the following sentence? The set of 2 2 symmetric matrices

GCDs & linear (ab)c = a (bc), a a = 0, combinations a + 0 = a, a+1 > a, .