max margin classifier
play

Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 - PowerPoint PPT Presentation

Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Outline Maximum Margin


  1. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7

  2. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data

  3. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings • Where does the maximization problem come from? • The intuition comes from the primal version, which is based on a feature mapping φ . • Theorem: Every valid kernel k ( x , y ) is the dot-product φ ( x )[ φ ( x )] T for some set of basis functions (feature mapping) φ . • The feature space φ ( x ) could be high-dimensional, even infinite. • This is good because if data aren’t separable in original input space ( x ), they may be in feature space φ ( x ) • We can think about how to find a good linear separator using the dot product in high dimensions, then transfer this back to kernels in the original input space.

  4. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings • Where does the maximization problem come from? • The intuition comes from the primal version, which is based on a feature mapping φ . • Theorem: Every valid kernel k ( x , y ) is the dot-product φ ( x )[ φ ( x )] T for some set of basis functions (feature mapping) φ . • The feature space φ ( x ) could be high-dimensional, even infinite. • This is good because if data aren’t separable in original input space ( x ), they may be in feature space φ ( x ) • We can think about how to find a good linear separator using the dot product in high dimensions, then transfer this back to kernels in the original input space.

  5. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings • Where does the maximization problem come from? • The intuition comes from the primal version, which is based on a feature mapping φ . • Theorem: Every valid kernel k ( x , y ) is the dot-product φ ( x )[ φ ( x )] T for some set of basis functions (feature mapping) φ . • The feature space φ ( x ) could be high-dimensional, even infinite. • This is good because if data aren’t separable in original input space ( x ), they may be in feature space φ ( x ) • We can think about how to find a good linear separator using the dot product in high dimensions, then transfer this back to kernels in the original input space.

  6. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings • Where does the maximization problem come from? • The intuition comes from the primal version, which is based on a feature mapping φ . • Theorem: Every valid kernel k ( x , y ) is the dot-product φ ( x )[ φ ( x )] T for some set of basis functions (feature mapping) φ . • The feature space φ ( x ) could be high-dimensional, even infinite. • This is good because if data aren’t separable in original input space ( x ), they may be in feature space φ ( x ) • We can think about how to find a good linear separator using the dot product in high dimensions, then transfer this back to kernels in the original input space.

  7. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Why Kernels? • If we can use dot products with features, why bother with kernels? • Often easier to specify how similar two things are (dot product) than to construct explicit feature space φ . • e.g. graphs, sets, strings (NIPS 2009 best student paper award). • There are high-dimensional (even infinite) spaces that have efficient-to-compute kernels

  8. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernel Trick • In previous lectures on linear models, we would explicitly compute φ ( x i ) for each datapoint • Run algorithm in feature space • For some feature spaces, can compute dot product φ ( x i ) T φ ( x j ) efficiently • Efficient method is computation of a kernel function k ( x i , x j ) = φ ( x i ) T φ ( x j ) • The kernel trick is to rewrite an algorithm to only have x enter in the form of dot products • The menu: • Kernel trick examples • Kernel functions

  9. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernel Trick • In previous lectures on linear models, we would explicitly compute φ ( x i ) for each datapoint • Run algorithm in feature space • For some feature spaces, can compute dot product φ ( x i ) T φ ( x j ) efficiently • Efficient method is computation of a kernel function k ( x i , x j ) = φ ( x i ) T φ ( x j ) • The kernel trick is to rewrite an algorithm to only have x enter in the form of dot products • The menu: • Kernel trick examples • Kernel functions

  10. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernel Trick • In previous lectures on linear models, we would explicitly compute φ ( x i ) for each datapoint • Run algorithm in feature space • For some feature spaces, can compute dot product φ ( x i ) T φ ( x j ) efficiently • Efficient method is computation of a kernel function k ( x i , x j ) = φ ( x i ) T φ ( x j ) • The kernel trick is to rewrite an algorithm to only have x enter in the form of dot products • The menu: • Kernel trick examples • Kernel functions

  11. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernel Trick • In previous lectures on linear models, we would explicitly compute φ ( x i ) for each datapoint • Run algorithm in feature space • For some feature spaces, can compute dot product φ ( x i ) T φ ( x j ) efficiently • Efficient method is computation of a kernel function k ( x i , x j ) = φ ( x i ) T φ ( x j ) • The kernel trick is to rewrite an algorithm to only have x enter in the form of dot products • The menu: • Kernel trick examples • Kernel functions

  12. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data A Kernel Trick • Let’s look at the nearest-neighbour classification algorithm • For input point x i , find point x j with smallest distance: || x i − x j || 2 ( x i − x j ) T ( x i − x j ) = x iT x i − 2 x iT x j + x jT x j = • If we used a non-linear feature space φ ( · ) : || φ ( x i ) − φ ( x j ) || 2 φ ( x i ) T φ ( x i ) − 2 φ ( x i ) T φ ( x j ) + φ ( x j ) T φ ( x j ) = = k ( x i , x i ) − 2 k ( x i , x j ) + k ( x j , x j ) • So nearest-neighbour can be done in a high-dimensional feature space without actually moving to it

  13. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data A Kernel Trick • Let’s look at the nearest-neighbour classification algorithm • For input point x i , find point x j with smallest distance: || x i − x j || 2 ( x i − x j ) T ( x i − x j ) = x iT x i − 2 x iT x j + x jT x j = • If we used a non-linear feature space φ ( · ) : || φ ( x i ) − φ ( x j ) || 2 φ ( x i ) T φ ( x i ) − 2 φ ( x i ) T φ ( x j ) + φ ( x j ) T φ ( x j ) = = k ( x i , x i ) − 2 k ( x i , x j ) + k ( x j , x j ) • So nearest-neighbour can be done in a high-dimensional feature space without actually moving to it

  14. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data A Kernel Trick • Let’s look at the nearest-neighbour classification algorithm • For input point x i , find point x j with smallest distance: || x i − x j || 2 ( x i − x j ) T ( x i − x j ) = x iT x i − 2 x iT x j + x jT x j = • If we used a non-linear feature space φ ( · ) : || φ ( x i ) − φ ( x j ) || 2 φ ( x i ) T φ ( x i ) − 2 φ ( x i ) T φ ( x j ) + φ ( x j ) T φ ( x j ) = = k ( x i , x i ) − 2 k ( x i , x j ) + k ( x j , x j ) • So nearest-neighbour can be done in a high-dimensional feature space without actually moving to it

  15. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Example: The Quadratic Kernel Function • Consider again the kernel function k ( x , z ) = ( 1 + x T z ) 2 • With x , z ∈ R 2 , ( 1 + x 1 z 1 + x 2 z 2 ) 2 k ( x , z ) = 1 + 2 x 1 z 1 + 2 x 2 z 2 + x 2 1 z 2 1 + 2 x 1 z 1 x 2 z 2 + x 2 2 z 2 = 2 √ √ √ √ √ √ 2 x 2 , x 2 2 x 1 x 2 , x 2 2 z 2 , z 2 2 z 1 z 2 , z 2 2 ) T = ( 1 , 2 x 1 , 1 , 2 )( 1 , 2 z 1 , 1 , φ ( x ) T φ ( z ) = • So this particular kernel function does correspond to a dot product in a feature space (is valid) • Computing k ( x , z ) is faster than explicitly computing φ ( x ) T φ ( z ) • In higher dimensions, larger exponent, much faster

  16. Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Example: The Quadratic Kernel Function • Consider again the kernel function k ( x , z ) = ( 1 + x T z ) 2 • With x , z ∈ R 2 , ( 1 + x 1 z 1 + x 2 z 2 ) 2 k ( x , z ) = 1 + 2 x 1 z 1 + 2 x 2 z 2 + x 2 1 z 2 1 + 2 x 1 z 1 x 2 z 2 + x 2 2 z 2 = 2 √ √ √ √ √ √ 2 x 2 , x 2 2 x 1 x 2 , x 2 2 z 2 , z 2 2 z 1 z 2 , z 2 2 ) T = ( 1 , 2 x 1 , 1 , 2 )( 1 , 2 z 1 , 1 , φ ( x ) T φ ( z ) = • So this particular kernel function does correspond to a dot product in a feature space (is valid) • Computing k ( x , z ) is faster than explicitly computing φ ( x ) T φ ( z ) • In higher dimensions, larger exponent, much faster

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend