kernel methods
play

Kernel Methods for regression and classification Prof. Mike Hughes - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Kernel Methods for regression and classification Prof. Mike Hughes Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie,


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2019s/ Kernel Methods for regression and classification Prof. Mike Hughes Many ideas/slides attributable to: Dan Sheldon (U.Mass.) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

  2. Objectives for Day 19: Kernels Big idea: Use kernel functions (similarity function with special properties) to obtain flexible high-dimensional feature transformations without explicit features • From linear regression (LR) to kernelized LR • What is a kernel function? • Basic properties • Example: Polynomial kernel • Example: Squared Exponential kernel • Kernels for classification • Logistic Regression • SVMs Mike Hughes - Tufts COMP 135 - Spring 2019 3

  3. Task: Regression & Classification y is a numeric variable Supervised e.g. sales in $$ Learning y Unsupervised Learning Reinforcement Learning x Mike Hughes - Tufts COMP 135 - Spring 2019 4

  4. Keys to Regression Success • Feature transformation + linear model • Penalized weights to avoid overfitting Mike Hughes - Tufts COMP 135 - Spring 2019 5

  5. Can fit linear functions to nonlinear features A nonlinear function of x: y ( x i ) = θ 0 + θ 1 x i + θ 2 x 2 i + θ 3 x 3 ˆ i φ ( x i ) = [1 x i x 2 i x 3 i ] Can be written as a linear function of 4 θ g φ g ( x i ) = θ T φ ( x i ) X y ( x i ) = ˆ g =1 “Linear regression” means linear in the parameters (weights, biases) Features can be arbitrary transforms of raw data Mike Hughes - Tufts COMP 135 - Spring 2019 6

  6. What feature transform to use? • Anything that works for your data! • sin / cos for periodic data • polynomials for high-order dependencies φ ( x i ) = [1 x i x 2 i . . . ] • interactions between feature dimensions φ ( x i ) = [1 x i 1 x i 2 x i 3 x i 4 . . . ] • Many other choices possible Mike Hughes - Tufts COMP 135 - Spring 2019 7

  7. Review: Linear Regression Prediction : Linear transform of G-dim features G y ( x i , θ ) = θ T φ ( x i ) = X ˆ θ g · φ ( x i ) g g =1 Training : Solve optimization problem N + L2 penalty y ( x n , θ )) 2 X min ( y n − ˆ (optional) θ n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 8

  8. Problems with high-dim features • Feature transformation + linear model How expensive is this transformation? (Runtime and storage) Mike Hughes - Tufts COMP 135 - Spring 2019 9

  9. Thought Experiment • Suppose that the optimal weight vector can be exactly constructed via a linear combination of the training set feature vectors θ ∗ = α 1 φ ( x 1 ) + α 2 φ ( x 2 ) + . . . + α N φ ( x N ) Each alpha is a scalar Each feature vector is a vector of size G Mike Hughes - Tufts COMP 135 - Spring 2019 10

  10. Justification? Is optimal theta a linear combo of feature vectors? Stochastic gradient descent, with 1 example per batch, can be seen as creating optimal weight vector of this form • Starting with all zero vector • In each step, adding a weight * feature vector Each update step: θ t ← θ t − 1 − η · d d θ loss( y n , θ T φ ( x n )) Let’s simplify this via chain rule! Mike Hughes - Tufts COMP 135 - Spring 2019 11

  11. Justification? Stochastic gradient descent, with 1 example per batch, can be seen as creating optimal weight vector of this form • Starting with all zero vector • In each step, adding a weight * feature vector Each update step: θ t ← θ t − 1 − η · d da loss( y n , a ) · d d θθ T φ ( x n ) scalar scalar Vector of size G Mike Hughes - Tufts COMP 135 - Spring 2019 12

  12. Justification? Stochastic gradient descent, with 1 example per batch, can be seen as creating optimal weight vector of this form • Starting with all zero vector • In each step, adding a weight * feature vector Each update step: θ t ← θ t − 1 − η · d da loss( y n , a ) · φ ( x n ) scalar scalar Vector of size G (simplified) Mike Hughes - Tufts COMP 135 - Spring 2019 13

  13. How to Predict in this thought experiment θ ∗ = α 1 φ ( x 1 ) + α 2 φ ( x 2 ) + . . . + α N φ ( x N ) Prediction : y ( x i , θ ) = θ T φ ( x i ) = ˆ ! T N X y ( x i , θ ∗ ) = ˆ α n φ ( x n ) φ ( x i ) n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 14

  14. How to Predict in this thought experiment θ ∗ = α 1 φ ( x 1 ) + α 2 φ ( x 2 ) + . . . + α N φ ( x N ) Prediction : y ( x i , θ ) = θ T φ ( x i ) = ˆ N α n φ ( x n ) T φ ( x i ) X y ( x i , θ ∗ ) = ˆ n =1 Inner product of test feature vector with each training feature! Mike Hughes - Tufts COMP 135 - Spring 2019 15

  15. Kernel Function k ( x i , x j ) = φ ( x i ) T φ ( x j ) Input: any two vectors x i and x j Output: scalar real Interpretation: similarity function for x i and x j Properties: Larger output values mean i and j are more similar Symmetric Mike Hughes - Tufts COMP 135 - Spring 2019 16

  16. Kernelized Linear Regression • Prediction : N X y ( x i , α , { x n } N ˆ n =1 ) = α n k ( x n , x i ) = X n =1 • Training N X y ( x n , α , X )) 2 min ( y n − ˆ α n =1 Can do all needed operations with only access to kernel (no feature vectors) Mike Hughes - Tufts COMP 135 - Spring 2019 17

  17. Compare: Linear Regression Prediction : Linear transform of G-dim features G y ( x i , θ ) = θ T φ ( x i ) = X ˆ θ g · φ ( x i ) g g =1 Training : Solve optimization problem N + L2 penalty y ( x n , θ )) 2 X min ( y n − ˆ (optional) θ n =1 Mike Hughes - Tufts COMP 135 - Spring 2019 18

  18. Why is kernel trick good idea? Before: Training problem optimized vector of size G Prediction cost: scales linearly with G (num. high-dim features) After: Training problem optimized vector of size N Prediction cost: scales linearly with N (num. train examples) requires N evaluations of kernel So we get some saving in runtime/storage if G is bigger than N AND we can compute k faster than inner product Mike Hughes - Tufts COMP 135 - Spring 2019 19

  19. Example: From Features to Kernels x = [ x 1 x 2 ] z = [ z 1 z 2 ] √ √ √ φ ( x ) = [1 x 2 1 x 2 2 x 1 2 x 2 2 x 1 x 2 ] 2 k ( x, z ) = (1 + x 1 z 1 + x 2 z 2 ) 2 Compare: What is relationship between these two functions defined above? φ ( x ) T φ ( z ) k ( x, z ) Mike Hughes - Tufts COMP 135 - Spring 2019 20

  20. Example: From Features to Kernels x = [ x 1 x 2 ] z = [ z 1 z 2 ] √ √ √ φ ( x ) = [1 x 2 1 x 2 2 x 1 2 x 2 2 x 1 x 2 ] 2 k ( x, z ) = (1 + x 1 z 1 + x 2 z 2 ) 2 Compare: What is relationship between these two functions defined above? = φ ( x ) T φ ( z ) k ( x, z ) Punchline: Can sometimes find faster ways to compute high-dim. inner product Mike Hughes - Tufts COMP 135 - Spring 2019 21

  21. Cost comparison x = [ x 1 x 2 ] z = [ z 1 z 2 ] √ √ √ φ ( x ) = [1 x 2 1 x 2 2 x 1 2 x 2 2 x 1 x 2 ] 2 k ( x, z ) = (1 + x 1 z 1 + x 2 z 2 ) 2 Compare: φ ( x ) T φ ( z ) Number of add and multiply ops to compute Number of add and multiply ops to compute k ( x, z ) Mike Hughes - Tufts COMP 135 - Spring 2019 22

  22. Example: Kernel cheaper than inner product x = [ x 1 x 2 ] √ √ √ φ ( x ) = [1 x 2 1 x 2 2 x 1 2 x 2 2 x 1 x 2 ] 2 k ( x, z ) = (1 + x 1 z 1 + x 2 z 2 ) 2 z = [ z 1 z 2 ] φ ( x ) T φ ( z ) Compare: Number of add and multiply ops to compute 6 multiply and 5 add k ( x, z ) Number of add and multiply ops to compute 3 multiply (include square) and 2 add Mike Hughes - Tufts COMP 135 - Spring 2019 23

  23. Squared Exponential Kernel Assume x is a scalar k ( x, z ) = e − ( x − z ) 2 max at x = z z x Also called “radial basis function (RBF)” kernel Mike Hughes - Tufts COMP 135 - Spring 2019 24

  24. Squared Exponential Kernel Assume x is a scalar k ( x, z ) = e − ( x − z ) 2 = e − x 2 − z 2 +2 xz = e − x 2 e − z 2 e 2 xz Mike Hughes - Tufts COMP 135 - Spring 2019 25

  25. Recall: Taylor series for e^x ∞ k ! x k = 1 + x + 1 1 e x = 2 x 2 + . . . X k =0 ∞ 2 k e 2 xz = X k ! x k z k k =0 Mike Hughes - Tufts COMP 135 - Spring 2019 26

  26. Squared Exponential Kernel k ( x, z ) = e − ( x − z ) 2 = e − x 2 − z 2 +2 xz ∞ ! ∞ ! r r 2 k 2 k z ) = e − x 2 e − z 2 X X k ! x k k ! z k k =0 k =0 = φ ( x ) T φ ( z ) Corresponds to an INFINITE DIMENSIONAL feature vector r r r 2 k 2 0 2 1 0! x 0 e − x 2 1! x 1 e − x 2 k ! x k e − x 2 φ ( x ) = [ ] . . . . . . Mike Hughes - Tufts COMP 135 - Spring 2019 27

  27. Kernelized Regression Demo Mike Hughes - Tufts COMP 135 - Spring 2019 28

  28. Linear Regression Mike Hughes - Tufts COMP 135 - Spring 2019 29

  29. Kernel Matrix for training set • K : N x N symmetric matrix   k ( x 1 , x 1 ) k ( x 1 , x 2 ) . . . k ( x 1 , x N ) k ( x 2 , x 1 ) k ( x 2 , x 2 ) . . . k ( x 2 , x N )   K = .   .   .   k ( x N , x 1 ) k ( x N , x 2 ) . . . k ( x N , x N ) Mike Hughes - Tufts COMP 135 - Spring 2019 30

  30. Linear Regression with Kernel 100 training examples in x_train 505 test examples in x_test Mike Hughes - Tufts COMP 135 - Spring 2019 31

  31. Linear Regression with Kernel Mike Hughes - Tufts COMP 135 - Spring 2019 32

  32. Polynomial Kernel, deg. 5 Mike Hughes - Tufts COMP 135 - Spring 2019 33

  33. Polynomial Kernel, deg. 12 Mike Hughes - Tufts COMP 135 - Spring 2019 34

  34. Gaussian kernel (aka sq. exp.) Mike Hughes - Tufts COMP 135 - Spring 2019 35

  35. Kernel Regression in sklearn Demo will use kernel=‘precomputed’ Mike Hughes - Tufts COMP 135 - Spring 2019 36

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend