data mining techniques
play

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton Rasmussen & Williams, Percy Liang) Kernel Regression Basis function regression Linear regression Basis


  1. Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 4 Jan-Willem van de Meent ( credit : Yijun Zhao, Arthur Gretton 
 Rasmussen & Williams, Percy Liang)

  2. Kernel Regression

  3. Basis function regression Linear regression Basis function regression For N samples Polynomial regression

  4. Basis Function Regression M = 3 1 t 0 − 1 0 1 x

  5. The Kernel Trick Define a kernel function such that k can be cheaper to evaluate than φ !

  6. Kernel Ridge Regression MAP / Expected value for Weights ( requires inversion of D x D matrix ) A : = ( Φ > Φ + λ I ) E [ w | y ] = A � 1 Φ > y Φ : = Φ ( X ) Alternate representation ( requires inversion of N x N matrix ) A � 1 Φ > = Φ > ( K + λ I ) � 1 K : = λ � 1 ΦΦ > Predictive posterior (using kernel function) E [ f ( x ⇤ ) | y ] = φ ( x ⇤ ) > E [ w | y ] = φ ( x ⇤ ) > Φ > ( K + λ I ) � 1 y X

  7. Kernel Ridge Regression MAP / Expected value for Weights ( requires inversion of D x D matrix ) A : = ( Φ > Φ + λ I ) E [ w | y ] = A � 1 Φ > y Φ : = Φ ( X ) Alternate representation ( requires inversion of N x N matrix ) A � 1 Φ > = Φ > ( K + λ I ) � 1 K : = λ � 1 ΦΦ > Predictive posterior (using kernel function) E [ f ( x ⇤ ) | y ] = φ ( x ⇤ ) > E [ w | y ] = φ ( x ⇤ ) > Φ > ( K + λ I ) � 1 y X k ( x ⇤ , x n )( K + λ I ) � 1 = nm y m n , m

  8. Kernel Ridge Regression n ! ( y i � h f , φ ( x i ) i H ) 2 + λ k f k 2 X f ∗ = arg min H . f ∈ H i = 1 λ =0.1, σ =0.6 λ =10, σ =0.6 λ =1e − 07, σ =0.6 1 1 1.5 1 0.5 0.5 0.5 0 0 0 − 0.5 − 0.5 − 0.5 − 1 − 1 − 1 − 0.5 0 0.5 1 1.5 − 0.5 0 0.5 1 1.5 − 0.5 0 0.5 1 1.5 Closed form Solution

  9. Gaussian Processes (a.k.a. Kernel Ridge Regression with Variance Estimates) 2 2 1 1 output, f(x) output, f(x) 0 0 − 1 − 1 − 2 − 2 − 5 0 5 − 5 0 5 input, x input, x k ( x ⇤ , x ) > [ K + σ 2 noise I ] − 1 y , � p ( y ⇤ | x ⇤ , x , y ) ∼ N k ( x ⇤ , x ⇤ ) + σ 2 noise − k ( x ⇤ , x ) > [ K + σ 2 noise I ] − 1 k ( x ⇤ , x ) � adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

  10. Choosing Kernel Hyperparameters � � too long 1.0 about right too short function value, y 0.5 0.0 − 0.5 − 10 − 5 0 5 10 input, x The mean posterior predictive function is plotted for 3 different length scales (the − ( x − x 0 ) 2 function: k ( x , x 0 ) = v 2 exp + � 2 � � noise � xx 0 . 2 ` 2 Characteristic Lengthscales adapted from : Carl Rasmussen, Probabilistic Machine Learning 4f13, http://mlg.eng.cam.ac.uk/teaching/4f13/1617/

  11. Intermezzo: Kernels Borrowing from : 
 Arthur Gretton 
 (Gatsby, UCL)

  12. Hilbert Spaces Definition (Inner product) Let H be a vector space over R . A function h · , · i H : H ⇥ H ! R is an inner product on H if 1 Linear: h α 1 f 1 + α 2 f 2 , g i H = α 1 h f 1 , g i H + α 2 h f 2 , g i H 2 Symmetric: h f , g i H = h g , f i H 3 h f , f i H � 0 and h f , f i H = 0 if and only if f = 0. p Norm induced by the inner product: k f k H := h f , f i H

  13. Example: Fourier Bases

  14. Example: Fourier Bases

  15. Example: Fourier Bases

  16. Example: Fourier Bases Fourier modes define a vector space

  17. Kernels Definition Let X be a non-empty set. A function k : X × X → R is a kernel if there exists an R -Hilbert space and a map φ : X → H such that ∀ x , x 0 ∈ X , ⌦ ↵ k ( x , x 0 ) := φ ( x ) , φ ( x 0 ) H . Almost no conditions on X (eg, X itself doesn’t need an inner product, eg. documents). A single kernel can correspond to several possible features. A trivial example for X := R :  x / √ � 2 √ φ 1 ( x ) = x φ 2 ( x ) = and 2 x /

  18. Sums, Transformations, Products Theorem (Sums of kernels are kernels) Given α > 0 and k, k 1 and k 2 all kernels on X , then α k and k 1 + k 2 are kernels on X . (Proof via positive definiteness: later!) A di ff erence of kernels may not be a kernel ( why? ) Theorem (Mappings between spaces) Let X and e X be sets, and define a map A : X → e X . Define the kernel k on e X . Then the kernel k ( A ( x ) , A ( x 0 )) is a kernel on X . Example: k ( x , x 0 ) = x 2 ( x 0 ) 2 . Theorem (Products of kernels are kernels) Given k 1 on X 1 and k 2 on X 2 , then k 1 ⇥ k 2 is a kernel on X 1 ⇥ X 2 . If X 1 = X 2 = X , then k := k 1 ⇥ k 2 is a kernel on X . Proof: Main idea only!

  19. Polynomial Kernels Theorem (Polynomial kernels) Let x , x 0 2 R d for d � 1 , and let m � 1 be an integer and c � 0 be a positive real. Then � m �⌦ x , x 0 ↵ k ( x , x 0 ) := + c is a valid kernel. To prove : expand into a sum (with non-negative scalars) of kernels h x , x 0 i raised to integer powers. These individual terms are valid kernels by the product rule.

  20. Infinite Sequences Definition The space ` 2 ( square summable sequences) comprises all sequences a := ( a i ) i � 1 for which 1 k a k 2 X a 2 ` 2 = i < 1 . i = 1 Definition Given sequence of functions ( � i ( x )) i � 1 in ` 2 where � i : X ! R is the i th coordinate of � ( x ) . Then 1 X k ( x , x 0 ) := � i ( x ) � i ( x 0 ) (1) i = 1

  21. Infinite Sequences Why square summable? By Cauchy-Schwarz, � � 1 � � X � � φ i ( x ) φ i ( x 0 ) � φ ( x 0 ) �  k φ ( x ) k ` 2 � � � ` 2 , � � � i = 1 so the sequence defining the inner product converges for all x , x 0 2 X

  22. Taylor Series Kernels Definition (Taylor series kernel) For r 2 ( 0 , 1 ] , with a n � 0 for all n � 0 1 X a n z n f ( z ) = | z | < r , z 2 R , n = 0 Define X to be the p r -ball in R d , so k x k < p r , 1 x , x 0 ↵ n . X �⌦ x , x 0 ↵� ⌦ k ( x , x 0 ) = f = a n n = 0 Example (Exponential kernel) �⌦ x , x 0 ↵� k ( x , x 0 ) := exp .

  23. Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as ⇣ � 2 ⌘ − γ � 2 � � x − x 0 � k ( x , x 0 ) := exp . Proof : an exercise! Use product rule, mapping rule, exponential kernel.

  24. Gaussian Kernel (also known as Radial Basis Function (RBF) kernel) Example (Gaussian kernel) The Gaussian kernel on R d is defined as ⇣ � 2 ⌘ − γ � 2 � � x − x 0 � k ( x , x 0 ) := exp . Proof : an exercise! Use product rule, mapping rule, exponential kernel. Squared Exponential (SE) Automatic Relevance 
 Determination (ARD)

  25. Products of Kernels me: Squared-exp ( SE ) Periodic ( Per ) Linear ( Lin ) − ( x ≠ x Õ ) 2 f ( x − c )( x Õ − c ) 1 2 1 ¸ 2 sin 2 1 22 − 2 π x ≠ x Õ σ 2 σ 2 σ 2 f exp f exp 2 ¸ 2 p ) : 0 0 0 x (with x Õ = 1 ) x − x Õ x − x Õ Lin × Lin SE × Per Lin × SE Lin × Per 0 0 0 0 x (with x Õ = 1 ) x (with x Õ = 1 ) x (with x Õ = 1 ) x − x Õ source: David Duvenaud (PhD Thesis)

  26. Positive Definiteness Definition (Positive definite functions) A symmetric function k : X × X → R is positive definite if ∀ n ≥ 1 , ∀ ( a 1 , . . . a n ) ∈ R n , ∀ ( x 1 , . . . , x n ) ∈ X n , n n X X a i a j k ( x i , x j ) ≥ 0 . i = 1 j = 1 The function k ( · , · ) is strictly positive definite if for mutually distinct x i , the equality holds only when all the a i are zero.

  27. Mercer’s Theorem Theorem Let H be a Hilbert space, X a non-empty set and φ : X ! H . Then h φ ( x ) , φ ( y ) i H =: k ( x , y ) is positive definite. Proof. n n n n X X X X a i a j k ( x i , x j ) = h a i φ ( x i ) , a j φ ( x j ) i H i = 1 j = 1 i = 1 j = 1 2 � � n � � X = a i φ ( x i ) � 0 . � � � � � � i = 1 H Reverse also holds: positive definite k ( x , x 0 ) is inner product in a unique H (Moore-Aronsajn: coming later!).

  28. DIMENSIONALITY REDUCTION Borrowing from : 
 Percy Liang 
 (Stanford)

  29. Linear Dimensionality Reduction Idea : Project high-dimensional vector 
 onto a lower dimensional space ∈ x ∈ R 361 z = U > x z ∈ R 10

  30. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Transpose of X 
 used in regression!

  31. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k

  32. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x

  33. Problem Setup Given n data points in d dimensions: x 1 , . . . , x n ∈ R d X = ( x 1 · · · · · · x n ) ∈ R d ⇥ n Want to reduce dimensionality from d to k Choose k directions u 1 , . . . , u k U = ( u 1 ·· u k ) ∈ R d ⇥ k For each u j , compute “similarity” z j = u > j x Project x down to z = ( z 1 , . . . , z k ) > = U > x How to choose U ?

  34. Principal Component Analysis ∈ x ∈ R 361 z = U > x z ∈ R 10 Optimize two equivalent objectives 1. Minimize the reconstruction error 2. Maximizes the projected variance

  35. PCA Objective 1: Reconstruction Error U serves two functions: • Encode: z = U > x , z j = u > j x P

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend