coms 4721 machine learning for data science lecture 10 2
play

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University F EATURE EXPANSIONS F EATURE EXPANSIONS Feature expansions (also called


  1. COMS 4721: Machine Learning for Data Science Lecture 10, 2/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. F EATURE EXPANSIONS

  3. F EATURE EXPANSIONS Feature expansions (also called basis expansions ) are names given to a technique we’ve already discussed and made use of. Problem: A linear model on the original feature space x ∈ R d doesn’t work. Solution: Map the features to a higher dimensional space φ ( x ) ∈ R D , where D > d , and do linear modeling there. Examples ◮ For polynomial regression on R , we let φ ( x ) = ( x , x 2 , . . . , x p ) . ◮ For jump discontinuities, φ ( x ) = ( x , 1 { x < a } ) .

  4. M APPING EXAMPLE FOR REGRESSION y y x cos(x) x (a) Data for linear regression (b) Same data mapped to higher dimension High-dimensional maps can transform the data so output is linear in inputs. Left: Original x ∈ R and response y . Right: x mapped to R 2 using φ ( x ) = ( x , cos x ) T .

  5. M APPING EXAMPLE FOR REGRESSION Using the mapping φ ( x ) = ( x , cos x ) T , learn the linear regression model w 0 + φ ( x ) T w ≈ y ≈ w 0 + w 1 x + w 2 cos x . y y x cos(x) x Left: Learn ( w 0 , w 1 , w 2 ) to approximate data on the left with a plane. Right: For each point x , map to φ ( x ) and predict y . Plot as a function of x .

  6. M APPING EXAMPLE FOR CLASSIFICATION 2 x 1 x 2 x 1 x 1 x 2 x 2 2 (e) Data for binary classification (f) Same data mapped to higher dimension High-dimensional maps can transform data so it becomes linearly separable. Left: Original data in R 2 . Right: Data mapped to R 3 using φ ( x ) = ( x 2 1 , x 1 x 2 , x 2 2 ) T .

  7. M APPING EXAMPLE FOR CLASSIFICATION Using the mapping φ ( x ) = ( x 2 1 , x 1 x 2 , x 2 2 ) T , learn a linear classifier sign ( w 0 + φ ( x ) T w ) = y sign ( w 0 + w 1 x 2 1 + w 2 x 1 x 2 + w 3 x 2 = 2 ) . 2 x 1 x 2 x 1 x 2 2 x 1 x 2 Left: Learn ( w 0 , w 1 , w 2 , w 3 ) to linearly separate classes with hyperplane. Right: For each point x , map to φ ( x ) and classify. Color decision regions in R 2 .

  8. F EATURE EXPANSIONS AND DOT PRODUCTS What expansion should I use? This is not obvious. The illustrations required knowledge about the data that we likely won’t have (especially if it’s in high dimensions). One approach is to use the “kitchen sink”: If you can think of it, then use it. Select the useful features with an ℓ 1 penalty n � w ℓ 1 = arg min f ( y i , φ ( x i ) , w ) + λ � w � 1 . w i = 1 We know that this will find a sparse subset of the dimensions of φ ( x ) to use. Often we only need to work with dot products φ ( x i ) T φ ( x j ) ≡ K ( x i , x j ) . This is called a kernel and can produce some interesting results.

  9. K ERNELS

  10. P ERCEPTRON ( SOME MOTIVATION ) Perceptron classifier Let x i ∈ R d + 1 and y i ∈ {− 1 , + 1 } for i = 1 , . . . , n observations. We saw that the Perceptron constructs the hyperplane from data, w = � i ∈M y i x i , (assume η = 1 and M has no duplicates) where M is the sequentially constructed set of misclassified examples. Predicting new data We also discussed how we can predict the label y 0 for a new observation x 0 : y 0 = sign ( x T �� i ∈M y i x T � 0 w ) = sign 0 x i We’ve taken feature expansions for granted, but we can explicitly write it as y 0 = sign ( φ ( x 0 ) T w ) = sign �� i ∈M y i φ ( x 0 ) T φ ( x i ) � We can represent the decision using dot products between data points.

  11. K ERNELS Kernel definition A kernel K ( · , · ) : R d × R d → R is a symmetric function defined as follows: Definition: If for any n points x 1 , . . . , x n ∈ R d , the n × n matrix K , where K ij = K ( x i , x j ) , is positive semidefinite , then K ( · , · ) is a “kernel.” Intuitively, this means K satisfies the properties of a covariance matrix. Mercer’s theorem If the function K ( · , · ) satisfies the above properties, then there exists a mapping φ : R d → R D ( D can equal ∞ ) such that K ( x i , x j ) = φ ( x i ) T φ ( x j ) . If we first define φ ( · ) and then K , this is obvious. However, sometimes we first define K ( · , · ) and avoid ever using φ ( · ) .

  12. G AUSSIAN KERNEL ( RADIAL BASIS FUNCTION ) The most popular kernel is the Gaussian kernel, also called the radial basis function (RBF), � � − 1 K ( x , x ′ ) = a exp b � x − x ′ � 2 . ◮ This is a good, general-purpose kernel that usually works well. ◮ It takes into account proximity in R d . Things close together in space have larger value (as defined by kernel width b ). In this case, the the mapping φ ( x ) that produces the RBF kernel is infinite dimensional (it’s a continuous function instead of a vector). Therefore � K ( x , x ′ ) = φ t ( x ) φ t ( x ′ ) dt . ◮ φ t ( x ) can be thought of as a function of t with parameter x that also has a Gaussian form.

  13. K ERNELS Another kernel √ √ √ 2 x d , x 2 1 , . . . , x 2 Map : φ ( x ) = ( 1 , 2 x 1 , . . . , d , . . . , 2 x i x j , . . . ) Kernel : φ ( x ) T φ ( x ′ ) = K ( x , x ′ ) = ( 1 + x T x ′ ) 2 In fact, we can show K ( x , x ′ ) = ( 1 + x T x ′ ) b , for b > 0 is a kernel as well. Kernel arithmetic Certain functions of kernels can produce new kernels. Let K 1 and K 2 be any two kernels, then constructing K in the following ways produces a new kernel (among many other ways): K ( x , x ′ ) K 1 ( x , x ′ ) K 2 ( x , x ′ ) = K ( x , x ′ ) K 1 ( x , x ′ ) + K 2 ( x , x ′ ) = K ( x , x ′ ) exp { K 1 ( x , x ′ ) } =

  14. K ERNELIZED P ERCEPTRON Returning to the Perceptron We write the feature-expanded decision as �� i ∈M y i φ ( x 0 ) T φ ( x i ) � = y 0 sign �� � = i ∈M y i K ( x 0 , x i ) sign We can pick the kernel we want to use. Let’s pick the RBF (set a = 1). Then �� b � x 0 − x i � 2 � i ∈M y i e − 1 y 0 = sign Notice that we never actually need to calculate φ ( x ) . What is this doing? ◮ Notice 0 < K ( x 0 , x i ) ≤ 1, with bigger values when x 0 is closer to x i . ◮ This is like a “soft voting” among the data picked by Perceptron.

  15. K ERNELIZED P ERCEPTRON Learning the kernelized Perceptron Recall: Given a current vector w ( t ) = � i ∈M t y i x i , we update it as follows, 1. Find a new x ′ such that y ′ � = sign ( x ′ T w ( t ) ) 2. Add the index of x ′ to M and set w ( t + 1 ) = � i ∈M t + 1 y i x i Again we only need dot products, meaning these steps are equivalent to 1. Find a new x ′ such that y ′ � = sign ( � i ∈M t y i K ( x ′ , x i )) 2. Add the index of x ′ to M but don’t bother calculating w ( t + 1 ) The trick is to realize that we never need to work with φ ( x ) . ◮ We don’t need φ ( x ) to do Step 1 above. ◮ We don’t need φ ( x ) to classify new data (previous slide). ◮ We only ever need to calculate K ( x , x ′ ) between two points.

  16. K ERNEL k -NN An extension We can generalize kernelized Perceptron to soft k -NN with a simple change. Instead of summing over misclassified data M , sum over all the data: �� n b � x 0 − x i � 2 � i = 1 y i e − 1 y 0 = sign . Next, notice the decision doesn’t change if we divide by a positive constant. b � x 0 − x j � 2 Let : Z = � n j = 1 e − 1 Z e − 1 b � x 0 − x i � 2 Construct : Vector p ( x 0 ) , where p i ( x 0 ) = 1 � � n � Declare : y 0 = sign i = 1 y i p i ( x 0 ) ◮ We let all data vote for the label based on a “confidence score” p ( x 0 ) . ◮ Set b so that most p i ( x 0 ) ≈ 0 to only focus on neighborhood around x 0 .

  17. K ERNEL REGRESSION Nadaraya-Watson model The developments are almost limitless. Here’s a regression example almost identical to the kernelized k -NN: Before: y ∈ {− 1 , + 1 } Now: y ∈ R Using the RBF kernel, for a new ( x 0 , y 0 ) predict n K ( x 0 , x i ) � y 0 = j = 1 K ( x 0 , x j ) . y i � n i = 1 What is this doing? We’re taking a locally weighted average of all y i for which x i is close to x 0 (as decided by the kernel width). Gaussian processes are another option . . .

  18. G AUSSIAN PROCESSES

  19. K ERNELIZED B AYESIAN LINEAR REGRESSION Regression setup : For n observations, with response vector y ∈ R n and their feature matrix X , we define the likelihood and prior y ∼ N ( Xw , σ 2 I ) , w ∼ N ( 0 , λ − 1 I ) . Marginalizing : What if we integrate out w ? We can solve this, � p ( y | X , w ) p ( w ) dw = N ( 0 , σ 2 I + λ − 1 XX T ) . p ( y | X ) = Kernelization : Notice that ( XX T ) ij = x T i x j . Replace each x with φ ( x ) after which we can say [ φ ( X ) φ ( X ) T ] ij = K ( x i , x j ) . We can define K directly, so � p ( y | X , w ) p ( w ) dw = N ( 0 , σ 2 I + λ − 1 K ) . p ( y | X ) = This is called a Gaussian process . We never use w or φ ( x ) , but just K ( x i , x j ) .

  20. G AUSSIAN PROCESSES Definition • Let f ( x ) ∈ R and x ∈ R d . • Define the kernel K ( x , x ′ ) between two points x and x ′ . • Then f ( x ) is a Gaussian process and y ( x ) the noise-added process if for n observed pairs ( x 1 , y 1 ) , . . . , ( x n , y n ) , where x ∈ X and y ∈ R , y | f ∼ N ( f , σ 2 I ) , y ∼ N ( 0 , σ 2 I + K ) f ∼ N ( 0 , K ) ⇐ ⇒ where y = ( y 1 , . . . , y n ) T and K is n × n with K ij = K ( x i , x j ) . Comments: ◮ We assume λ = 1 to reduce notation. ◮ Typical breakdown: f ( x ) is the GP and y ( x ) equals f ( x ) plus i.i.d. noise. ◮ The kernel is what keeps this from being “just a Gaussian.”

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend