coms 4721 machine learning for data science lecture 14 3
play

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University U NSUPERVISED L EARNING S UPERVISED LEARNING Framework of supervised


  1. COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

  2. U NSUPERVISED L EARNING

  3. S UPERVISED LEARNING Framework of supervised learning Given : Pairs ( x 1 , y 1 ) , . . . , ( x n , y n ) . Think of x as input and y as output. Learn : A function f ( x ) that accurately predicts y i ≈ f ( x i ) on this data. Goal : Use the function f ( x ) to predict new y 0 given x 0 . Probabilistic motivation If we think of ( x , y ) as a random variable with joint distribution p ( x , y ) , then supervised learning seeks to learn the conditional distribution p ( y | x ) . This can be done either directly or indirectly: Directly: e.g., with logistic regression where p ( y | x ) = sigmoid function Indirectly: e.g., with a Bayes classifier y = arg max p ( y = k | x ) = arg max p ( x | y = k ) p ( y = k ) k k

  4. U NSUPERVISED LEARNING Some motivation ◮ The Bayes classifier factorizes the joint density as p ( x , y ) = p ( x | y ) p ( y ) . ◮ The joint density can also be written as p ( x , y ) = p ( y | x ) p ( x ) . ◮ Unsupervised learning focuses on the term p ( x ) — learning p ( x | y ) on a class-specific subset has the same “feel.” What should this be? ◮ (This implies an underlying classification task, but often there isn’t one.) Unsupervised learning Given : A data set x 1 , . . . , x n , where x i ∈ X , e.g., X = R d Define : Some model of the data (probabilistic or non-probabilistic). Goal : Learn structure within the data set as defined by the model . ◮ Supervised learning has a clear performance metric: accuracy ◮ Unsupervised learning is often (but not always) more subjective

  5. S OME TYPES OF UNSUPERVISED LEARNING Overview of second half of course We will discuss a few types of unsupervised learning approaches in the second half of the course. Clustering models : Learn a partition of data x 1 , . . . , x n into groups. ◮ Image segmentation, data quantization, preprocessing for other models Matrix factorization : Learn an underlying dot-product representation. ◮ User preference modeling, topic modeling Sequential models : Learn a model based on sequential information. ◮ Learn how to rank objects, target tracking As will become evident, an unsupervised model can often be interpreted as a supervised model, or very easily turned into one.

  6. C LUSTERING Problem ◮ Given data x 1 , . . . , x n , partition it into groups called clusters . ◮ Find the clusters, given only the data. ◮ Observations in same group ⇒ “similar,” different groups ⇒ “different.” ◮ We will set how many clusters we learn. Cluster assignment representation For K clusters, encode cluster assignments as an indicator c ∈ { 1 , . . . , K } , c i = k ⇐ ⇒ x i is assigned to cluster k Clustering feels similar to classification in that we “label” an observation by its cluster assignment. The difference is that there is no ground truth.

  7. T HE K- MEANS ALGORITHM

  8. C LUSTERING AND K- MEANS K-means is the simplest and most fundamental clustering algorithm. Input : x 1 , . . . , x n , where x ∈ R d . Output : Vector c of cluster assignments, and K mean vectors µ ◮ c = ( c 1 , . . . , c n ) , c i ∈ { 1 , . . . , K } • If c i = c j = k , then x i and x j are clustered together in cluster k . µ k ∈ R d (same space as x i ) ◮ µ = ( µ 1 , . . . , µ K ) , • Each µ k (called a centroid ) defines a cluster. As usual, we need to define an objective function . We pick one that: 1. Tells us what are good c and µ , and 2. That is easy to optimize.

  9. K- MEANS OBJECTIVE FUNCTION The K-means objective function can be written as n K � � µ ∗ , c ∗ = arg min 1 { c i = k }� x i − µ k � 2 µ , c i = 1 k = 1 Some observations: ◮ K-means uses the squared Euclidean distance of x i to the centroid µ k . ◮ It only penalizes the distance of x i to the centroid it’s assigned to by c i . n K K � � � � 1 { c i = k }� x i − µ k � 2 � x i − µ k � 2 L = = i = 1 k = 1 k = 1 i : c i = k ◮ The objective function is “non-convex” ◮ This means that we can’t actually find the optimal µ ∗ and c ∗ . ◮ We can only derive an algorithm for finding a local optimum (more later).

  10. O PTIMIZING THE K- MEANS OBJECTIVE Gradient-based optimization We can’t optimize the K-means objective function exactly by taking derivatives and setting to zero, so we use an iterative algorithm. However, the algorithm we will use is different from gradient methods: w ← w − η ∇ w L ( gradient descent ) Recall : With gradient descent, when we update a parameter “ w ” we move in the direction that decreases the objective function, but ◮ It will almost certainly not move to the best value for that parameter. ◮ It may not even move to a better value if the step size η is too big. ◮ We also need the parameter w to be continuous-valued.

  11. K- MEANS AND C OORDINATE DECENT Coordinate descent We will discuss a new and widely used optimization procedure in the context of K -means clustering. We want to minimize the objective function n K � � 1 { c i = k }� x i − µ k � 2 . L = i = 1 k = 1 We split the variables into two unknown sets µ and c . We can’t find their best values at the same time to minimize L . However, we will see that ◮ Fixing µ we can find the best c exactly. ◮ Fixing c we can find the best µ exactly. This optimization approach is called coordinate descent : Hold one set of parameters fixed, and optimize the other set. Then switch which set is fixed.

  12. C OORDINATE DESCENT Coordinate descent (in the context of K-means) Input: x 1 , . . . , x n where x i ∈ R d . Randomly initialize µ = ( µ 1 , . . . , µ K ) . ◮ Iterate back-and-forth between the following two steps: 1. Given µ , find the best value c i ∈ { 1 , . . . , K } for i = 1 , . . . , n . 2. Given c , find the best vector µ k ∈ R d for k = 1 , . . . , K . There’s a circular way of thinking about why we need to iterate: 1. Given a particular µ , we may be able to find the best c , but once we change c we can probably find a better µ . 2. Then find the best µ for the new-and-improved c found in # 1, but now that we’ve changed µ , there is probably a better c . We have to iterate because the values of µ and c depend on each other . This happens very frequently in unsupervised models.

  13. K- MEANS ALGORITHM : U PDATING c Assignment step Given µ = ( µ 1 , . . . , µ K ) , update c = ( c 1 , . . . , c n ) . By rewriting L , we notice the independence of each c i given µ , K K � � � � � � 1 { c 1 = k }� x 1 − µ k � 2 1 { c n = k }� x n − µ k � 2 L = + · · · + . k = 1 k = 1 � �� � � �� � distance of x 1 to its assigned centroid distance of x n to its assigned centroid We can minimize L with respect to each c i by minimizing each term above separately. The solution is to assign x i to the closest centroid � x i − µ k � 2 . c i = arg min k Because there are only K options for each c i , there are no derivatives. Simply calculate all the possible values for c i and pick the best (smallest) one.

  14. K- MEANS ALGORITHM : U PDATING µ Update step Given c = ( c 1 , . . . , c n ) , update µ = ( µ 1 , . . . , µ K ) . For a given c , we can break L into K clusters defined by c so that each µ i is independent. N N � 1 { c i = 1 }� x i − µ 1 � 2 � � 1 { c i = K }� x i − µ K � 2 � � � L = + · · · + . i = 1 i = 1 � �� � � �� � sum squared distance of data in cluster # 1 sum squared distance of data in cluster # K For each k , we then optimize. Let n k = � n i = 1 1 { c i = k } . Then n n µ k = 1 � � 1 { c i = k }� x i − µ � 2 µ k = arg min − → x i 1 { c i = k } . n k µ i = 1 i = 1 That is, µ k is the mean of the data assigned to cluster k .

  15. K- MEANS CLUSTERING ALGORITHM Algorithm: K-means clustering Given: x 1 , . . . , x n where each x ∈ R d Goal: Minimize L = � n � K k = 1 1 { c i = k }� x i − µ k � 2 . i = 1 ◮ Randomly initialize µ = ( µ 1 , . . . , µ K ) . ◮ Iterate until c and µ stop changing 1. Update each c i : � x i − µ k � 2 c i = arg min k 2. Update each µ k : Set n n µ k = 1 � � n k = 1 { c i = k } x i 1 { c i = k } and n k i = 1 i = 1

  16. K- MEANS ALGORITHM : E XAMPLE RUN (a) 2 A random initialization 0 −2 −2 0 2

  17. K- MEANS ALGORITHM : E XAMPLE RUN (b) 2 Iteration 1 0 Assign data to clusters −2 −2 0 2

  18. K- MEANS ALGORITHM : E XAMPLE RUN (c) 2 Iteration 1 0 Update the centroids −2 −2 0 2

  19. K- MEANS ALGORITHM : E XAMPLE RUN (d) 2 Iteration 2 0 Assign data to clusters −2 −2 0 2

  20. K- MEANS ALGORITHM : E XAMPLE RUN (e) 2 Iteration 2 0 Update the centroids −2 −2 0 2

  21. K- MEANS ALGORITHM : E XAMPLE RUN (f) 2 Iteration 3 0 Assign data to clusters −2 −2 0 2

  22. K- MEANS ALGORITHM : E XAMPLE RUN (g) 2 Iteration 3 0 Update the centroids −2 −2 0 2

  23. K- MEANS ALGORITHM : E XAMPLE RUN (h) 2 Iteration 4 0 Assign data to clusters −2 −2 0 2

  24. K- MEANS ALGORITHM : E XAMPLE RUN (i) 2 Iteration 4 0 Update the centroids −2 −2 0 2

  25. C ONVERGENCE OF K- MEANS 1000 L 500 0 1 2 3 4 Iteration Objective function after ◮ the “assignment” step (blue: corresponding to c ), and ◮ the “update” step (red: corresponding to µ ).

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend