week 7 video 1
play

Week 7 Video 1 Clustering Clustering A type of Structure Discovery - PowerPoint PPT Presentation

Week 7 Video 1 Clustering Clustering A type of Structure Discovery algorithm This type of method is also referred to as Dimensionality Reduction , based on a common application Clustering You have a large number of data points You


  1. Week 7 Video 1 Clustering

  2. Clustering ¨ A type of Structure Discovery algorithm ¨ This type of method is also referred to as Dimensionality Reduction , based on a common application

  3. Clustering ¨ You have a large number of data points ¨ You want to find what structure there is among the data points ¨ You don’t know anything a priori about the structure ¨ Clustering tries to find data points that “group together”

  4. Trivial Example ¨ Let’s say your data has two variables ¤ Probability the student knows the skill from BKT (Pknow) ¤ Unitized Time ¨ Note: clustering works for (and is effective in) large feature spaces

  5. +3 time 0 -3 0 1 pknow

  6. k-Means Clustering Algorithm +3 time 0 -3 0 1 pknow

  7. Not the only clustering algorithm ¨ Just the simplest ¨ We’ll discuss fancier ones as the week goes on

  8. How did we get these clusters? ¨ First we decided how many clusters we wanted, 5 ¤ How did we do that? More on this in the next lecture ¨ We picked starting values for the “centroids” of the clusters… ¤ Usually chosen randomly ¤ Sometimes there are good reasons to start with specific initial values…

  9. +3 time 0 -3 0 1 pknow

  10. Then… ¨ We classify every point as to which centroid it’s closest to ¤ This defines the clusters ¤ Typically visualized as a voronoi diagram

  11. +3 time 0 -3 0 1 pknow

  12. Then… ¨ We re-fit the centroids as the center of the points in each cluster

  13. +3 time 0 -3 0 1 pknow

  14. Then… ¨ Repeat the process until the centroids stop moving ¨ “Convergence”

  15. +3 time 0 -3 0 1 pknow

  16. +3 time 0 -3 0 1 pknow

  17. +3 time 0 -3 0 1 pknow

  18. +3 time 0 -3 0 1 pknow

  19. +3 time 0 -3 0 1 pknow

  20. Note that there are some outliers +3 time 0 -3 0 1 pknow

  21. What if we start with these points? +3 time 0 -3 0 1 pknow

  22. Not very good clusters +3 time 0 -3 0 1 pknow

  23. What happens? ¨ What happens if your starting points are in strange places? ¨ Not trivial to avoid, considering the full span of possible data distributions

  24. One Solution ¨ Run several times, involving different starting points ¨ cf. Conati & Amershi (2009)

  25. Exercises ¨ Take the following examples ¨ (The slides will be available in course materials so you can work through them) ¨ And execute k-means for them ¨ Do this by hand… ¨ Focus on getting the concept rather than the exact right answer… ¨ (Solutions are by hand rather than actually using code, and are not guaranteed to be perfect)

  26. Exercise 7-1-1 +3 time 0 -3 0 1 pknow

  27. Pause Here with In-Video Quiz ¨ Do this yourself if you want to ¨ Only quiz option: go ahead

  28. Solution Step 1 +3 time 0 -3 0 1 pknow

  29. Solution Step 2 +3 time 0 -3 0 1 pknow

  30. Solution Step 3 +3 time 0 -3 0 1 pknow

  31. Solution Step 4 +3 time 0 -3 0 1 pknow

  32. Solution Step 5 +3 time 0 -3 0 1 pknow

  33. No points switched -- convergence +3 time 0 -3 0 1 pknow

  34. Notes ¨ K-Means did pretty reasonable here

  35. Exercise 7-1-2 +3 time 0 -3 0 1 pknow

  36. Pause Here with In-Video Quiz ¨ Do this yourself if you want to ¨ Only quiz option: go ahead

  37. Solution Step 1 +3 time 0 -3 0 1 pknow

  38. Solution Step 2 +3 time 0 -3 0 1 pknow

  39. Solution Step 3 +3 time 0 -3 0 1 pknow

  40. Solution Step 4 +3 time 0 -3 0 1 pknow

  41. Solution Step 5 +3 time 0 -3 0 1 pknow

  42. Notes ¨ The three clusters in the same data lump might move around for a little while ¨ But really, what we have here is one cluster and two outliers… ¨ k should be 3 rather than 5 ¤ See next lecture to learn more

  43. Exercise 7-1-3 +3 time 0 -3 0 1 pknow

  44. Pause Here with In-Video Quiz ¨ Do this yourself if you want to ¨ Only quiz option: go ahead

  45. Solution +3 time 0 -3 0 1 pknow

  46. Notes ¨ The bottom-right cluster is actually empty! ¨ There was never a point where that centroid was actually closest to any point

  47. Exercise 7-1-4 +3 time 0 -3 0 1 pknow

  48. Pause Here with In-Video Quiz ¨ Do this yourself if you want to ¨ Only quiz option: go ahead

  49. Solution Step 1 +3 time 0 -3 0 1 pknow

  50. Solution Step 2 +3 time 0 -3 0 1 pknow

  51. Solution Step 3 +3 time 0 -3 0 1 pknow

  52. Solution Step 4 +3 time 0 -3 0 1 pknow

  53. Solution Step 5 +3 time 0 -3 0 1 pknow

  54. Solution Step 6 +3 time 0 -3 0 1 pknow

  55. Solution Step 7 +3 time 0 -3 0 1 pknow

  56. Approximate Solution +3 time 0 -3 0 1 pknow

  57. Notes ¨ Kind of a weird outcome ¨ By unlucky initial positioning ¤ One data lump at left became three clusters ¤ Two clearly distinct data lumps at right became one cluster

  58. Exercise 7-1-5 +3 time 0 -3 0 1 pknow

  59. Pause Here with In-Video Quiz ¨ Do this yourself if you want to ¨ Only quiz option: go ahead

  60. Exercise 7-1-5 +3 time 0 -3 0 1 pknow

  61. Notes ¨ That actually kind of came out ok…

  62. As you can see ¨ A lot depends on initial positioning ¨ And on the number of clusters ¨ How do you pick which final position and number of clusters to go with?

  63. Next lecture ¨ Clustering – Validation and Selection of k

Download Presentation
Download Policy: The content available on the website is offered to you 'AS IS' for your personal information and use only. It cannot be commercialized, licensed, or distributed on other websites without prior consent from the author. To download a presentation, simply click this link. If you encounter any difficulties during the download process, it's possible that the publisher has removed the file from their server.

Recommend


More recommend