Week 7 Video 1 Clustering Clustering A type of Structure Discovery - - PowerPoint PPT Presentation

week 7 video 1
SMART_READER_LITE
LIVE PREVIEW

Week 7 Video 1 Clustering Clustering A type of Structure Discovery - - PowerPoint PPT Presentation

Week 7 Video 1 Clustering Clustering A type of Structure Discovery algorithm This type of method is also referred to as Dimensionality Reduction , based on a common application Clustering You have a large number of data points You


slide-1
SLIDE 1

Clustering

Week 7 Video 1

slide-2
SLIDE 2

Clustering

¨ A type of Structure Discovery algorithm ¨ This type of method is also referred to as

Dimensionality Reduction, based on a common application

slide-3
SLIDE 3

Clustering

¨ You have a large number of data points ¨ You want to find what structure there is among the

data points

¨ You don’t know anything a priori about the structure ¨ Clustering tries to find data points that “group

together”

slide-4
SLIDE 4

Trivial Example

¨ Let’s say your data has two variables

¤ Probability the student knows the skill from BKT (Pknow) ¤ Unitized Time

¨ Note: clustering works for (and is effective in)

large feature spaces

slide-5
SLIDE 5

1 pknow time +3

  • 3
slide-6
SLIDE 6

1 pknow time +3

  • 3

k-Means Clustering Algorithm

slide-7
SLIDE 7

Not the only clustering algorithm

¨ Just the simplest ¨ We’ll discuss fancier ones as the week goes on

slide-8
SLIDE 8

How did we get these clusters?

¨ First we decided how many clusters we wanted, 5

¤ How did we do that? More on this in the next lecture

¨ We picked starting values for the “centroids” of the

clusters…

¤ Usually chosen randomly ¤ Sometimes there are good reasons to start with specific

initial values…

slide-9
SLIDE 9

1 pknow time +3

  • 3
slide-10
SLIDE 10

Then…

¨ We classify every point as to which centroid it’s

closest to

¤ This defines the clusters ¤ Typically visualized as a voronoi diagram

slide-11
SLIDE 11

1 pknow time +3

  • 3
slide-12
SLIDE 12

Then…

¨ We re-fit the centroids as the center of the points in

each cluster

slide-13
SLIDE 13

1 pknow time +3

  • 3
slide-14
SLIDE 14

Then…

¨ Repeat the process until the centroids stop moving ¨ “Convergence”

slide-15
SLIDE 15

1 pknow time +3

  • 3
slide-16
SLIDE 16

1 pknow time +3

  • 3
slide-17
SLIDE 17

1 pknow time +3

  • 3
slide-18
SLIDE 18

1 pknow time +3

  • 3
slide-19
SLIDE 19

1 pknow time +3

  • 3
slide-20
SLIDE 20

1 pknow time +3

  • 3

Note that there are some outliers

slide-21
SLIDE 21

1 pknow time +3

  • 3

What if we start with these points?

slide-22
SLIDE 22

1 pknow time +3

  • 3

Not very good clusters

slide-23
SLIDE 23

What happens?

¨ What happens if your starting points are in strange

places?

¨ Not trivial to avoid, considering the full span of

possible data distributions

slide-24
SLIDE 24

One Solution

¨ Run several times, involving different starting points ¨ cf. Conati & Amershi (2009)

slide-25
SLIDE 25

Exercises

¨ Take the following examples ¨ (The slides will be available in course materials so you can work

through them)

¨ And execute k-means for them ¨ Do this by hand… ¨ Focus on getting the concept rather than the exact right answer… ¨ (Solutions are by hand rather than actually using code, and are not

guaranteed to be perfect)

slide-26
SLIDE 26

1 pknow time +3

  • 3

Exercise 7-1-1

slide-27
SLIDE 27

Pause Here with In-Video Quiz

¨ Do this yourself if you want to ¨ Only quiz option: go ahead

slide-28
SLIDE 28

1 pknow time +3

  • 3

Solution Step 1

slide-29
SLIDE 29

1 pknow time +3

  • 3

Solution Step 2

slide-30
SLIDE 30

1 pknow time +3

  • 3

Solution Step 3

slide-31
SLIDE 31

1 pknow time +3

  • 3

Solution Step 4

slide-32
SLIDE 32

1 pknow time +3

  • 3

Solution Step 5

slide-33
SLIDE 33

1 pknow time +3

  • 3

No points switched -- convergence

slide-34
SLIDE 34

Notes

¨ K-Means did pretty reasonable here

slide-35
SLIDE 35

1 pknow time +3

  • 3

Exercise 7-1-2

slide-36
SLIDE 36

Pause Here with In-Video Quiz

¨ Do this yourself if you want to ¨ Only quiz option: go ahead

slide-37
SLIDE 37

1 pknow time +3

  • 3

Solution Step 1

slide-38
SLIDE 38

1 pknow time +3

  • 3

Solution Step 2

slide-39
SLIDE 39

1 pknow time +3

  • 3

Solution Step 3

slide-40
SLIDE 40

1 pknow time +3

  • 3

Solution Step 4

slide-41
SLIDE 41

1 pknow time +3

  • 3

Solution Step 5

slide-42
SLIDE 42

Notes

¨ The three clusters in the same data lump might move

around for a little while

¨ But really, what we have here is one cluster and two

  • utliers…

¨ k should be 3 rather than 5

¤ See next lecture to learn more

slide-43
SLIDE 43

1 pknow time +3

  • 3

Exercise 7-1-3

slide-44
SLIDE 44

Pause Here with In-Video Quiz

¨ Do this yourself if you want to ¨ Only quiz option: go ahead

slide-45
SLIDE 45

1 pknow

Solution

time +3

  • 3
slide-46
SLIDE 46

Notes

¨ The bottom-right cluster is actually empty! ¨ There was never a point where that centroid was

actually closest to any point

slide-47
SLIDE 47

1 pknow time +3

  • 3

Exercise 7-1-4

slide-48
SLIDE 48

Pause Here with In-Video Quiz

¨ Do this yourself if you want to ¨ Only quiz option: go ahead

slide-49
SLIDE 49

1 pknow time +3

  • 3

Solution Step 1

slide-50
SLIDE 50

1 pknow time +3

  • 3

Solution Step 2

slide-51
SLIDE 51

1 pknow time +3

  • 3

Solution Step 3

slide-52
SLIDE 52

1 pknow time +3

  • 3

Solution Step 4

slide-53
SLIDE 53

1 pknow time +3

  • 3

Solution Step 5

slide-54
SLIDE 54

1 pknow time +3

  • 3

Solution Step 6

slide-55
SLIDE 55

1 pknow time +3

  • 3

Solution Step 7

slide-56
SLIDE 56

1 pknow time +3

  • 3

Approximate Solution

slide-57
SLIDE 57

Notes

¨ Kind of a weird outcome ¨ By unlucky initial positioning

¤ One data lump at left became three clusters ¤ Two clearly distinct data lumps at right became one

cluster

slide-58
SLIDE 58

1 pknow time +3

  • 3

Exercise 7-1-5

slide-59
SLIDE 59

Pause Here with In-Video Quiz

¨ Do this yourself if you want to ¨ Only quiz option: go ahead

slide-60
SLIDE 60

1 pknow time +3

  • 3

Exercise 7-1-5

slide-61
SLIDE 61

Notes

¨ That actually kind of came out ok…

slide-62
SLIDE 62

As you can see

¨ A lot depends on initial positioning ¨ And on the number of clusters ¨ How do you pick which final position and number of

clusters to go with?

slide-63
SLIDE 63

Next lecture

¨ Clustering – Validation and Selection of k