Department of Computer Science CSCI 5622: Machine Learning Chenhao - - PowerPoint PPT Presentation

department of computer science csci 5622 machine learning
SMART_READER_LITE
LIVE PREVIEW

Department of Computer Science CSCI 5622: Machine Learning Chenhao - - PowerPoint PPT Presentation

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 18: Clustering Slides adapted from Jordan Boyd-Graber, Chris Ketelsen 1 Learning objectives Learn about general clustering Learn about the K-Means


slide-1
SLIDE 1

Department of Computer Science CSCI 5622: Machine Learning Chenhao Tan Lecture 18: Clustering Slides adapted from Jordan Boyd-Graber, Chris Ketelsen

1

slide-2
SLIDE 2

Learning objectives

  • Learn about general clustering
  • Learn about the K-Means algorithm
  • Learn about Gaussian Mixture Models

2

slide-3
SLIDE 3

Supervised learning

3

Unsupervised learning

Data: X Labels: Y Data: X Latent structure: Z

slide-4
SLIDE 4

Clustering

  • One important unsupervised method is clustering
  • Goal: Organize data in classes

4

slide-5
SLIDE 5

Clustering applications – Microarray Gene Expression data

5

From: “Skin layer-specific transcriptional profiles in normal and recessive yellow (Mc1re/Mc1re) mice'' by April and Barsh in Pigment Cell Research (2006)

slide-6
SLIDE 6

Clustering applications – Medical Imaging

6

slide-7
SLIDE 7

Clustering applications – Community detection

7

slide-8
SLIDE 8

News Media

8

slide-9
SLIDE 9

Clustering

  • One important unsupervised method is clustering
  • Goal: Organize data in classes
  • Classes are hard to define
  • Different data representation may lead to different clusterings

9

slide-10
SLIDE 10

Clustering

  • One important unsupervised method is clustering
  • Goal: Organize data in classes
  • Data have high in-class similarity
  • Data have low out-of-class similarity

10

slide-11
SLIDE 11

Clustering - Similarity

11

slide-12
SLIDE 12

Clustering - Similarity

12

slide-13
SLIDE 13

K-Means

  • Simplest clustering method
  • Iterative in nature
  • Reasonably fast
  • Very popular in practice (though with more bells and whistles)
  • Requires real-valued data

13

slide-14
SLIDE 14

K-Means

14

slide-15
SLIDE 15

K-Means

15

slide-16
SLIDE 16

16

slide-17
SLIDE 17

17

slide-18
SLIDE 18

18

slide-19
SLIDE 19

19

slide-20
SLIDE 20

20

slide-21
SLIDE 21

21

slide-22
SLIDE 22

22

slide-23
SLIDE 23

More K-means

  • Animations:

http://shabal.in/visuals/kmeans/4.html

23

slide-24
SLIDE 24

K-Means in numbers

24

slide-25
SLIDE 25

K-Means in numbers

25

slide-26
SLIDE 26

K-Means in numbers

26

slide-27
SLIDE 27

K-Means in numbers

27

slide-28
SLIDE 28

K-Means in numbers

28

slide-29
SLIDE 29

K-Means in numbers

29

slide-30
SLIDE 30

K-Means in numbers

30

slide-31
SLIDE 31

K-Means in numbers

31

slide-32
SLIDE 32

K-Means in numbers

32

slide-33
SLIDE 33

K-Means in numbers

33

slide-34
SLIDE 34

K-Means in numbers

34

slide-35
SLIDE 35

K-Means in numbers

35

slide-36
SLIDE 36

K-Means

36

slide-37
SLIDE 37

K-Means

37

slide-38
SLIDE 38

K-Means

  • Weaknesses
  • Doesn't really work with categorical data
  • Usually only converges to local minimum
  • Have to determine number of clusters
  • Can be sensitive to outliers
  • Only generates convex clusters

38

slide-39
SLIDE 39

K-means - Weaknesses

  • Doesn't really work with categorical data

39

slide-40
SLIDE 40

K-means - Weaknesses

  • Doesn't really work with categorical data
  • Fix: Do K-Modes instead

40

slide-41
SLIDE 41

K-means - Weaknesses

  • Usually only converges to local minimum

41

slide-42
SLIDE 42

K-means - Weaknesses

  • Usually only converges to local minimum
  • Fix: Do several runs with random inits. and choose best

42

slide-43
SLIDE 43

K-means - Weaknesses

  • Have to determine number of clusters

43

slide-44
SLIDE 44

K-means - Weaknesses

  • Have to determine number of clusters
  • Fix: Use the elbow method

Run K-Means for different values of k and look at loss function

44

slide-45
SLIDE 45

45

slide-46
SLIDE 46

46

slide-47
SLIDE 47

47

slide-48
SLIDE 48

48

slide-49
SLIDE 49

49

slide-50
SLIDE 50

50

slide-51
SLIDE 51

Gaussian Mixture Models

51

slide-52
SLIDE 52

Gaussian Mixture Models

52

slide-53
SLIDE 53

Gaussian Mixture Models

53

slide-54
SLIDE 54

Gaussian Mixture Models

54

slide-55
SLIDE 55

Gaussian Mixture Models

55

slide-56
SLIDE 56

Gaussian Mixture Models

56

slide-57
SLIDE 57

Gaussian Mixture Models

57

slide-58
SLIDE 58

Gaussian Mixture Models

58

slide-59
SLIDE 59

Gaussian Mixture Models

59

slide-60
SLIDE 60

Recap

  • K-means is the most commonly used clustering algorithm
  • We learned the Gaussian Mixture Model’s generative story
  • We will learn EM-algorithm next week

60