A Novel LTM-based Method for Multi-partition Clustering Tengfei Liu - - PowerPoint PPT Presentation

a novel ltm based method for multi partition clustering
SMART_READER_LITE
LIVE PREVIEW

A Novel LTM-based Method for Multi-partition Clustering Tengfei Liu - - PowerPoint PPT Presentation

A Novel LTM-based Method for Multi-partition Clustering Tengfei Liu 1 Nevin L. Zhang 1 Kin Man Poon 1 Hua Liu 1 Yi Wang 2 1 The Hong Kong University of Science and Technology { liutf, lzhang, lkmpoon, aprillh } @cse.ust.hk 2 National University of


slide-1
SLIDE 1

A Novel LTM-based Method for Multi-partition Clustering

Tengfei Liu1 Nevin L. Zhang1 Kin Man Poon1 Hua Liu1 Yi Wang2

1The Hong Kong University of Science and Technology

{liutf, lzhang, lkmpoon, aprillh}@cse.ust.hk

2National University of Singapore

wangy@comp.nus.edu.sg September 13, 2012

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

1 / 29

slide-2
SLIDE 2

Outline

1

What is multi-partition clustering?

2

What are latent tree models? Introduction to latent tree models Results on real-world data Bridged Islands Algorithm

3

Experiment Results

4

Conclusion

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

2 / 29

slide-3
SLIDE 3

What is multi-partition clustering?

Outline

1

What is multi-partition clustering?

2

What are latent tree models? Introduction to latent tree models Results on real-world data Bridged Islands Algorithm

3

Experiment Results

4

Conclusion

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

3 / 29

slide-4
SLIDE 4

What is multi-partition clustering?

What is multi-partition clustering: an example

How to cluster these?

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

4 / 29

slide-5
SLIDE 5

What is multi-partition clustering?

What is multi-partition clustering: an example

How to cluster these? By Object

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

4 / 29

slide-6
SLIDE 6

What is multi-partition clustering?

What is multi-partition clustering: an example

How to cluster these? By Style

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

4 / 29

slide-7
SLIDE 7

What is multi-partition clustering?

What is multi-partition clustering: an example

How to cluster these? Multi-partition clustering

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

4 / 29

slide-8
SLIDE 8

What is multi-partition clustering?

What is multi-partition clustering: more examples

Other examples: Student Population:

course grades extracurriculum activities

Movie Reviews:

sentiment (positive or negative) genre (comedy, action, war, etc.)

Social Survey:

demographic information views on social issues.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

5 / 29

slide-9
SLIDE 9

What are latent tree models?

Outline

1

What is multi-partition clustering?

2

What are latent tree models? Introduction to latent tree models Results on real-world data Bridged Islands Algorithm

3

Experiment Results

4

Conclusion

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

6 / 29

slide-10
SLIDE 10

What are latent tree models? Introduction to latent tree models

What is Latent Tree Model?

→ Latent variables → Observed variables

Latent Tree Models:

Tree-structured Bayesian network; Encode a joint distribution1: P(X1, . . . , Xn, Y1, . . . , Ym) =

n

  • i=1

P(Xi|parent(Xi))

1Suppose there are n observed variables X1,. . . ,Xn and m latent variables

Y1,. . . ,Ym in an LTM.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

7 / 29

slide-11
SLIDE 11

What are latent tree models? Introduction to latent tree models

What is latent tree model: another perspective

Y1 X1 X2 X6 X7 Latent Class Model Latent Tree Model Generalize Y2 Y3 Y1 X1 X2 X3 X5 X6 X7 X4

LCM → LTM One latent variable → Multiple latent variables One clustering → Multiple clusterings

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

8 / 29

slide-12
SLIDE 12

What are latent tree models? Results on real-world data

LTMs for multi-partition clustering: survey data

ICAC data: ICAC is the anticorruption agency of Hong Kong. Survey respondents are asked about their: attitude towards corruption; perception of the ICAC’s performances.

Sample Question:

Are you willing to report corruption?

  • A. willing B. unwilling C. depending on circumstances D. Don’t know/no opinion

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

9 / 29

slide-13
SLIDE 13

What are latent tree models? Results on real-world data

LTMs for multi-partition clustering: survey data

Figure 1: The structure of the LTM obtained for the ICAC data. Abbreviations: C – Corruption, I – ICAC, Y – Year, Gov – Government, Bus – Business Sector. Meanings of manifest variables: Tolerance-C-Gov means ‘tolerance towards corruption in the government’; C-City means ‘level of corruption in the city’; C-NextY means ‘change in the level of corruption next year’; I-Effectiveness means ‘effectiveness of ICAC’s work’; I-Powers means ‘ICAC powers’; Confid-I means ‘confidence in ICAC’; etc.

The edge widths visually show the strength of correlation between variables. They are computed from the probability distributions of the model.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

10 / 29

slide-14
SLIDE 14

What are latent tree models? Results on real-world data

LTMs for multi-partition clustering: survey data

Figure 1: The structure of the LTM obtained for the ICAC data. Abbreviations: C – Corruption, I – ICAC, Y – Year, Gov – Government, Bus – Business Sector. Meanings of manifest variables: Tolerance-C-Gov means ‘tolerance towards corruption in the government’; C-City means ‘level of corruption in the city’; C-NextY means ‘change in the level of corruption next year’; I-Effectiveness means ‘effectiveness of ICAC’s work’; I-Powers means ‘ICAC powers’; Confid-I means ‘confidence in ICAC’; etc.

The edge widths visually show the strength of correlation between variables. They are computed from the probability distributions of the model.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

10 / 29

slide-15
SLIDE 15

What are latent tree models? Results on real-world data

Inspect individual clustering: CCPDs of Y2

Table 1: The class conditional probability distributions of Y2. P(. | Y2) Income Age Education Sex P(Y2 = s1) = .37 s0 s1 s2 s3 s4 s5 s6 .04 .1 .42 .28 .14 .02 .05 .35 .39 .17 .03 .04 .41 .09 .09 .37 .57 .43 P(Y2 = s2) = .24 s0 s1 s2 s3 s4 s5 s6 .29 .24 .04 .43 .03 .08 .41 .35 .13 .05 .29 .35 .26 .04 .01 1 P(. | Y2) Income Age Education Sex P(Y2 = s3) = .22 s0 s1 s2 s3 s4 s5 s6 .11 .17 .25 .31 .07 .1 .07 .22 .4 .3 .02 .29 .43 .19 .05 .01 .8 .2 P(Y2 = s4) = .17 s0 s1 s2 s3 s4 s5 s6 .78 .08 .09 .03 .02 .99 .01 .08 .47 .21 .1 .16 .5 .5

States of the manifest variables

s0 s1 s2 s3 s4 s5 s6 Income none –4k 4–7k 7–10k 10–20k 20–40k 40k– Age 15–24 25–34 35–44 45–54 55– Education none primary f1-3 f4-5 f6-7 diploma degree Sex m f T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

11 / 29

slide-16
SLIDE 16

What are latent tree models? Results on real-world data

Inspect individual clustering: CCPDs of Y2

Table 2: The class conditional probability distributions of Y2. P(. | Y2) Income Age Education Sex P(Y2 = s1) = .37 s0 s1 s2 s3 s4 s5 s6 .04 .1 .42 .28 .14 .02 .05 .35 .39 .17 .03 .04 .41 .09 .09 .37 .57 .43 P(Y2 = s2) = .24 s0 s1 s2 s3 s4 s5 s6 .29 .24 .04 .43 .03 .08 .41 .35 .13 .05 .29 .35 .26 .04 .01 1 P(. | Y2) Income Age Education Sex P(Y2 = s3) = .22 s0 s1 s2 s3 s4 s5 s6 .11 .17 .25 .31 .07 .1 .07 .22 .4 .3 .02 .29 .43 .19 .05 .01 .8 .2 P(Y2 = s4) = .17 s0 s1 s2 s3 s4 s5 s6 .78 .08 .09 .03 .02 .99 .01 .08 .47 .21 .1 .16 .5 .5

States of the manifest variables

s0 s1 s2 s3 s4 s5 s6 Income none –4k 4–7k 7–10k 10–20k 20–40k 40k– Age 15–24 25–34 35–44 45–54 55– Education none primary f1-3 f4-5 f6-7 diploma degree Sex m f people with good education and good income a class of women with poor education people with poor education and average income a class of young people with low income T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

12 / 29

slide-17
SLIDE 17

What are latent tree models? Results on real-world data

Relationship between clusterings: conditional probability

P(Y3|Y1)

Figure 2: The structure of the LTM obtained for the ICAC data. Abbreviations: C – Corruption, I – ICAC, Y – Year, Gov – Government, Bus – Business Sector. Meanings of manifest variables: Tolerance-C-Gov means ‘tolerance towards corruption in the government’; C-City means ‘level of corruption in the city’; C-NextY means ‘change in the level of corruption next year’; I-Effectiveness means ‘effectiveness of ICAC’s work’; I-Powers means ‘ICAC powers’; Confid-I means ‘confidence in ICAC’; etc. T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

13 / 29

slide-18
SLIDE 18

What are latent tree models? Results on real-world data

Relationship between clusterings

Cluster information of Y3: A clustering about people’s tolerance towards corruption Y3 = 1: people who find corruption intolerable Y3 = 2: people who find corruption tolerable Y3 = 3: people who find corruption totally tolerable Consider the conditional probability P(Y3|Y1) which is associate with edge Y1 → Y3 P(Y3 | Y1) Y3 = 1 Y3 = 2 Y3 = 3 Y1 = 1 .31 .07 .62 Y1 = 2 .14 .18 .68 Y1 = 3 .12 .39 .49 Y1 = 4 .40 .17 .42 good education poor education tolerable The information may be interesting or useful for people who conduct the survey.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

14 / 29

slide-19
SLIDE 19

What are latent tree models? Results on real-world data

LTMs for multi-partition clustering: text data

Figure 3: Latent tree model produced by BI on WebKB data T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

15 / 29

slide-20
SLIDE 20

What are latent tree models? Results on real-world data

LTMs for multi-partition clustering: text data

Table 3: Part of the clusterings found by the latent tree model

Y51 Y48 Y40 Y66 Y14 Y10 wisc washington intelligence programming april image madison seattle artificial

  • riented

march images wi wa learning

  • bject

february visual dayton box ai language conference pattern wisconsin usa knowledge languages symposium vision street planning program proceedings model machine programs january developed neural compiler international Y18 Y76 Y35 Y11 Y56 Y75 performance assignment journal management ph phone high hours pp database research

  • ffice

architecture class vol storage professor email design pm conference databases fax hall hardware grade proceedings query publications hours memory thursday international large interests teaching implementation lecture acm application university appointment instruction instructor symposium support computer monday processors grading ieee system department wednesday

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

16 / 29

slide-21
SLIDE 21

What are latent tree models? Bridged Islands Algorithm

How to Learn a Latent Tree Model

Data Learn?

Y2 Y3 Y1 X1 X2 X3 X5 X6 X7 X4

To learn a LTM, we need to determine: the number of latent variables; the number of states of each latent variable; the connections between the variables; model parameters.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

17 / 29

slide-22
SLIDE 22

What are latent tree models? Bridged Islands Algorithm

Bridged Islands Algorithm: a greedy method

BI Algorithm

1: Run UD-test to partition the set of attributes into sibling clusters; 2: For each sibling cluster, learn a latent class model; 3: Determine the connections among the latent variables so that they form a tree; 4: Refine the model.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

18 / 29

slide-23
SLIDE 23

What are latent tree models? Bridged Islands Algorithm

Bridged Islands Algorithm: a greedy method

BI Algorithm

1: Run UD-test to partition the set of attributes into sibling clusters; 2: For each sibling cluster, learn a latent class model; 3: Determine the connections among the latent variables so that they form a tree; 4: Refine the model.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

19 / 29

slide-24
SLIDE 24

What are latent tree models? Bridged Islands Algorithm

Bridged Islands Algorithm: a greedy method

BI Algorithm

1: Run UD-test to partition the set of attributes into sibling clusters; 2: For each sibling cluster, learn a latent class model; 3: Determine the connections among the latent variables so that they form a tree; 4: Refine the model.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

20 / 29

slide-25
SLIDE 25

What are latent tree models? Bridged Islands Algorithm

Bridged Islands Algorithm: a greedy method

BI Algorithm

1: Run UD-test to partition the set of attributes into sibling clusters; 2: For each sibling cluster, learn a latent class model; 3: Determine the connections among the latent variables so that they form a tree; 4: Refine the model.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

21 / 29

slide-26
SLIDE 26

Experiment Results

Outline

1

What is multi-partition clustering?

2

What are latent tree models? Introduction to latent tree models Results on real-world data Bridged Islands Algorithm

3

Experiment Results

4

Conclusion

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

22 / 29

slide-27
SLIDE 27

Experiment Results

Time Cost

Time cost: Data 1 (81*3021) Data 2 (108*2763) Data 3 (15*1000) EAST 6 days 24 days 485 seconds BI 35 min 69 min 22 seconds EAST (Chen et al. 2012): previous fastest algorithm for learning general LTMs.

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

23 / 29

slide-28
SLIDE 28

Experiment Results

Clustering Performances

Toy Data (15×1000): DK SAC OP EAST BI Y1 .78±.00 .73±.04 .73±.02 .91±.00 .91±.00 Y2 .48±.00 .57±.04 .55±.04 .86±.00 .86±.00 Y3 .97±.00 .59±.49 .91±.20 .98±.00 .98±.00 WebKB Data (336×1041): (EAST did not finish in 14 days.) DK SAC OP BI course .43±.01 .47±.01 .47±.02 .63±.02 faculty .18±.04 .17±.07 .18±.01 .30±.01 project .04±.00 .04±.00 .05±.04 .07±.00 student .18±.00 .20±.01 .20±.01 .25±.01 cornell .22±.15 .09±.02 .36±.24 .34±.01 texas .31±.18 .20±.20 .45±.23 .61±.02 washington .22±.13 .41±.23 .56±.25 .59±.12 wisconsin .38±.12 .16±.12 .45±.13 .55±.11

Other comparison methods:

Decorrelated-Kmeans(DK)(Jain et al. 2008), Orthogonal Projection(OP)(Cui et al.)(2007), Singular Alternative Clustering(SAC) (Qi et al.(2009)).

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

24 / 29

slide-29
SLIDE 29

Conclusion

Outline

1

What is multi-partition clustering?

2

What are latent tree models? Introduction to latent tree models Results on real-world data Bridged Islands Algorithm

3

Experiment Results

4

Conclusion

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

25 / 29

slide-30
SLIDE 30

Conclusion

Conclusion

Latent tree model is a tree-structured Bayesian network. LTMs are a generalization of latent class models. LTMs can produce multiple clusterings and find the relationships between them. Bridged Islands (BI) algorithm provides a fast way to learn an LTM. More about Latent Tree Models: http://www.cse.ust.hk/~lzhang/ltm/index.htm

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

26 / 29

slide-31
SLIDE 31

Conclusion

References I

Chen T., Zhang N.L., Liu T.F., Poon K.M., Wang Y. (2012) Model-based multidimensional clustering of categorical data. Artificial Intelligence, 2011, pp 2246–2269 Cui Y, Fern X.Z., Dy J.G. (2007) Non-reduntant multi-view clustering via orthogonalization. Proceedings of the IEEE International Conference on Data Mining, 2007 (ICDM 2007), pp 133–142 Jain P., Meka R., Dhillon I.S. (2008) Simultaneous unsupervised learning of disparate clusterings. Statistical Analysis and Data Mining, 2008, pp 195–210 Qi Z, Davidson I. (2009) A principled and flexible framework for finding alternative clusterings. Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD2009), pp 717–726

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

27 / 29

slide-32
SLIDE 32

Conclusion

Inspect Individual Clustering: Information Curves

0.23 0.47 0.70 0.94 1.17 Mutual Information Information Curves of Y2 I n c

  • m

e A g e E d u c a t i

  • n

S e x T

  • l

e r a n c e − C − B u s 20 40 60 80 100 Percent Pairwise Cumulative

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

28 / 29

slide-33
SLIDE 33

Conclusion

Inspect Individual Clustering: Information Curves

0.23 0.47 0.70 0.94 1.17 Mutual Information Information Curves of Y2 I n c

  • m

e A g e E d u c a t i

  • n

S e x T

  • l

e r a n c e − C − B u s 20 40 60 80 100 Percent Pairwise Cumulative

Left: I(Y2;Income,Age,Education); Right:

I(Y2;Income,Age,Education) I(Y2; all attributes)

Left: I(Y2; Education); Right:

I(Y2; Education) I(Y2; all attributes) T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

28 / 29

slide-34
SLIDE 34

Conclusion

To partition attributes: A Greedy Technique

Illustration of UD-test: X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 Y1(3) Y1(3) Y2(2) VS. Highest MI Sibling Cluster m1 m2

During each step, choose the attribute X closest to test set S:

I(X; S) = max

Z∈S I(X; Z) UD-test fails when

BIC(m2 | Dp) − BIC(m1 | Dp) ≥ δ

T.F Liu, N.L Zhang, K.M Poon, H. Liu, Y. Wang (HKUST) A Novel LTM-based Method for Multi-partition Clustering

September 13, 2012

29 / 29