Intuitive Parameterization of Distance-Based Clustering Techniques - - PowerPoint PPT Presentation

intuitive parameterization of distance based clustering
SMART_READER_LITE
LIVE PREVIEW

Intuitive Parameterization of Distance-Based Clustering Techniques - - PowerPoint PPT Presentation

Intuitive Parameterization of Distance-Based Clustering Techniques Altobelli de Brito Mantuan Leandro A. F. Fernandes amantuan@ic.uff.br laffernandes@ic.uff.br Many Faces of Distances - Campinas, Brazil - 2014 Conventional Pipeline Input


slide-1
SLIDE 1

Intuitive Parameterization of Distance-Based Clustering Techniques

Altobelli de Brito Mantuan amantuan@ic.uff.br Leandro A. F. Fernandes laffernandes@ic.uff.br

Many Faces of Distances - Campinas, Brazil - 2014

slide-2
SLIDE 2

Conventional Pipeline

Many Faces of Distances - Campinas, Brazil - 2014 2

Apriori

{A, C, F} → {B} {A, D} → {F}

Input Incidence Matrix Mined Rules

A B C D E F T1 1 1 1 T2 1 1 1 1 ... ... ... ... ... ... ... TN 1 1 1

The typical example

Transaction ID Milk Bread Butter Beer

T1 1 1 T2 1 T3 1 T4 1 1 1 T5 1

Association Rule: {Butter, Bread} → {Milk} Support = 20% Confidence = 50%

Performance and scalability issues

slide-3
SLIDE 3

Proposed Approach

Many Faces of Distances - Campinas, Brazil - 2014 3

Apriori

{A, C, F} → {B} {A, D} → {F}

Input Incidence Matrix Mined Rules

Dual Scaling Clustering & Pruning

A B C D E F T1 1 1 1 T2 1 1 1 1 ... ... ... ... ... ... ... TN 1 1 1

Response Style Space

Euclidean distance is not intuitive to parameterize clustering techniques

slide-4
SLIDE 4

Defining the Response Style Space

  • Dual scaling [Nishisato 1993]

▪ Versatile method typically applied in marketing research ▪ Analysis of preferences of human subjects ▪ Graphical representation of

  • Response-style patterns among surveyed transition
  • The preference over a set of item

Many Faces of Distances - Campinas, Brazil - 2014 4 Nishisato, S. “On quantifying different types of categorical data”. In: Psychometrika. 58(4), pp.617-629, 1993.

slide-5
SLIDE 5

Points in Response Style Space

Many Faces of Distances - Campinas, Brazil - 2014 5 Mapped transactions Mapped items T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 T13 T14 T15 T16 T17 T18 T19 T20 T21 T22 T24 T25 T26 T27

A space where transition and item are represented as points

** Fictitious data

Transition T is more related to (i.e., prefers) item A than to E

T A E

slide-6
SLIDE 6

Emerging Contexts

Many Faces of Distances - Campinas, Brazil - 2014 6

A context emerges from the existence of groups of items having similar preferences

A set of transition with similar preferences

Mapped transactions Mapped items

Elements in the same context are likely to be part of significant itemsets

** Fictitious data

slide-7
SLIDE 7

Dendrogram of Items in Response Style Space

Many Faces of Distances - Campinas, Brazil - 2014 7

Items Euclidean distance in reponse-style space

slide-8
SLIDE 8

Using Uncertainty Propagation

  • We treat each item as an independent Bernoulli variable

with parameter 𝑞𝑗

  • Uncertainty is propagated from input data to the response-

style space

  • Advantages

▪ Easy to compute ▪ Domain-independent interpretation ▪ Intuitive parameterization

  • We are investigating two approaches:

▪ Sampling-based approach ▪ First-order error propagation based approach

Many Faces of Distances - Campinas, Brazil - 2014 8

slide-9
SLIDE 9

Sampling-Based Approach

  • Dual scaling maximizes the squared correlation ratio (𝜃𝑗

2) of each

column of the input matrix 𝐺 𝜃𝑗

2 = 𝑦𝑗 𝑈 𝐺𝑈 𝐸𝑠 −1 𝐺 𝑦𝑗

𝑦𝑗

𝑈 𝐸𝑑 𝑦𝑗

  • We are interested in the coefficients
  • f 𝑦𝑗, i.e., the location of the items in

response-style space

  • A sample Ƹ

𝑡𝑗

𝑙 is produced using

Ƹ 𝑡𝑗

𝑙 = 𝑌 𝐺−1 𝑡𝑗 𝑙 , for 𝑌 = 𝑦1, 𝑦2, ⋯ , 𝑦𝑛

  • The samples of a given item define a symmetric distribution

in response-style space

Many Faces of Distances - Campinas, Brazil - 2014 9

A B C D E F T1 1 1 1 T2 1 1 1 1 ... ... ... ... ... ... ... TN 1 1 1

Input Incidence Matrix 𝐺

slide-10
SLIDE 10

Sampling-Based Approach

Many Faces of Distances - Campinas, Brazil - 2014 10 Mapped samples Mapped items Company: 2 Oil Loss: Yes Gas Loss: Yes Company: 4 Shift: Morn. Shift: Night Gas Loss: No Daylight Sav.: Yes Daylight Sav.: No Shift: After. Oil Loss: No

The samples characterize symmetric distributions around the original item in response style space

** Fictitious data Company: 1 Company: 3

slide-11
SLIDE 11

Items

Dendrogram of Overlapping Distributions Uncertainty Items

Many Faces of Distances - Campinas, Brazil - 2014 11

Sampling is computationally expensive. Define the proper number of samples is difficult.

Complementary Bhattacharyya distance between distributions

slide-12
SLIDE 12

First-Order Error Propagation

𝑦 = 𝑔 𝑐1, 𝑐2, ⋯ , 𝑐𝑙

Function that maps items to response-style space

ҧ 𝑦 = 𝑔 ത 𝑐1, ത 𝑐2, ⋯ , ത 𝑐𝑙

Compute the resulting expectation of the distribution

Σ𝑌 ≈ J𝑌 Σ𝐶 J𝑌

𝑈 Compute the resulting covariance matrix of the distribution

(J𝑌: Jacobian matrix of 𝑔, Σ: covariance matrix)

Many Faces of Distances - Campinas, Brazil - 2014 12

slide-13
SLIDE 13

Final Remarks

  • Ongoing work
  • Contributions

▪ A divide-and-conquer approach to alleviate the combinatorial issue of association rule learning ▪ The use of uncertainty propagation to develop an intuitive parameterization for distance-based clustering

  • Easy to compute, domain-independent interpretation, intuitive
  • Synthetic and real databases

▪ ~1000 items and ~3000 transactions

Many Faces of Distances - Campinas, Brazil - 2014 13