Turning Clusters into Patterns: Rectangle-based Discriminative Data - PDF document

Turning Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao Martin Ester School of Computing Science, Simon Fraser University, Canada bgao@cs.sfu.ca ester@cs.sfu.ca Abstract search conditions in SELECT query statements to retrieve (generate) cluster contents, supporting query-based iterative mining [13] and interactive exploration of clusters. The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as To be understandable, cluster descriptions should appear human-comprehensible patterns from which end-users can short in length and simple in format. Sum of Rectangles gain intuitions and insights. Yet not all data mining methods ( SOR ), simply taking the union of a set of rectangles, has produce such readily understandable knowledge, e.g., most been the canonical format for cluster descriptions in the clustering algorithms output sets of points as clusters. In database literature. However, this relatively restricted for- this paper, we perform a systematic study of cluster descrip- mat may produce unnecessarily lengthy descriptions. We tion that generates interpretable patterns from clusters. We introduce two novel description formats, leading to more introduce and analyze novel description formats leading to expressive power yet still simple enough to be intuitively understandable. The SOR − format describes a cluster as more expressive power, motivate and define novel description problems specifying different trade-offs between inter- the difference of its bounding box and a SOR description of the non-cluster points within the box. The kSOR ± for- pretability and accuracy. We also present effective heuristic algorithms together with their empirical evaluations. mat allows describing different parts of a cluster separately, using either SOR or SOR − descriptions. We prove that the kSOR ± -based description language is equivalently expressive to the (most general) propositional language [18]. 1. Introduction Meanwhile, cluster descriptions should cover cluster contents accurately, which conflicts with the goal of min- The ultimate goal of data mining is to discover useful imizing description length. The Pareto front for the bicrite- knowledge, ideally represented as human-comprehensible ria problem of optimizing description accuracy and length, patterns, in large databases. Clustering is one of the major as illustrated in Figure 3, offers the best trade-offs between data mining tasks, grouping objects together into clusters accuracy and interpretability for a given format. To solve that exhibit internal cohesion and external isolation. Unfor- the bicriteria problem, we introduce the novel Maximum tunately, most clustering methods simply represent clusters Description Accuracy (MDA) problem with the objective as sets of points and do not generalize them into patterns of maximizing description accuracy at a given description that provide interpretability, intuitions, and insights. length. The optimal solutions to the MDA problems with So far, the database and data mining literature lacks sys- different length specifications up to a maximal length con- tematic study of cluster description that transforms clusters stitute the Pareto front. The maximal length to specify (20 into human-understandable patterns. For numerical data, in Figure 3) is determined by the optimal solution to the hyper-rectangles generalize multi-dimensional points, and Minimum Description Length (MDL) problem, which aims a standard approach in database systems is to describe a set at finding some shortest perfectly accurate description that of points with a set of isothetic hyper-rectangles [1, 16, 18]. covers a cluster completely and exclusively. Previous re- Due to the property of being axis-parallel, such rectangles search only considered the MDL problem; however, per- can be specified in an intuitive manner; e.g., “3.80 ≤ GPA ≤ fectly accurate descriptions can become very lengthy and 4.33 and 0.1 ≤ visual acuity ≤ 0.5 and 0 ≤ minutes in gym hard to interpret for arbitrary shape clusters. The MDA per week ≤ 30” intuitively describes a group of “nerds”. problem allows trading accuracy for interpretability so that Patterns are models with generalization capacity, as well users can zoom in and out to view the clusters. as templates that can be used to make or to generate things. The rectangle-based expressions are interpretable models; The description problems are NP-hard. We present as another practical application, they can also be used as heuristic algorithms Learn2Cover for the MDL problem to

Turning Clusters into Patterns: Rectangle-based Discriminative Data - PDF document

Turning Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao Martin Ester School of Computing Science, Simon Fraser University, Canada bgao@cs.sfu.ca ester@cs.sfu.ca Abstract search conditions in SELECT query

PAC Learning and The VC Dimension Rectangle Game Fix a rectangle (unknown to you): From An

are: Opposite sides of a rectangle are parallel. 1. Opposite sides of a rectangle are equal. 2.

I nternational research The evidence on clusters is clear Firms located in clusters are more

Internet Server Clusters Internet Server Clusters Jeff Chase Duke University, Department of

Factory Patterns: Factory Method and Abstract Factory Design Patterns In Java Bob Tarr

MIS Project MIS Project Hala Salah Salah Hala Hany El- -Sawah Sawah Hany El Hany El Hany

Modern Alchemy Modern Alchemy Turning Waste into Gold Turning Waste into Gold Stephen Salter,

TURNING SWU INTO U TURNING SWU INTO U Thomas L. Neff Center for International Studies

Principles and Patterns 26 February, 2020 Recap Principles Patterns Inheritance Anti-patterns

AIRS Minor Constituents Focus Group: Turning small residuals into science Turning small

Factorization of density correlation functions for clusters touching the sides of a rectangle

Understanding the Basics of Plastic Materials and the IAPD Thermoplastics Rectangle Prepared by

ENUMERATION OF POLYOMINOES INSCRIBED IN A RECTANGLE Alain Goupil, Hugo Cloutier, Fathallah Nouboud

Rectangle-of-influence triangulations Therese Biedl 1 Anna Lubiw 1 Saeed Mehrabi 1 Sander

Advanced C++ 1 * declare pointers: pointerToRect is a pointer to Rectangle initially points to

Design Patterns Applications Programming What is design patterns? The design patterns are

TOWNHALL MEETING June 13 th , 2012 June 13 th , 2012 Agenda 1. Enrolment 7. Governance Structure

CROSS-LANGUAGE ENTITY LINKING PAUL MCNAMEE JAMES MAYFIELD* DOUGLAS W. OARD TAN XU KE WU

VERIA 23-27/8/2018 UNDER THE AUSPICES OF THE MUNICIPALITY OF VERIA [1] 1. Organization

An academic community where every student is a unique and irreplaceable personality Where are we ?

On the impact of inventory accuracy improvements on sales Christoph Glock, Yacine Rekik, Aris A.

Digitalcrea+vityandnewprofessions:

CIMO TECO-2018 highlights - Towards fit-for-purpose environmental measurements Krunoslav Premec

SIN SINGAP APORE RE MAN MANAGEME MENT NT UN UNIV IVERSIT SITY LIN INETTE E LIM IM