Data-driven Clustering via Parameterized Lloyds Families Travis - PowerPoint PPT Presentation

Data-driven Clustering via Parameterized Lloyds Families Travis Dick Joint work with Maria-Florina Balcan and Colin White Carnegie Mellon University NeurIPS 2018

Data-driven Clustering

Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters.

Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering.

Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function.

Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function. • Hope that low-cost clusterings recover the natural clusters.

Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function. • Hope that low-cost clusterings recover the natural clusters. • There are many algorithms and many objectives.

Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function. • Hope that low-cost clusterings recover the natural clusters. • There are many algorithms and many objectives. How do we choose the best algorithm for a specific application?

Data-driven Clustering • Clustering aims to divide a dataset into self-similar clusters. • Goal: find some unknown natural clustering. • However, most clustering algorithms minimize a clustering cost function. • Hope that low-cost clusterings recover the natural clusters. • There are many algorithms and many objectives. How do we choose the best algorithm for a specific application? Can we automate this process?

Learning Model

Learning Model • An unknown distribution ! over clustering instances.

Learning Model • An unknown distribution ! over clustering instances. • Given a sample " # , … , " & ∼ ! annotated by their target clusterings.

Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings.

Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings. • Want " to also work well for new instances from ! !

Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings. • Want " to also work well for new instances from ! ! • In this work: 1. Introduce large parametric family of clustering algorithms, (*, +) -Lloyds.

Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings. • Want " to also work well for new instances from ! ! • In this work: 1. Introduce large parametric family of clustering algorithms, (*, +) -Lloyds. 2. Efficient procedures for finding best parameters on a sample.

Learning Model • An unknown distribution ! over clustering instances. • Given a sample # $ , … , # ' ∼ ! annotated by their target clusterings. • Find an algorithm " that produces clusterings similar to the target clusterings. • Want " to also work well for new instances from ! ! • In this work: 1. Introduce large parametric family of clustering algorithms, (*, +) -Lloyds. 2. Efficient procedures for finding best parameters on a sample. 3. Generalization: optimal parameters on sample are nearly optimal on ! .

Lloyds Method

Lloyds Method • Maintains ! centers " # , … , " & that define clusters.

Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers.

Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center.

Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center. 2. Update each center to be the mean of assigned points.

Lloyds Method • Maintains ! centers " # , … , " & that define clusters. • Perform local search to improve the ! -means cost of the centers. 1. Assign each point to nearest center. 2. Update each center to be the mean of assigned points. 3. Repeat until convergence.

Initial Centers are Important!

Initial Centers are Important! • Lloyd’s method can get stuck if initial centers are chosen poorly

Initial Centers are Important! • Lloyd’s method can get stuck if initial centers are chosen poorly • Initialization is a well-studied problem with many proposed procedures (e.g., ! -means++)

Initial Centers are Important! • Lloyd’s method can get stuck if initial centers are chosen poorly • Initialization is a well-studied problem with many proposed procedures (e.g., ! -means++) • Best method will depend on properties of the clustering instances.

The (", $) -Lloyds Family

The (", $) -Lloyds Family Initialization: Parameter "

The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++)

The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset * randomly.

The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset * randomly. ' . • Probability that point + ∈ * is center - . is proportional to & +, - / , … , - .1/

The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset , randomly. ' . • Probability that point - ∈ , is center / 0 is proportional to & -, / 1 , … , / 031 " = 0 : random initialization

The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset - randomly. ' . • Probability that point . ∈ - is center 0 1 is proportional to & ., 0 2 , … , 0 142 " = 0 : random initialization " = 2 : ) -means++

The (", $) -Lloyds Family Initialization: Parameter " • Use & ' -sampling (generalizing & ( -sampling of ) -means++) • Choose initial centers from dataset . randomly. ' . • Probability that point / ∈ . is center 1 2 is proportional to & /, 1 3 , … , 1 253 " = 0 : random initialization " = 2 : ) -means++ " = ∞ : farthest first

Data-driven Clustering via Parameterized Lloyds Families Travis - PowerPoint PPT Presentation

Data-driven Clustering via Parameterized Lloyds Families Travis Dick Joint work with Maria-Florina Balcan and Colin White Carnegie Mellon University NeurIPS 2018 Data-driven Clustering Data-driven Clustering Clustering aims to divide a

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Lloyds and the London Market New Zealand Feb 2016 Lloyds and the London market Contents

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

State Farm Lloyds v. Page State Farm Lloyds v. Page No. 08- -0799, June 11, 2010, 0799, June

Parameterized Power Vertex Cover Eric Angel, Evripidis Bampis, Bruno Escoffier, Michael Lampis

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

DSPACE CLUSTERING DSPACE CLUSTERING VIA PUPPET, HAPROXY AND CEPHFS VIA PUPPET, HAPROXY AND

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Talking Points 1 Office of the Deputy Assistant Secretary of Defense for Military Community and

Pulmonary arterial hypertension Definition and classification Pulmonary arterial hypertension:

A. Van Catterton, Jr., Esq. avc@avcpa-law.com www.avcpa-law.com AirBNB brand name synonymous

MAKING SENSE OF Disclosures CARRIER SCREENING Research funding from Natera Consultant to

Type Families with Class, Type Classes with Family Alejandro Serrano 1 Jurriaan Hage 1 Patrick

Formalizing Frankls Conjecture: FC-Families c, Miodrag Filip Mari c, Bojan Vu ckovi

A $-Family Friendly Approach to Prototype Selection Corey Pittman Eugene M. Taranta II Joseph

FAMILY Resilience AMANDA SCOTT-THOMAS COMMUNITY PARTNERSHIP OFFICE 3 2 The