Clustering Data Streams - PDF document

Clustering Data Streams zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Sudipto Guha zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Liadan O’Callaghan zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Abstract zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA * Nina Mishra t Rajeev Motwani 5 the sequence observed zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA large to fit in main memory and are typically stored in secondary storage devices, making access, particu- larly random access, very expensive. Data stream al- W e study clustering under the data stream model model zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA gorithms access the input only via linear scans with- of computation where: given a sequence of points, the out random access and only require a few (hopefully, objective is to maintain a consistently good clustering of one) such scans over the data. Furthermore, since the so far, using a small amount of amount of data far exceeds the amount of space (main memory and time. The data stream model is relevant memory) available to the algorithm, it is not possible to new classes of applications involving massive data for the algorithm to “remember” too much of the data sets, such as web click stream analysis and multimedia scanned in the past. This scarcity of space necessitates data analysis. W e give constant-factor approximation the design of a novel kind of algorithm that stores only algorithms for the k-Median problem in the data stream a summary of past data, leaving enough memory for of computation in a single pass. W e also show the processing of future data. We remark that this is negative results implying that our algorithms cannot be not the same as the model of online algorithms. improved in a certain sense. times. Formally, a data stream is a sequence of points zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Clustering has recently been widely studied across several disciplines, but only a few of the techniques de- veloped scale to support clustering of very large data 1 Introduction sets. A common formulation of clustering is the k- Median problem: find k centers in a set of n points so A data stream is an ordered sequence of points as to minimize the sum of distances from data points that can be read only once or a small number of to their closest cluster centers. Most algorithms for k- one pass over the data using n‘ memory (for zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Median have large space requirements and involve ran- ..., ,..., x i x , , read in increasing order of the in- dom access to the input data. We give constant-factor XI, dices i. The performance of an algorithm that op- approximation algorithms for the k-Median problem erates on data streams is measured by the number that naturally fit into this data stream setting. Our of passes the algorithm must make over the stream, algorithms make a single pass over the data and use that uses zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA when.constrained in terms of available memory, in ad- small space. We first give a randomized constant-factor dition to the more conventional measures. The data approximation algorithm for k-Median, which makes E < 1) stream model is motivated by emerging application involving massive data sets, e.g., customer click streams, and requires only d(nk) time. We also prove that telephone records, large sets of web pages, multime- any deterministic k-Median algorithm that achieves a dia data, and sets of retail chain transactions can be constant-factor approximation cannot run in time less modeled as data streams. These data sets are far too than !2(nk). Finally, we give a deterministic d(nk)- time, polylog(n)-approximation single-pass algorithm space, for E < 1. *Department of Computer Science, Stanford University, CA nE 94305. Email: sudiptoQcs. stanford.edu. Research supported by IBM Research Fellowship and NSF Grant 11s-9811904. t Hewlett Packard Laboratories, Palo Alto, CA 94304, Email: Related Work on Data Streams One of the first nmishraQhpl.hp.com results in data streams w a s the result of Munro and $Department of Computer Science, Stanford University, CA Paterson [16], where they studied the space require- 94305. Email: rajeevQcs. stanford.edu. Research supported a function of the num- in part by NSF Grant 11s-9811904. ment of selection and sorting as §Department of Computer Science, Stanford University, CA ber of passes over the data. The model was formal- . Email: locQcs stanford.edu. 94305. Research supported ized by Henzinger, Raghavan, and Rajagopalan [7], i n part by an NSF Graduate Fellowship, ARO MURI Grant who gave several algorithms and complexity results re- DAAH04-96-1-0007, and NSF Grant 11s-9811904. 359 2000 IEEE 0-7695-0850-2/00 $10.00 0

Clustering Data Streams - PDF document

Clustering Data Streams zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Sudipto Guha zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Liadan OCallaghan zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Abstract

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

DMA Support for the Sancus Architecture Lightweight and Open-Source Trusted Computing for the IoT

Computer Programming II Algorithm Analysis and Sorting Techniques Algorithm A sequence of

COMP 516 COMP 516 Research Methods in Computer Science Research Methods in Computer Science

Mining, and Intro to Categorization Tues April 10 Kristen Grauman UT Austin UT Austin, CS 376

Sub-lithographic Semiconductor Computing Systems Andr DeHon andre@cs.caltech.edu In

Tracking Dice, see dieroll2.cpp Defining tvector objects const int DICE_SIDES = 4; Can specify

Vectors Vectors are homogeneous collections with random access Store the same type/class of

LECTURE 1: OVERVIEW CS 4100: Foundations of AI Instructor: Robert Platt (some slides from Chris

Clustering Data Streams - PDF document

Clustering Data Streams zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Sudipto Guha zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Liadan OCallaghan zyxwvutsrqponmlkjihgfedcbaZYXWVUTSRQPONMLKJIHGFEDCBA Abstract

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

WITH C++ Prof. Amr Goneid AUC Part 9. Streams &amp; Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data &amp; Real Time Data Streams

Environmental Health Science Data Streams Data Streams Health Data Health Data Brian S.

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

DMA Support for the Sancus Architecture Lightweight and Open-Source Trusted Computing for the IoT

Computer Programming II Algorithm Analysis and Sorting Techniques Algorithm A sequence of

COMP 516 COMP 516 Research Methods in Computer Science Research Methods in Computer Science

Mining, and Intro to Categorization Tues April 10 Kristen Grauman UT Austin UT Austin, CS 376

Sub-lithographic Semiconductor Computing Systems Andr DeHon andre@cs.caltech.edu In

Tracking Dice, see dieroll2.cpp Defining tvector objects const int DICE_SIDES = 4; Can specify

Vectors Vectors are homogeneous collections with random access Store the same type/class of

LECTURE 1: OVERVIEW CS 4100: Foundations of AI Instructor: Robert Platt (some slides from Chris

WITH C++ Prof. Amr Goneid AUC Part 9. Streams & Files Prof. amr Goneid, AUC 1 Streams

Stream Algorithmics Albert Bifet March 2012 Data Streams Big Data & Real Time Data Streams