http://www.mmds.org In many data mining situations, we do not know - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University http://www.mmds.org

¡ In many data mining situations, we do not know the entire data set in advance ¡ Stream Management is important when the input rate is controlled externally: § Google queries § Twitter or Facebook status updates ¡ We can think of the data as infinite and non-stationary (the distribution changes over time) J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 2

¡ Input elements enter at a rapid rate, at one or more input ports (i.e., streams ) § We call elements of the stream tuples ¡ The system cannot store the entire stream accessibly ¡ Q: How do you make critical calculations about the stream using a limited amount of (secondary) memory? J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 3

§ Sensor data § E.g.,millions of temperature sensors deployed in the ocean § Image data from satellites, or even from surveillance cameras § E.g., London § Internet and Web traffic § Millions of streams of IP packets § Web data § Search queries to Google, clicks on Bing, etc. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 4

¡ Types of queries one wants on answer on a data stream: § Filtering a data stream § Select elements with property x from the stream § Counting distinct elements § Number of distinct elements in the last n elements of the stream § Estimating moments § Estimate avg./std. dev. of last n elements § Finding frequent elements J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 5

¡ Mining query streams § Google wants to know what queries are more frequent today than yesterday ¡ Mining click streams § Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour ¡ Mining social network news feeds § E.g., look for trending topics on Twitter, Facebook J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 6

¡ Sensor Networks § Many sensors feeding into a central controller ¡ IP packets monitored at a switch § Gather information for optimal routing § Detect denial-of-service attacks J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 7

¡ Input: sequence of T elements a 1 , a 2 , … a T from a known universe U, where |U|=u. Goal: perform a computation on the input, in single left to right pass using ¡ Process elements in real time ¡ Can’t store the full data => minimal storage requirement to maintain working “summary”. J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 8

32, 112, 14, 9, 37, 83, 115, 2, Some functions are easy: min, max, sum, … We use a single register ! , simple update: ¡ Maximum: Initialize ! ← 0 For element # , ! ← max !, # ¡ Sum: Initialize ! ← 0 For element # , ! ← ! + #

32, 12, 14, 32,7, 12, 32, 7, 32, 12, 4, Some applications: ¡ Determining popular products ¡ Computing frequent search queries ¡ Identifying heavy TCP flows ¡ Identifying volatile stocks

32, 12, 14, 32, 7, 12, 32, 7, 6, 12, 4, Applications: § IP Packet streams: Number of distinct IP addresses or IP flows (source+destination IP, port, protocol) § Anomaly detection, traffic monitoring § Search: Find how many distinct search queries were issued to a search engine (on a certain topic) yesterday § Web services: How many distinct users (cookies) searched/browsed a certain term/item § advertising, marketing, trends

32, 12, 14, 32, 7, 6, 12, 4, 12, 32, 7, ¡ Want to compute the number of distinct keys in the stream ¡ How can you do this without storing all the elements?

¡ Cool applications of probability (and hashing) ¡ Can compute interesting global properties of a long stream, with only one pass over the data, while maintaining only a small amount of information about it. We call this small amount of information a sketch

Special case: a majority element. One pass algorithm using sublinear auxiliary space?

counter:= 0; current := NULL for i := 1 to n do if counter == 0, then current := A[i]; counter++; else if A[i] == current then Counter ++ Else counter - - return current

provably impossible in sublinear space So what do we do?

32, 12, 14, 32, 7, 6, 12, 4, 12, 32, 7, ¡ The number of distinct keys in the stream

Ad-Hoc Queries Standing . . . 1, 5, 2, 7, 0, 9, 3 Queries . . . a, r, v, t, y, h, b Output Processor . . . 0, 0, 1, 0, 1, 1, 0 time Streams Entering. Each is stream is composed of elements / tuples Limited Working Archival Storage Storage J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 18

http://www.mmds.org In many data mining situations, we do not know - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a

Collision Detection 1 2 Many Different Situations Many Different Situations Thin moving

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

TYPES OF SITUATIONS CLEAR SITUATIONS UNCLEAR SITUATIONS Level of difficulty: Level of

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

MM MMDS Moroccan Membrane and Desalination Society Moroccan Membrane and Desalination Society

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

www.escardio.org www.escardio.org www.escardio.org www.escardio.org www.escardio.org

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Protodune Cosmic Ray tagger (CRT) Camillo Mariani ProtoDUNE DAQ Review November 3 rd and 4 th

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, David Broneske and Gunter Saake

CSCI403 Lecture 34: Data Warehousing (and other buzzwords) OLTP OLAP Reporting BI & KD

Datawarehousing para datos genticos, socioeconmicos y fenotpicos, con visualizacin 3D

Current Landscape of Business Analytics and Data Science 8/12/2015 Why the Interest? The The

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Sambuz

Useful Links

Newsletter

Mail Us

http://www.mmds.org In many data mining situations, we do not know - PowerPoint PPT Presentation

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a

Collision Detection 1 2 Many Different Situations Many Different Situations Thin moving

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

TYPES OF SITUATIONS CLEAR SITUATIONS UNCLEAR SITUATIONS Level of difficulty: Level of

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

MM MMDS Moroccan Membrane and Desalination Society Moroccan Membrane and Desalination Society

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

http://www.mmds.org #1: C4.5 Decision Tree - Classification (61 votes) #2: K-Means -

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

www.escardio.org www.escardio.org www.escardio.org www.escardio.org www.escardio.org

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

Week 5 Video 1 Relationship Mining Correlation Mining Relationship Mining Discover

Mining Data Streams (Part 1) CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Protodune Cosmic Ray tagger (CRT) Camillo Mariani ProtoDUNE DAQ Review November 3 rd and 4 th

Finding Recent Frequent Itemsets Adaptively over Online Data Stream Yueting Chen Outline

Large-Scale Data Engineering Data streams and low latency processing event.cwi.nl/lsde2015 DATA

Toward GPU Accelerated Data Stream Processing Marcus Pinnecke, David Broneske and Gunter Saake

CSCI403 Lecture 34: Data Warehousing (and other buzzwords) OLTP OLAP Reporting BI &amp; KD

Datawarehousing para datos genticos, socioeconmicos y fenotpicos, con visualizacin 3D

Current Landscape of Business Analytics and Data Science 8/12/2015 Why the Interest? The The

Bitmap Indexing and related indexing techniques Presented by: El Ghailani Maher Outline I

Sambuz

Useful Links

Newsletter

Mail Us

CSCI403 Lecture 34: Data Warehousing (and other buzzwords) OLTP OLAP Reporting BI & KD